From 835fadbcdb327825948cd58ca2514376aed06b1a Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Thu, 7 Sep 2023 15:28:06 -0700 Subject: [PATCH 1/8] Update databricks-configs.md to include MV/ST * Expanded configuration table to indicate that we now support all of the main table config options in Python as well as SQL, with the exception of Liquid Clustering, which is not yet part of PySpark. * Added section describing support for materialized_view and streaming_table. --- .../resource-configs/databricks-configs.md | 55 ++++++++++++++++--- 1 file changed, 47 insertions(+), 8 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index dc7f0cd53e3..9dfeb9ef32a 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -7,14 +7,14 @@ id: "databricks-configs" When materializing a model as `table`, you may include several optional configs that are specific to the dbt-databricks plugin, in addition to the standard [model configs](/reference/model-configs). -| Option | Description | Required? | Example | -|---------|------------------------------------------------------------------------------------------------------------------------------------|-------------------------|--------------------------| -| file_format | The file format to use when creating tables (`parquet`, `delta`, `hudi`, `csv`, `json`, `text`, `jdbc`, `orc`, `hive` or `libsvm`). | Optional | `delta`| -| location_root | The created table uses the specified directory to store its data. The table alias is appended to it. | Optional | `/mnt/root` | -| partition_by | Partition the created table by the specified columns. A directory is created for each partition. | Optional | `date_day` | -| liquid_clustered_by | Cluster the created table by the specified columns. Clustering method is based on [Delta's Liquid Clustering feature](https://docs.databricks.com/en/delta/clustering.html). Available since dbt-databricks 1.6.2. | Optional | `date_day` | -| clustered_by | Each partition in the created table will be split into a fixed number of buckets by the specified columns. | Optional | `country_code` | -| buckets | The number of buckets to create while clustering | Required if `clustered_by` is specified | `8` | +| Option | Description | Required? | Model Support | Example | +|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|---------------|----------------| +| file_format | The file format to use when creating tables (`parquet`, `delta`, `hudi`, `csv`, `json`, `text`, `jdbc`, `orc`, `hive` or `libsvm`). | Optional | SQL, Python | `delta` | +| location_root | The created table uses the specified directory to store its data. The table alias is appended to it. | Optional | SQL, Python | `/mnt/root` | +| partition_by | Partition the created table by the specified columns. A directory is created for each partition. | Optional | SQL, Python | `date_day` | +| liquid_clustered_by | Cluster the created table by the specified columns. Clustering method is based on [Delta's Liquid Clustering feature](https://docs.databricks.com/en/delta/clustering.html). Available since dbt-databricks 1.6.2. | Optional | SQL | `date_day` | +| clustered_by | Each partition in the created table will be split into a fixed number of buckets by the specified columns. | Optional | SQL, Python | `country_code` | +| buckets | The number of buckets to create while clustering | Required if `clustered_by` is specified | SQL, Python | `8` | ## Incremental models @@ -281,3 +281,42 @@ snapshots: ``` + + +## Advanced Materializations + +Starting with version 1.6.0, we are rolling out support for [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), alternatives to incremental tables that are powered by (Delta Live Tables)[https://docs.databricks.com/en/delta-live-tables/index.html]. +Read their entries [here](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) to understand their use cases. +These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered experimental. +In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses. + + + +```sql +{{ config( + materialized = 'materialized_view' + ) }} +``` + + + +or + + + +```sql +{{ config( + materialized = 'streaming_table' + ) }} +``` + + + +When a pre-existing relation of these types is detected, a `REFRESH` [command](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-refresh-full.html) is issued. + +### Limitations + +As mentioned above, support for these materializations should be considered experimental. +At this time we do not support specifying a refresh schedule for these materializations. +Nor do we monitor changes to respect the `on_configuration_change` settings. +Both of these configuration settings are expected to be supported in the near future. From 28140e2a0d73f8de4a806f30a95233428bf6e582 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Thu, 7 Sep 2023 15:37:34 -0700 Subject: [PATCH 2/8] Small changes to conform better to the style guide. --- .../reference/resource-configs/databricks-configs.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 9dfeb9ef32a..d0c050c9aee 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -287,7 +287,7 @@ snapshots: Starting with version 1.6.0, we are rolling out support for [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), alternatives to incremental tables that are powered by (Delta Live Tables)[https://docs.databricks.com/en/delta-live-tables/index.html]. Read their entries [here](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) to understand their use cases. -These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered experimental. +These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered _experimental_. In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses. @@ -317,6 +317,9 @@ When a pre-existing relation of these types is detected, a `REFRESH` [command](h ### Limitations As mentioned above, support for these materializations should be considered experimental. -At this time we do not support specifying a refresh schedule for these materializations. -Nor do we monitor changes to respect the `on_configuration_change` settings. +At this time the following configuration options are not available: + +* Specifying a refresh schedule for these materializations +* Specifying `on_configuration_change` settings. + Both of these configuration settings are expected to be supported in the near future. From 06c8f1f504d664dd130b6418b7bd88341e652b46 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Thu, 7 Sep 2023 15:44:38 -0700 Subject: [PATCH 3/8] Using more active voice --- .../docs/reference/resource-configs/databricks-configs.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index d0c050c9aee..885bc88b2a5 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -285,7 +285,7 @@ snapshots: ## Advanced Materializations -Starting with version 1.6.0, we are rolling out support for [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), alternatives to incremental tables that are powered by (Delta Live Tables)[https://docs.databricks.com/en/delta-live-tables/index.html]. +Starting with version 1.6.0, the dbt-databricks adapter supports [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), as alternatives to incremental tables that are powered by (Delta Live Tables)[https://docs.databricks.com/en/delta-live-tables/index.html]. Read their entries [here](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) to understand their use cases. These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered _experimental_. In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses. @@ -312,14 +312,14 @@ or -When a pre-existing relation of these types is detected, a `REFRESH` [command](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-refresh-full.html) is issued. +When dbt detects a pre-existing relation of one of these types, it issues a `REFRESH` [command](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-refresh-full.html). ### Limitations -As mentioned above, support for these materializations should be considered experimental. +As mentioned above, support for these materializations in the Databricks adapter is still limited. At this time the following configuration options are not available: * Specifying a refresh schedule for these materializations * Specifying `on_configuration_change` settings. -Both of these configuration settings are expected to be supported in the near future. +These configuration settings will be supported in the near future. From ba10bed55211987ce5378ea5d6aaa19a9977812b Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Thu, 7 Sep 2023 15:48:28 -0700 Subject: [PATCH 4/8] Be clearer about a link destination --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 885bc88b2a5..6518a316652 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -286,7 +286,7 @@ snapshots: ## Advanced Materializations Starting with version 1.6.0, the dbt-databricks adapter supports [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), as alternatives to incremental tables that are powered by (Delta Live Tables)[https://docs.databricks.com/en/delta-live-tables/index.html]. -Read their entries [here](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) to understand their use cases. +Read their entries in the [Delta Live Tables documentation site](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) to understand their use cases. These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered _experimental_. In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses. From 2b54b817bd2adae6ac49bf0bf7eba05b54126ee6 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Fri, 8 Sep 2023 16:18:41 -0700 Subject: [PATCH 5/8] Apply suggestions from code review Co-authored-by: Anders --- .../docs/reference/resource-configs/databricks-configs.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 6518a316652..bb987d5352b 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -283,10 +283,10 @@ snapshots: -## Advanced Materializations +## Materialized views and streaming stables -Starting with version 1.6.0, the dbt-databricks adapter supports [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), as alternatives to incremental tables that are powered by (Delta Live Tables)[https://docs.databricks.com/en/delta-live-tables/index.html]. -Read their entries in the [Delta Live Tables documentation site](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) to understand their use cases. +Starting with version 1.6.0, the dbt-databricks adapter supports [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), as alternatives to incremental tables that are powered by [Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/index.html). +See [What are Delta Live Tables?](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) for more information and use cases. These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered _experimental_. In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses. From cb6b26c99d8a4f351e11e29c8989af5d82249a7c Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Mon, 11 Sep 2023 09:33:50 -0700 Subject: [PATCH 6/8] Updating MV/ST section Updating to indicate this is a 1.6 feature with limitations that we plan to address in the 1.7 timeframe. --- .../reference/resource-configs/databricks-configs.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 5f3875d03be..bbccabecc7a 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -282,7 +282,7 @@ snapshots: - + ## Materialized views and streaming stables Starting with version 1.6.0, the dbt-databricks adapter supports [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), as alternatives to incremental tables that are powered by [Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/index.html). @@ -321,5 +321,8 @@ At this time the following configuration options are not available: * Specifying a refresh schedule for these materializations * Specifying `on_configuration_change` settings. - -These configuration settings will be supported in the near future. + +Additionally, if you change the model definition of your materialized view or streaming table, you will need to drop the materialization in your warehouse directly before running dbt again; otherwise, you will get a refresh error. + +We plan to address these limitations during the 1.7.x timeframe. + From f38f3367b962f9a9637fb5fbb7186c2a6cfc9481 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 13 Sep 2023 14:50:18 -0700 Subject: [PATCH 7/8] Update databricks-configs.md to indicate that 1.6 brought expanded python config support --- .../resource-configs/databricks-configs.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index bbccabecc7a..0d1b8044111 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -7,6 +7,21 @@ id: "databricks-configs" When materializing a model as `table`, you may include several optional configs that are specific to the dbt-databricks plugin, in addition to the standard [model configs](/reference/model-configs). + + +| Option | Description | Required? | Example | +|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|----------------| +| file_format | The file format to use when creating tables (`parquet`, `delta`, `hudi`, `csv`, `json`, `text`, `jdbc`, `orc`, `hive` or `libsvm`). | Optional | `delta` | +| location_root | The created table uses the specified directory to store its data. The table alias is appended to it. | Optional | `/mnt/root` | +| partition_by | Partition the created table by the specified columns. A directory is created for each partition. | Optional | `date_day` | +| liquid_clustered_by | Cluster the created table by the specified columns. Clustering method is based on [Delta's Liquid Clustering feature](https://docs.databricks.com/en/delta/clustering.html). Available since dbt-databricks 1.6.2. | Optional | `date_day` | +| clustered_by | Each partition in the created table will be split into a fixed number of buckets by the specified columns. | Optional | `country_code` | +| buckets | The number of buckets to create while clustering | Required if `clustered_by` is specified | `8` | + + + + + | Option | Description | Required? | Model Support | Example | |---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|---------------|----------------| | file_format | The file format to use when creating tables (`parquet`, `delta`, `hudi`, `csv`, `json`, `text`, `jdbc`, `orc`, `hive` or `libsvm`). | Optional | SQL, Python | `delta` | @@ -16,6 +31,8 @@ When materializing a model as `table`, you may include several optional configs | clustered_by | Each partition in the created table will be split into a fixed number of buckets by the specified columns. | Optional | SQL, Python | `country_code` | | buckets | The number of buckets to create while clustering | Required if `clustered_by` is specified | SQL, Python | `8` | + + ## Incremental models dbt-databricks plugin leans heavily on the [`incremental_strategy` config](/docs/build/incremental-models#about-incremental_strategy). This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of three values: From 78110579169f5a8b5e0d59175d71b87cada8f015 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Fri, 15 Sep 2023 16:25:35 -0700 Subject: [PATCH 8/8] Update databricks-configs.md to add documentation for replace_where --- .../resource-configs/databricks-configs.md | 95 ++++++++++++++++++- 1 file changed, 93 insertions(+), 2 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 0d1b8044111..5d86b437274 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -35,10 +35,11 @@ When materializing a model as `table`, you may include several optional configs ## Incremental models -dbt-databricks plugin leans heavily on the [`incremental_strategy` config](/docs/build/incremental-models#about-incremental_strategy). This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of three values: +dbt-databricks plugin leans heavily on the [`incremental_strategy` config](/docs/build/incremental-models#about-incremental_strategy). This config tells the incremental materialization how to build models in runs beyond their first. It can be set to one of four values: - **`append`** (default): Insert new records without updating or overwriting any existing data. - **`insert_overwrite`**: If `partition_by` is specified, overwrite partitions in the with new data. If no `partition_by` is specified, overwrite the entire table with new data. - - **`merge`** (Delta and Hudi file format only): Match records based on a `unique_key`; update old records, insert new ones. (If no `unique_key` is specified, all new data is inserted, similar to `append`.) + - **`merge`** (Delta and Hudi file format only): Match records based on a `unique_key`, updating old records, and inserting new ones. (If no `unique_key` is specified, all new data is inserted, similar to `append`.) + - **`replace_where`** (Delta file format only): Match records based on `incremental_predicates`, replacing all records that match the predicates from the existing table with records matching the predicates from the new data. (If no `incremental_predicates` are specified, all new data is inserted, similar to `append`.) Each of these strategies has its pros and cons, which we'll discuss below. As with any model config, `incremental_strategy` may be specified in `dbt_project.yml` or within a model file's `config()` block. @@ -265,6 +266,96 @@ merge into analytics.merge_incremental as DBT_INTERNAL_DEST +### The `replace_where` strategy + +The `replace_where` incremental strategy requires: +- `file_format: delta` +- Databricks Runtime 12.0 and above + +dbt will run an [atomic `replace where` statement](https://docs.databricks.com/en/delta/selective-overwrite.html#arbitrary-selective-overwrite-with-replacewhere) which selectively overwrites data matching one or more `incremental_predicates` specified as a string or array. Only rows matching the predicates will be inserted. If no `incremental_predicates` are specified, dbt will perform an atomic insert, as with `append`. + +:::caution + +`replace_where` inserts data into columns in the order provided, rather than by column name. If you reorder columns and the data is compatible with the existing schema, you may silently insert values into an unexpected column. If the incoming data is incompatible with the existing schema, you will instead receive an error. + +::: + + + + + + +```sql +{{ config( + materialized='incremental', + file_format='delta', + incremental_strategy = 'replace_where' + incremental_predicates = 'user_id >= 10000' # Never replace users with ids < 10000 +) }} + +with new_events as ( + + select * from {{ ref('events') }} + + {% if is_incremental() %} + where date_day >= date_add(current_date, -1) + {% endif %} + +) + +select + user_id, + max(date_day) as last_seen + +from events +group by 1 +``` + + + + + + + +```sql +create temporary view replace_where__dbt_tmp as + + with new_events as ( + + select * from analytics.events + + + where date_day >= date_add(current_date, -1) + + + ) + + select + user_id, + max(date_day) as last_seen + + from events + group by 1 + +; + +insert into analytics.replace_where_incremental + replace where user_id >= 10000 + table `replace_where__dbt_tmp` +``` + + + + + + + ## Persisting model descriptions Relation-level docs persistence is supported in dbt v0.17.0. For more