Skip to content

Commit

Permalink
Merge branch 'current' into teradata
Browse files Browse the repository at this point in the history
  • Loading branch information
amychen1776 authored Oct 4, 2024
2 parents 3d28feb + 78004ee commit cbc7ea2
Show file tree
Hide file tree
Showing 8 changed files with 107 additions and 22 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,14 @@ We’ve focused heavily thus far on the primary area of action in our dbt projec

### Project splitting

One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Our present stance on this for most projects, particularly for teams starting out, is straightforward: you should avoid it unless you have no other option or it saves you from an even more complex workaround. If you do have the need to split up your project, it’s completely possible through the use of private packages, but the added complexity and separation is, for most organizations, a hindrance, not a help, at present. That said, this is very likely subject to change! [We want to create a world where it’s easy to bring lots of dbt projects together into a cohesive lineage](https://github.com/dbt-labs/dbt-core/discussions/5244). In a world where it’s simple to break up monolithic dbt projects into multiple connected projects, perhaps inside of a modern mono repo, the calculus will be different, and the below situations we recommend against may become totally viable. So watch this space!
One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Currently, our advice for most teams, especially those just starting, is fairly simple: in most cases, we recommend doing so with [dbt Mesh](/best-practices/how-we-mesh/mesh-1-intro)! dbt Mesh allows organizations to handle complexity by connecting several dbt projects rather than relying on one big, monolithic project. This approach is designed to speed up development while maintaining governance.

- ❌ **Business groups or departments.** Conceptual separations within the project are not a good reason to split up your project. Splitting up, for instance, marketing and finance modeling into separate projects will not only add unnecessary complexity but destroy the unifying effect of collaborating across your organization on cohesive definitions and business logic.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.
As breaking up monolithic dbt projects into smaller, connected projects, potentially within a modern mono repo becomes easier, the scenarios we currently advise against may soon become feasible. So watch this space!

- ✅ **Business groups or departments.** Conceptual separations within the project are the primary reason to split up your project. This allows your business domains to own their own data products and still collaborate using dbt Mesh. For more information about dbt Mesh, please refer to our [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs).
- ✅ **Data governance.** Structural, organizational needs — such as data governance and security — are one of the few worthwhile reasons to split up a project. If, for instance, you work at a healthcare company with only a small team cleared to access raw data with PII in it, you may need to split out your staging models into their own projects to preserve those policies. In that case, you would import your staging project into the project that builds on those staging models as a [private package](https://docs.getdbt.com/docs/build/packages/#private-packages).
- ✅ **Project size.** At a certain point, your project may grow to have simply too many models to present a viable development experience. If you have 1000s of models, it absolutely makes sense to find a way to split up your project.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.

## Final considerations

Expand Down
32 changes: 20 additions & 12 deletions website/docs/docs/build/snapshots.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,20 +52,25 @@ It is not possible to "preview data" or "compile sql" for snapshots in dbt Cloud

<VersionBlock firstVersion="1.9">

In dbt Cloud Versionless and dbt Core v1.9 and later, snapshots are configurations defined in YAML files (typically in your snapshots directory). You'll configure your snapshot to tell dbt how to detect record changes.
Configure your snapshots in YAML files to tell dbt how to detect record changes. Define snapshots configurations in YAML files, alongside your models, for a cleaner, faster, and more consistent set up.

<File name='snapshots/orders_snapshot.yml'>

```yaml
snapshots:
- name: orders_snapshot
relation: source('jaffle_shop', 'orders')
- name: string
relation: relation # source('my_source', 'my_table') or ref('my_model')
config:
schema: snapshots
database: analytics
unique_key: id
strategy: timestamp
updated_at: updated_at
[database](/reference/resource-configs/database): string
[schema](/reference/resource-configs/schema): string
[alias](/reference/resource-configs/alias): string
[strategy](/reference/resource-configs/strategy): timestamp | check
[unique_key](/reference/resource-configs/unique_key): column_name_or_expression
[check_cols](/reference/resource-configs/check_cols): [column_name] | all
[updated_at](/reference/resource-configs/updated_at): column_name
[invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes): true | false
[snapshot_meta_column_names](/reference/resource-configs/snapshot_meta_column_names): dictionary

```

</File>
Expand All @@ -82,6 +87,7 @@ The following table outlines the configurations available for snapshots:
| [check_cols](/reference/resource-configs/check_cols) | If using the `check` strategy, then the columns to check | Only if using the `check` strategy | ["status"] |
| [updated_at](/reference/resource-configs/updated_at) | If using the `timestamp` strategy, the timestamp column to compare | Only if using the `timestamp` strategy | updated_at |
| [invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes) | Find hard deleted records in source and set `dbt_valid_to` to current time if the record no longer exists | No | True |
| [snapshot_meta_column_names](/reference/resource-configs/snapshot_meta_column_names) | Customize the names of the snapshot meta fields | No | dictionary |

- In versions prior to v1.9, the `target_schema` (required) and `target_database` (optional) configurations defined a single schema or database to build a snapshot across users and environment. This created problems when testing or developing a snapshot, as there was no clear separation between development and production environments. In v1.9, `target_schema` became optional, allowing snapshots to be environment-aware. By default, without `target_schema` or `target_database` defined, snapshots now use the `generate_schema_name` or `generate_database_name` macros to determine where to build. Developers can still set a custom location with [`schema`](/reference/resource-configs/schema) and [`database`](/reference/resource-configs/database) configs, consistent with other resource types.
- A number of other configurations are also supported (for example, `tags` and `post-hook`). For the complete list, refer to [Snapshot configurations](/reference/snapshot-configs).
Expand Down Expand Up @@ -160,7 +166,7 @@ To add a snapshot to your project follow these steps. For users on versions 1.8

### Configuration best practices

<Expandable alt_header="Use thetimestamp strategy where possible">
<Expandable alt_header="Use the timestamp strategy where possible">

This strategy handles column additions and deletions better than the `check` strategy.

Expand Down Expand Up @@ -188,9 +194,9 @@ Snapshots can't be rebuilt. Because of this, it's a good idea to put snapshots i

</Expandable>

<Expandable alt_header="Use ephemeral model to clean or tranform data before snapshotting">
<Expandable alt_header="Use ephemeral model to clean or transform data before snapshotting">

If you need to clean or transform your data before snapshotting, create an ephemeral model (or a staging model) that applies the necessary transformations. Then, reference this model in your snapshot configuration. This approach keeps your snapshot definitions clean and allows you to test and run transformations separately.
If you need to clean or transform your data before snapshotting, create an ephemeral model or a staging model that applies the necessary transformations. Then, reference this model in your snapshot configuration. This approach keeps your snapshot definitions clean and allows you to test and run transformations separately.

</Expandable>
</VersionBlock>
Expand All @@ -203,6 +209,8 @@ When you run the [`dbt snapshot` command](/reference/commands/snapshot):
- The `dbt_valid_to` column will be updated for any existing records that have changed
- The updated record and any new records will be inserted into the snapshot table. These records will now have `dbt_valid_to = null`

Note, these column names can be customized to your team or organizational conventions using the [snapshot_meta_column_names](#snapshot-meta-fields) config.

Snapshots can be referenced in downstream models the same way as referencing models — by using the [ref](/reference/dbt-jinja-functions/ref) function.

## Detecting row changes
Expand Down
2 changes: 1 addition & 1 deletion website/docs/docs/dbt-versions/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Release notes are grouped by month for both multi-tenant and virtual private clo
- **New**: In dbt Cloud Versionless, [Snapshots](/docs/build/snapshots) have been updated to use YAML configuration files instead of SQL snapshot blocks. This new feature simplifies snapshot management and improves performance, and will soon be released in dbt Core 1.9.
- Who does this affect? New user on Versionless can define snapshots using the new YAML specification. Users upgrading to Versionless who use snapshots can keep their existing configuration or can choose to migrate their snapshot definitions to YAML.
- Users on dbt 1.8 and earlier: No action is needed; existing snapshots will continue to work as before. However, we recommend upgrading to Versionless to take advantage of the new snapshot features.
- **Behavior change:** Set [`state_modified_compare_more_unrendered`](/reference/global-configs/behavior-changes#source-definitions-for-state) to true to reduce false positives for `state:modified` when configs differ between `dev` and `prod` environments.
- **Behavior change:** Set [`state_modified_compare_more_unrendered_values`](/reference/global-configs/behavior-changes#source-definitions-for-state) to true to reduce false positives for `state:modified` when configs differ between `dev` and `prod` environments.
- **Behavior change:** Set the [`skip_nodes_if_on_run_start_fails`](/reference/global-configs/behavior-changes#failures-in-on-run-start-hooks) flag to `True` to skip all selected resources from running if there is a failure on an `on-run-start` hook.
- **Enhancement**: In dbt Cloud Versionless, snapshots defined in SQL files can now use `config` defined in `schema.yml` YAML files. This update resolves the previous limitation that required snapshot properties to be defined exclusively in `dbt_project.yml` and/or a `config()` block within the SQL file. This will also be released in dbt Core 1.9.
- **New**: In dbt Cloud Versionless, the `snapshot_meta_column_names` config allows for customizing the snapshot metadata columns. This feature allows an organization to align these automatically-generated column names with their conventions, and will be included in the upcoming dbt Core 1.9 release.
Expand Down
4 changes: 2 additions & 2 deletions website/docs/reference/global-configs/behavior-changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ When we use dbt Cloud in the following table, we're referring to accounts that h
| source_freshness_run_project_hooks | 2024.03 | TBD* | 1.8.0 | 1.9.0 |
| [Redshift] [restrict_direct_pg_catalog_access](/reference/global-configs/redshift-changes#the-restrict_direct_pg_catalog_access-flag) | 2024.09 | TBD* | dbt-redshift v1.9.0 | 1.9.0 |
| skip_nodes_if_on_run_start_fails | 2024.10 | TBD* | 1.9.0 | TBD* |
| state_modified_compare_more_unrendered | 2024.10 | TBD* | 1.9.0 | TBD* |
| state_modified_compare_more_unrendered_values | 2024.10 | TBD* | 1.9.0 | TBD* |
When the dbt Cloud Maturity is "TBD," it means we have not yet determined the exact date when these flags' default values will change. Affected users will see deprecation warnings in the meantime, and they will receive emails providing advance warning ahead of the maturity date. In the meantime, if you are seeing a deprecation warning, you can either:
- Migrate your project to support the new behavior, and then set the flag to `True` to stop seeing the warnings.
Expand All @@ -85,7 +85,7 @@ Set the `skip_nodes_if_on_run_start_fails` flag to `True` to skip all selected r

The flag is `False` by default.

Set `state_modified_compare_more_unrendered` to `True` to reduce false positives during `state:modified` checks (especially when configs differ by target environment like `prod` vs. `dev`).
Set `state_modified_compare_more_unrendered_values` to `True` to reduce false positives during `state:modified` checks (especially when configs differ by target environment like `prod` vs. `dev`).

Setting the flag to `True` changes the `state:modified` comparison from using rendered values to unrendered values instead. It accomplishes this by persisting `unrendered_config` during model parsing and `unrendered_database` and `unrendered_schema` configs during source parsing.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,15 @@ dbt test -s "state:modified" --exclude "test_name:relationships"

<VersionBlock firstVersion="1.9">

To reduce false positives during `state:modified` selection due to env-aware logic, you can set the `state_modified_compare_more_unrendered` [behavior flag](/reference/global-configs/behavior-changes#behavior-change-flags) to `True`.
To reduce false positives during `state:modified` selection due to env-aware logic, you can set the `state_modified_compare_more_unrendered_values` [behavior flag](/reference/global-configs/behavior-changes#behavior-change-flags) to `True`.

</VersionBlock>

<VersionBlock lastVersion="1.8">
State comparison works by identifying discrepancies between two manifests. Those discrepancies could be the result of:

1. Changes made to a project in development
2. Env-aware logic that causes different behavior based on the `target`, env vars, etc., which can be avoided if you upgrade to dbt Core 1.9 and set the `state_modified_compare_more_unrendered` [behavior flag](/reference/global-configs/behavior-changes#behavior-change-flags) to `True`.
2. Env-aware logic that causes different behavior based on the `target`, env vars, etc., which can be avoided if you upgrade to dbt Core 1.9 and set the `state_modified_compare_more_unrendered_values` [behavior flag](/reference/global-configs/behavior-changes#behavior-change-flags) to `True`.

State comparison detects env-aware config in `dbt_project.yml`. This target-based config won't register as a modification:

Expand Down
3 changes: 3 additions & 0 deletions website/docs/reference/resource-configs/snowflake-configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ To-do:
- use the reference doc structure for this article / split into separate articles
--->

<VersionBlock firstVersion="1.9">

## Iceberg table format <Lifecycle status="beta"/>

The dbt-snowflake adapter supports the Iceberg table format. It is available for three of the Snowflake materializations:
Expand Down Expand Up @@ -95,6 +97,7 @@ There are some limitations to the implementation you need to be aware of:
- Using Iceberg tables with dbt, the result is that your query is materialized in Iceberg. However, often, dbt creates intermediary objects as temporary and transient tables for certain materializations, such as incremental ones. It is not possible to configure these temporary objects also to be Iceberg-formatted. You may see non-Iceberg tables created in the logs to support specific materializations, but they will be dropped after usage.
- You cannot incrementally update a preexisting incremental model to be an Iceberg table. To do so, you must fully rebuild the table with the `--full-refresh` flag.

</VersionBlock>

## Dynamic tables

Expand Down
Loading

0 comments on commit cbc7ea2

Please sign in to comment.