rfc_1_orchestration

RFC #1 - Job Orchestration

1. Introduction

This document proposes a new top-level construct in dbt: the job. The job block will be responsible for "orchestrating" runs of dbt projects.

2. Motivation

Presently, dbt runs like a firecracker: It picks a starting point based on the shape of the graph and runs everything in its path until there's nothing left to run. While dbt projects contain models, tests, hooks, and so on, the order and manner in which these constructs are run is largely dictated by dbt.

dbt projects are unduly restrictive in a few different ways.

First, modifiers like --full-refresh and --non-destructive are all-or-nothing: they are either applied to all models, or no models at all. Ideally, these configurations could be applied to one or more models, selectively, depending on the needs of the user. These configurations might differ between dev and prod, or between hourly and nightly runs of dbt, for instance.

Second, arbitrary (non-model) sql is confined to running in a few different places in dbt:

Before the entire run (on-run-start)
Before a given model (pre-hook)
After a given model (post-hook)
After the entire run (on-run-end)

dbt users should be able to inject arbitrary sql into their dbt runs between individual models, or between subgraphs of models. This might look like vacuuming the tables in the snowplow source data schema before a project's snowplow models execute in production.

This could also take the shape of inserting records like

(run_id, model_name, start_time, end_time)

into an audit table before and after each model runs, but only in production. The in production qualifier here, while minor, makes this task unduly difficult in dbt ~0.9.0. Tasks like this should be simple and straightforward to accomplish in dbt.

Third, "resources" cannot currently be mixed within an invocation of dbt. The commands:

$ dbt seed
$ dbt archive
$ dbt run
$ dbt test

will load seed data and run archives, models, and tests, respectively. Instead of running four different commands, it should be possible to "orchestrate" the execution of these tasks within a single run of dbt. This will make it possible to, for example, run and test your snowplow models in a single command. While saving keystrokes is good and noble, it also makes complex deployments of dbt easier to manage.

Finally, dbt's current yaml-based approach to configuration is unwieldy. The dbt_project.yml configuration is tied to the folder hierarchy of models on disk. Moving or renaming files can silently break dbt projects. This makes configuring models inside of packages more difficult than it should be. Further, some configuration options accept jinja code and some do not. This configuration should be at once simpler and more powerful.

While these items are not an exhaustive list of the shortcomings of dbt as it exists today, they do frame the limitations of the existing programming model. The key takeaway is that, in short: right now dbt runs you, but we think that you should be running dbt.

A word on nomenclature

Models, Tests, Operations, and Archives are all represented internally as "Resources" by dbt. The term "Resource" will be used below to refer to objects like tests, models, operations, and archives, as the following principles are not specific to any specific dbt construct.

3. The big idea

By introducing a new Jinja block, the job, dbt can accomplish everything listed above, and more. These job blocks will be responsible for 1) selecting resources 2) applying configuration and 3) invoking resources.

A. Anatomy of a job

Jobs can be defined in any source file in the source-paths directory of a dbt project.

Job blocks will look like this:

{% job default %}

.... code ....

{% endjob %}

Job blocks must be named, but require no other configuration. This name allows the job to be invoked from the command line:

$ dbt run default

Or more simply

# implicitly run the job named "default"
$ dbt run

This new command-line structure means that top-level dbt commands like dbt test, dbt seed, and dbt archive will go away. Instead, all invocations of dbt will go through the dbt run command.

Here are two job blocks: one for dev and one for prod, for a project built using the snowplow package.

----------------------------------------------
-- A simple development job to run all models
----------------------------------------------
{% job dev %}

  -- Run all models in the project
  {% do _.select("models").run() %}

{% endjob %}

----------------------------------------------
-- A complex production deployment
----------------------------------------------
{% job prod %}

  -- Vacuum all of the source tables in the `snowplow` schema (using the `vacuum_tables_in_schema` macro)
  {% do vacuum_tables_in_schema('snowplow') %}

  -- Reconfigure `snowplow_sessions` to run in full-refresh mode
  {% do _.select('models[name=snowplow_sessions]').config({"full_refresh": True}) %}

  -- Select all of the models in the Snowplow package
  {% set snowplow_models = _.select("models[package=snowplow]") %}

  -- Add a post-hook to vacuum any incremental models in the Snowplow package (using the `vacuum_table` macro)
  {% do snowplow_models.select('[materialized=incremental]').onComplete(vacuum_table) %}

  -- Run all of the snowplow models (and their parents)
  {% set snowplow_models.run(parents=True, children=True) %}

  -- Insert audit records for each of the previously run models using a macro
  {% do snowplow_models.onComplete(insert_audit_records) %}

{% endjob %}

B. Selecting resources

Because the job block requires users to explicitly invoke resources, dbt must provide a mechanism for selecting resources to run. This selection mechanism must be simple, unambiguous, and comprehensive.

Simple: These selectors should be easy and intuitive to write -- constantly trawling through the docs to find the correct syntax would be unpleasant for dbt users. This selection syntax will also likely make it's way into the CLI, so it should be reasonable compact and comprehensible.

Unambiguous: The existing --models selection syntax on the dbt command line is ambiguous. The following command can mean three different things:

$ dbt run --models snowplow

Run a model named snowplow
Run the models in the models/snowplow directory
Run the models in the snowplow package

A viable resource selection syntax will be totally unambiguous.

Comprehensive The selector syntax should make it easy to select common groupings of resources, and possible to select complex groups of resources. There should never be a class of resources that is impossible to select. If the resources can't be selected, then they can't be run!

The resource selection syntax shown in this document is inspired by jQuery and underscore.js. Both of these libraries are used to select, filter, and operate on complex data structures, so they serve as useful starting points for dbt's graph-selection syntax.

Selection functions

When a job block is parsed, a variable will be added to the block's context representing the entire set of defined resources in the project. This document uses an underscore (_) for the variable name, but it could equivalently be named dbt or graph or any other valid Python variable name. This variable is a Selection object.

Selection objects provide a number of useful functions for interacting with a given selection. Each of these functions returns another Selection object, so these selectors can be easily chained! This document proposes a single select function, but others like exclude, intersect, union, etc are both possible and compelling.

Selection syntax

The select function will accept one or more string arguments. Each string argument should be in the format:

'<resource>[<attribute><qualifier><value>, ...]'

Resource can be one of: {models, tests, archives, seeds, operations}
Attribute can be one of:
- name: the name of a model
- package: the name of a package
- tags: the tags attached to a model
- any configuration option provided to a model, eg. materialized
Qualifier can be one of : {=, !=, *=}
Value can be any string

Here are some examples of valid selectors:

# Select a single model
'models[name=snowplow_sessions]'

# Select all of the models in a package
'models[package=snowplow]'

# Select models by an attribute
'models[materialized=table]'

# Select models by multiple attributes
'models[package=snowplow, materialized=incremental]'

# Select all tests
'tests'

# Select all tests containing the tag `base-model`
'tests[tags*=base-model]'

# Select all archives
'archives'

These selectors can be used to select nodes using the select function:

# Select all models in the snowpow package OR materialized as tables
_.select('models[package=snowplow])

Finally, Selection objects should provide the following methods:

children() get all children nodes of the selected nodes
parents() get all parent nodes of the selected nodes

These methods also return Selection objects.

C. Applying configuration

The Selection object returned by calls to select() will provide a function, config(), intended to configure resources. This config function will work just like the existing config() implementation, with the notable exception that it can be called more than one time. Subsequent calls to config will override previous configuration settings. Note that config is called on a set of resources -- even if that set only contains one element.

# Configure all models to be materialized as tables

_.select('models').config(materialized='table')
# OR:
_.select('models').config({'materialized': 'table'})


# Configure a specific model

_.select('models[name=snowplow_sessions]').config(materialized='table')

Overriding models

If a Selection object contains a single model, that model can be "replaced" with another model. This is useful for augmenting models defined in packages, for instance. This syntax looks like:

{% set local_model = _.select(models[name=snowplow_sessions_local, package=internal_analytics]) %}
{% set package_model = _.select(models[name=snowplow_sessions, package=snowplow]) %}
{% do package_model.replace_with(local_model) %}
{% do package_model.run() }

Or:

{% set local_model = _.find('snowplow_sessions_local']) %}
{% do _.find('snowplow_sessions').replace_with(local_model) %}
{% do package_model.run() }

Here, the find function is like select, except it returns a Selection containing a single model uniquely identified by its name.

Resource Hooks

Hooks can be added to resources using onStart and onComplete functions of a Selection object. In this way, hooks can be applied to subsets of models, but they can also vary across jobs. This might look like running vacuum hooks in production, but not in development, for instance. The interface for these methods looks like:

Selection.onStart(sql_or_macro)
Selection.onComplete(sql_or_macro)

The onStart and onComplete functions can either be called with a SQL string or with a macro. Macros provided to onStart should accept one argument: a resource object. Macros provided to onComplete should accept either one two arguments: a resource object (required), and a result object (optional). If a macro is used, it should return runnable SQL that will be executed by dbt.

In practice, this code will look like:

_.select('models').onComplete("grant select on table {{ this }} to BI_USER")

Or, better, use a macro:

{% macro grant_model(this) %}
  grant select on table {{ this }} to BI_USER;
{% endmacro %}


_.select('models').onComplete(grant_model)

Finally, arbitrary SQL can be executed using the sql function:

sql('grant select on all tables in schema {{ target.schema }} to BI_USER')

D. Invoking resources

In addition to being configurable, Selection objects are also executable. All of the resources selected by a Selection can be executed using the .run() function. The signature for this function looks like:

Selection.run(parents=False, children=False)

This function is supported for all resource types. Whereas calling run on a model will execute that model, calling run on a test will run the test and report on results.

The run function will store the results of the executed resources in memory. These Result objects will contain information about the execution of the selected resources including the execution status and start/end/elapsed time. They can be accessed through onComplete hooks, or via a global variable.

6. Beyond jobs

While jobs are the core construct responsible for orchestrating dbt runs, resources can also be configured outside of a job. This is useful for 1) applying configs to groups of models across all jobs and 2) configuring models contained within packages. This syntax would look like:

{% do _.select('models[package=snowplow]').config({"vars": {"events_table":ref('base_events')}}) %}
{% do _.select('models[package=mailchimp]').config(enabled=False) %}

{% job default %}

  {% do _.select(models).run() %}

{% endjob %}

7. Brief notes on implementation

Code inside of the job block should be executed lazily. Functions like Selection.run or Selection.onComplete should translate to a set of instructions to execute, but they should not immediately execute themselves. This will enable features like dbt run --dry or dbt info [model-name] to statically understand the entirety of the job without needing to actually run it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly