diff --git a/website/docs/guides/airflow-and-dbt-cloud.md b/website/docs/guides/airflow-and-dbt-cloud.md index a3ff59af14e..e7f754ef02d 100644 --- a/website/docs/guides/airflow-and-dbt-cloud.md +++ b/website/docs/guides/airflow-and-dbt-cloud.md @@ -11,50 +11,29 @@ recently_updated: true ## Introduction -In some cases, [Airflow](https://airflow.apache.org/) may be the preferred orchestrator for your organization over working fully within dbt Cloud. There are a few reasons your team might be considering using Airflow to orchestrate your dbt jobs: - -- Your team is already using Airflow to orchestrate other processes -- Your team needs to ensure that a [dbt job](https://docs.getdbt.com/docs/dbt-cloud/cloud-overview#schedule-and-run-dbt-jobs-in-production) kicks off before or after another process outside of dbt Cloud -- Your team needs flexibility to manage more complex scheduling, such as kicking off one dbt job only after another has completed -- Your team wants to own their own orchestration solution -- You need code to work right now without starting from scratch - -### Prerequisites - -- [dbt Cloud Teams or Enterprise account](https://www.getdbt.com/pricing/) (with [admin access](https://docs.getdbt.com/docs/cloud/manage-access/enterprise-permissions)) in order to create a service token. Permissions for service tokens can be found [here](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens#permissions-for-service-account-tokens). -- A [free Docker account](https://hub.docker.com/signup) in order to sign in to Docker Desktop, which will be installed in the initial setup. -- A local digital scratchpad for temporarily copy-pasting API keys and URLs - -### Airflow + dbt Core - -There are [so many great examples](https://gitlab.com/gitlab-data/analytics/-/blob/master/dags/transformation/dbt_snowplow_backfill.py) from GitLab through their open source data engineering work. This is especially appropriate if you are well-versed in Kubernetes, CI/CD, and docker task management when building your airflow pipelines. If this is you and your team, you’re in good hands reading through more details [here](https://about.gitlab.com/handbook/business-technology/data-team/platform/infrastructure/#airflow) and [here](https://about.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/). - -### Airflow + dbt Cloud API w/Custom Scripts - -This has served as a bridge until the fabled Astronomer + dbt Labs-built dbt Cloud provider became generally available [here](https://registry.astronomer.io/providers/dbt%20Cloud/versions/latest). - -There are many different permutations of this over time: - -- [Custom Python Scripts](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/archive/dbt_cloud_example.py): This is an airflow DAG based on [custom python API utilities](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/archive/dbt_cloud_utils.py) -- [Make API requests directly through the BashOperator based on the docs](https://docs.getdbt.com/dbt-cloud/api-v2-legacy#operation/triggerRun): You can make cURL requests to invoke dbt Cloud to do what you want -- For more options, check out the [official dbt Docs](/docs/deploy/deployments#airflow) on the various ways teams are running dbt in airflow - -These solutions are great, but can be difficult to trust as your team grows and management for things like: testing, job definitions, secrets, and pipelines increase past your team’s capacity. Roles become blurry (or were never clearly defined at the start!). Both data and analytics engineers start digging through custom logging within each other’s workflows to make heads or tails of where and what the issue really is. Not to mention that when the issue is found, it can be even harder to decide on the best path forward for safely implementing fixes. This complex workflow and unclear delineation on process management results in a lot of misunderstandings and wasted time just trying to get the process to work smoothly! +Many organization already use [Airflow](https://airflow.apache.org/) to orchestrate their data workflows. dbt Cloud works great with Airflow, letting you execute your dbt code in dbt Cloud while keeping orchestration duties with Airflow. This ensures your project's metadata (important for tools like dbt Explorer) is available and up-to-date, while still enabling you to use Airflow for general tasks such as: +- Scheduling other processes outside of dbt runs +- Ensuring that a [dbt job](/docs/deploy/job-scheduler) kicks off before or after another process outside of dbt Cloud +- Triggering a dbt job only after another has completed In this guide, you'll learn how to: -1. Creating a working local Airflow environment -2. Invoking a dbt Cloud job with Airflow (with proof!) -3. Reusing tested and trusted Airflow code for your specific use cases +1. Create a working local Airflow environment +2. Invoke a dbt Cloud job with Airflow +3. Reuse tested and trusted Airflow code for your specific use cases You’ll also gain a better understanding of how this will: - Reduce the cognitive load when building and maintaining pipelines - Avoid dependency hell (think: `pip install` conflicts) -- Implement better recoveries from failures -- Define clearer workflows so that data and analytics engineers work better, together ♥️ +- Define clearer handoff of workflows between data engineers and analytics engineers + +## Prerequisites +- [dbt Cloud Teams or Enterprise account](https://www.getdbt.com/pricing/) (with [admin access](/docs/cloud/manage-access/enterprise-permissions)) in order to create a service token. Permissions for service tokens can be found [here](/docs/dbt-cloud-apis/service-tokens#permissions-for-service-account-tokens). +- A [free Docker account](https://hub.docker.com/signup) in order to sign in to Docker Desktop, which will be installed in the initial setup. +- A local digital scratchpad for temporarily copy-pasting API keys and URLs 🙌 Let’s get started! 🙌 @@ -72,7 +51,7 @@ brew install astro ## Install and start Docker Desktop -Docker allows us to spin up an environment with all the apps and dependencies we need for the example. +Docker allows us to spin up an environment with all the apps and dependencies we need for this guide. Follow the instructions [here](https://docs.docker.com/desktop/) to install Docker desktop for your own operating system. Once Docker is installed, ensure you have it up and running for the next steps. @@ -80,7 +59,7 @@ Follow the instructions [here](https://docs.docker.com/desktop/) to install Dock ## Clone the airflow-dbt-cloud repository -Open your terminal and clone the [airflow-dbt-cloud repository](https://github.com/sungchun12/airflow-dbt-cloud.git). This contains example Airflow DAGs that you’ll use to orchestrate your dbt Cloud job. Once cloned, navigate into the `airflow-dbt-cloud` project. +Open your terminal and clone the [airflow-dbt-cloud repository](https://github.com/sungchun12/airflow-dbt-cloud). This contains example Airflow DAGs that you’ll use to orchestrate your dbt Cloud job. Once cloned, navigate into the `airflow-dbt-cloud` project. ```bash git clone https://github.com/sungchun12/airflow-dbt-cloud.git @@ -91,12 +70,9 @@ cd airflow-dbt-cloud ## Start the Docker container -You can initialize an Astronomer project in an empty local directory using a Docker container, and then run your project locally using the `start` command. - -1. Run the following commands to initialize your project and start your local Airflow deployment: +1. From the `airflow-dbt-cloud` directory you cloned and opened in the prior step, run the following command to start your local Airflow deployment: ```bash - astro dev init astro dev start ``` @@ -110,10 +86,10 @@ You can initialize an Astronomer project in an empty local directory using a Doc Airflow Webserver: http://localhost:8080 Postgres Database: localhost:5432/postgres The default Airflow UI credentials are: admin:admin - The default Postrgres DB credentials are: postgres:postgres + The default Postgres DB credentials are: postgres:postgres ``` -2. Open the Airflow interface. Launch your web browser and navigate to the address for the **Airflow Webserver** from your output in Step 1. +2. Open the Airflow interface. Launch your web browser and navigate to the address for the **Airflow Webserver** from your output above (for us, `http://localhost:8080`). This will take you to your local instance of Airflow. You’ll need to log in with the **default credentials**: @@ -126,15 +102,15 @@ You can initialize an Astronomer project in an empty local directory using a Doc ## Create a dbt Cloud service token -Create a service token from within dbt Cloud using the instructions [found here](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). Ensure that you save a copy of the token, as you won’t be able to access this later. In this example we use `Account Admin`, but you can also use `Job Admin` instead for token permissions. +[Create a service token](/docs/dbt-cloud-apis/service-tokens) with `Job Admin` privileges from within dbt Cloud. Ensure that you save a copy of the token, as you won’t be able to access this later. ## Create a dbt Cloud job -In your dbt Cloud account create a job, paying special attention to the information in the bullets below. Additional information for creating a dbt Cloud job can be found [here](/guides/bigquery). +[Create a job in your dbt Cloud account](/docs/deploy/deploy-jobs#create-and-schedule-jobs), paying special attention to the information in the bullets below. -- Configure the job with the commands that you want to include when this job kicks off, as Airflow will be referring to the job’s configurations for this rather than being explicitly coded in the Airflow DAG. This job will run a set of commands rather than a single command. +- Configure the job with the full commands that you want to include when this job kicks off. This sample code has Airflow triggering the dbt Cloud job and all of its commands, instead of explicitly identifying individual models to run from inside of Airflow. - Ensure that the schedule is turned **off** since we’ll be using Airflow to kick things off. - Once you hit `save` on the job, make sure you copy the URL and save it for referencing later. The url will look similar to this: @@ -144,77 +120,59 @@ https://cloud.getdbt.com/#/accounts/{account_id}/projects/{project_id}/jobs/{job -## Add your dbt Cloud API token as a secure connection - - +## Connect dbt Cloud to Airflow -Now you have all the working pieces to get up and running with Airflow + dbt Cloud. Let’s dive into make this all work together. We will **set up a connection** and **run a DAG in Airflow** that kicks off a dbt Cloud job. +Now you have all the working pieces to get up and running with Airflow + dbt Cloud. It's time to **set up a connection** and **run a DAG in Airflow** that kicks off a dbt Cloud job. -1. Navigate to Admin and click on **Connections** +1. From the Airflow interface, navigate to Admin and click on **Connections** ![Airflow connections menu](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-connections-menu.png) 2. Click on the `+` sign to add a new connection, then click on the drop down to search for the dbt Cloud Connection Type - ![Create connection](/img/guides/orchestration/airflow-and-dbt-cloud/create-connection.png) - ![Connection type](/img/guides/orchestration/airflow-and-dbt-cloud/connection-type.png) 3. Add in your connection details and your default dbt Cloud account id. This is found in your dbt Cloud URL after the accounts route section (`/accounts/{YOUR_ACCOUNT_ID}`), for example the account with id 16173 would see this in their URL: `https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/` -![https://lh3.googleusercontent.com/sRxe5xbv_LYhIKblc7eiY7AmByr1OibOac2_fIe54rpU3TBGwjMpdi_j0EPEFzM1_gNQXry7Jsm8aVw9wQBSNs1I6Cyzpvijaj0VGwSnmVf3OEV8Hv5EPOQHrwQgK2RhNBdyBxN2](https://lh3.googleusercontent.com/sRxe5xbv_LYhIKblc7eiY7AmByr1OibOac2_fIe54rpU3TBGwjMpdi_j0EPEFzM1_gNQXry7Jsm8aVw9wQBSNs1I6Cyzpvijaj0VGwSnmVf3OEV8Hv5EPOQHrwQgK2RhNBdyBxN2) - -## Add your `job_id` and `account_id` config details to the python file + ![Connection type](/img/guides/orchestration/airflow-and-dbt-cloud/connection-type-configured.png) - Add your `job_id` and `account_id` config details to the python file: [dbt_cloud_provider_eltml.py](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/dags/dbt_cloud_provider_eltml.py). +## Update the placeholders in the sample code -1. You’ll find these details within the dbt Cloud job URL, see the comments in the code snippet below for an example. + Add your `account_id` and `job_id` to the python file [dbt_cloud_provider_eltml.py](https://github.com/sungchun12/airflow-dbt-cloud/blob/main/dags/dbt_cloud_provider_eltml.py). - ```python - # dbt Cloud Job URL: https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/ - # account_id: 16173 - #job_id: 65767 +Both IDs are included inside of the dbt Cloud job URL as shown in the following snippets: - # line 28 - default_args={"dbt_cloud_conn_id": "dbt_cloud", "account_id": 16173}, +```python +# For the dbt Cloud Job URL https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/ +# The account_id is 16173 - trigger_dbt_cloud_job_run = DbtCloudRunJobOperator( - task_id="trigger_dbt_cloud_job_run", - job_id=65767, # line 39 - check_interval=10, - timeout=300, - ) - ``` - -2. Turn on the DAG and verify the job succeeded after running. Note: screenshots taken from different job runs, but the user experience is consistent. - - ![https://lh6.googleusercontent.com/p8AqQRy0UGVLjDGPmcuGYmQ_BRodyL0Zis-eQgSmp69EHbKW51o4S-bCl1fXHlOmwpYEBxD0A-O1Q1hwt-VDVMO1wWH-AIeaoelBx06JXRJ0m1OcHaPpFKH0xDiduIhNlQhhbLiy](https://lh6.googleusercontent.com/p8AqQRy0UGVLjDGPmcuGYmQ_BRodyL0Zis-eQgSmp69EHbKW51o4S-bCl1fXHlOmwpYEBxD0A-O1Q1hwt-VDVMO1wWH-AIeaoelBx06JXRJ0m1OcHaPpFKH0xDiduIhNlQhhbLiy) - - ![Airflow DAG](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-dag.png) - - ![Task run instance](/img/guides/orchestration/airflow-and-dbt-cloud/task-run-instance.png) - - ![https://lh6.googleusercontent.com/S9QdGhLAdioZ3x634CChugsJRiSVtTTd5CTXbRL8ADA6nSbAlNn4zV0jb3aC946c8SGi9FRTfyTFXqjcM-EBrJNK5hQ0HHAsR5Fj7NbdGoUfBI7xFmgeoPqnoYpjyZzRZlXkjtxS](https://lh6.googleusercontent.com/S9QdGhLAdioZ3x634CChugsJRiSVtTTd5CTXbRL8ADA6nSbAlNn4zV0jb3aC946c8SGi9FRTfyTFXqjcM-EBrJNK5hQ0HHAsR5Fj7NbdGoUfBI7xFmgeoPqnoYpjyZzRZlXkjtxS) - -## How do I rerun the dbt Cloud job and downstream tasks in my pipeline? - -If you have worked with dbt Cloud before, you have likely encountered cases where a job fails. In those cases, you have likely logged into dbt Cloud, investigated the error, and then manually restarted the job. - -This section of the guide will show you how to restart the job directly from Airflow. This will specifically run *just* the `trigger_dbt_cloud_job_run` and downstream tasks of the Airflow DAG and not the entire DAG. If only the transformation step fails, you don’t need to re-run the extract and load processes. Let’s jump into how to do that in Airflow. - -1. Click on the task +# Update line 28 +default_args={"dbt_cloud_conn_id": "dbt_cloud", "account_id": 16173}, +``` - ![Task DAG view](/img/guides/orchestration/airflow-and-dbt-cloud/task-dag-view.png) +```python +# For the dbt Cloud Job URL https://cloud.getdbt.com/#/accounts/16173/projects/36467/jobs/65767/ +# The job_id is 65767 + +# Update line 39 +trigger_dbt_cloud_job_run = DbtCloudRunJobOperator( + task_id="trigger_dbt_cloud_job_run", + job_id=65767, + check_interval=10, + timeout=300, + ) +``` -2. Clear the task instance + - ![Clear task instance](/img/guides/orchestration/airflow-and-dbt-cloud/clear-task-instance.png) +## Run the Airflow DAG - ![Approve clearing](/img/guides/orchestration/airflow-and-dbt-cloud/approve-clearing.png) +Turn on the DAG and trigger it to run. Verify the job succeeded after running. -3. Watch it rerun in real time +![Airflow DAG](/img/guides/orchestration/airflow-and-dbt-cloud/airflow-dag.png) - ![Re-run](/img/guides/orchestration/airflow-and-dbt-cloud/re-run.png) +Click Monitor Job Run to open the run details in dbt Cloud. +![Task run instance](/img/guides/orchestration/airflow-and-dbt-cloud/task-run-instance.png) ## Cleaning up @@ -224,9 +182,9 @@ At the end of this guide, make sure you shut down your docker container. When y $ astrocloud dev stop [+] Running 3/3 - ⠿ Container airflow-dbt-cloud_e3fe3c-webserver-1 Stopped 7.5s - ⠿ Container airflow-dbt-cloud_e3fe3c-scheduler-1 Stopped 3.3s - ⠿ Container airflow-dbt-cloud_e3fe3c-postgres-1 Stopped 0.3s + ⠿ Container airflow-dbt-cloud_e3fe3c-webserver-1 Stopped 7.5s + ⠿ Container airflow-dbt-cloud_e3fe3c-scheduler-1 Stopped 3.3s + ⠿ Container airflow-dbt-cloud_e3fe3c-postgres-1 Stopped 0.3s ``` To verify that the deployment has stopped, use the following command: @@ -244,37 +202,29 @@ airflow-dbt-cloud_e3fe3c-scheduler-1 exited airflow-dbt-cloud_e3fe3c-postgres-1 exited ``` - + ## Frequently asked questions ### How can we run specific subsections of the dbt DAG in Airflow? -Because of the way we configured the dbt Cloud job to run in Airflow, you can leave this job to your analytics engineers to define in the job configurations from dbt Cloud. If, for example, we need to run hourly-tagged models every hour and daily-tagged models daily, we can create jobs like `Hourly Run` or `Daily Run` and utilize the commands `dbt run -s tag:hourly` and `dbt run -s tag:daily` within each, respectively. We only need to grab our dbt Cloud `account` and `job id`, configure it in an Airflow DAG with the code provided, and then we can be on your way. See more node selection options: [here](/reference/node-selection/syntax) - -### How can I re-run models from the point of failure? - -You may want to parse the dbt DAG in Airflow to get the benefit of re-running from the point of failure. However, when you have hundreds of models in your DAG expanded out, it becomes useless for diagnosis and rerunning due to the overhead that comes along with creating an expansive Airflow DAG. +Because the Airflow DAG references dbt Cloud jobs, your analytics engineers can take responsibility for configuring the jobs in dbt Cloud. -You can’t re-run from failure natively in dbt Cloud today (feature coming!), but you can use a custom rerun parser. +For example, to run some models hourly and others daily, there will be jobs like `Hourly Run` or `Daily Run` using the commands `dbt run --select tag:hourly` and `dbt run --select tag:daily` respectively. Once configured in dbt Cloud, these can be added as steps in an Airflow DAG as shown in this guide. Refer to our full [node selection syntax docs here](/reference/node-selection/syntax). -Using a simple python script coupled with the dbt Cloud provider, you can: - -- Avoid managing artifacts in a separate storage bucket(dbt Cloud does this for you) -- Avoid building your own parsing logic -- Get clear logs on what models you're rerunning in dbt Cloud (without hard coding step override commands) - -Watch the video below to see how it works! +### How can I re-run models from the point of failure? - +You can trigger re-run from point of failure with the `rerun` API endpoint. See the docs on [retrying jobs](/docs/deploy/retry-jobs) for more information. ### Should Airflow run one big dbt job or many dbt jobs? -Overall we recommend being as purposeful and minimalistic as you can. This is because dbt manages all of the dependencies between models and the orchestration of running those dependencies in order, which in turn has benefits in terms of warehouse processing efforts. +dbt jobs are most effective when a build command contains as many models at once as is practical. This is because dbt manages the dependencies between models and coordinates running them in order, which ensures that your jobs can run in a highly parallelized fashion. It also streamlines the debugging process when a model fails and enables re-run from point of failure. + +As an explicit example, it's not recommended to have a dbt job for every single node in your DAG. Try combining your steps according to desired run frequency, or grouping by department (finance, marketing, customer success...) instead. ### We want to kick off our dbt jobs after our ingestion tool (such as Fivetran) / data pipelines are done loading data. Any best practices around that? -Our friends at Astronomer answer this question with this example: [here](https://registry.astronomer.io/dags/fivetran-dbt-cloud-census) +Astronomer's DAG registry has a sample workflow combining Fivetran, dbt Cloud and Census [here](https://registry.astronomer.io/dags/fivetran-dbt_cloud-census/versions/3.0.0). ### How do you set up a CI/CD workflow with Airflow? @@ -285,12 +235,12 @@ Check out these two resources for accomplishing your own CI/CD pipeline: ### Can dbt dynamically create tasks in the DAG like Airflow can? -We prefer to keep models bundled vs. unbundled. You can go this route, but if you have hundreds of dbt models, it’s more effective to let the dbt Cloud job handle the models and dependencies. Bundling provides the solution to clear observability when things go wrong - we've seen more success in having the ability to clearly see issues in a bundled dbt Cloud job than combing through the nodes of an expansive Airflow DAG. If you still have a use case for this level of control though, our friends at Astronomer answer this question [here](https://www.astronomer.io/blog/airflow-dbt-1/)! +As discussed above, we prefer to keep jobs bundled together and containing as many nodes as are necessary. If you must run nodes one at a time for some reason, then review [this article](https://www.astronomer.io/blog/airflow-dbt-1/) for some pointers. -### Can you trigger notifications if a dbt job fails with Airflow? Is there any way to access the status of the dbt Job to do that? +### Can you trigger notifications if a dbt job fails with Airflow? -Yes, either through [Airflow's email/slack](https://www.astronomer.io/guides/error-notifications-in-airflow/) functionality by itself or combined with [dbt Cloud's notifications](/docs/deploy/job-notifications), which support email and slack notifications. +Yes, either through [Airflow's email/slack](https://www.astronomer.io/guides/error-notifications-in-airflow/) functionality, or [dbt Cloud's notifications](/docs/deploy/job-notifications), which support email and Slack notifications. You could also create a [webhook](/docs/deploy/webhooks). -### Are there decision criteria for how to best work with dbt Cloud and airflow? +### How should I plan my dbt Cloud + Airflow implementation? -Check out this deep dive into planning your dbt Cloud + Airflow implementation [here](https://www.youtube.com/watch?v=n7IIThR8hGk)! +Check out [this recording](https://www.youtube.com/watch?v=n7IIThR8hGk) of a dbt meetup for some tips. diff --git a/website/static/img/guides/orchestration/airflow-and-dbt-cloud/connection-type-configured.png b/website/static/img/guides/orchestration/airflow-and-dbt-cloud/connection-type-configured.png new file mode 100644 index 00000000000..f77d3e188fc Binary files /dev/null and b/website/static/img/guides/orchestration/airflow-and-dbt-cloud/connection-type-configured.png differ