Skip to content

Commit

Permalink
Merge branch 'current' into er/update-versionless-version-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
emmyoop authored Oct 28, 2024
2 parents 2cedcab + 42beca2 commit b9b678d
Show file tree
Hide file tree
Showing 4 changed files with 127 additions and 3 deletions.
121 changes: 121 additions & 0 deletions website/docs/docs/cloud/connect-data-platform/connnect-bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,123 @@ As an end user, if your organization has set up BigQuery OAuth, you can link a p

To learn how to optimize performance with data platform-specific configurations in dbt Cloud, refer to [BigQuery-specific configuration](/reference/resource-configs/bigquery-configs).

### Optional configurations

In BigQuery, optional configurations let you tailor settings for tasks such as query priority, dataset location, job timeout, and more. These options give you greater control over how BigQuery functions behind the scenes to meet your requirements.

To customize your optional configurations in dbt Cloud:

1. Click your name at the bottom left-hand side bar menu in dbt Cloud
2. Select **Your profile** from the menu
3. From there, click **Projects** and select your BigQuery project
5. Go to **Development Connection** and select BigQuery
6. Click **Edit** and then scroll down to **Optional settings**

<Lightbox src="/img/bigquery/bigquery-optional-config.png" width="70%" title="BigQuery optional configuration"/>

The following are the optional configurations you can set in dbt Cloud:

| Configuration | <div style={{width:'250'}}>Information</div> | Type | <div style={{width:'150'}}>Example</div> |
|---------------------------|-----------------------------------------|---------|--------------------|
| [Priority](#priority) | Sets the priority for BigQuery jobs (either `interactive` or queued for `batch` processing) | String | `batch` or `interactive` |
| [Retries](#retries) | Specifies the number of retries for failed jobs due to temporary issues | Integer | `3` |
| [Location](#location) | Location for creating new datasets | String | `US`, `EU`, `us-west2` |
| [Maximum bytes billed](#maximum-bytes-billed) | Limits the maximum number of bytes that can be billed for a query | Integer | `1000000000` |
| [Execution project](#execution-project) | Specifies the project ID to bill for query execution | String | `my-project-id` |
| [Impersonate service account](#impersonate-service-account) | Allows users authenticated locally to access BigQuery resources under a specified service account | String | `service-account@project.iam.gserviceaccount.com` |
| [Job retry deadline seconds](#job-retry-deadline-seconds) | Sets the total number of seconds BigQuery will attempt to retry a job if it fails | Integer | `600` |
| [Job creation timeout seconds](#job-creation-timeout-seconds) | Specifies the maximum timeout for the job creation step | Integer | `120` |
| [Google cloud storage-bucket](#google-cloud-storage-bucket) | Location for storing objects in Google Cloud Storage | String | `my-bucket` |
| [Dataproc region](#dataproc-region) | Specifies the cloud region for running data processing jobs | String | `US`, `EU`, `asia-northeast1` |
| [Dataproc cluster name](#dataproc-cluster-name) | Assigns a unique identifier to a group of virtual machines in Dataproc | String | `my-cluster` |


<Expandable alt_header="Priority">

The `priority` for the BigQuery jobs that dbt executes can be configured with the `priority` configuration in your BigQuery profile. The priority field can be set to one of `batch` or `interactive`. For more information on query priority, consult the [BigQuery documentation](https://cloud.google.com/bigquery/docs/running-queries).

</Expandable>

<Expandable alt_header="Retries">

Retries in BigQuery help to ensure that jobs complete successfully by trying again after temporary failures, making your operations more robust and reliable.

</Expandable>

<Expandable alt_header="Location">

The `location` of BigQuery datasets can be set using the `location` setting in a BigQuery profile. As per the [BigQuery documentation](https://cloud.google.com/bigquery/docs/locations), `location` may be either a multi-regional location (for example, `EU`, `US`), or a regional location (like `us-west2`).

</Expandable>

<Expandable alt_header="Maximum bytes billed">

When a `maximum_bytes_billed` value is configured for a BigQuery profile, that allows you to limit how much data your query can process. It’s a safeguard to prevent your query from accidentally processing more data than you expect, which could lead to higher costs. Queries executed by dbt will fail if they exceed the configured maximum bytes threshhold. This configuration should be supplied as an integer number of bytes.

If your `maximum_bytes_billed` is 1000000000, you would enter that value in the `maximum_bytes_billed` field in dbt cloud.


</Expandable>

<Expandable alt_header="Execution project">

By default, dbt will use the specified `project`/`database` as both:

1. The location to materialize resources (models, seeds, snapshots, and so on), unless they specify a custom project/database config
2. The GCP project that receives the bill for query costs or slot usage

Optionally, you may specify an execution project to bill for query execution, instead of the project/database where you materialize most resources.

</Expandable>

<Expandable alt_header="Impersonate service account">

This feature allows users authenticating using local OAuth to access BigQuery resources based on the permissions of a service account.

For a general overview of this process, see the official docs for [Creating Short-lived Service Account Credentials](https://cloud.google.com/iam/docs/create-short-lived-credentials-direct).

</Expandable>

<Expandable alt_header="Job retry deadline seconds">

Job retry deadline seconds is the maximum amount of time BigQuery will spend retrying a job before it gives up.

</Expandable>

<Expandable alt_header="Job creation timeout seconds">

Job creation timeout seconds is the maximum time BigQuery will wait to start the job. If the job doesn’t start within that time, it times out.

</Expandable>

#### Run dbt python models on Google Cloud Platform

import BigQueryDataproc from '/snippets/_bigquery-dataproc.md';

<BigQueryDataproc />

<Expandable alt_header="Google cloud storage bucket">

Everything you store in Cloud Storage must be placed inside a [bucket](https://cloud.google.com/storage/docs/buckets). Buckets help you organize your data and manage access to it.

</Expandable>

<Expandable alt_header="Dataproc region">

A designated location in the cloud where you can run your data processing jobs efficiently. This region must match the location of your BigQuery dataset if you want to use Dataproc with BigQuery to ensure data doesn't move across regions, which can be inefficient and costly.

For more information on [Dataproc regions](https://cloud.google.com/bigquery/docs/locations), refer to the BigQuery documentation.

</Expandable>

<Expandable alt_header="Dataproc cluster name">

A unique label you give to your group of virtual machines to help you identify and manage your data processing tasks in the cloud. When you integrate Dataproc with BigQuery, you need to provide the cluster name so BigQuery knows which specific set of resources (the cluster) to use for running the data jobs.

Have a look at [Dataproc's document on Create a cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) for an overview on how clusters work.

</Expandable>

### Account level connections and credential management

You can re-use connections across multiple projects with [global connections](/docs/cloud/connect-data-platform/about-connections#migration-from-project-level-connections-to-account-level-connections). Connections are attached at the environment level (formerly project level), so you can utilize multiple connections inside of a single project (to handle dev, staging, production, etc.).
Expand Down Expand Up @@ -147,3 +264,7 @@ For a project, you will first create an environment variable to store the secret
"extended_attributes_id": FFFFF
}'
```
Original file line number Diff line number Diff line change
Expand Up @@ -390,9 +390,9 @@ my-profile:

### Running Python models on Dataproc

To run dbt Python models on GCP, dbt uses companion services, Dataproc and Cloud Storage, that offer tight integrations with BigQuery. You may use an existing Dataproc cluster and Cloud Storage bucket, or create new ones:
- https://cloud.google.com/dataproc/docs/guides/create-cluster
- https://cloud.google.com/storage/docs/creating-buckets
import BigQueryDataproc from '/snippets/_bigquery-dataproc.md';

<BigQueryDataproc />

Then, add the bucket name, cluster name, and cluster region to your connection profile:

Expand Down
3 changes: 3 additions & 0 deletions website/snippets/_bigquery-dataproc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
To run dbt Python models on GCP, dbt uses companion services, Dataproc and Cloud Storage, that offer tight integrations with BigQuery. You may use an existing Dataproc cluster and Cloud Storage bucket, or create new ones:
- https://cloud.google.com/dataproc/docs/guides/create-cluster
- https://cloud.google.com/storage/docs/creating-buckets
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b9b678d

Please sign in to comment.