Skip to content

Latest commit

 

History

History
408 lines (300 loc) · 16.2 KB

README.md

File metadata and controls

408 lines (300 loc) · 16.2 KB

NOTE: This project is still very much a work in progress, with much of the larger model restructuring still to come, see the TODO file for more info.

dbt GA4 Project

First and foremost, this project is based off of the dbt GA4 Package by Velir, but has been modified and refactored for internal purposes. This project uses Google Analytics 4 BigQuery Exports as its source data, and offers useful base transformations to provide report-ready dimension & fact models that can be used for reporting purposes, blending with other data, and/or feature engineering for ML models.

Find more info about Google Analytics 4 BigQuery Exports here.

Features Overview:

  • Four final tables—ga4__events, ga4__pages, ga4__sesssions, and ga4__users—that are completed unnested to be wide & denomalized for easy querying by the end-user.
  • Conversion of the the day-shared events_YYYYMMDD & events_intraday_YYYYMMDD tables into singular date-partitionioned incremental base models.
  • Dynamically flattens event_params into their own individual columns.
  • Dynamically flattens user_props into their own individual columns.
  • Dynamically extracts & flattens URL query_params (e.g., gclid, fbclid, _ga) into their own individual columns.
  • Custom Variables. See here for more info.
  • Custom Marcros. See here for more info.

Style Guide:

This project and any future projects that may be based off of this intial dbt_ga4_project, will be following This Project's Style Guide...IN PROGRESS, which borrows ideals from the following Style Guides:

Models

DAG Overview

NOTE: This DAG Image is NOT current & will continue to CHANGE until all models are finalized.

DAG Overview

Mart Models

Model Name Description
ga4__events This is the table for event-level metrics & dimensions, that has been transformed to be wide & denomalized for easier quering.
ga4__pages This is the table for page-level metrics & dimensions, such as page_views, exits, and users. This table is grouped by page_title, event_date, and page_path.
ga4__sessions This is the table for session-level metrics & dimensions, such as is_engaged_session, engagement_duration, and page_views. This table is grouped by both session_key and user_key.
ga4__users This is the table for user-level metrics & dimensions, such as first & last_seen_date, geo, and traffic_source. This table is grouped by the hashed user_key dimension, which is based on user_id, or user_pseudo_id if one doesn't exist.

Staging & Intermediate Models

Model Name Description
stg_ga4__events Creates a table with event data that is enhanced with useful event_keys, page_keys, session keys, and user_keys.
stg_ga4__event_params Creates a table that unnests all of the event parameters specific to each event (e.g. page_view, click, or scroll), except for those marked in the dbt_project.yml file.
stg_ga4__traffic_sources Creates a table that designates a default_channel_grouping via the source, medium, campaign columns.
stg_ga4__user_props Creates a table that unnests the user_properties, except for those marked in the dbt_project.yml file.
stg_ga4__query_params Maps any and all query parameters (e.g. gclid, fbclid, etc.) contained in each event's page_location.
stg_ga4__conversions Creates a table for the events that you mark as a conversion_event in the dbt_project.yml file.
int_ga4__events_joined ...[TO DO]...
int_ga4__pages_grouped ...[TO DO]...
int_ga4__sessions_grouped ...[TO DO]...
int_ga4__users_grouped ...[TO DO]...

Macros

NOTE: These Macros are also not finalized & are likely to change.

get_first(by_column_name,from_column_name) source

This macro returns the FIRST position of a specified from_column_name, which is partioned by the by_column_name.

Args:

  • by_column_name (required): The name of the column which you want to partition your selction by.
  • from_column_name (required): The name of the column to get the first value of.

Usage:

{{ get_first('<by_column_name>', '<from_column_name>') }}

Example: Get the landing_page of a corresponding Session by selecting the first page_path using that Session's session_key.

SELECT
  {{ get_first('session_key', 'page_path') }} AS landing_page
  ...

get_last(by_column_name,from_column_name) source

This macro returns the LAST position of a specified from_column_name, which is partioned by the by_column_name.

Args:

  • by_column_name (required): The name of the column which you want to partition your selction by.
  • from_column_name (required): The name of the column to get the last value of.

Usage:

{{ get_last('<by_column_name>', '<from_column_name>') }}

Example: Get the last event_key for a corresponding Session using that Session's session_key.

SELECT
  {{ get_last('session_key', 'event_key') }} AS last_session_event_key,
  ...

extract_hostname_from_url(url) source

This macro extracts the hostname from a column containing a url.

Args:

  • url (required): The column containting URLs.

Usage:

{{ extract_hostname_from_url('<url>') }}

Example: Extract the hostname from the page_location column.

SELECT
  {{ extract_hostname_from_url('page_location') }} AS page_hostname,
  ...

extract_query_string_from_url(url) source

This macro extracts the query_string from a column containing a url.

Args:

  • url (required): The column containting URLs.

Usage:

{{ extract_query_string_from_url('<url>') }}

Example: Extract the query_string from the page_location column.

SELECT
  {{ extract_query_string_from_url('page_location') }} AS page_query_string,
  ...

remove_query_parameters(url, [parameters]) source

This macro removes the specified parameters from a column containing a url.

Args:

  • url (required): The column containting URLs.
  • parameters (required, default=[]): A list of query parameters to remove from the URL.

Usage:

{{ remove_query_parameters('<url>', '[parameters]')  }}

Example: Remove the parameters: gclid, fbclid, and _ga from the page_location column.

{% set parameters = ['gclid','fbclid','_ga'] %}

SELECT
  {{ remove_query_parameters('page_location', parameters) }} AS clean_page_location,
  ...

unnest_by_key(column_to_unnest, key_to_extract, value_type = "string") source

This macro unnests a single key's value from an array. This macro will dynamically alias the sub-query with the name of the column_to_unnest.

Args:

  • column_to_unnest (required): The array column to unnest the key's value from.
  • key_to_extract (required): The key by which to get the corresponding value for.
  • value_type (optional, default="string"): The data type of the key's value column.

Usage:

{{ unnest_by_key('<column_to_unnest>', '<key_to_extract>', '<value_type>') }}

Example: Unnest the corresponding values for the keys: page_location and ga_session_number from the nested event_params column.

SELECT
  -- Unnest the default STRING value type
  {{ unnest_by_key('event_params', 'page_location') }},
  -- Unnest the INT value type
  {{ unnest_by_key('event_params', 'ga_session_number',  'int') }},
  ...

unnest_by_key_alt(column_to_unnest, key_to_extract, value_type = "string") source

This macro unnests a single key's value from an array. This macro allows for a custom alias named sub-query.

Args:

  • column_to_unnest (required): The array column to unnest the key's value from.
  • key_to_extract (required): The key by which to get the corresponding value for.
  • value_type (optional, default="string"): The data type of the key's value column.

Usage:

{{ unnest_by_key_alt('<column_to_unnest>', '<key_to_extract>', '<value_type>') }} AS <custom_alias_name>,

Example: Unnest the corresponding values for the keys: page_location and ga_session_number from the nested event_params column.

SELECT
  -- Unnest the default STRING value type & use a custom alias
  {{ unnest_by_key_alt('event_params', 'page_location') }} AS url, 
  -- Unnest the INT value type & use a custom alias
  {{ unnest_by_key_alt('event_params', 'ga_session_number',  'int') }} AS session_number,
  ...

get_event_params() source

This macro will dynamically return all of the keys and their corresponding value_types found in the event_params array column.

  • This macro will exclude event_params added to the excluded_event_params variable, which is specified in the dbt_project.yml file.

Usage / Example:

SELECT
  {% for event_param in get_event_params() -%}

  {{ unnest_by_key('event_params', event_param['event_param_key'], event_param['event_param_value']) }}
    
  {{- "," if not loop.last }}
  {% endfor %}
  ...

default_channel_grouping(source, medium, source_category) source

This macro determines the default_channel_grouping and will result in one the following classifications:

  • Direct
  • Paid Social
  • Oraginc Social
  • Email
  • Affiliates
  • Paid Shopping
  • Paid Search
  • Display
  • Other Advertising
  • Organic Search
  • Organic Video
  • Organic Shopping
  • Audio
  • SMS
  • (Other)

Args:

  • source (required): The source column used in determining the default channel grouping.
  • medium (required): The medium column used in determining the default channel grouping.
  • source_category (required): The source category column used in determining the default channel grouping. These are desiganted in the ga4_source_categories.csv seed file.

Usage:

{{ default_channel_grouping('<source>', '<medium>', '<source_category>') }}

Example:

SELECT
  {{ default_channel_grouping('source', 'medium', 'source_category') }} AS default_channel_grouping,
  ...

Seeds

Seed File Description
ga4_source_categories.csv Google's mapping between source and source_category. More info and the download can be found here.

Make sure to run dbt seed before running dbt run.

Installation & Configuration

Setup

...[TO DO]...

Required Variables

This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is located using the following variables which must be set in your dbt_project.yml file.

vars:
  project: '<gcp_project>' # Set your Project ID here.
  dataset: '<ga4_dataset>' # Set your Dataset name here.
  start_date: 'YYYYMMDD'   # Set the start date that you want to retrieve data from.
  frequency: 'daily'       # daily|streaming|daily+streaming Match to the type of export configured in GA4; daily+streaming appends today's intraday data to daily data.

If you don't have any GA4 data of your own, you can connect to Google's public data set with the following settings:

vars:
  project: 'bigquery-public-data'
  dataset: 'ga4_obfuscated_sample_ecommerce'
  start_date: '20210120'

Find more info about the GA4 obfuscated dataset here.

Optional Variables

NOTE: These Variables are also NOT finalized & are LIKELY to change.

Query Parameter Exclusions

Setting any query_parameter_exclusions will remove query string parameters from the page_location field for all downstream processing. Original parameters are captured in a new original_page_location field. Ex:

vars: 
  query_parameter_exclusions: ['gclid', 'fbclid', '_ga'] 

Conversion Events

Specific events can be set as conversions with the conversion_events variable in your dbt_project.yml file. These events will be counted against each session and included in the final mart models. Ex:

vars:
  conversion_events: ['purchase', 'download']

Consideration Events

Specific events can be set as considerations with the conversion_events variable in your dbt_project.yml file. These events will be counted against each session and included in the final mart models. Ex:

vars:
  consideration_events: ['cta_click', 'view_search_results']

Funnel Stages [TO DO]

Set specific events to be stages in a funnel.

vars:
  funnel_stages: ['begin_checkout', 'add_shipping_info', 'add_payment_info', 'purchase']

Excluded Events [TO DO]

Exclude specific events from the final tables.

vars:
  excluded__events: ['session_start']

Excluded Event parameters [TO DO]

Exclude specific event parameters from the final tables.

vars:
  excluded__event_params: ['ga_session_id', 'page_location', 'ga_session_number', 'session_engaged', 'engagement_time_msec', 'entrances', 'page_title', 'page_referrer', 'source', 'medium', 'campaign', 'debug_mode', 'term', 'clean_event', 'value', 'tax', 'coupon', 'promotion_name', 'transaction_id']

Excluded Columns [TO DO]

Exclude specific default columns from the final tables.

vars:
  excluded__columns: ['event_previous_timestamp', 'event_bundle_sequence_id', 'event_server_timestamp_offset', 'user_id', 'user_pseudo_id', 'stream_id', 'ga_session_id', 'privacy_info', 'event_dimensions', 'app_info']

Excluded User Properties [TO DO]

Exclude specific user properties from the final tables.

vars:
  excluded__user_props: ['logged_in']

Included Query Parameters [TO DO]

Include specific query parameters to be in the final tables.

vars:
  included__query_params: ['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'utm_term', 'gclid', 'fbclid', 'gclsrc', '_ga']

Resources & References: