Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/dremio): Dremio Source Ingestion #11598

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
38e5946
feat: Dremio Source Ingestion
sagar-salvi-apptware Oct 11, 2024
a5b8c45
refactor: improvement to metadata gathering
sagar-salvi-apptware Oct 11, 2024
4a81b19
fix: Update dremio_entity.py
sagar-salvi-apptware Oct 11, 2024
51c2443
fix: Update dremio_sql_queries.py
sagar-salvi-apptware Oct 11, 2024
84227b6
refactor: Update dremio_source.py
sagar-salvi-apptware Oct 11, 2024
e462be6
feat: Dremio Source Ingestion
sagar-salvi-apptware Oct 11, 2024
703a3f8
refactor: improvement to metadata gathering
sagar-salvi-apptware Oct 11, 2024
4283cb6
fix: Update dremio_entity.py
sagar-salvi-apptware Oct 11, 2024
f897b9e
fix: Update dremio_sql_queries.py
sagar-salvi-apptware Oct 11, 2024
ac56777
refactor: Update dremio_source.py
sagar-salvi-apptware Oct 11, 2024
3ce57b9
test: add integration test for dremio
sagar-salvi-apptware Oct 17, 2024
e1ee817
fix: added minor changes + fix testcase
sagar-salvi-apptware Oct 17, 2024
b23e003
fix: PR Comments
sagar-salvi-apptware Oct 21, 2024
7247ac1
Merge branch 'feat/dremio-connector-source' of https://github.com/sag…
acrylJonny Oct 21, 2024
b9d7b8a
switch to drill dialect
acrylJonny Oct 21, 2024
230fbd7
Merge branch 'master' into feat/dremio-connector-source
acrylJonny Oct 21, 2024
d979d31
Update dremio_entities.py
acrylJonny Oct 21, 2024
9244e4c
Update dremio_entities.py
acrylJonny Oct 21, 2024
44853b8
Update datahub-web-react/src/app/ingest/source/builder/sources.json
acrylJonny Oct 21, 2024
fac705c
Update metadata-ingestion/docs/sources/dremio/README.md
acrylJonny Oct 21, 2024
29ba440
reafactor: Dremio Authetication
sagar-salvi-apptware Oct 21, 2024
6574b63
Cite Dremio Docs for SchemaFieldTypeMapper
acrylJonny Oct 21, 2024
9835e31
test: fix ci test
sagar-salvi-apptware Oct 21, 2024
a6a9a91
fix: PR comments
sagar-salvi-apptware Oct 22, 2024
b0c8c08
fix: dataset_pattern changes for tables and views
sagar-salvi-apptware Oct 23, 2024
ff211af
add warnings when unable to parse sql query
acrylJonny Oct 23, 2024
459cfc8
add view definition aspect
acrylJonny Oct 23, 2024
224ec10
bug fix - external urls. Improve Dremio Cloud API support for projects
acrylJonny Oct 24, 2024
0982a1b
fix: ci test
sagar-salvi-apptware Oct 24, 2024
3ce5e02
docs: minor docs changes per pr comments
sagar-salvi-apptware Oct 25, 2024
9c1e3fb
fix: PR comments
sagar-salvi-apptware Oct 25, 2024
7b938b2
docs: minor changes
sagar-salvi-apptware Oct 25, 2024
6914f75
build: added dependacy for sql
sagar-salvi-apptware Oct 25, 2024
3176e71
test: fix ci test
sagar-salvi-apptware Oct 25, 2024
9c1c556
fix: add minor change in report failure
sagar-salvi-apptware Oct 25, 2024
6670a98
fix: statefull ingestion error
sagar-salvi-apptware Oct 25, 2024
3e2eae7
fix: PR Comments and added mysql as a source to test
sagar-salvi-apptware Oct 27, 2024
195f78c
fix: PR Comments
sagar-salvi-apptware Oct 28, 2024
9de8be0
test: Updated the test for platform instance
sagar-salvi-apptware Oct 28, 2024
102b55f
Merge branch 'master' into feat/dremio-connector-source
sagar-salvi-apptware Oct 28, 2024
8efa349
fix: minor comments
sagar-salvi-apptware Oct 28, 2024
82988fd
test: fix ci test
sagar-salvi-apptware Oct 28, 2024
7433b8b
fix: added fixes for dremio cloud apis
sagar-salvi-apptware Oct 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import clickhouseLogo from '../../../../images/clickhouselogo.png';
import cockroachdbLogo from '../../../../images/cockroachdblogo.png';
import trinoLogo from '../../../../images/trinologo.png';
import dbtLogo from '../../../../images/dbtlogo.png';
import dremioLogo from '../../../../images/dremiologo.png';
import druidLogo from '../../../../images/druidlogo.png';
import elasticsearchLogo from '../../../../images/elasticsearchlogo.png';
import feastLogo from '../../../../images/feastlogo.png';
Expand Down Expand Up @@ -52,6 +53,8 @@ export const COCKROACHDB = 'cockroachdb';
export const COCKROACHDB_URN = `urn:li:dataPlatform:${COCKROACHDB}`;
export const DBT = 'dbt';
export const DBT_URN = `urn:li:dataPlatform:${DBT}`;
export const DREMIO = 'dremio';
export const DREMIO_URN = `urn:li:dataPlatform:${DREMIO}`;
export const DRUID = 'druid';
export const DRUID_URN = `urn:li:dataPlatform:${DRUID}`;
export const DYNAMODB = 'dynamodb';
Expand Down Expand Up @@ -139,6 +142,7 @@ export const PLATFORM_URN_TO_LOGO = {
[CLICKHOUSE_URN]: clickhouseLogo,
[COCKROACHDB_URN]: cockroachdbLogo,
[DBT_URN]: dbtLogo,
[DREMIO_URN]: dremioLogo,
[DRUID_URN]: druidLogo,
[DYNAMODB_URN]: dynamodbLogo,
[ELASTICSEARCH_URN]: elasticsearchLogo,
Expand Down
8 changes: 8 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/sources.json
Original file line number Diff line number Diff line change
Expand Up @@ -302,5 +302,13 @@
"description": "Configure a custom recipe using YAML.",
"docsUrl": "https://datahubproject.io/docs/metadata-ingestion/",
"recipe": "source:\n type: <source-type>\n config:\n # Source-type specifics config\n <source-configs>"
},
{
"urn": "urn:li:dataPlatform:dremio",
"name": "dremio",
"displayName": "Dremio",
"description": "Import Spaces, Sources, Tables and statistics from Dremio.",
"docsUrl": "https://datahubproject.io/docs/metadata-ingestion/",
"recipe": "source:\n type: dremio\n config:\n # Coordinates\n hostname: null\n port: null\n # Credentials\n authentication_method: password\n username: null\n password: null\n stateful_ingestion:\n enabled: true"
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved
}
]
Binary file added datahub-web-react/src/images/dremiologo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -705,6 +705,7 @@ Please see our [Integrations page](https://datahubproject.io/integrations) if yo
| [datahub-lineage-file](./generated/ingestion/sources/file-based-lineage.md) | _no additional dependencies_ | Lineage File source |
| [datahub-business-glossary](./generated/ingestion/sources/business-glossary.md) | _no additional dependencies_ | Business Glossary File source |
| [dbt](./generated/ingestion/sources/dbt.md) | _no additional dependencies_ | dbt source |
| [dremio](./generated/ingestion/sources/dremio.md) | `pip install 'acryl-datahub[dremio]'` | Dremio Source |
| [druid](./generated/ingestion/sources/druid.md) | `pip install 'acryl-datahub[druid]'` | Druid Source |
| [feast](./generated/ingestion/sources/feast.md) | `pip install 'acryl-datahub[feast]'` | Feast source (0.26.0) |
| [glue](./generated/ingestion/sources/glue.md) | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
Expand Down
18 changes: 18 additions & 0 deletions metadata-ingestion/docs/sources/dremio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
### Concept Mapping

Here's a table for **Concept Mapping** between Dremio and DataHub to provide a clear overview of how entities and concepts in Dremio are mapped to corresponding entities in DataHub:

| Source Concept | DataHub Concept | Notes |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------------------------------------------------------- | --- |
| **Physical Dataset** | `Dataset` | A dataset directly queried from an external source without modifications. Subtype: `Table` | |
| **Virtual Dataset** | `Dataset` | A dataset built from SQL-based transformations on other datasets. Subtype: `View` | |
| **Spaces** | `Container` | Top-level organizational unit in Dremio, used to group datasets. Mapped to DataHub’s `Container` aspect. Subtype: `Space` | |
| **Folders** | `Container` | Substructure inside spaces, used for organizing datasets. Mapped as a `Container` in DataHub. Subtype: `Folder` | |
| **Sources** | `Container` | External data sources connected to Dremio (e.g., S3, databases). Represented as a `Container` in DataHub. Subtype: `Source` | |
| **Column Lineage** | `ColumnLineage` | Lineage between columns in datasets, showing how individual columns are transformed across datasets. | |
| **Dataset Lineage** | `UpstreamLineage` | Lineage between datasets, tracking the flow and transformations between different datasets. | |
| **Ownership (Dataset)** | `Ownership` | Ownership information for datasets, representing the technical owner in DataHub’s `Ownership` aspect. | |
| **Glossary Terms** | `GlossaryTerms` | Business terms associated with datasets, providing context. Mapped as `GlossaryTerms` in DataHub. | |
| **Schema Metadata** | `SchemaMetadata` | Schema details (columns, data types) for datasets. Mapped to DataHub’s `SchemaMetadata` aspect. | |
| **SQL Transformations** | `Dataset` (with lineage) | SQL queries in Dremio that transform datasets. Represented as `Dataset` in DataHub, with lineage showing dependency. | |
| **Queries** | `Query` (if mapped) | Historical SQL queries executed on Dremio datasets. These can be tracked for audit purposes in DataHub. | |
34 changes: 34 additions & 0 deletions metadata-ingestion/docs/sources/dremio/dremio_pre.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
### Setup

This integration pulls metadata directly from the Dremio APIs.

You'll need to have a Dremio instance up and running with access to the necessary datasets, and API access should be enabled with a valid token.

**Dremio instance can be one of following**:

- Dremio Cloud (Fully managed cloud SaaS)
- Standard
- Enterprise
- Dremio Software (self-managed on own infrastructure / on-premise)
- Community (oss)
- Enterprise

The API token should have the necessary permissions to **read metadata** and **retrieve lineage**.

#### Steps to Get the Required Information

1. **Generate an API Token**:

- Log in to your Dremio instance.
- Navigate to your user profile in the top-right corner.
- Select **Generate API Token** to create an API token for programmatic access.

2. **Permissions**:

sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved
- The token should have **read-only** or **admin** permissions that allow it to:
- View all datasets (physical and virtual).
- Access all spaces, folders, and sources.
- Retrieve dataset and column-level lineage information.

3. **Verify External Data Source Permissions**:
- If Dremio is connected to external data sources (e.g., AWS S3, relational databases), ensure that Dremio has access to the credentials required for querying those sources.
29 changes: 29 additions & 0 deletions metadata-ingestion/docs/sources/dremio/dremio_recipe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
source:
type: dremio
config:
# Coordinates
hostname: localhost
port: 9047
tls: true

# Credentials with basic auth
authentication_method: password
username: user
password: pass
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved

# Credentials with personal access token
authentication_method: PAT
password: pass

include_query_lineage: True

source_mappings:
- platform: s3
platform_name: samples
sagar-salvi-apptware marked this conversation as resolved.
Show resolved Hide resolved

schema_pattern:
allow:
- ".*"

sink:
# sink configs
3 changes: 3 additions & 0 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,7 @@
"delta-lake": {*data_lake_profiling, *delta_lake},
"dbt": {"requests"} | dbt_common | aws_common,
"dbt-cloud": {"requests"} | dbt_common,
"dremio": {"requests"} | sqlglot_lib,
"druid": sql_common | {"pydruid>=0.6.2"},
"dynamodb": aws_common | classification_lib,
# Starting with 7.14.0 python client is checking if it is connected to elasticsearch client. If its not it throws
Expand Down Expand Up @@ -592,6 +593,7 @@
"clickhouse-usage",
"cockroachdb",
"delta-lake",
"dremio",
"druid",
"elasticsearch",
"feast",
Expand Down Expand Up @@ -690,6 +692,7 @@
"s3 = datahub.ingestion.source.s3:S3Source",
"dbt = datahub.ingestion.source.dbt.dbt_core:DBTCoreSource",
"dbt-cloud = datahub.ingestion.source.dbt.dbt_cloud:DBTCloudSource",
"dremio = datahub.ingestion.source.dremio.dremio_source:DremioSource",
"druid = datahub.ingestion.source.sql.druid:DruidSource",
"dynamodb = datahub.ingestion.source.dynamodb.dynamodb:DynamoDBSource",
"elasticsearch = datahub.ingestion.source.elastic_search:ElasticsearchSource",
Expand Down
Empty file.
Loading
Loading