Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-947] GX Validation Record Keeping #135

Merged
merged 23 commits into from
Jun 10, 2024

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Jun 7, 2024

Problem:

As the agora-data-tools pipeline is run more and more, the versions of output files and GX reports are piling up. It is important that we are able to identify which files were produced during a particular run, and a record-keeping solution is therefore necessary.

Solution:
Design Doc

Implement a method of record keeping within ADT that will upload metadata about a processing run to a Synapse Table in the Agora Project. This way, we will keep a historical record of what files were produced during which ADT runs and we will be able to look up the specific GX report files for those runs.

The majority of logic for this new feature is contained within two new classes:

  • DatasetReport: Contains all of the fields needed to populate one row of the Synapse Table. There will be one of these per GX-enabled dataset from each ADT run.
  • ADTGXReporter: Contains all of the fields common for all DatasetReports in a run, and also performs the updating of the Synapse table in the end.

Notes:

  • To surface the metadata contained within ADT runs, some refactoring needed to be done.
  • A new optional parameter run_id has been added to help keep track of specific GH Actions and Nextflow Tower runs. nf-agora PR.
  • Added a gx_table parameter to the config files so that testing runs and live runs have their records stored separately.
  • The Platform enum is moved to its own module to prevent circular import issues.
  • Tests are added and updated as needed.

Future Work:

While working on this feature, a couple of issues came up I have created tickets to track:

  • Recording information about ADT input files/configuration files used (ticket).
  • Leveraging the GX mostly parameter to create some flexible expectations and report out warnings about how much data has met/not met an expectation (ticket).

@BWMac BWMac marked this pull request as ready for review June 7, 2024 21:48
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here! Just some comments

src/agoradatatools/run_platform.py Outdated Show resolved Hide resolved
src/agoradatatools/reporter.py Show resolved Hide resolved
src/agoradatatools/reporter.py Show resolved Hide resolved
src/agoradatatools/reporter.py Show resolved Hide resolved
src/agoradatatools/reporter.py Show resolved Hide resolved
Copy link

sonarcloud bot commented Jun 10, 2024

Quality Gate Passed Quality Gate passed

Issues
47 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
25.4% Duplication on New Code

See analysis details on SonarCloud

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM!

@BWMac BWMac merged commit 4d3da56 into dev Jun 10, 2024
9 checks passed
@BWMac BWMac deleted the bwmac/IBCDPE-947/gx_validation_record_keeping branch June 10, 2024 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants