Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-769] Adds gene_info Expectation Suite #139

Merged
merged 6 commits into from
Jul 2, 2024

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Jun 27, 2024

Description:

This PR implements a GX expectation suite for the gene_info dataset.

The expectations chosen and limit values for the expectations (where applicable) come from the rules defined in the R scripts here, with a few exceptions.

  1. total_nominations was expected to not be null according to the rules in the R script, but when I checked with expect_column_values_to_not_be_null this was not the case ~97.5% of the time so I left that expectation off.
  2. Fields such as name, summary, and hgnc_symbol were also expected to not be null, but I found that they were to varying degrees. For these fields I introduced the mostly keyword which sets a percentage threshold for the expectation passing rather than the expectation failing if even one record does not satisfy it. The mostly keyword will be used in future work to add "warning" functionality to our data validation.
  3. In the case of the target_nominations field, there were approximately 1.2% of the nested values which did not meet my JSON schema expectation (based on the rules here). This is caused by a combination of the data_used_to_support_target_selection, predicted_therapeutic_direction, and target_choice_justification fields containing invalid unicode characters, and null values in the data_synapseid and validation_study_details. So I set the mostly value for this expectation to 0.98.

Please let me know if there are any questions, suggestions for adjustments to expectations used/expectation configurations, or if any further discussion is needed with regards to decisions I have made here. This is by far the most complex GX suite we have implemented.

Notes:

  • A sample HTML report for this expectation suite can be found here.
  • Due to the complexity of this GX validation, the overall time of ADT data processing has increased: Elapsed time: 00:10:43 for gene_info dataset.

@BWMac BWMac marked this pull request as ready for review June 27, 2024 20:05
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! I'll leave it to the Agora team to provide feedback on the expectations themselves!

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just had a few questions before approving.

gx_suite_definitions/gene_info.ipynb Show resolved Hide resolved
gx_suite_definitions/gene_info.ipynb Outdated Show resolved Hide resolved
gx_suite_definitions/gene_info.ipynb Outdated Show resolved Hide resolved
gx_suite_definitions/gene_info.ipynb Outdated Show resolved Hide resolved
Copy link

sonarcloud bot commented Jul 1, 2024

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@BWMac BWMac merged commit 20c550a into dev Jul 2, 2024
9 checks passed
@BWMac BWMac deleted the bwmac/IBCDPE-769/gene_info_gx_suite branch July 2, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants