Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biodomains semicolon update #124

Closed

Conversation

jaclynbeck-sage
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage commented Feb 21, 2024

This PR addresses AG-1366 and AG-1371.

The new biodomains file changed the format of their ensembl_id column, so that instead of a single Ensembl ID per row, some rows have strings like this: "ENSG0001;ENSG0002;ENSG0003" while most still have a single ID per row. To account for this change I did the following:

  1. Bumped the version of the genes_biodomains file from 1 to 4 in the configs
  2. Created a function in utils.py called split_delimited_field_to_multiple_rows, which will take each row with multiple Ensembl IDs, duplicate it so there's enough rows for all the IDs in the list, and then assign a single Ensembl ID per row. This function is written to be generic.
  3. Wrote tests for this function in test_utils.py
  4. Had both affected transforms (genes_biodomains and gene_info) call this util function to expand out the biodomains dataframe before doing more processing on it
  5. Updated the integration test for genes_biodomains to have some input with semicolon-delimited lists
  6. Updated the gene annotation pre-processing notebook to also use the util function in order to get all the Ensembl IDs used in the biodomains data set.
  7. Ran the gene annotation pre-processing notebook to generate a new gene_metadata file and bumped the version of it from 10 to 11 in the configs.

There are also several places in the code where I did de-linting to make SonarCloud happy.

@@ -110,7 +110,6 @@ def rename_columns(df: pd.DataFrame, column_map: dict) -> pd.DataFrame:
df.rename(columns=column_map, inplace=True)
except TypeError:
print("Column mapping must be a dictionary")
return df

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted this "return df" statement because SonarCloud complained about it: the function always returns df whether there's an error or not.

" df=df, split_field=\"ensembl_id\", delim=\";\"\n",
" )\n",
" file_ensembl_ids = df[\"ensembl_id\"].drop_duplicates()\n",
"\n",
Copy link
Contributor Author

@jaclynbeck-sage jaclynbeck-sage Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk of code above is the only relevant code change in this notebook. Everything else is either formatting or different output being displayed.

@jaclynbeck-sage jaclynbeck-sage marked this pull request as ready for review February 21, 2024 18:50
Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple very minor comments. I confirmed that everything runs and genes_biodomains passes GX validation.

src/agoradatatools/etl/transform/gene_info.py Outdated Show resolved Hide resolved
src/agoradatatools/etl/transform/gene_info.py Outdated Show resolved Hide resolved
Copy link

sonarcloud bot commented Feb 21, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@jaclynbeck-sage
Copy link
Contributor Author

Closing this PR without merging. They changed the biodomains file format back to the old format (one Ensembl ID per row), so this update is no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants