-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Biodomains semicolon update #124
Conversation
…embl IDs, and edited the tests to test that case
…ile, made it more generic, updated its test, plus sonarcloud delinting
… bumped version of gene_table_merged to the new file
@@ -110,7 +110,6 @@ def rename_columns(df: pd.DataFrame, column_map: dict) -> pd.DataFrame: | |||
df.rename(columns=column_map, inplace=True) | |||
except TypeError: | |||
print("Column mapping must be a dictionary") | |||
return df | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted this "return df" statement because SonarCloud complained about it: the function always returns df whether there's an error or not.
" df=df, split_field=\"ensembl_id\", delim=\";\"\n", | ||
" )\n", | ||
" file_ensembl_ids = df[\"ensembl_id\"].drop_duplicates()\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunk of code above is the only relevant code change in this notebook. Everything else is either formatting or different output being displayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple very minor comments. I confirmed that everything runs and genes_biodomains
passes GX validation.
Quality Gate passedIssues Measures |
Closing this PR without merging. They changed the biodomains file format back to the old format (one Ensembl ID per row), so this update is no longer needed. |
This PR addresses AG-1366 and AG-1371.
The new biodomains file changed the format of their
ensembl_id
column, so that instead of a single Ensembl ID per row, some rows have strings like this: "ENSG0001;ENSG0002;ENSG0003" while most still have a single ID per row. To account for this change I did the following:utils.py
calledsplit_delimited_field_to_multiple_rows
, which will take each row with multiple Ensembl IDs, duplicate it so there's enough rows for all the IDs in the list, and then assign a single Ensembl ID per row. This function is written to be generic.test_utils.py
genes_biodomains
andgene_info
) call this util function to expand out the biodomains dataframe before doing more processing on itgenes_biodomains
to have some input with semicolon-delimited listsgene_metadata
file and bumped the version of it from 10 to 11 in the configs.There are also several places in the code where I did de-linting to make SonarCloud happy.