Biodomains semicolon update #124

jaclynbeck-sage · 2024-02-21T04:23:37Z

This PR addresses AG-1366 and AG-1371.

The new biodomains file changed the format of their ensembl_id column, so that instead of a single Ensembl ID per row, some rows have strings like this: "ENSG0001;ENSG0002;ENSG0003" while most still have a single ID per row. To account for this change I did the following:

Bumped the version of the genes_biodomains file from 1 to 4 in the configs
Created a function in utils.py called split_delimited_field_to_multiple_rows, which will take each row with multiple Ensembl IDs, duplicate it so there's enough rows for all the IDs in the list, and then assign a single Ensembl ID per row. This function is written to be generic.
Wrote tests for this function in test_utils.py
Had both affected transforms (genes_biodomains and gene_info) call this util function to expand out the biodomains dataframe before doing more processing on it
Updated the integration test for genes_biodomains to have some input with semicolon-delimited lists
Updated the gene annotation pre-processing notebook to also use the util function in order to get all the Ensembl IDs used in the biodomains data set.
Ran the gene annotation pre-processing notebook to generate a new gene_metadata file and bumped the version of it from 10 to 11 in the configs.

There are also several places in the code where I did de-linting to make SonarCloud happy.

…embl IDs, and edited the tests to test that case

…ile, made it more generic, updated its test, plus sonarcloud delinting

… bumped version of gene_table_merged to the new file

…o transform

jaclynbeck-sage · 2024-02-21T04:28:26Z

src/agoradatatools/etl/utils.py

@@ -110,7 +110,6 @@ def rename_columns(df: pd.DataFrame, column_map: dict) -> pd.DataFrame:
        df.rename(columns=column_map, inplace=True)
    except TypeError:
        print("Column mapping must be a dictionary")
-        return df



I deleted this "return df" statement because SonarCloud complained about it: the function always returns df whether there's an error or not.

jaclynbeck-sage · 2024-02-21T18:33:56Z

data_analysis/agora/notebooks/preprocessing/AG-896_Preprocess_Gene_Annotations.ipynb

+    "            df=df, split_field=\"ensembl_id\", delim=\";\"\n",
+    "        )\n",
+    "        file_ensembl_ids = df[\"ensembl_id\"].drop_duplicates()\n",
+    "\n",


The chunk of code above is the only relevant code change in this notebook. Everything else is either formatting or different output being displayed.

BWMac

LGTM! Just a couple very minor comments. I confirmed that everything runs and genes_biodomains passes GX validation.

src/agoradatatools/etl/transform/gene_info.py

sonarcloud · 2024-02-21T20:35:18Z

Quality Gate passed

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

jaclynbeck-sage · 2024-02-22T20:51:47Z

Closing this PR without merging. They changed the biodomains file format back to the old format (one Ensembl ID per row), so this update is no longer needed.

jaclynbeck-sage added 4 commits February 16, 2024 18:25

Modified genes_biodomains transform to handle semicolon-separated Ens…

303786e

…embl IDs, and edited the tests to test that case

Moved split_ensembl_ids function from biodomains transform to utils f…

ca08eff

…ile, made it more generic, updated its test, plus sonarcloud delinting

Updated gene annotation pre-processing to handle new biodomains file,…

04014dd

… bumped version of gene_table_merged to the new file

Moved biodomain split operation before grouping operation in gene_inf…

699defe

…o transform

jaclynbeck-sage commented Feb 21, 2024

View reviewed changes

Added clarification to the docstring for the new util function

4d5baef

jaclynbeck-sage commented Feb 21, 2024

View reviewed changes

Added test for null values for the split delimited function

2327fbc

jaclynbeck-sage marked this pull request as ready for review February 21, 2024 18:50

jaclynbeck-sage requested review from BWMac, JessterB and thomasyu888 February 21, 2024 18:50

BWMac approved these changes Feb 21, 2024

View reviewed changes

src/agoradatatools/etl/transform/gene_info.py Outdated Show resolved Hide resolved

src/agoradatatools/etl/transform/gene_info.py Outdated Show resolved Hide resolved

Changed resource url variables to uppercase to indicate constants

e6c30d4

jaclynbeck-sage closed this Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biodomains semicolon update #124

Biodomains semicolon update #124

jaclynbeck-sage commented Feb 21, 2024 •

edited

Loading

jaclynbeck-sage Feb 21, 2024

jaclynbeck-sage Feb 21, 2024 •

edited

Loading

BWMac left a comment

sonarcloud bot commented Feb 21, 2024

jaclynbeck-sage commented Feb 22, 2024

Biodomains semicolon update #124

Biodomains semicolon update #124

Conversation

jaclynbeck-sage commented Feb 21, 2024 • edited Loading

jaclynbeck-sage Feb 21, 2024

Choose a reason for hiding this comment

jaclynbeck-sage Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

BWMac left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 21, 2024

Quality Gate passed

jaclynbeck-sage commented Feb 22, 2024

jaclynbeck-sage commented Feb 21, 2024 •

edited

Loading

jaclynbeck-sage Feb 21, 2024 •

edited

Loading