Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biomarkers transform for ModelAD #148

Merged
merged 41 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
f348d68
Added biomarker files and functions to necessary locations, none are …
Sep 18, 2024
2d43820
Added biomarker transform for the Model-AD project. The transform out…
Sep 18, 2024
b19a529
Biomarkers input and output test files
Sep 19, 2024
84355f7
Added tests for biomarkers
Sep 19, 2024
1297757
Ran black formatter
Sep 20, 2024
537605c
Biomarkers test passes when it should
Sep 20, 2024
597bfca
Biomarkers transform working, need to remove custom_transform from yaml
Sep 23, 2024
b796474
Correct use of the custom_transformations parameter in yaml config file
Sep 23, 2024
d6a7d19
Added fake test data made by hand for testing biomarkers transform
Sep 24, 2024
46feee2
Added testing for duplicate data
Sep 25, 2024
4ac1f23
Formatting with black
Sep 25, 2024
bb8cb5d
Addressing PR comment about process_dataset() error message.
Sep 26, 2024
1a09560
Reformatting process.py
Sep 26, 2024
2e4c792
Addressing PR comment about TypeError for biomarkers dataset.
Sep 26, 2024
200f068
Addressing PR comment: Improved docstring and typing for the transfor…
Sep 26, 2024
dd8d422
PR comment: Reverting back to using standard typing hints to prevent …
Sep 26, 2024
3fae0ae
Removed unused import that caused pre-commit to fail.
Sep 26, 2024
f75638b
Removed unnecessary formatting from ADTDataProcessingError message.
Sep 26, 2024
8016df4
Using typing library to add more specific type hints to the transform…
Sep 26, 2024
5e3dfeb
PR comment - using preferred context managed open for converting a li…
Sep 26, 2024
8ee844d
Reverting change to see if it fixes CI: pre-commit fail
Sep 27, 2024
8ca5cc9
Maybe now the CI pre-commit will pass?
Sep 27, 2024
44cbdb2
What about now? Will the CI pre-commit pass now?
Sep 27, 2024
653bede
I think the problem was just formatting
Sep 27, 2024
f1c8e93
Added test for none/NA/nan values and using pd fillna for missing or …
Sep 30, 2024
5e1046b
Added test for missing data
Sep 30, 2024
da793a3
Added a fail test case for missing columns
Sep 30, 2024
d882457
Added passing test for datasets with extra/unknown columns
Sep 30, 2024
88d3f9d
PR comment: removed unnecessary type checking in transform_biomarkers…
Sep 30, 2024
d2b1ba0
Removed biomarkers test with real data, we have lots of test with fak…
Sep 30, 2024
e73210a
Using warning instead of raise exception to allow datasets without a …
Sep 30, 2024
31c5806
Simplifying transform_biomarkers() to make it more readable and maint…
Sep 30, 2024
e663961
Removing unnecessary warning and explicit isinstance(df,DataFrame) in…
Sep 30, 2024
ae0d9ef
Added output type hint to apply_custom_transformations()
Sep 30, 2024
97f8c54
Removing unnecessary comment
Sep 30, 2024
2c36981
Improved docstring for list_to_json()
Sep 30, 2024
50c2f5c
Outputting biomarkers transform as pd.DataFrame, removing unnecessary…
Oct 2, 2024
b464333
Removed unused import statements
Oct 2, 2024
ce25c12
PR comments - removed unused warnings import and added None as part o…
Oct 2, 2024
7848b96
Removing extra typing import, pre-commit passes
Oct 2, 2024
cc782c6
Changing ageDead to age_death in all processing and test files
Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions modelad_test_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
destination: &dest syn51498092
staging_path: ./staging
gx_folder: none
gx_table: none
datasets:
- biomarkers:
files:
- name: biomarkers
id: syn61250724.1
format: csv
final_format: json
provenance:
- syn61250724.1
destination: *dest
custom_transformations: 1
column_rename:
agedeath: ageDeath
JessterB marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 18 additions & 0 deletions src/agoradatatools/etl/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,21 @@ def dict_to_json(df: dict, staging_path: str, filename: str) -> str:
json.dump(df_as_dict, temp_json, cls=NumpyEncoder, indent=2)
temp_json.close()
return temp_json.name


def list_to_json(df: list, staging_path: str, filename: str) -> str:
"""Converts a list into a JSON file.

Args:
df (list): List to be converted to a JSON file
BWMac marked this conversation as resolved.
Show resolved Hide resolved
staging_path (str): Path to staging directory
filename (str): name of JSON file to be created

Returns:
str: Returns a string containing the name of the new JSON file
"""
BWMac marked this conversation as resolved.
Show resolved Hide resolved

temp_json = open(os.path.join(staging_path, filename), "w+")
json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
temp_json.close()
return temp_json.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, a context managed open is preferred like:

    with open(os.path.join(staging_path, filename), "w+") as temp_json:
          json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
          return temp_json.name

This is so you don't need to be concerned about calling .close(), which is a valid way of accomplishing this, however, if this is the approach you want to take the .close() should be within a finally block so it's guaranteed to execute.

Copy link
Member Author

@beatrizsaldana beatrizsaldana Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...

  1. Update just this one function with the preferred context managed open
  2. Update all of the X to json functions with the preferred context managed open
  3. Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open

Thoughts? @BryanFauble

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would:

  1. Update any of the code you are already touching to following this approach
  2. Log a tech debt ticket to go back and look at the other areas of the code

Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JessterB I need help figuring out where to create this Jira ticket 🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll jump in: Go to JIRA (https://sagebionetworks.jira.com/), click on "Create" in the top bar, for project select "Model AD Explorer (MG)", Issue type = "Task" (Jess will change it if she wants something different), assign it to Jess, don't worry about filling in Team or Validator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jaclynbeck-sage, I missed this while I was out last week. What Jaclyn said works fine, once the ticket exists I can take it from there.

2 changes: 2 additions & 0 deletions src/agoradatatools/etl/transform/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
)
from agoradatatools.etl.transform.team_info import transform_team_info
from agoradatatools.etl.transform.proteomics import transform_proteomics
from agoradatatools.etl.transform.biomarkers import transform_biomarkers

__all__ = [
"transform_distribution_data",
Expand All @@ -28,4 +29,5 @@
"transform_rnaseq_differential_expression",
"transform_team_info",
"transform_proteomics",
"transform_biomarkers",
]
71 changes: 71 additions & 0 deletions src/agoradatatools/etl/transform/biomarkers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
"""
This module contains the transformation logic for the biomarkers dataset.
This is for the Model AD project.
"""

import pandas as pd


def transform_biomarkers(datasets: dict) -> list:
BWMac marked this conversation as resolved.
Show resolved Hide resolved
"""
Takes dictionary of dataset DataFrames, extracts the biomarkers
DataFrame, and transforms it into a list of dictionaries grouped by
'model', 'type', 'ageDeath', 'tissue', and 'units'.

Args:
datasets (dict[str, pd.DataFrame]): dictionary of dataset names mapped to their DataFrame

Returns:
dict: a dictionary of biomarkers data modeled after intended final JSON structure
"""
biomarkers_dataset = datasets["biomarkers"]

# Check that the dataset looks like what we expect
if not isinstance(biomarkers_dataset, pd.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on your test function, this error check probably isn't necessary.

Copy link
Member Author

@beatrizsaldana beatrizsaldana Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about this earlier. The type hints should catch this :)

I'll remove it. Thank you for the validation!

raise ValueError("Biomarker dataset is not a pandas DataFrame")
BWMac marked this conversation as resolved.
Show resolved Hide resolved
if (
not list(biomarkers_dataset.columns).sort()
== [
"model",
"type",
"ageDeath",
"tissue",
"units",
"genotype",
"measurement",
"sex",
].sort()
):
raise ValueError(
f"Biomarker dataset does not contain expected columns. Columns found: {list(biomarkers_dataset.columns)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth changing this check from == to checking that biomarkers contains those columns, so that the data set has to have those columns in it but can have extra columns we don't care about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So true! I was trying to be strict with the error handling, but if there is a possibility for extra columns that we could just ignore, then I'll use isin or something like that instead of the ==. Thank you!

)

data_as_list = []
grouped = biomarkers_dataset.groupby(
["model", "type", "ageDeath", "tissue", "units"]
)

for (model, type_, ageDeath, tissue, units), group in grouped:
BWMac marked this conversation as resolved.
Show resolved Hide resolved
# Create the base structure for each group
entry = {
"model": model,
"type": type_,
"ageDeath": ageDeath,
"tissue": tissue,
"units": units,
"points": [],
}

# Append the measurement, genotype, and sex for each row
for _, row in group.iterrows():
point = {
"genotype": row["genotype"],
"measurement": row["measurement"],
"sex": row["sex"],
}
entry["points"].append(point)
BWMac marked this conversation as resolved.
Show resolved Hide resolved

# Add the entry to the list
data_as_list.append(entry)

return data_as_list
14 changes: 13 additions & 1 deletion src/agoradatatools/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ def apply_custom_transformations(datasets: dict, dataset_name: str, dataset_obj:
if dataset_name in ["proteomics", "proteomics_tmt", "proteomics_srm"]:
df = datasets[dataset_name]
return transform.transform_proteomics(df=df)
if dataset_name == "biomarkers":
return transform.transform_biomarkers(datasets=datasets)
else:
return None

Expand Down Expand Up @@ -123,12 +125,22 @@ def process_dataset(
staging_path=staging_path,
filename=dataset_name + "." + dataset_obj[dataset_name]["final_format"],
)
else:
elif isinstance(df, list):
json_path = load.list_to_json(
df=df,
staging_path=staging_path,
filename=dataset_name + "." + dataset_obj[dataset_name]["final_format"],
)
elif isinstance(df, DataFrame):
BWMac marked this conversation as resolved.
Show resolved Hide resolved
json_path = load.df_to_json(
df=df,
staging_path=staging_path,
filename=dataset_name + "." + dataset_obj[dataset_name]["final_format"],
)
else:
raise ADTDataProcessingError(
f"Data processing failed for {dataset_name}. Dataframe type is not supported."
)
BWMac marked this conversation as resolved.
Show resolved Hide resolved

gx_enabled = dataset_obj[dataset_name].get("gx_enabled", False)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
model,type,measurement,units,ageDeath,tissue,sex,genotype
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,1,A,1,TissueA,male,genotype2
ModelA,TypeA,1,A,1,TissueA,male,genotype2
39 changes: 39 additions & 0 deletions tests/test_assets/biomarkers/input/biomarkers_good_input.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
genotype,measurement,sex,model,type,ageDeath,tissue,units
3xTG-AD,1.887403509,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,2.507680422,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,3.095873882,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,3.639017544,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,0.74988673,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,11.98119457,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,6.393679577,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,3.098566667,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,3.452296527,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,2.644899123,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,0.110659787,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,1.899942325,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,2.397684799,male,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,6.048395918,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,5.886637838,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,3.368940156,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,0.234726268,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,6.193847515,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,7.122192076,female,3xTG-AD,Insoluble Abeta40,4,cerebral cortex,pg/mg
3xTG-AD,4.853281065,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,37.12325,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,15.62036898,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,37.12734722,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,75.55092784,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,63.56345,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,60.22938,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,23.23818056,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,93.87042718,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,107.5141702,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,130.5238413,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,13.75984343,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,34.07885294,male,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,80.33369231,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,84.89822857,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,92.64340206,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,133.3285882,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,130.4151077,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
3xTG-AD,255.0758333,female,3xTG-AD,Insoluble Abeta40,4,hippocampus,pg/mg
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
model,type,measurement,units,ageDeath,tissue,sex,genotype
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,2,A,1,TissueA,male,genotype1
ModelA,TypeA,3,A,2,TissueA,male,genotype2
ModelA,TypeB,4,A,2,TissueA,male,genotype1
ModelA,TypeB,5,A,3,TissueA,male,genotype1
ModelA,TypeB,6,A,3,TissueA,male,genotype2
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[
{
"model": "ModelA",
"type": "TypeA",
"ageDeath": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype2",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype2",
"measurement": 1,
"sex": "male"
}
]
}
]
Loading
Loading