Biomarkers transform for ModelAD #148

beatrizsaldana · 2024-09-25T23:45:14Z

Creates a new transform for the biomarkers dataset. The transform will re-structure data as described in this jira ticket.

This is my first PR in this repo. Please review carefully and be as brutally honest as is necessary. It's better for me to learn things now than for us to have to go back and fix or add things later because nobody wanted to tell me I was doing something sub-optimally.

Expected Changes

Added modelad_test_config.yaml
Added a biomarkers transform function
Added test cases and test data

Unexpected Changes

The transform_biomarkers() function outputs the transform as a list instead of dict or pd.DataFrame as is expected.
I added a list_to_json() function in src/agoradatatools/etl/load.py to acomodate the new output type
I added elif isinstance(df, list): in the process_dataset() function in src/agoradatatools/process.py.

@BWMac what do you think about the Unexpected Changes? Would it be better for the transform_biomarkers() function to output a dict or pd.DataFrame and prevent any of these extra changes? All feedback is welcome.

UPDATE

No unexpected changes were implemented.

…funcitonal at the moment

…puts a list and so a new list_to_json() function was added to the load module and logic to handle this was added to the process_dataset function

src/agoradatatools/process.py

src/agoradatatools/etl/transform/biomarkers.py

BryanFauble · 2024-09-26T16:59:09Z

src/agoradatatools/etl/load.py

+    temp_json = open(os.path.join(staging_path, filename), "w+")
+    json.dump(df, temp_json, cls=NumpyEncoder, indent=2)
+    temp_json.close()
+    return temp_json.name


Generally, a context managed open is preferred like:

with open(os.path.join(staging_path, filename), "w+") as temp_json: json.dump(df, temp_json, cls=NumpyEncoder, indent=2) return temp_json.name

This is so you don't need to be concerned about calling .close(), which is a valid way of accomplishing this, however, if this is the approach you want to take the .close() should be within a finally block so it's guaranteed to execute.

Hmm, I do like this approach better than what I was doing. I was trying to copy what the other functions are doing. Feedback please: Should I...

Update just this one function with the preferred context managed open

Update all of the X to json functions with the preferred context managed open

Leave things as they are and create a Jira ticket for updating the functions to use the preferred context managed open

Thoughts? @BryanFauble

I would:

Update any of the code you are already touching to following this approach

Log a tech debt ticket to go back and look at the other areas of the code

Generally, the mantra I follow is: "Leave the code in a better place than when I started". That needs to be balanced with the scope of the change, the time you have to make the changes, and the time it's going to take to validate the change. Some minor things are probably not worth fixing if it means there is a significant effort required to test the change.

I agree, update your own code and make a ticket for anything else you notice. I'm not sure who to assign the issue to so it doesn't get lost in the ether, maybe Jess?

Thank you both for the feedback! I'll update the function I wrote, create a tech debt Jira ticket and assign it to Jess :)

@JessterB I need help figuring out where to create this Jira ticket 🙃

I'll jump in: Go to JIRA (https://sagebionetworks.jira.com/), click on "Create" in the top bar, for project select "Model AD Explorer (MG)", Issue type = "Task" (Jess will change it if she wants something different), assign it to Jess, don't worry about filling in Team or Validator.

Thank you @jaclynbeck-sage, I missed this while I was out last week. What Jaclyn said works fine, once the ticket exists I can take it from there.

src/agoradatatools/etl/transform/biomarkers.py

…m_biomarkers() function.

…problems with puthon 3.8

jaclynbeck-sage · 2024-09-26T19:34:53Z

tests/transform/test_biomarkers.py

+    pass_test_data = [
+        (  # Pass with good real data
+            "biomarkers_good_input.csv",
+            "biomarkers_good_output.json",


I'm not sure I see a need to test both real data and fake data if they're both good input. Usually for my tests I just subset to a small number of rows from the real data as my test input, and then tweak a few things from there if I need to check what happens with missing values or duplicates.

Very true. I removed the "real" data test since it is hardest for us to visually validate.

…ainable

… process_dataset() for converting to json

src/agoradatatools/process.py

src/agoradatatools/etl/load.py

modelad_test_config.yaml

BWMac · 2024-09-30T23:35:09Z

@beatrizsaldana Check this out if you haven't seen it yet for using pre-commit locally.

src/agoradatatools/etl/load.py

jaclynbeck-sage · 2024-09-30T23:44:09Z

src/agoradatatools/process.py

@@ -1,5 +1,6 @@
 import logging
 import typing
+import warnings


This needs to get removed per pre-commit.

src/agoradatatools/process.py

… functions that were being used to convert a list to a json

…f the apply_custom_transformations() output types

thomasyu888

🚀 LGTM! I'll defer to others for final review, but the code looks great. Good looking tests!

sonarcloud · 2024-10-03T22:00:21Z

Quality Gate passed

Issues
11 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

JessterB

LGTM!

jaclynbeck-sage

Excellent! Everything looks great.

BWMac

🔥 LGTM

Beatriz Saldana added 11 commits September 18, 2024 11:13

Added biomarker files and functions to necessary locations, none are …

f348d68

…funcitonal at the moment

Added biomarker transform for the Model-AD project. The transform out…

2d43820

…puts a list and so a new list_to_json() function was added to the load module and logic to handle this was added to the process_dataset function

Biomarkers input and output test files

b19a529

Added tests for biomarkers

84355f7

Ran black formatter

1297757

Biomarkers test passes when it should

537605c

Biomarkers transform working, need to remove custom_transform from yaml

597bfca

Correct use of the custom_transformations parameter in yaml config file

b796474

Added fake test data made by hand for testing biomarkers transform

d6a7d19

Added testing for duplicate data

46feee2

Formatting with black

4ac1f23

thomasyu888 requested a review from BWMac September 25, 2024 23:56

beatrizsaldana commented Sep 25, 2024

View reviewed changes

src/agoradatatools/process.py Outdated Show resolved Hide resolved

beatrizsaldana commented Sep 25, 2024

View reviewed changes

src/agoradatatools/process.py Outdated Show resolved Hide resolved

beatrizsaldana requested review from JessterB and jaclynbeck-sage September 25, 2024 23:59

beatrizsaldana self-assigned this Sep 25, 2024

beatrizsaldana marked this pull request as draft September 26, 2024 00:01

beatrizsaldana added the enhancement New feature or request label Sep 26, 2024

BryanFauble reviewed Sep 26, 2024

View reviewed changes

src/agoradatatools/etl/transform/biomarkers.py Outdated Show resolved Hide resolved

BryanFauble reviewed Sep 26, 2024

View reviewed changes

src/agoradatatools/etl/transform/biomarkers.py Outdated Show resolved Hide resolved

Beatriz Saldana added 7 commits September 26, 2024 11:18

Addressing PR comment about process_dataset() error message.

bb8cb5d

Reformatting process.py

1a09560

Addressing PR comment about TypeError for biomarkers dataset.

2e4c792

Addressing PR comment: Improved docstring and typing for the transfor…

200f068

…m_biomarkers() function.

PR comment: Reverting back to using standard typing hints to prevent …

dd8d422

…problems with puthon 3.8

Removed unused import that caused pre-commit to fail.

3fae0ae

Removed unnecessary formatting from ADTDataProcessingError message.

f75638b

jaclynbeck-sage reviewed Sep 26, 2024

View reviewed changes

Beatriz Saldana added 2 commits September 30, 2024 15:20

Simplifying transform_biomarkers() to make it more readable and maint…

31c5806

…ainable

Removing unnecessary warning and explicit isinstance(df,DataFrame) in…

e663961

… process_dataset() for converting to json

beatrizsaldana requested review from BryanFauble, BWMac, jaclynbeck-sage and thomasyu888 September 30, 2024 23:22