Skip to content
This repository has been archived by the owner on Jun 30, 2023. It is now read-only.

Fetch nhdv2 attributes from ScienceBase #82

Merged
merged 9 commits into from
Feb 17, 2022

Conversation

lekoenig
Copy link
Collaborator

@lekoenig lekoenig commented Feb 7, 2022

Addresses #16 and #77.

The code here adds a target to download all of the data associated with our VarsOfInterest table. The function to download the data also filters the output to only include COMIDs specified by the user, and saves a csv of the filtered output. The function also contains an argument to choose between retaining or deleting the original downloaded zipped and unzipped files.

I'm tagging @msleckman for a code review and @jds485 for a more cursory review. A few questions I have on this PR:

  • We could add the NLCD years beyond 2011 to VarsOfInterest and download them in a similar manner. Would this be helpful for addressing Fetch NHD attributes : NLCD Land Cover 2001-2019 #56? I know we currently download NLCD data separately (in fact, 'Land Cover' variables have been filtered out of the target for p1_vars_of_interest), so I'm happy to leave that alone and fetch these data in a similar manner to NLCD2011. @msleckman, I'm interested in your thoughts on that since I'm not too familiar with the existing NLCD download functions.

  • There are some variables in our VarsOfInterest table that seem like they could be duplicates (i.e., slightly different GIS layer name but same sb id). I haven't looked at the data sources too closely, so I'd appreciate if anyone else can clarify whether these are distinct variables. From my notes:

There are variables labeled "ET" and "AET" (rows 78 and 62 in the VarsOfInterest csv) that point to the same sb link. Are these the same variable that got duplicated, or were we expecting different data for "ET" and "AET"? Similar question regarding variables RUN and RUN7100 (rows 115 and 112) and RH and RH6190 (rows 61 and 77).

  • In the vars of interest file, I added a column to indicate the sb items we are interested in so as to exclude bulky rasters that are included in some of these datasets. I used the entry in that column to limit the National Inventory of Dams (NID) dataset downloads to start at 1960, even though there are also sb items available for 1930, 1940, and 1950. It seemed like we were cutting off other metrics at 1960 (e.g., FORE-SCE?), so I figured I'd be consistent here. What do you think? Should we retain all of the years available?

@jds485
Copy link
Member

jds485 commented Feb 7, 2022

Duplicate variables: Thanks for finding these! I just looked at the SB metadata to see if they differ, and I think these are exact duplicates with different names. Here are my thoughts on which names to retain:

  • ET and AET: AET to be specific that it's actual instead of potential.
  • RUN and RUN7100: RUN7100 to be specific to years used.
  • RH and RH6190: RH6190 ""

@jds485
Copy link
Member

jds485 commented Feb 7, 2022

Should we retain all of the years available?

Cutting off at 1960 sounds good to me to capture ~20 year historical information from our start date. That applies to all datasets with multiple years available.

@jds485
Copy link
Member

jds485 commented Feb 7, 2022

I looked at the column you added in the VarsOfInterest table. The only variables that do not have zipped files listed are the mean monthly precip variables that I added, so that makes me think I may have specified column names incorrectly. I thought I used the names listed in their metadata file. Let me know if you need help getting the data for those variables

@lekoenig
Copy link
Collaborator Author

lekoenig commented Feb 7, 2022

I looked at the column you added in the VarsOfInterest table. The only variables that do not have zipped files listed are the mean monthly precip variables that I added, so that makes me think I may have specified column names incorrectly. I thought I used the names listed in their metadata file. Let me know if you need help getting the data for those variables

@jds485, thanks for that offer! There are missing names in that column for the precip variables because I was lazy and didn't want to copy all the folder names (there are many!) 😄 I instead adding special handling to define those item names in lines 23-37 in 1_fetch/src/fetch_nhdv2_attributes_from_sb.R.


# Parse name(s) of unzipped files
file_name <- basename(out_file)
file_name_sans_ext <- substr(file_name,1,nchar(file_name)-4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could get the number of characters in the extension from a strsplit that takes the last element. But I do expect that data downloads would have 4 character extensions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion - like you said, it works for downloading our nhd attributes but it'd be nice to make that file extension bit robust to various file types. I had originally thought to use str_split w/ pattern = "." and just take everything before the period to define the file name. But then I wasn't sure how often "." would be used within the file name itself (not just to demarcate the extension), so went with this approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like:
ext_nchar <- str_split(file_name, pattern = '.')[[1]] %>% nchar() %>% last()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good idea to use last() like that! I also implemented this right after posting my first comment:

str_split(file_name,".[[:alnum:]]+$")[[1]][1]

It should split the string based on ".ext" (regardless of length of file extension) and then take the first part.

@jds485
Copy link
Member

jds485 commented Feb 7, 2022

I instead adding special handling to define those item names

Sounds good. I didn't realize that you added the items in that column manually!

I finished overviewing the code and it looks good to me. Thanks for implementing the fetch-unzip-clip that we talked about last week

@lekoenig
Copy link
Collaborator Author

lekoenig commented Feb 8, 2022

Should we retain all of the years available?

Cutting off at 1960 sounds good to me to capture ~20 year historical information from our start date. That applies to all datasets with multiple years available.

Thanks, @jds485. As a recap, here's what I've implemented for datasets with multiple years based on discussion in this PR:

  • I will exclude two available years (1940, 1950) from the HDENS dataset so that our gathered data starts at 1960 (and goes to 2010).
  • The monthly PPT data are available from 1946-2014, but I'll only pull 1971-2000 to be consistent with our other met drivers
  • For the NID data, I think I'll actually exclude the data tagged "1960" since that data would reflect the decade spanning the 1950's (see below). By starting our gathered data at 1970, we'll capture the dam dataset starting at the period spanning 1961-1970.

Number of dams built on or before YYYY per NHDPlus version 2 catchment, where YYYY is the last year for the decade of record (for example 1960 spans 1951 - 1960).

@lekoenig
Copy link
Collaborator Author

lekoenig commented Feb 8, 2022

Duplicate variables: Thanks for finding these! I just looked at the SB metadata to see if they differ, and I think these are exact duplicates with different names. Here are my thoughts on which names to retain:

ET and AET: AET to be specific that it's actual instead of potential.
RUN and RUN7100: RUN7100 to be specific to years used.
RH and RH6190: RH6190 ""

We may have to retain ET (over AET) and RUN (over RUN7100). If I download the linked data associated with AET and RUN7100, there are no column names that match those descriptors, and so they must just be identical to ET and RUN?

@jds485
Copy link
Member

jds485 commented Feb 8, 2022

We may have to retain ET (over AET) and RUN (over RUN7100)

Okay, sounds good!

Copy link
Collaborator

@msleckman msleckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments:

In addition to my inline comments, I read in all the downloaded files to skim how they each look. All are correctly read in (did not do get a chance to do further df checks such NA, min, max - lmk if that is desired). Downloaded tables have correct structure organized by COMID. I suggest testing them with xwalk, seems ready for testing that. Also, most have readable column headers that match the file name (ex: ACC_PT for PT.csv) - so will be easy to keep track of as we work through them either in bulk or individually.

Some files are more complex (not following the ACC*, TOT*, CAT* structure) and therefore more longer to grasp without pulling out the original metadata file in sb (keeping eye on this one: "1_fetch/out/NDAMS.csv”)

Suggestions: Grouping certain downloads:

Could we create a subfolder for these specific downloads?

  • foldername: STATSGO for the following 3 csvs:
"1_fetch/out/STATSGO_HYDGRP.csv" 
"1_fetch/out/STATSGO_TEXT.csv" 
"1_fetch/out/STATSGO_LAYER.csv”

folder name: PPT for the following 3 csvs:

"1_fetch/out/PPT_ACC.csv"
"1_fetch/out/PPT_TOT.csv"
"1_fetch/out/PPT_CAT.csv”

If easier, these could also be combined into 1 df.

1_fetch.R Outdated
p1_vars_of_interest_downloaded_csvs,
p1_vars_of_interest %>%
split(.,.$sb_id) %>%
lapply(.,fetch_nhdv2_attributes_from_sb,save_dir = "1_fetch/out",comids=p1_nhdv2reaches_sf$COMID,delete_local_copies=TRUE) %>%
Copy link
Collaborator

@msleckman msleckman Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I always default to using function(x) in apply() because I think that it doesn't like it when there are multiple arguments in the given function. Nice to see it works!

1_fetch.R Outdated
Comment on lines 240 to 244
p1_vars_of_interest %>%
split(.,.$sb_id) %>%
lapply(.,fetch_nhdv2_attributes_from_sb,save_dir = "1_fetch/out",comids=p1_nhdv2reaches_sf$COMID,delete_local_copies=TRUE) %>%
do.call('c',.),
format = "file"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what is happening here with a couple comments. In particular, why are you using split()? (see comment suggestion above).
what does do.call(., 'c') do ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, thanks! I've added comments to briefly explain what each line is doing within the p1_vars_of_interest_downloaded_csvs target. In particular, fetch_nhdv2_attributes_from_sb returns the file output path for each unique sb_id, so do.call('c',.) concatenates those strings into one vector of class chr.

1_fetch.R Outdated Show resolved Hide resolved
@lekoenig
Copy link
Collaborator Author

Could we create a subfolder for these specific downloads?

foldername: STATSGO for the following 3 csvs:
"1_fetch/out/STATSGO_HYDGRP.csv"
"1_fetch/out/STATSGO_TEXT.csv"
"1_fetch/out/STATSGO_LAYER.csv”

folder name: PPT for the following 3 csvs:
"1_fetch/out/PPT_ACC.csv"
"1_fetch/out/PPT_TOT.csv"
"1_fetch/out/PPT_CAT.csv”

I agree that because these are thematically similar, it might be more satisfying if they were combined somehow, either in a separate folder or a single data frame. For the STATSGO tables, these csv's reflect the format of the input datasets that are contained on ScienceBase, and they at least follow the expected structure (columns for "ACC","CAT", and "TOT). The PPT files on ScienceBase are a little different because the ACC/CAT/TOT files actually represent separate sb_id's all-together. I initially thought about combining those PPT tables but opted not to because it would make the data download code more complex and thus harder to read (i.e., by adding special handling code), and because I wasn't sure that it'd be necessary. In the data processing steps, I envision that we'll subset any of these tables to only include those columns containing "CAT" to apply various functions (e.g. sum, area-weighted mean). So the PPT_TOT and PPT_ACC csv's just wouldn't get passed through that data processing code. Would you agree with my characterization here (if it makes sense, that is!)?

So I think I'm still leaning toward leaving the downloaded files as is to keep the code as simple as possible and to keep fetch tasks separate from other processing tasks we might need/want. Do you think that makes sense? Perhaps we can revisit in the data processing step?

@lekoenig
Copy link
Collaborator Author

Thanks for your review, @msleckman, and thank you for skimming all those downloaded files! If, after skimming the downloaded files, you have notes for processing steps we'll need to add or at least consider, it'd be useful to add those to issue #80. I think I've addressed all of your comments for the fetch step here, so let me know if we should add anything else or whether we can merge.

@lekoenig lekoenig linked an issue Feb 14, 2022 that may be closed by this pull request
3 tasks
@lekoenig lekoenig self-assigned this Feb 14, 2022
@lekoenig
Copy link
Collaborator Author

Based on Margaux's review (and looking forward to processing the downloaded Wiezorek datasets), I realized that it would make the code cleaner if I used dynamic branching instead of a multi-step lapply process to create the p1_vars_of_interest_downloaded_csvs target.

So, fyi @msleckman, @jds485 I went ahead and made that change here. @msleckman, would you mind checking that this still builds locally for you?

Copy link
Collaborator

@msleckman msleckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lekoenig I've run this locally with your changes and had no issues 😄 . I like the dynamic branching approach a lot. Nice job incorporating that targets pattern!
I did feel like it took longer to download the files than the last time >40 min (although I did not perfectly time it).

One idea/suggestion - is it possible to put a sys.time() variable wrapper around the p1_vars_of_interest_downloaded_csvs to print the time it takes for all the vars of interest files to download? (I have no idea if that is feasible to do within 1_fetch.R)

@jds485
Copy link
Member

jds485 commented Feb 16, 2022

Looks good!

is it possible to put a sys.time() variable wrapper

targets tracks the time it takes to build every target (and every branch) within the meta file. You can access this information with tar_meta()


# Parse name(s) of unzipped files
file_name <- basename(out_file)
file_name_sans_ext <- str_split(file_name,".[[:alnum:]]+$")[[1]][1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[randomly skimming github stuff and this caught my eye] - do you have a reason not to use the built-in tools::file_path_sans_ext() here? This str_split approach works for 1 file but not multiple files, so needs to be modified in some way if you're serious about the plural option implied by name(s) 3 lines above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that suggestion, Alison! In short, I don't have a good reason not to use the built-in function you reference. My brain got going on regex I guess 😄. We map over this function for individual files so I don't expect that the str_split approach would cause issues, but it's good to simplify the code (and the plural option that was implied in the commented line above). I've edited this line to use tools::file_path_sans_ext() instead.

@lekoenig
Copy link
Collaborator Author

I did feel like it took longer to download the files than the last time >40 min (although I did not perfectly time it).

targets's first instinct is to map over all rows of a data frame. When I first implemented the branching approach for p1_vars_of_interest_downloaded_csvs, I noticed that some sets of files were being downloaded multiple times (e.g. the PPT_CAT) because that sb_id is repeated 12 times in our VarsOfInterest table to reflect the fact that the dataset includes columns for 12 different months. To instead map over row subsets delineated by unique sb_id's, I added the following lines to the upstream target p1_vars_of_interest:

... %>%
group_by(sb_id) %>%
tar_group()

This worked for me to ensure that PPT_CAT, PPT_ACC, PPT_TOT, and others were only being downloaded and unpacked once. @msleckman do you know if p1_vars_of_interest got rebuilt when you ran the pipeline with dynamic branching? 40 min seems long, but I was probably working on other things in the meantime and may not have noticed if it took longer than before.

@lekoenig lekoenig merged commit 1714e46 into USGS-R:main Feb 17, 2022
@lekoenig lekoenig deleted the fetch-nhdv2-attributes branch February 17, 2022 21:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gather NHDv2 segment/catchment attributes
5 participants