Fetch nhdv2 attributes from ScienceBase #82

lekoenig · 2022-02-07T15:50:31Z

Addresses #16 and #77.

The code here adds a target to download all of the data associated with our VarsOfInterest table. The function to download the data also filters the output to only include COMIDs specified by the user, and saves a csv of the filtered output. The function also contains an argument to choose between retaining or deleting the original downloaded zipped and unzipped files.

I'm tagging @msleckman for a code review and @jds485 for a more cursory review. A few questions I have on this PR:

We could add the NLCD years beyond 2011 to VarsOfInterest and download them in a similar manner. Would this be helpful for addressing Fetch NHD attributes : NLCD Land Cover 2001-2019 #56? I know we currently download NLCD data separately (in fact, 'Land Cover' variables have been filtered out of the target for p1_vars_of_interest), so I'm happy to leave that alone and fetch these data in a similar manner to NLCD2011. @msleckman, I'm interested in your thoughts on that since I'm not too familiar with the existing NLCD download functions.
There are some variables in our VarsOfInterest table that seem like they could be duplicates (i.e., slightly different GIS layer name but same sb id). I haven't looked at the data sources too closely, so I'd appreciate if anyone else can clarify whether these are distinct variables. From my notes:

There are variables labeled "ET" and "AET" (rows 78 and 62 in the VarsOfInterest csv) that point to the same sb link. Are these the same variable that got duplicated, or were we expecting different data for "ET" and "AET"? Similar question regarding variables RUN and RUN7100 (rows 115 and 112) and RH and RH6190 (rows 61 and 77).

In the vars of interest file, I added a column to indicate the sb items we are interested in so as to exclude bulky rasters that are included in some of these datasets. I used the entry in that column to limit the National Inventory of Dams (NID) dataset downloads to start at 1960, even though there are also sb items available for 1930, 1940, and 1950. It seemed like we were cutting off other metrics at 1960 (e.g., FORE-SCE?), so I figured I'd be consistent here. What do you think? Should we retain all of the years available?

jds485 · 2022-02-07T22:05:08Z

Duplicate variables: Thanks for finding these! I just looked at the SB metadata to see if they differ, and I think these are exact duplicates with different names. Here are my thoughts on which names to retain:

ET and AET: AET to be specific that it's actual instead of potential.
RUN and RUN7100: RUN7100 to be specific to years used.
RH and RH6190: RH6190 ""

jds485 · 2022-02-07T22:07:29Z

Should we retain all of the years available?

Cutting off at 1960 sounds good to me to capture ~20 year historical information from our start date. That applies to all datasets with multiple years available.

jds485 · 2022-02-07T22:17:31Z

I looked at the column you added in the VarsOfInterest table. The only variables that do not have zipped files listed are the mean monthly precip variables that I added, so that makes me think I may have specified column names incorrectly. I thought I used the names listed in their metadata file. Let me know if you need help getting the data for those variables

lekoenig · 2022-02-07T22:22:17Z

I looked at the column you added in the VarsOfInterest table. The only variables that do not have zipped files listed are the mean monthly precip variables that I added, so that makes me think I may have specified column names incorrectly. I thought I used the names listed in their metadata file. Let me know if you need help getting the data for those variables

@jds485, thanks for that offer! There are missing names in that column for the precip variables because I was lazy and didn't want to copy all the folder names (there are many!) 😄 I instead adding special handling to define those item names in lines 23-37 in 1_fetch/src/fetch_nhdv2_attributes_from_sb.R.

jds485 · 2022-02-07T22:41:58Z

1_fetch/src/fetch_nhdv2_attributes_from_sb.R

+
+  # Parse name(s) of unzipped files
+  file_name <- basename(out_file)
+  file_name_sans_ext <- substr(file_name,1,nchar(file_name)-4)


Could get the number of characters in the extension from a strsplit that takes the last element. But I do expect that data downloads would have 4 character extensions.

Thanks for this suggestion - like you said, it works for downloading our nhd attributes but it'd be nice to make that file extension bit robust to various file types. I had originally thought to use str_split w/ pattern = "." and just take everything before the period to define the file name. But then I wasn't sure how often "." would be used within the file name itself (not just to demarcate the extension), so went with this approach.

I was thinking something like:
ext_nchar <- str_split(file_name, pattern = '.')[[1]] %>% nchar() %>% last()

Oh, good idea to use last() like that! I also implemented this right after posting my first comment:

str_split(file_name,".[[:alnum:]]+$")[[1]][1]

It should split the string based on ".ext" (regardless of length of file extension) and then take the first part.

jds485 · 2022-02-07T22:48:36Z

I instead adding special handling to define those item names

Sounds good. I didn't realize that you added the items in that column manually!

I finished overviewing the code and it looks good to me. Thanks for implementing the fetch-unzip-clip that we talked about last week

lekoenig · 2022-02-08T18:57:18Z

Should we retain all of the years available?

Cutting off at 1960 sounds good to me to capture ~20 year historical information from our start date. That applies to all datasets with multiple years available.

Thanks, @jds485. As a recap, here's what I've implemented for datasets with multiple years based on discussion in this PR:

I will exclude two available years (1940, 1950) from the HDENS dataset so that our gathered data starts at 1960 (and goes to 2010).
The monthly PPT data are available from 1946-2014, but I'll only pull 1971-2000 to be consistent with our other met drivers
For the NID data, I think I'll actually exclude the data tagged "1960" since that data would reflect the decade spanning the 1950's (see below). By starting our gathered data at 1970, we'll capture the dam dataset starting at the period spanning 1961-1970.

Number of dams built on or before YYYY per NHDPlus version 2 catchment, where YYYY is the last year for the decade of record (for example 1960 spans 1951 - 1960).

lekoenig · 2022-02-08T19:00:17Z

Duplicate variables: Thanks for finding these! I just looked at the SB metadata to see if they differ, and I think these are exact duplicates with different names. Here are my thoughts on which names to retain:

ET and AET: AET to be specific that it's actual instead of potential.
RUN and RUN7100: RUN7100 to be specific to years used.
RH and RH6190: RH6190 ""

We may have to retain ET (over AET) and RUN (over RUN7100). If I download the linked data associated with AET and RUN7100, there are no column names that match those descriptors, and so they must just be identical to ET and RUN?

jds485 · 2022-02-08T19:11:56Z

We may have to retain ET (over AET) and RUN (over RUN7100)

Okay, sounds good!

msleckman

Comments:

In addition to my inline comments, I read in all the downloaded files to skim how they each look. All are correctly read in (did not do get a chance to do further df checks such NA, min, max - lmk if that is desired). Downloaded tables have correct structure organized by COMID. I suggest testing them with xwalk, seems ready for testing that. Also, most have readable column headers that match the file name (ex: ACC_PT for PT.csv) - so will be easy to keep track of as we work through them either in bulk or individually.

Some files are more complex (not following the ACC*, TOT*, CAT* structure) and therefore more longer to grasp without pulling out the original metadata file in sb (keeping eye on this one: "1_fetch/out/NDAMS.csv”)

Suggestions: Grouping certain downloads:

Could we create a subfolder for these specific downloads?

foldername: STATSGO for the following 3 csvs:

"1_fetch/out/STATSGO_HYDGRP.csv" 
"1_fetch/out/STATSGO_TEXT.csv" 
"1_fetch/out/STATSGO_LAYER.csv”

folder name: PPT for the following 3 csvs:

"1_fetch/out/PPT_ACC.csv"
"1_fetch/out/PPT_TOT.csv"
"1_fetch/out/PPT_CAT.csv”

If easier, these could also be combined into 1 df.

msleckman · 2022-02-09T00:02:47Z

1_fetch.R

+    p1_vars_of_interest_downloaded_csvs,
+    p1_vars_of_interest %>%
+      split(.,.$sb_id) %>%
+      lapply(.,fetch_nhdv2_attributes_from_sb,save_dir = "1_fetch/out",comids=p1_nhdv2reaches_sf$COMID,delete_local_copies=TRUE) %>%


Oh, I always default to using function(x) in apply() because I think that it doesn't like it when there are multiple arguments in the given function. Nice to see it works!

msleckman · 2022-02-09T00:10:58Z

1_fetch.R

+    p1_vars_of_interest %>%
+      split(.,.$sb_id) %>%
+      lapply(.,fetch_nhdv2_attributes_from_sb,save_dir = "1_fetch/out",comids=p1_nhdv2reaches_sf$COMID,delete_local_copies=TRUE) %>%
+      do.call('c',.),
+    format = "file"


Can you describe what is happening here with a couple comments. In particular, why are you using split()? (see comment suggestion above).
what does do.call(., 'c') do ?

Good call, thanks! I've added comments to briefly explain what each line is doing within the p1_vars_of_interest_downloaded_csvs target. In particular, fetch_nhdv2_attributes_from_sb returns the file output path for each unique sb_id, so do.call('c',.) concatenates those strings into one vector of class chr.

1_fetch.R

lekoenig · 2022-02-14T14:46:33Z

Could we create a subfolder for these specific downloads?

foldername: STATSGO for the following 3 csvs:
"1_fetch/out/STATSGO_HYDGRP.csv"
"1_fetch/out/STATSGO_TEXT.csv"
"1_fetch/out/STATSGO_LAYER.csv”

folder name: PPT for the following 3 csvs:
"1_fetch/out/PPT_ACC.csv"
"1_fetch/out/PPT_TOT.csv"
"1_fetch/out/PPT_CAT.csv”

I agree that because these are thematically similar, it might be more satisfying if they were combined somehow, either in a separate folder or a single data frame. For the STATSGO tables, these csv's reflect the format of the input datasets that are contained on ScienceBase, and they at least follow the expected structure (columns for "ACC","CAT", and "TOT). The PPT files on ScienceBase are a little different because the ACC/CAT/TOT files actually represent separate sb_id's all-together. I initially thought about combining those PPT tables but opted not to because it would make the data download code more complex and thus harder to read (i.e., by adding special handling code), and because I wasn't sure that it'd be necessary. In the data processing steps, I envision that we'll subset any of these tables to only include those columns containing "CAT" to apply various functions (e.g. sum, area-weighted mean). So the PPT_TOT and PPT_ACC csv's just wouldn't get passed through that data processing code. Would you agree with my characterization here (if it makes sense, that is!)?

So I think I'm still leaning toward leaving the downloaded files as is to keep the code as simple as possible and to keep fetch tasks separate from other processing tasks we might need/want. Do you think that makes sense? Perhaps we can revisit in the data processing step?

lekoenig · 2022-02-14T14:50:14Z

Thanks for your review, @msleckman, and thank you for skimming all those downloaded files! If, after skimming the downloaded files, you have notes for processing steps we'll need to add or at least consider, it'd be useful to add those to issue #80. I think I've addressed all of your comments for the fetch step here, so let me know if we should add anything else or whether we can merge.

lekoenig · 2022-02-15T22:54:37Z

Based on Margaux's review (and looking forward to processing the downloaded Wiezorek datasets), I realized that it would make the code cleaner if I used dynamic branching instead of a multi-step lapply process to create the p1_vars_of_interest_downloaded_csvs target.

So, fyi @msleckman, @jds485 I went ahead and made that change here. @msleckman, would you mind checking that this still builds locally for you?

msleckman

@lekoenig I've run this locally with your changes and had no issues 😄 . I like the dynamic branching approach a lot. Nice job incorporating that targets pattern!
I did feel like it took longer to download the files than the last time >40 min (although I did not perfectly time it).

One idea/suggestion - is it possible to put a sys.time() variable wrapper around the p1_vars_of_interest_downloaded_csvs to print the time it takes for all the vars of interest files to download? (I have no idea if that is feasible to do within 1_fetch.R)

jds485 · 2022-02-16T13:47:31Z

Looks good!

is it possible to put a sys.time() variable wrapper

targets tracks the time it takes to build every target (and every branch) within the meta file. You can access this information with tar_meta()

aappling-usgs · 2022-02-16T13:53:56Z

1_fetch/src/fetch_nhdv2_attributes_from_sb.R

+
+  # Parse name(s) of unzipped files
+  file_name <- basename(out_file)
+  file_name_sans_ext <- str_split(file_name,".[[:alnum:]]+$")[[1]][1]


[randomly skimming github stuff and this caught my eye] - do you have a reason not to use the built-in tools::file_path_sans_ext() here? This str_split approach works for 1 file but not multiple files, so needs to be modified in some way if you're serious about the plural option implied by name(s) 3 lines above

Thanks for that suggestion, Alison! In short, I don't have a good reason not to use the built-in function you reference. My brain got going on regex I guess 😄. We map over this function for individual files so I don't expect that the str_split approach would cause issues, but it's good to simplify the code (and the plural option that was implied in the commented line above). I've edited this line to use tools::file_path_sans_ext() instead.

lekoenig · 2022-02-16T18:28:06Z

I did feel like it took longer to download the files than the last time >40 min (although I did not perfectly time it).

targets's first instinct is to map over all rows of a data frame. When I first implemented the branching approach for p1_vars_of_interest_downloaded_csvs, I noticed that some sets of files were being downloaded multiple times (e.g. the PPT_CAT) because that sb_id is repeated 12 times in our VarsOfInterest table to reflect the fact that the dataset includes columns for 12 different months. To instead map over row subsets delineated by unique sb_id's, I added the following lines to the upstream target p1_vars_of_interest:

... %>%
group_by(sb_id) %>%
tar_group()

This worked for me to ensure that PPT_CAT, PPT_ACC, PPT_TOT, and others were only being downloaded and unpacked once. @msleckman do you know if p1_vars_of_interest got rebuilt when you ran the pipeline with dynamic branching? 40 min seems long, but I was probably working on other things in the meantime and may not have noticed if it took longer than before.

lkoenig-usgs added 4 commits February 4, 2022 14:25

add sb items to retrieve to VarsOfInterest csv

b2860d5

change monthly precip sb links to child items representing CAT,ACC,TOT

1313d4e

harmonize shorthand names across sb_ids in VarsOfInterest table

ecb008e

download nhdv2 attributes from ScienceBase

117a029

lekoenig requested review from msleckman and jds485 February 7, 2022 15:50

jds485 reviewed Feb 7, 2022

View reviewed changes

lkoenig-usgs added 2 commits February 8, 2022 15:45

remove duplicated vars from VarsOfInterest table

a3d020e

edit year and file name parsing in fetch_nhdv2_attributes_from_sb.R

887bd66

msleckman suggested changes Feb 11, 2022

View reviewed changes

lekoenig requested a review from msleckman February 14, 2022 14:51

add documentation to nhdv2 attr data download target

b8e6a13

lekoenig linked an issue Feb 14, 2022 that may be closed by this pull request

Gather NHDv2 segment/catchment attributes #16

Closed

3 tasks

lekoenig self-assigned this Feb 14, 2022

download VarsOfInterest using dynamic branching

76f4596

msleckman approved these changes Feb 16, 2022

View reviewed changes

aappling-usgs reviewed Feb 16, 2022

View reviewed changes

minor edit to parse file name in downloaded sb data

7065cf3

lekoenig merged commit 1714e46 into USGS-R:main Feb 17, 2022

lekoenig deleted the fetch-nhdv2-attributes branch February 17, 2022 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch nhdv2 attributes from ScienceBase #82

Fetch nhdv2 attributes from ScienceBase #82

lekoenig commented Feb 7, 2022

jds485 commented Feb 7, 2022

jds485 commented Feb 7, 2022 •

edited

Loading

jds485 commented Feb 7, 2022 •

edited

Loading

lekoenig commented Feb 7, 2022

jds485 Feb 7, 2022

lekoenig Feb 8, 2022

jds485 Feb 8, 2022

lekoenig Feb 8, 2022

jds485 commented Feb 7, 2022

lekoenig commented Feb 8, 2022

lekoenig commented Feb 8, 2022

jds485 commented Feb 8, 2022

msleckman left a comment •

edited

Loading

msleckman Feb 9, 2022 •

edited

Loading

msleckman Feb 9, 2022

lekoenig Feb 14, 2022

lekoenig commented Feb 14, 2022

lekoenig commented Feb 14, 2022

lekoenig commented Feb 15, 2022

msleckman left a comment •

edited

Loading

jds485 commented Feb 16, 2022

aappling-usgs Feb 16, 2022

lekoenig Feb 16, 2022

lekoenig commented Feb 16, 2022

Fetch nhdv2 attributes from ScienceBase #82

Fetch nhdv2 attributes from ScienceBase #82

Conversation

lekoenig commented Feb 7, 2022

jds485 commented Feb 7, 2022

jds485 commented Feb 7, 2022 • edited Loading

jds485 commented Feb 7, 2022 • edited Loading

lekoenig commented Feb 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jds485 commented Feb 7, 2022

lekoenig commented Feb 8, 2022

lekoenig commented Feb 8, 2022

jds485 commented Feb 8, 2022

msleckman left a comment • edited Loading

Choose a reason for hiding this comment

msleckman Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lekoenig commented Feb 14, 2022

lekoenig commented Feb 14, 2022

lekoenig commented Feb 15, 2022

msleckman left a comment • edited Loading

Choose a reason for hiding this comment

jds485 commented Feb 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lekoenig commented Feb 16, 2022

jds485 commented Feb 7, 2022 •

edited

Loading

jds485 commented Feb 7, 2022 •

edited

Loading

msleckman left a comment •

edited

Loading

msleckman Feb 9, 2022 •

edited

Loading

msleckman left a comment •

edited

Loading