-
Notifications
You must be signed in to change notification settings - Fork 14
Conversation
river_dl/preproc_utils.py
Outdated
:return: [numpy array] batched data with dims [nbatches, nseg, seq_len | ||
(batch_size), nfeat] | ||
""" | ||
if offset>1: | ||
period = int(offset) | ||
else: | ||
period = int(offset*seq_len) | ||
num_batches = data_array.shape[1]//period | ||
|
||
ndays = data_array.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth calling this nsteps
instead of ndays
? Or maybe the convention of daily timesteps is hardcoded elsewhere such that it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good catch. I was trying to be generic with this code, but missed a few spots. I think I hard-coded 'D' for daily timesteps somewhere and I can make that user specified
river_dl/preproc_utils.py
Outdated
:param fill_time: [bool] When True, filled in data are time indices that | ||
follow in sequence from the previous timesteps. When False, filled in data | ||
are replicates of the previous timesteps. | ||
When False, filled in data are replicates of the previous timesteps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this line is accidentally duplicated? ("When False . . . previous timesteps.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks! I'll fix that
fill_dates_array), | ||
axis = 1) | ||
else: | ||
data_array = np.concatenate((data_array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is it to maintain the temporal structure of the original dataset in the padded values? (It seems important, but I can also convince myself that it isn't.) With this approach to padding the data array, the temporal structure isn't continuous. For example, if the data_array has 545 steps and a period length of 365 (so 1 1/2 years of data with 1 yr batches), the number of repeated steps is 185 and the repeated steps would be 360 - 545. If the original dataset starts in Jan, then it would go through the following June and we'd need to pad for July - Dec, but the repeated steps are Jan - June, so the last batch would have winter-spring-winter-spring (instead of winter-spring-summer-fall). With annual periods, you could pull the needed data from the last timesteps of the prior batch, but that breaks down if the sequence length is other than annual. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is it to maintain the temporal structure of the original dataset in the padded values?
If I'm thinking about this correctly, the padding only affects the last batch. All of the observations for the filled in timesteps in that batch are set to np.nan, so I think it does not matter which data are used to pad the timeseries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely only affects the last batch (or at least on the run I did). I compared the input data for the test partition with and without padding and they were identical through the full shape of the un-padded array.
And the models shouldn't be using any future data. Is it worth checking that? (doing a run or 2 pulling different data to pad the series? could even test the extremes or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the models shouldn't be using any future data. Is it worth checking that?
I checked the shapes of the observation data in the training, validation and testing partitions for spatial and temporal methods in the repo. The shapes matched the x data array (and ids and times arrays), and the observations were nan for the padded times. Is that the kind of check you were thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing a run or 2 pulling different data to pad the series?
Like program a random sample to pad the timeseries? I can try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shapes matched the x data array (and ids and times arrays), and the observations were nan for the padded times. Is that the kind of check you were thinking of?
I was thinking more something to test the assumption that it doesn't really matter the x data that's used to for the padding. So from a temperature perspective, trying the padding with warmer air temps or colder air temps and checking to make sure the simulated temps are essentially the same. (and if not, then we need to think more carefully about what we use for the x data for padding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me try with the random sample. I have set the random seeds and verified that results are the same when those are specified. So using a random sample should provide the same results as I've obtained from a previous run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got identical results using a random sample and filling in by replicating the last timesteps.
Code for random sampling of timesteps:
sample_inds = random.sample(range(nsteps), num_rep_steps)
data_array = np.concatenate((data_array,
data_array[:,sample_inds,:]),
axis = 1)
Results for replicating timesteps (top) and random sampling (middle) and a difference (bottom)
How should we pad discontinuous partitions?
|
I believe |
I'm super rusty on this repo as a project workflow (i.e., beyond it's general modeling methods), do I understand the original problem correctly:
If I understand it well enough, I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together, instead you'll just have two |
yes, that's correct.
I think option 1 would also achieve this, after dropping any sequences that only have nan data. For a partition with [data1, gap, data2], we would fill in the gap with observed x data and set y to nan in that gap. The difference would be that the nans in data2 could appear at the start of the first and end of the last sequence for option 1, and only at the end of the last sequence for option 2. So, I think there could be fewer overall nans using option 2. If cell and hidden states are not carried over from batch to batch, that's also a good case for option 2 |
Ah, okay. It sounds fairly inconsequential then from my outsider point of view. I think |
Thanks for your perspectives, @jdiaz4302! I can implement option 2 for this PR |
|
||
#get the start date indices. These are used to split the | ||
# dataset before creating batches | ||
continuous_start_inds.append(i+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might run counter to our previous discussion re: starting batches after gaps, but would we ever want to fill short gaps and not restart the batch? (ie partitioning our data so we test on July and train on Aug through June, with 365 day batches) In that case we might want to fill the gap in July but keep going and not restart the batch. That's not the best example and maybe since that's not a current use we should just note that as an issue and keep this as you have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we ever want to fill short gaps and not restart the batch?
Maybe. The check for gap_length < seq_len
could be used to identify which indices to fill instead of restart the batch after the gap. seq_len
could be changed to a function argument that specifies the maximum gap length that should be filled. I think that approach would also require an edit to the array to fill in the missing dates.
I initially thought the example you gave could be solved by specifying the start dates of training and testing as Aug 1 and July 1, but that will not work after leap years, so filling in small gaps would be preferable.
I don't currently need that functionality, so I'd vote to convert to an issue.
An issue sounds good to me 🙂
------------------------------------------------
Janet Barclay
U.S. Geological Survey
New England Water Science Center
Connecticut Office
101 Pitkin St.
East Hartford, CT 06108
Phone (office) 860 291-6763
Fax 860 291-6799
Email ***@***.***>***@***.******@***.***>
https://www.usgs.gov/staff-profiles/janet-barclay
________________________________
From: Jared D. Smith ***@***.***>
Sent: Friday, May 5, 2023 11:48 AM
To: USGS-R/river-dl ***@***.***>
Cc: Barclay, Janet R ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [USGS-R/river-dl] Pad train/val/test data (PR #218)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@jds485 commented on this pull request.
________________________________
In river_dl/preproc_utils.py<#218 (comment)>:
+ gap_ind = np.where(time_diff == t)[0]
+ #there can be multiple gaps of the same length
+ for i in gap_ind:
+ #determine if the gap is longer than the sequence length
+ date_before_gap = dataset[time_idx_name][i].values
+ next_date = dataset[time_idx_name][i+1].values
+ #gap length in the same unit as the timestep
+ gap_length = int((next_date - date_before_gap)/timestep_1)
+ if gap_length < seq_len:
+ #I originally had this as a sys.exit, but did not want
+ # to force this condition.
+ print("The gap between this partition's continuous time periods is less than the sequence length")
+
+ #get the start date indices. These are used to split the
+ # dataset before creating batches
+ continuous_start_inds.append(i+1)
would we ever want to fill short gaps and not restart the batch?
Maybe. The check for gap_length < seq_len could be used to identify which indices to fill instead of restart the batch after the gap. seq_len could be changed to a function argument that specifies the maximum gap length that should be filled. I think that approach would also require an edit to the array to fill in the missing dates.
I initially thought the example you gave could be solved by specifying the start dates of training and testing as Aug 1 and July 1, but that will not work after leap years, so filling in small gaps would be preferable.
I don't currently need that functionality, so I'd vote to convert to an issue.
—
Reply to this email directly, view it on GitHub<#218 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA5H7UAQZ4ANCIY4VQTCW5DXEUOLVANCNFSM6AAAAAAXJWUUIE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I ran the updates for padding the discontinuous partitions and they ran fine and the padding looks good. When I looked at the predicted temps, however, I saw that there were predictions for the added days. Looking at the updated predict function, I see there's an option to specify the last days, but I hadn't noticed that (and therefore hadn't updated my snakemake file to account for it) and as written it doesn't remove padded days in the middle of the partition (like with discontinuous times). What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the |
That's a great suggestion! I can look into adding that |
@janetrbarclay This is ready to be tested. I tried with several discontinuous partitions and it worked for me. This is a plot of one of the prediction timeseries. The horizontal line is where there are no data. I did this with latest_time = None (default). I decided to leave that as a function argument in case someone wants to use it And this is the corresponding indicator of padded data (1 = padded). The x-axis is the index in the array, not the date, so the full length of the data gap is not included in this figure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for sticking with this through many slow reviews!
This PR adds function arguments that allow a user to pad the training, validation, and/or testing datasets so that data are not trimmed. For the inland salinity dataset, this resulted in 5000-20000 more observations used.
I tested this for cases where the train/val/test is a continuous time period. I think if a partition is a discontinuous time period (e.g., training is 2000-01-01 to 2005-01-01 and 2010-01-01 to 2015-01-01), that would require a different coding approach. For example, pad each of the continuous periods within the discontinuous blocks so that batches are defined for each of the continuous periods. That might also address #127. Curious in your thoughts on how is best to approach padding for discontinuous time periods.
Closes: #217