-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update cache.py #326
Update cache.py #326
Conversation
The netcdf cache function validates the cache by comparing the ds argument and other function arguments to the pickled arguments. If they match, the cache can be used. Currently, just the coordinates of the argument ds and the output ds had to match, introducing two errors: - If the data_vars differ and are used the cache is falsely valid - The coordintates of the ds argument has to match the coordinates of the output ds. This limits the use of the cache function. The PR compares the hash of the coords and data_vars of the ds argument to those that were stored in the pickle together with the cached output ds. Ideally, the cache.cache_netcdf() accepts arguments that specify specifically which data_vars and coords need to be included in the validation check. Beyond the scope of this pr.
Thanks @bdestombe, the new set up looks a lot cleaner than the previous one. It does change the behavior of the cache function a bit. Because the caching can become complex I have tried to list the changes and our previous solutions to make sure we are on the same page before we decide on implementing this. Changes in behavior of cache functionI think these are the changes in behavior of the cache function:
Issues in old behaviorThe new behavior solves some issues:
An example of this would be:
The notebook on caching gives a description on how we previously tackled this:
Note that "it would simply take too much time to check all the variables in the dataset" is not so relevant anymore because your method of using the hashing is pretty fast I think. However, I do like the idea of 'Northsea' as a separate argument because it makes it more explicit what data is used in the function. At the same time you really have to know this when you create a function where you want to apply the cache to. I am not even 100% sure that we use this consistently in nlmod.
If I understand this correctly the use of the cache function is limited when the cached dataset has coordinates that are irrelevant to the function. An example of this limitation would be:
This is related to this issue: #181. Our solution back then, which is not very well documented, is to modify the advantages/disavantagesI tried to summarize the above by listing the advantages and disadvantages of the new behavior: new behavior regarding checking dataset coordinates
new behavior regarding checking dataset datavars
I hope this whole story was somewhat understandable and complete. Please let me know if I made any mistakes or forgot anything. I think if we both agree on the changes and implications we can discuss shortly with @dbrakenhoff and @rubencalje and decide on implementing this. |
Hi Onno, So this approach is a bit pickier, but rather too picky than falsely validating the arguments and using the cache. As I cryptically proposed in the last sentence, we could introduce a keyword argument to the cache_netcdf(). That allows you explicitly set which datavars and coordinates are checked, such that it is picky for the right reason. It would solve your Northsea case. If this argument is not provided, it might be a bit too picky, but rightfully so, as continuing with false cache is really hard to debug. If the provided list with datavars is not exhaustive, a clear error is given as the input ds only contains the datavars of the provided list. Take care, |
Thus we need to include the following logic somewhere: def ds_contains(ds, coords_2d=False, coords_3d=False, datavars=[], coords=[]):
if coords_2d:
coords.append("x")
coords.append("y")
return xr.Dataset(datavars=ds.datavars[datavars], coords=ds.coords[coords], attrs=ds.attrs) But where do we put that logic? Case 1 or 2? Case 1@cache_netcdf(coords_2d=True, coords_3d=False, datavars=[], coords=[])
def get_ahn(ds, identifier="AHN4_DTM_5m"):
...
return ahn In one of the first lines of Case 2Or use @ds_contains(ds, coords_2d=True, coords_3d=False, data_vars=[], coords=[])
@cache_netcdf()
def get_ahn(ds, identifier="AHN4_DTM_5m"):
...
return ahn No changes to |
def ds_contains(ds, coords_2d=False, coords_3d=False, datavars=[], coords=[]):
if coords_2d:
coords.append("x")
coords.append("y")
if "northsea" in datavars and not in ds.datavars:
raise Error("Northsea not in dataset. Run nlmod.read.rws.add_northsea() first.")
return xr.Dataset(datavars=ds.datavars[datavars], coords=ds.coords[coords], attrs=ds.attrs) |
Nice idea I really like your solution. Just to check if I understand your approach correctly. When you use the cache decorator on a function you have the option to list specific datavars/coordinates/attributes. If you choose to do so and you call the function a hash representation of the specific datavars/coordinates/attributes is created and saved together with the cache (as a pickle). When you call the function again the hash from the previous call is compared to a hash of the current call and if they differ the cache becomes invalid. Is this correct? What do we do if no datavars/coordinates/attributes are specified, do we check everything, nothing or maybe only coordinates? I think checking all would make sense to be more fool proof. Finally we have to choose between the two cases you mention. Personally I prefer case 1 because I think we have to add most of the logic in the Do you have time to add this to the PR? If not I can also give it a try. |
Ah indeed, default arguments of I'll come up with something. |
Hi Onno, seems to be working. Could you have a look at the code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new approach a lot! I just added one comment on the code. There are also still some tests failing. If it is up to me the PR can be merged once these are fixed but maybe @dbrakenhoff and @rubencalje also have an opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing!
The netcdf cache function validates the cache by comparing the ds argument and other function arguments to the pickled arguments. If they match, the cache can be used.
Currently, just the coordinates of the argument ds and the output ds had to match, introducing two errors:
The PR compares the hash of the coords and data_vars of the ds argument to those that were stored in the pickle together with the cached output ds.
Ideally, the cache.cache_netcdf() accepts arguments that specify specifically which data_vars and coords need to be included in the validation check.
Beyond the scope of this pr.Part of this PR.