New features/ideas for future releases (e.g., CDAT, other libraries) #271
Replies: 7 comments 34 replies
-
Do we need to do something more sophisticated with masking? I think the answer is no (at least for now), but I am opening this stub in case someone thinks we do. CDAT users frequently needed to use
I think that numpy should be able to fill in for MV2 for these operations. |
Beta Was this translation helpful? Give feedback.
-
I went through Probably Implemented/Available (in some form)
Planned
Consider Implementing
|
Beta Was this translation helpful? Give feedback.
-
I did a pass through of ** Potentially useful **
** Potentially useful but perhaps outside scope **
** Statistics functionality **
|
Beta Was this translation helpful? Give feedback.
-
I hate NaNs, because they will always represent some kind of numerical error for me, and are a kind of crutch for lazy beginners/coders. That's why I think using masks is The clean way to do things! But you have to be careful to explicitly use masks all the time because of numpy/numpy#18675 |
Beta Was this translation helpful? Give feedback.
-
Stephen
I use masks for 4 things. Others may have more
1) Mask out ocean to calculate land quantities
2) Mask out land to calculate ocean quantities
3) Mask out above or below a threshold to calculate extremes.
4) Mask out missing data in observational products
Masks are a critical part of my analysis. Without it, I am stuck.
Michael
… On Dec 8, 2022, at 9:18 AM, Stephen Po-Chedley ***@***.***> wrote:
I think this is a continuation of this thread <#271 (comment)>.
I always found masks in CDAT dangerous, because I never fully trusted that the separate mask array was actually being accounted for. If you have a NaN or 1E20 in your array, you'll figure out pretty quickly whether you've accounted for masking (or not). Also – I don't think xarray really generates a masked-array-like object when it opens data. Does it? I would worry that if we generated a masked array when opening datasets with xcdat that we'd have the same situation: when is this mask being taken into account and when is it not? Last, for large datasets there is some memory penalty to carry around an extra array (maybe it is small for a Boolean array)?
With that said, why do people like masks?
I thought I had read that they speed up operations [but right now I am finding examples of masked arrays slowing things down here <https://currents.soest.hawaii.edu/ocn_data_analysis/_static/masked_arrays.html> and here <https://stackoverflow.com/questions/55987642/why-are-numpy-masked-arrays-useful>].
Masked arrays may simplify code in some cases (examples here <https://currents.soest.hawaii.edu/ocn_data_analysis/_static/masked_arrays.html>) [though this probably isn't the case for xcdat/xarray, because the functions were built to handle NaN values]
Maybe you care about different types of masks and a vanilla NaN complicates things for you (e.g., missing versus masked versus land/ocean/region) [though this case could probably be easily handled by explicitly creating masks, e.g., here <https://geohackweek.github.io/nDarrays/09-masking/>?]
What am I missing? It seems like NaNs are safer, faster, and already implemented for the things I care about (and probably do not prevent others from converting to a masked array?).
—
Reply to this email directly, view it on GitHub <#271 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSKCHGM4KM7YGKD23KEEITWMIJ6VANCNFSM55XQA47A>.
You are receiving this because you are subscribed to this thread.
|
Beta Was this translation helpful? Give feedback.
-
Stephen
Thanks. Looks straightforward enough.
Michael
… On Dec 8, 2022, at 10:47 AM, Stephen Po-Chedley ***@***.***> wrote:
@mfwehner <https://github.com/mfwehner> – I think all of this can be done in xarray / xcdat. Here is an example <https://geohackweek.github.io/nDarrays/09-masking/>.
For observational products, missing values will be read in as NaN values (assuming the dataset sets a missing_value attribute).
—
Reply to this email directly, view it on GitHub <#271 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSKCHGFSKZSUSG756VXLMTWMIUNPANCNFSM55XQA47A>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
Tom
Maybe I am missing something as I am just starting on xcdat. But can’t we just set missing_value=1.e20 like we did for cdat and be done with it?
The beauty of that missing value is that if things don’t work, the answer makes no sense, alerting you to an error.
Michael
… On Dec 9, 2022, at 10:58 AM, Tom Vo ***@***.***> wrote:
Masks are not just here for cleanly handling missing values, as pointed out above! They are really useful for selecting any type of region (not just land vs ocean),
We all seem to be on the same page in regards to the usefulness of masking. I think the details we are trying to figure out are:
How to represent missing values
How to operate on missing values (e.g., masking)
Should handle cases such as divide by zero
CDAT implements np.masked_array directly. Libraries such as xarray/pandas represent missing values with NaNand provide APIs to operate on them. I think the main difference here is the implementation for similar functionality.
and the nice thing is that you can use np.logical_XXX functions to combine regions and/or select some part of the data based on some fancy tests. And the tests can be based on the data array itself, or on the values of other arrays that are not just land-sea masks (e.g. get the temperature on land where the precipitation has a specific value)
If I understand correctly, xarray supports these types of masking operations using .where().
How Xarray handles missing values and masking operations:
Read in datasets that have missing values with xarray
Xarray replaces all missing values represented by _FillValue with NaN
Perform masking operations as needed with .where(). Examples below:
https://foundations.projectpythia.org/core/xarray/computation-masking.html#using-where-with-multiple-conditions <https://foundations.projectpythia.org/core/xarray/computation-masking.html#using-where-with-multiple-conditions>
https://geohackweek.github.io/nDarrays/09-masking/ <https://geohackweek.github.io/nDarrays/09-masking/>
Write datasets back out
xarray sets NaN back to _FillValue
I’m aware that what I have written above can probably be done with NaN values, but in my opinion NaN should only be used when there is some kind of numerical error, and not because matlab users don’t know how to use their data. Besides, if I find some NaN values instead of missing values in a NetCDF file, I will probably find it very suspicious. It’s some kind of level 0 Quality Contro
I understand what you’re saying here, those are good points.
Xarray replaces missing values represented by _FillValue with NaN purely for internal operations. When you write the NetCDF back out, the NaN values gets set back to _FillValue.
Note, users can represent missing values with something other than NaN if they really want to.
https://docs.xarray.dev/en/stable/generated/xarray.Dataset.fillna.html <https://docs.xarray.dev/en/stable/generated/xarray.Dataset.fillna.html>
https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_masked_array <https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_masked_array>.
Key Takeaways
My key takeaway is that we should experiment more with Xarray and masking data using the examples and resources to get a better understanding of their approach.
Realistically, we don't have the resources to implement the management of another data structure (masked arrays) within the existing Xarray data structures given the complexities and trade-offs. In my opinion, I think the way Xarray handles missing data and masking should be sufficient in most cases and it might not make sense to implement masked arrays internally.
We can consider positive attributes for using masked arrays and figure out if/how to include functionality that xarray doesn't already cover. If you prefer to use masked arrays, Xarray provides an API for it here <https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_masked_array>.
—
Reply to this email directly, view it on GitHub <#271 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSKCHAS4H6VHMTUJDTWDPDWMN6L7ANCNFSM55XQA47A>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
The PCMDI group is interested in providing feedback and feature suggestions based on their most used features in CDAT.
Describe the solution you'd like
xarray
with relative easexarray
or are already implemented inxCDAT
GH Discussions now has a Polls features -> https://github.com/xCDAT/xcdat/discussions/categories/polls
Things to consider:
Beta Was this translation helpful? Give feedback.
All reactions