New features/ideas for future releases (e.g., CDAT, other libraries) #271

tomvothecoder · 2022-05-04T15:48:10Z

tomvothecoder
May 4, 2022
Maintainer

The PCMDI group is interested in providing feedback and feature suggestions based on their most used features in CDAT.

Describe the solution you'd like

Use this post to gather and discuss ideas. Each comment has its own thread.
Once ready, discuss the ideas at our meeting.
Polling
1. Create a survey to gather a list of features that are most used by CDAT and PCMDI group members
2. Aggregate the responses by count
  - Check if the features can already be accomplished in xarray with relative ease
  - Check if some of the features are already implemented or can be extended in xCDAT
3. Create a poll for the top 10 responses and send out to the CDAT/PCMDI community
  - Add a note for the features that can accomplished in xarray or are already implemented in xCDAT
4. Discuss the results with the xCDAT team
  - Follow the feature request criteria to narrow down list
  - How valuable is the feature request for the broader community?
5. Send out follow up email of the results

GH Discussions now has a Polls features -> https://github.com/xCDAT/xcdat/discussions/categories/polls

Things to consider:

Potential for feature bloat and scope-creep! Figure out what is/is not generalizable and maintainable.
In additional to CDAT users, the target audience for xCDAT should also include new/existing xarray users. We should aim to serve the needs of the broader climate community.

pochedls · 2022-08-05T22:59:52Z

pochedls
Aug 5, 2022
Collaborator

Do we need to do something more sophisticated with masking? I think the answer is no (at least for now), but I am opening this stub in case someone thinks we do.

CDAT users frequently needed to use MV2.mask* functionality, because cdms transient variables were represented as masked arrays. In xarray, missing values are filled with np.nan.

MV2.mask* includes: mask_or(), masked_equal(), masked_inside(), masked_not_equal(), masked_values(), masked, masked_greater(), masked_less(), masked_object(), masked_where(), masked_array(), masked_greater_equal(), masked_less_equal(), masked_outside

np.mask* includes: masked_all(), masked_greater(), masked_less(), masked_outside(), masked_where(), masked_all_like(), masked_greater_equal(), masked_less_equal(), masked_print_option, masked_array, masked_inside(), masked_not_equal(), masked_singleton, masked_equal(), masked_invalid(), masked_object(), masked_values()

I think that numpy should be able to fill in for MV2 for these operations.

15 replies

pochedls Aug 18, 2022
Collaborator

@taylor13 - I don't know why I wrote that: I just corrected my comment to say that 0.0*np.nan = np.nan instead of 0.0*np.nan = 0.0. Anything convolved with np.nan is a NaN unless you use an operator that ignores NaNs (e.g., np.nansum).

I helped Ben write some code to transform netCDF files (including to months since 1800 and to replace NaN values with 1E20). But...you can also save files from xarray/xcdat to fill missing values with 1E20.

taylor13 Aug 18, 2022

Reading #319, there is still something unexplained, I think regarding how NaNs are handled in python. Why did WA = sum(T*W)/sum(W) result in WA = -0.0032786884513057645 ? I would have thought you'd get NaN since 10 of the 12 elements of T are NaNs. As I understand it, you only get the result you did when WA = np.nansum(T*W)/sum(W). I'm probably missing some obvious explanation.

pochedls Aug 18, 2022
Collaborator

@taylor13 - The code was using a .sum operator from xarray, which ignores NaN values. The issue here is that the denominator did not zero out the weights that corresponded to a NaN value in the T matrix. This conversation should probably be moved to that thread since this is specific to that issue (and there is actually more nuance than I want to get into here).

durack1 Aug 19, 2022
Collaborator

@pochedls is it possible to select the function (e.g. .sum vs .nansum) IF nan values exist on the dataset - or alternatively, just use nansum as the default?

pochedls Aug 19, 2022
Collaborator

There are different ways of summing data and the results depend on the context and exact function you are using. In general, it is possible to use sum or nansum. xcdat doesn't implement these things (they come from other libraries like numpy or xarray).

pochedls · 2022-08-05T23:30:35Z

pochedls
Aug 5, 2022
Collaborator

I went through cdutil.* and tried to categorize some of the useful-looking functions. This thread represents things that I missed or mis-classified.

Probably Implemented/Available (in some form)

JJA, YEAR, ...
averager, getAxisWeight, ...
cdtime
setAxisTimeBounds*
setAxisTimeBounds*
setTimeBounds*
switchCalendars

Planned

vertical
- linearInterpolation
- logLinearInterpolation
- sigma2Pressure
- reconstructPressureFromHybrid

Consider Implementing

region
centroid, generalCriteria
generateLandSeaMask
generateSurfaceTypeByRegionMask
sftbyrgn

5 replies

durack1 Aug 9, 2022
Collaborator

Consider Implementing
* region
* centroid, generalCriteria
* generateLandSeaMask
* generateSurfaceTypeByRegionMask
* sftbyrgn

@pochedls I would be happy to assist with this, I have used this a fair bit in the past, including updating the inputs to a higher-res grid in cdutil times in the past. As a spec, the sftbyrgn.nc file has a 240x480 (latxlon) resolution, with the navy_land.nc having a higher 1080x2160. It's also worth noting the existence of the CF Standardized Region List which is a subset of the Global Change Master Directory (GCMD) Keywords, which has a size of 550 (compared to 72 in CF).

If I was to add to the wishlist, the ability to subset variables using arbitrary indexes (ala CDAT/cdat#1288) would be right up the top, not sure if this is already part of xarray?

pochedls Aug 9, 2022
Collaborator

Thanks for commenting @durack1 – do you know if similar functionality like this exists anywhere else? I guess the other question is whether xcdat is the right place for all of this functionality; if so, where should it go (maybe in the functionality around xcdat.spatial?).

durack1 Aug 9, 2022
Collaborator

@pochedls nope, I don't believe this exists anywhere else. I have heard that Iris is being deprecated, so aside from the Pangeo ecosystem, not sure what other libraries have momentum behind them?

RE: where, sure if it was to be wedged in xcdat.spatial would make sense to me, seems to be the best fit out of the current choices: axis, bounds, dataset, logger, spatial, temporal, utils might be the other option, if that is a dumping ground for functionality. Is there anywhere that data (I'm thinking the netcdf data that defines regions) resides in the repo at present?

durack1 Aug 9, 2022
Collaborator

Ok, so to correct myself regionmask (and docs) also exists - last update March 2022 - I was looking for the AR6 regions, and found it. From the contributors, I am guessing it plays nicely with xarray

pochedls Aug 9, 2022
Collaborator

It sounds like much of this functionality does exist. Also note that we do not currently have datasets in the package (see this thread).

pochedls · 2022-08-05T23:59:09Z

pochedls
Aug 5, 2022
Collaborator

I did a pass through of genutil functionality. There are some potentially useful features to implement. It would also be nice to know if there are statistical functions (or options for those functions) that do not exist in scipy or other libraries:

** Potentially useful **

genutil.grower
genutil.picker

** Potentially useful but perhaps outside scope **

genutil.filters.runningaverage
genutil.filters.smooth121
genutil.minmax

** Statistics functionality **

genutil.statistics.autocorrelation
genutil.statistics.autocovariance
genutil.statistics.correlation
genutil.statistics.covariance
genutil.statistics.geometricmean
genutil.statistics.laggedcorrelation
genutil.statistics.laggedcovariance
genutil.statistics.linearregression
genutil.statistics.meanabsdiff
genutil.statistics.median
genutil.statistics.percentiles
genutil.statistics.rank
genutil.statistics.rms
genutil.statistics.std
genutil.statistics.variance

5 replies

durack1 Aug 9, 2022
Collaborator

** Potentially useful but perhaps outside scope **
genutil.filters.runningaverage genutil.filters.smooth121 genutil.minmax

It seems numpy has a number of windowing/running average functions already - see Numpy Window functions

** Statistics functionality **
genutil.statistics.autocorrelation genutil.statistics.autocovariance ...

Looks like much of this is in numpy Numpy Statistics in addition to numpy.linalg Numpy Linear algebra and Numpy Polynomials. This is great, as scipy isn't an xCDAT dependency at the moment, and isn't planned to be, right?

pochedls Aug 9, 2022
Collaborator

Thanks @durack1. When I mentioned scipy I was thinking about whether the functionality exists in other modern libraries that can easily be used. If so, we probably don't need to reproduce that work in xcdat (someone can just import these other libraries and use them). I don't think they need to be a dependency of xcdat, though.

tomvothecoder Dec 5, 2022
Maintainer Author

Related discuss for genutil.statistics.linearregression: #395

gleckler1 Dec 7, 2022

I'm just catching up with this thread. We'll be testing the implications of using/not NaN's in several metrics applications and preparation of obs4MIPs products. We'll report on this soon.

tomvothecoder May 24, 2023
Maintainer Author

Our decision based on feedback from our xCDAT Developer Day on 5/4/23 and our team meeting on 5/10/23:

The @xCDAT/core-developers recognizes that statistical functionalities are valuable and convenient for users. However, we decided that statistical functionalities are not within the scope of xCDAT at this time. We found that supporting statistical functionalities carries a large risk with the possibility of duplicating existing solutions/packages. The team is also lacking members with statistical expertise. The downstream cost to implement and maintain these functions are beyond the time/budget currently allocated to the project.

As an alternative, the following options can help fill the gap of genutil:

Investigate if xarray statistical packages exist and they meet user requirements
- Ask Pangeo Forum and Xarray forum what statistical packages do they use for climate science
- Some packages of interest
  - xskillscore
  - xarray.polyfit
    - Doesn't take into account Ben Santer's logic for regression, which can be found in genutil.statistica.linearrgression
    - Is it possible to regress on another field?
  - sklearn-xarray
    - The latest version is 0.4.0, which was released on 6/18/2020. Is this package well-maintained?
If no existing packages meet requirements, PCMDI users can create a separate package (e.g., pcmdi_utils) or @lee1043 will consider adding these functionalities within pcmdi_metrics
- Paul Durack created a similar package based on CDAT called durolib
- Need to consider that PCMDI Metrics has a lot of dependencies that users might not want/need in their environment if they only need statistical functionality

jypeter · 2022-12-07T23:12:14Z

jypeter
Dec 7, 2022

I hate NaNs, because they will always represent some kind of numerical error for me, and are a kind of crutch for lazy beginners/coders.

That's why I think using masks is The clean way to do things! But you have to be careful to explicitly use masks all the time because of numpy/numpy#18675

2 replies

pochedls Dec 8, 2022
Collaborator

I think this is a continuation of this thread.

I always found masks in CDAT dangerous, because I never fully trusted that the separate mask array was actually being accounted for. If you have a NaN or 1E20 in your array, you'll figure out pretty quickly whether you've accounted for masking (or not). Also – I don't think xarray really generates a masked-array-like object when it opens data. Does it? I would worry that if we generated a masked array when opening datasets with xcdat that we'd have the same situation: when is this mask being taken into account and when is it not? Last, for large datasets there is some memory penalty to carry around an extra array (maybe it is small for a Boolean array)?

With that said, why do people like masks?

I thought I had read that they speed up operations [but right now I am finding examples of masked arrays slowing things down here and here].
Masked arrays may simplify code in some cases (examples here) [though this probably isn't the case for xcdat/xarray, because the functions were built to handle NaN values]
Maybe you care about different types of masks and a vanilla NaN complicates things for you (e.g., missing versus masked versus land/ocean/region) [though this case could probably be easily handled by explicitly creating masks, e.g., here?]

What am I missing? It seems like NaNs are safer, faster, and already implemented for the things I care about (and probably do not prevent others from converting to a masked array?).

tomvothecoder Dec 8, 2022
Maintainer Author

The main gripe I'm sensing is that NaN has some quirks that users have to be aware of. Additionally, there are specific use-cases for using NaN vs. masked_array and trade-offs (performance, implementation complexity, usage complexity, consistency in representing missing values, etc.).

Background on `NaN` in numpy:

Source: https://numpy.org/doc/stable/user/misc.html#ieee-754-floating-point-special-values

NaNs can be used as a poor-man’s mask (if you don’t care what the original value was)

They state that NaNs cannot be used to test equality:

Note: cannot use equality to test NaNs.

But provide some special value functions, examples:

isinf(): True if value is inf
isfinite(): True if not nan or inf
nan_to_num(): Map nan to 0, inf to max float, -inf to min float

Why Pandas uses `NaN`:

For lack of NA (missing) support from the ground up in NumPy and Python in general, we were given the difficult choice between either:

A masked array solution: an array of data and an array of boolean values indicating whether a value is there or is missing.

Using a special sentinel value, bit pattern, or set of sentinel values to denote NA across the dtypes.

For many reasons we chose the latter. After years of production use it has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions DataFrame.isna() and DataFrame.notna() which can be used across the dtypes to detect NA values.
-- https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#choice-of-na-representation

An alternate approach is that of using masked arrays. A masked array is an array of data with an associated boolean mask denoting whether each value should be considered NA or not. I am personally not in love with this approach as I feel that overall it places a fairly heavy burden on the user and the library implementer. Additionally, it exacts a fairly high performance cost when working with numerical data compared with the simple approach of using NaN. Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
-- https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#why-not-make-numpy-like-r

NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
-- https://stackoverflow.com/a/17534682

What is the different between `np.nan` vs. `np.masked_array` and when to use them:

Source: https://stackoverflow.com/a/30529713

The difference resides in the data held by the two structures.

Using a regular array with np.nan, there is no data behind invalid values.

Using a masked array, you can initialize a full array, and then apply a mask over it so that certain values appear invalid. The numpy.ma module provides methods so that you don't have to deal with np.nan behavior (for example, np.nan == np.nan is always False, etc.)

If you have an array where you'll never need values placed in invalid cells, use the former. You can always replicate complex operations using np.nan and some indexing techniques, but that's what masked arrays are for.

mfwehner · 2022-12-08T18:40:22Z

mfwehner
Dec 8, 2022

Stephen I use masks for 4 things. Others may have more 1) Mask out ocean to calculate land quantities 2) Mask out land to calculate ocean quantities 3) Mask out above or below a threshold to calculate extremes. 4) Mask out missing data in observational products Masks are a critical part of my analysis. Without it, I am stuck. Michael

…

On Dec 8, 2022, at 9:18 AM, Stephen Po-Chedley ***@***.***> wrote: I think this is a continuation of this thread <#271 (comment)>. I always found masks in CDAT dangerous, because I never fully trusted that the separate mask array was actually being accounted for. If you have a NaN or 1E20 in your array, you'll figure out pretty quickly whether you've accounted for masking (or not). Also – I don't think xarray really generates a masked-array-like object when it opens data. Does it? I would worry that if we generated a masked array when opening datasets with xcdat that we'd have the same situation: when is this mask being taken into account and when is it not? Last, for large datasets there is some memory penalty to carry around an extra array (maybe it is small for a Boolean array)? With that said, why do people like masks? I thought I had read that they speed up operations [but right now I am finding examples of masked arrays slowing things down here <https://currents.soest.hawaii.edu/ocn_data_analysis/_static/masked_arrays.html> and here <https://stackoverflow.com/questions/55987642/why-are-numpy-masked-arrays-useful>]. Masked arrays may simplify code in some cases (examples here <https://currents.soest.hawaii.edu/ocn_data_analysis/_static/masked_arrays.html>) [though this probably isn't the case for xcdat/xarray, because the functions were built to handle NaN values] Maybe you care about different types of masks and a vanilla NaN complicates things for you (e.g., missing versus masked versus land/ocean/region) [though this case could probably be easily handled by explicitly creating masks, e.g., here <https://geohackweek.github.io/nDarrays/09-masking/>?] What am I missing? It seems like NaNs are safer, faster, and already implemented for the things I care about (and probably do not prevent others from converting to a masked array?). — Reply to this email directly, view it on GitHub <#271 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSKCHGM4KM7YGKD23KEEITWMIJ6VANCNFSM55XQA47A>. You are receiving this because you are subscribed to this thread.

5 replies

pochedls Dec 8, 2022
Collaborator

@mfwehner – I think all of this can be done in xarray / xcdat. Here is an example.

For observational products, missing values will be read in as NaN values (assuming the dataset sets a missing_value attribute).

jypeter Dec 9, 2022

Masks are not just here for cleanly handling missing values, as pointed out above! They are really useful for selecting any type of region (not just land vs ocean), and the nice thing is that you can use np.logical_XXX functions to combine regions and/or select some part of the data based on some fancy tests. And the tests can be based on the data array itself, or on the values of other arrays that are not just land-sea masks (e.g. get the temperature on land where the precipitation has a specific value)

I have added below a copy from an old (2007) tutorial I wrote (back when we were using MV/Numeric and not MV2/numpy), that shows how masks can be combined

I'm afraid most users are not aware of lots of test based operations (np.[ma.]choose, np.[ma.]where, ...) or operations that use masks explicitly or implicitly to remove loops and tests from their python code and are key to data heavy lifting.

I also like the A.compressed() and A.count() functions

Last week, I noticed by chance that one of our 2nd year PhD students had a python script suspiciously using 100% CPU. It turned out that he had an explicit loop on all the points of an array, and created a new array based on some tests. Yuck! Correctly using tests/masks reduced the code size and allowed him to do the same stuff with array syntax, and the script execution time dropped from 15 mn to 3s. I'm not even mentioning people who would try to solve this by using some parallel stuff...

I'm aware that what I have written above can probably be done with NaN values, but in my opinion NaN should only be used when there is some kind of numerical error, and not because matlab users don't know how to use their data. Besides, if I find some NaN values instead of missing values in a NetCDF file, I will probably find it very suspicious. It's some kind of level 0 Quality Control

Lots of modern users have no idea what it means when we talk of 1-byte boolean data, or 4-byte or 8-byte reals, or what precision means and when to upgrade/downgrade the data type to increase precision or decrease memory usage, and we end up with overloaded servers because of clueless notebooks/spyder/vs-code users. They have probably never heard of IEEE 754 and even what NaN means (hey, it's probably some kind of magical string displayed on the screen!) and that you should not rely on an invalid value to create valid results

I have just checked what goes on these days when you don't take precautions when dividing by zero. Note that you don't get nan by default with numpy, but you get something cleaner (or at least closer than what I would expect by default) with numpy.ma !

>>> import numpy as np
>>> a = np.identity(3)
>>> a.dtype
dtype('float64')
>>> a
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
>>> a_div = 1/a
<stdin>:1: RuntimeWarning: divide by zero encountered in true_divide
>>> a_div
array([[ 1., inf, inf],
       [inf,  1., inf],
       [inf, inf,  1.]])
>>> a_div.sum()
inf
>>> a_div_cleaner = np.where(a != 0, 1/a, np.nan)
>>> a_div_cleaner
array([[ 1., nan, nan],
       [nan,  1., nan],
       [nan, nan,  1.]])
>>> a_div_cleaner.sum()
nan
>>> b = np.ma.identity(3)
>>> b
masked_array(
  data=[[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]],
  mask=False,
  fill_value=1e+20)
>>> b_div = 1/b
>>> b_div
masked_array(
  data=[[1.0, --, --],
        [--, 1.0, --],
        [--, --, 1.0]],
  mask=[[False,  True,  True],
        [ True, False,  True],
        [ True,  True, False]],
  fill_value=1e+20)
>>> b_div.sum()
3.0
>>> b_div.mean()
1.0
>>> b_div.count()
3
>>> b_div.compressed()
array([1., 1., 1.])

My apologies if I got carried away and this became kind of philosophical :-), but I'm so fed up with users who don't know how to correctly use their numerical tools (because their advisors did not spend enough time teaching them) and have sub-optimal code that prevent other codes from running correctly on shared systems. And who probably think that a notebook is a text editor.

I think that masks are also a nice teaching tool for using array syntax and working efficiently (with our kind of big climate data)

tomvothecoder Dec 9, 2022
Maintainer Author

Masks are not just here for cleanly handling missing values, as pointed out above! They are really useful for selecting any type of region (not just land vs ocean),

We all seem to be on the same page in regards to the usefulness of masking. I think the details we are trying to figure out are:

How to represent missing values
How to operate on missing values (e.g., masking)
- Should handle cases such as divide by zero

CDAT implements np.masked_array directly. Libraries such as xarray/pandas represent missing values with NaNand provide APIs to operate on them. I think the main difference here is the implementation for similar functionality.

and the nice thing is that you can use np.logical_XXX functions to combine regions and/or select some part of the data based on some fancy tests. And the tests can be based on the data array itself, or on the values of other arrays that are not just land-sea masks (e.g. get the temperature on land where the precipitation has a specific value)

If I understand correctly, xarray supports these types of masking operations using .where().

How Xarray handles missing values and masking operations:

Read in datasets that have missing values with xarray
- Xarray replaces all missing values represented by _FillValue with NaN
Perform masking operations as needed with .where(). Examples below:
- https://foundations.projectpythia.org/core/xarray/computation-masking.html#using-where-with-multiple-conditions
- https://geohackweek.github.io/nDarrays/09-masking/
Write datasets back out
- Xarray sets NaN back to _FillValue

I’m aware that what I have written above can probably be done with NaN values, but in my opinion NaN should only be used when there is some kind of numerical error, and not because matlab users don’t know how to use their data. Besides, if I find some NaN values instead of missing values in a NetCDF file, I will probably find it very suspicious. It’s some kind of level 0 Quality Contro

I understand what you’re saying here, those are good points.

Xarray replaces missing values represented by _FillValue with NaN purely for internal operations. When you write the NetCDF back out, the NaN values gets set back to _FillValue.

Note, users can represent missing values with something other than NaN if they really want to.

Key Takeaways

My key takeaway is that we should experiment more with Xarray and masking data using the examples and resources to get a better understanding of their approach.

Realistically, we don't have the resources to implement the management of another data structure (masked arrays) within the existing Xarray data structures given the complexities and trade-offs. In my opinion, I think the way Xarray handles missing data and masking should be sufficient in most cases and it might not make sense to implement masked arrays internally.

We can consider positive attributes for using masked arrays and figure out if/how to include functionality that xarray doesn't already cover. If you prefer to use masked arrays, Xarray provides an API for it here.

jypeter Dec 21, 2022

I hope I can spend more time on learning xarray and testing xcdat in 2023!

I guess most users don't need to know what takes place behind the scene, as long data handling feels natural, and the defaults give the expected results, instead of side effects. I don't know yet if what cdms2 users consider default behavior is the same as for xarray users

We just have to be sure that we don't run into scary things (when you are not aware) like dividing integers in Python 2 gives an integer result instead of a float

tomvothecoder Dec 21, 2022
Maintainer Author

Totally on the same page as you.

Thank you @jypeter! We value and appreciate your contributions.

mfwehner · 2022-12-08T23:23:22Z

mfwehner
Dec 8, 2022

Stephen Thanks. Looks straightforward enough. Michael

…

On Dec 8, 2022, at 10:47 AM, Stephen Po-Chedley ***@***.***> wrote: @mfwehner <https://github.com/mfwehner> – I think all of this can be done in xarray / xcdat. Here is an example <https://geohackweek.github.io/nDarrays/09-masking/>. For observational products, missing values will be read in as NaN values (assuming the dataset sets a missing_value attribute). — Reply to this email directly, view it on GitHub <#271 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSKCHGFSKZSUSG756VXLMTWMIUNPANCNFSM55XQA47A>. You are receiving this because you were mentioned.

0 replies

mfwehner · 2022-12-09T19:20:54Z

mfwehner
Dec 9, 2022

Tom Maybe I am missing something as I am just starting on xcdat. But can’t we just set missing_value=1.e20 like we did for cdat and be done with it? The beauty of that missing value is that if things don’t work, the answer makes no sense, alerting you to an error. Michael

…

On Dec 9, 2022, at 10:58 AM, Tom Vo ***@***.***> wrote: Masks are not just here for cleanly handling missing values, as pointed out above! They are really useful for selecting any type of region (not just land vs ocean), We all seem to be on the same page in regards to the usefulness of masking. I think the details we are trying to figure out are: How to represent missing values How to operate on missing values (e.g., masking) Should handle cases such as divide by zero CDAT implements np.masked_array directly. Libraries such as xarray/pandas represent missing values with NaNand provide APIs to operate on them. I think the main difference here is the implementation for similar functionality. and the nice thing is that you can use np.logical_XXX functions to combine regions and/or select some part of the data based on some fancy tests. And the tests can be based on the data array itself, or on the values of other arrays that are not just land-sea masks (e.g. get the temperature on land where the precipitation has a specific value) If I understand correctly, xarray supports these types of masking operations using .where(). How Xarray handles missing values and masking operations: Read in datasets that have missing values with xarray Xarray replaces all missing values represented by _FillValue with NaN Perform masking operations as needed with .where(). Examples below: https://foundations.projectpythia.org/core/xarray/computation-masking.html#using-where-with-multiple-conditions <https://foundations.projectpythia.org/core/xarray/computation-masking.html#using-where-with-multiple-conditions> https://geohackweek.github.io/nDarrays/09-masking/ <https://geohackweek.github.io/nDarrays/09-masking/> Write datasets back out xarray sets NaN back to _FillValue I’m aware that what I have written above can probably be done with NaN values, but in my opinion NaN should only be used when there is some kind of numerical error, and not because matlab users don’t know how to use their data. Besides, if I find some NaN values instead of missing values in a NetCDF file, I will probably find it very suspicious. It’s some kind of level 0 Quality Contro I understand what you’re saying here, those are good points. Xarray replaces missing values represented by _FillValue with NaN purely for internal operations. When you write the NetCDF back out, the NaN values gets set back to _FillValue. Note, users can represent missing values with something other than NaN if they really want to. https://docs.xarray.dev/en/stable/generated/xarray.Dataset.fillna.html <https://docs.xarray.dev/en/stable/generated/xarray.Dataset.fillna.html> https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_masked_array <https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_masked_array>. Key Takeaways My key takeaway is that we should experiment more with Xarray and masking data using the examples and resources to get a better understanding of their approach. Realistically, we don't have the resources to implement the management of another data structure (masked arrays) within the existing Xarray data structures given the complexities and trade-offs. In my opinion, I think the way Xarray handles missing data and masking should be sufficient in most cases and it might not make sense to implement masked arrays internally. We can consider positive attributes for using masked arrays and figure out if/how to include functionality that xarray doesn't already cover. If you prefer to use masked arrays, Xarray provides an API for it here <https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_masked_array>. — Reply to this email directly, view it on GitHub <#271 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACSKCHAS4H6VHMTUJDTWDPDWMN6L7ANCNFSM55XQA47A>. You are receiving this because you were mentioned.

2 replies

tomvothecoder Dec 12, 2022
Maintainer Author

Note, users can represent missing values with something other than NaN if they really want to.
[docs.xarray.dev/en/stable/generated/xarray.Dataset.fillna.html](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.fillna.html)

jypeter Dec 13, 2022

Like np.ma.masked ? ;-)

New features/ideas for future releases (e.g., CDAT, other libraries) #271

tomvothecoder May 4, 2022 Maintainer

Describe the solution you'd like

Replies: 7 comments · 34 replies

pochedls Aug 5, 2022 Collaborator

pochedls Aug 18, 2022 Collaborator

pochedls Aug 18, 2022 Collaborator

durack1 Aug 19, 2022 Collaborator

pochedls Aug 19, 2022 Collaborator

pochedls Aug 5, 2022 Collaborator

durack1 Aug 9, 2022 Collaborator

pochedls Aug 9, 2022 Collaborator

durack1 Aug 9, 2022 Collaborator

durack1 Aug 9, 2022 Collaborator

pochedls Aug 9, 2022 Collaborator

pochedls Aug 5, 2022 Collaborator

durack1 Aug 9, 2022 Collaborator

pochedls Aug 9, 2022 Collaborator

tomvothecoder Dec 5, 2022 Maintainer Author

tomvothecoder May 24, 2023 Maintainer Author

pochedls Dec 8, 2022 Collaborator

tomvothecoder Dec 8, 2022 Maintainer Author

Background on NaN in numpy:

Why Pandas uses NaN:

What is the different between np.nan vs. np.masked_array and when to use them:

pochedls Dec 8, 2022 Collaborator

tomvothecoder Dec 9, 2022 Maintainer Author

Key Takeaways

tomvothecoder Dec 21, 2022 Maintainer Author

tomvothecoder Dec 12, 2022 Maintainer Author

tomvothecoder
May 4, 2022
Maintainer

Replies: 7 comments 34 replies

pochedls
Aug 5, 2022
Collaborator

pochedls Aug 18, 2022
Collaborator

pochedls Aug 18, 2022
Collaborator

durack1 Aug 19, 2022
Collaborator

pochedls Aug 19, 2022
Collaborator

pochedls
Aug 5, 2022
Collaborator

durack1 Aug 9, 2022
Collaborator

pochedls Aug 9, 2022
Collaborator

durack1 Aug 9, 2022
Collaborator

durack1 Aug 9, 2022
Collaborator

pochedls Aug 9, 2022
Collaborator

pochedls
Aug 5, 2022
Collaborator

durack1 Aug 9, 2022
Collaborator

pochedls Aug 9, 2022
Collaborator

tomvothecoder Dec 5, 2022
Maintainer Author

tomvothecoder May 24, 2023
Maintainer Author

pochedls Dec 8, 2022
Collaborator

tomvothecoder Dec 8, 2022
Maintainer Author

Background on `NaN` in numpy:

Why Pandas uses `NaN`:

What is the different between `np.nan` vs. `np.masked_array` and when to use them:

pochedls Dec 8, 2022
Collaborator

tomvothecoder Dec 9, 2022
Maintainer Author

tomvothecoder Dec 21, 2022
Maintainer Author

tomvothecoder Dec 12, 2022
Maintainer Author