Skip to content

VTools3 Development

water-e edited this page Jun 21, 2019 · 10 revisions

Major design considerations

  1. Move to pandas/xarray
  2. Focus on netcdf and csv as data storage. De-emphasize dss.
  3. Possibly reduce the functionality if idiomatic pandas does certain things better (but at the expense of backward compatibility).
  4. Move to python 3.

Definition of terms

  • vtools2 refers to the existing implementation of vtools, including its hand-rolled time series data structures. It is implemented in python 2, although Nicky has some conversion going towards python 3. See final entry for discussion of this.

  • vtools3 refers to an implementation of vtools that is switched entirely to pandas/xarray data structures as the main implementation for time series. We simply will re-implement our most important capabilities and convenience functions. Existing vtools2 scripts using vtools3 will break, at the very least because pandas lacks the simple .times and .data attributes. It is possible that we could manually add these attributes to pandas/xarray structures returned by our functions, possibly using a decorator. This could be done either to simplify retrieval and make fewer scripts break, or it could be done to print an informational deprecation warning ... or both. It was the intention that other tools like pyschism will be migrated to this when it is ready although they might make an intermediate stop at the "vtools2, python3" variant described below. There are two possible end goals:

  1. Emphasize parsimony and idiomatic pandas or backward compatibility. We would go through our API and keep only the things that are unique or make life easier
  2. Emphasize backward compatibility: we keep most of our API, translating it to idiomatic pandas for our implementation.
  • vtools2, python3 (v2py3) refers to an implementation of current vtools that has been migrated to python3. Such an implementation would still be based on traditional vtools time series data structures. Some possible extensions and open questions include:
  1. Can we easily include DSS in such a move?
  2. Addition of a to_pd and/or to_xray helper for coercing to pandas
  3. Addition of a pd_vtools(ts) function to coerce from pandas

I think there is some demand for this, particularly with HEC-DSS. However, it keeps vtools2 alive for longer, doesn't help much with vtools3 and we don't want to rely on the to_pd() stuff much in pyschism. For that, I prefer we make a fuller break to pandas and vtools3, perhaps not an incremental coding effort.

  • Delegation: this is an idea where vtools timeseries wrap and delegate to a pandas datastructure. This will be a lot of work and is hard to maintain and to make interoperable with other tools that return vanilla pandas data structures.

Data Structures

Time

  • Contenders are numpy.datetime64 and pandas Timestamp
  • From what I have seen, pd.Timestamp is very flexible and easy to conver to dtm.datetime and np.datetime64.
  1. Works well with our nearly-ISO format and also works well constructed from integers: pd.Timestamp(2009,04,11,2,15). datetime64 is awkward this way.
  2. Should we create a vtime() factory function to hide the back-end implementation better? Probably not

Deltas and intervals

  • Maintain the constructor functions for deltas (seconds(30), ... years(10) ), again to hide the implementation
  • Parsed strings should follow the pd.tseries.offset standards (15min, 1H, 1D, 1M)
  • The key sticking point here is the concept of a day. The pd.tseries.offset.Day is fine and it is the basis for the days(nday) factory function. Importantly, a DateOffset is not equivalent, because of DST quirks that we don't care about (daylight time). This also means that this kind of offset doesn't seem to support math (division of one interval by another). For this reason, legal "intervals" for functions based on regular time like tidal filtration will be based on pd.tseries.offset.Day. Even better would simply be to avoid this entirely and use minutes and hours.
  • xarray and pd.DataFrames get built with pd.tseries.offsets as their "freq", which isn't really a frequency in signal processing terms.
  • I have not determined the interoperability of any of this with pandas timeDeltas.

ts2 = resample(ts1, interval) # interval will become a DateOffset ts2 = cosine_lanczos(ts1, hours(40)) # should hours(40) return a delta or a DateOffset?

Time Series:

Contenders for the main backend are pandas dataframe (and series) and xarray.DataArray.

  • pd.Dataframe seems to be natural for csv tables (data that has been reduced to 2d) and DataArray for multidimensional data.
  • Functions api should work interoperably on these provided that the data structure has a "time" dimension, giving good multidimensional results as far as scaling will allow.
  • Mostly we should follow CF for things like units. Seems like only xarray adequately contains metadata slot?
  • CF conventions leaves a couple items not well-defined for our purposes:
  1. sampling interval for data that are advertised as "regular"
  2. the CF way of expressing cell averages with cell_methods is a bit clunky for our main case, which is averages over prescribed regular intervals. In this case it is nice to just say whether the stamp is at the beginning or end of the interval and skip the cell boundaries attribute. Need to figure out how longer averages are typically done ... I tend to favor stamping January data on January 1, but we have to test converters carefully because this could cause a lot of issues with DSS data.
  • Data Sources Main data sources will be:
  1. CSV which often converts nicely to pd. data frame and doesn't have metadata.
  2. netcdf, which often converts nicely to xarray dataset or dataarray and does have metadata

Writing to csv for 2D data arrays using pandas is trivial and we don't have to micromanage this, just make it convenient for some of the more popular cases. We should make sure the time formats are standardized for our own work, but maybe this isn't something we need to dictate. Should we have a way of describing name-value pair or units metadata for csv, or just agree to drop it? There is a W3 convention on it.

Use cases:

  1. DSS data: derived from dumped dss files. Decide handling of period ops and time stamp conventions.
  2. SCHISM station output
  3. All the read_ts() formats (NOAA, USGS, CDEC, etc), most can be re-handled with pd.read_csv. Should we continue with the sniffers or just expect the reader to know what they are loading?
  4. DSM2 outputs?
  5. netCDF files: a. SCHISM atmospheric b. UGRID???? Or should we punt to other peoples tools? c. Univariate series written using templated metadata so that we don't have to write every little detail concerning, say, DWR or USGS generated metadata.
Clone this wiki locally