Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ feat: nan handling #59

Merged
merged 28 commits into from
Feb 2, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8052806
✨ feat: add nan implementation of m4 algorithm
NielsPraet Aug 2, 2023
4209d6e
✨ feat: add nan implementation of minmax algorithm
NielsPraet Aug 2, 2023
5199d87
✨ feat: add nan implementation of minmaxlttb algorithm
NielsPraet Aug 2, 2023
496def7
💩 feat: update lib script to incorporate nan-handling functions
NielsPraet Aug 3, 2023
3b51b15
🚧 feat: add new nan downsampler
NielsPraet Aug 3, 2023
33c52e3
✅ tests: add new nan functions to rust mod tests
NielsPraet Aug 3, 2023
591bdd7
🎨 chore: format code
NielsPraet Aug 3, 2023
acf1ff3
✨ feat: expose new nan downsamplers to api
NielsPraet Aug 3, 2023
585a889
✅ tests: update tsdownsample tests to support nan downsamplers
NielsPraet Aug 3, 2023
224c0d9
🎨 chore: format code
NielsPraet Aug 3, 2023
4486894
✅ tests: add test for nan downsamplers
NielsPraet Aug 3, 2023
32b6a26
✨ feat: add python counterparts of Rust downsamplers
NielsPraet Aug 3, 2023
f8e3581
✅ tests: re-enable commented out tests
NielsPraet Aug 3, 2023
f0af701
🎨 chore: format code
NielsPraet Aug 3, 2023
49ed8ab
🔥 chore: remove commented code
NielsPraet Aug 4, 2023
b6fa372
📝 docs: update README.md
NielsPraet Aug 4, 2023
df8db7f
📝 docs: update NaN descriptions
NielsPraet Aug 7, 2023
e285c1d
Merge branch 'main' into feat/nan-support
jvdd Nov 24, 2023
45fc0d3
Merge branch 'main' into feat/nan-support
jvdd Jan 23, 2024
3c6d9d9
:broom: remove threaded
jvdd Jan 23, 2024
6916dbb
:tada: cleanup code
jvdd Jan 24, 2024
910c788
:see_no_evil: fix typo in NaNMinMaxDownsampler
jvdd Jan 24, 2024
62b0489
:detective: benchmark NaN downsamplers
jvdd Jan 24, 2024
401338c
:broom:
jvdd Jan 25, 2024
6aca12a
:broom:
jvdd Jan 25, 2024
d9c9e73
:broom: limit duplicate code
jvdd Jan 25, 2024
39a7ab7
:see_no_evil: fix linting
jvdd Jan 25, 2024
7a7cfd3
:broom:
jvdd Feb 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 28 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
[![CodeQL](https://github.com/predict-idlab/tsdownsample/actions/workflows/codeql.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/codeql.yml)
[![Testing](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-downsample_rs.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-downsample_rs.yml)
[![Testing](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-tsdownsample.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-tsdownsample.yml)

<!-- TODO: codecov -->

Extremely fast **time series downsampling 📈** for visualization, written in Rust.

## Features ✨

* **Fast**: written in rust with PyO3 bindings
- **Fast**: written in rust with PyO3 bindings
- leverages optimized [argminmax](https://github.com/jvdd/argminmax) - which is SIMD accelerated with runtime feature detection
- scales linearly with the number of data points
<!-- TODO check if it scales sublinearly -->
Expand All @@ -25,21 +26,21 @@ Extremely fast **time series downsampling 📈** for visualization, written in R
</blockquote>
In Rust - which is a compiled language - there is no GIL, so CPU-bound tasks can be parallelized (with <a href="https://github.com/rayon-rs/rayon">Rayon</a>) with little to no overhead.
</details>
* **Efficient**: memory efficient
- **Efficient**: memory efficient
- works on views of the data (no copies)
- no intermediate data structures are created
* **Flexible**: works on any type of data
- supported datatypes are
- for `x`: `f32`, `f64`, `i16`, `i32`, `i64`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`
- for `y`: `f16`, `f32`, `f64`, `i8`, `i16`, `i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`, `bool`
- **Flexible**: works on any type of data
- supported datatypes are
- for `x`: `f32`, `f64`, `i16`, `i32`, `i64`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`
- for `y`: `f16`, `f32`, `f64`, `i8`, `i16`, `i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`, `bool`
<details>
<summary><i>!! 🚀 <code>f16</code> <a href="https://github.com/jvdd/argminmax">argminmax</a> is 200-300x faster than numpy</i></summary>
In contrast with all other data types above, <code>f16</code> is *not* hardware supported (i.e., no instructions for f16) by most modern CPUs!! <br>
🐌 Programming languages facilitate support for this datatype by either (i) upcasting to <u>f32</u> or (ii) using a software implementation. <br>
💡 As for argminmax, only comparisons are needed - and thus no arithmetic operations - creating a <u>symmetrical ordinal mapping from <code>f16</code> to <code>i16</code></u> is sufficient. This mapping allows to use the hardware supported scalar and SIMD <code>i16</code> instructions - while not producing any memory overhead 🎉 <br>
<i>More details are described in <a href="https://github.com/jvdd/argminmax/pull/1">argminmax PR #1</a>.</i>
</details>
* **Easy to use**: simple & flexible API
- **Easy to use**: simple & flexible API

## Install

Expand Down Expand Up @@ -83,6 +84,7 @@ downsample([x], y, n_out, **kwargs) -> ndarray[uint64]
```

**Arguments**:

- `x` is optional
- `x` and `y` are both positional arguments
- `n_out` is a mandatory keyword argument that defines the number of output values<sup>*</sup>
Expand All @@ -93,7 +95,8 @@ downsample([x], y, n_out, **kwargs) -> ndarray[uint64]

**Returns**: a `ndarray[uint64]` of indices that can be used to index the original data.

<sup>*</sup><i>When there are gaps in the time series, fewer than `n_out` indices may be returned.</i>
<sup>\*</sup><i>When there are gaps in the time series, fewer than `n_out` indices may be returned.</i>

### Downsampling algorithms 📈

The following downsampling algorithms (classes) are implemented:
Expand All @@ -107,12 +110,28 @@ The following downsampling algorithms (classes) are implemented:

<sup>*</sup><i>Default value for `minmax_ratio` is 4, which is empirically proven to be a good default. More details here: https://arxiv.org/abs/2305.00332</i>

### Handling NaNs

This library supports two `NaN`-policies:

1. Omit `NaN`s (`NaN`s are ignored during downsampling).
2. Return index of first `NaN` once there is at least one present in the bin of the considered data.

| Omit `NaN`s | Return `NaN`s |
| ----------------------: | :------------------------- |
| `MinMaxDownsampler` | `NaNMinMaxDownsampler` |
| `M4Downsampler` | `NaNM4Downsampler` |
| `MinMaxLTTBDownsampler` | `NaNMinMaxLTTBDownsampler` |
| `LTTBDownsampler` | |

> Note that NaNs are not supported for `x`-data.

## Limitations & assumptions 🚨

Assumes;

1. `x`-data is (non-strictly) monotonic increasing (i.e., sorted)
2. no `NaNs` in the data
2. no `NaN`s in `x`-data

---

Expand Down
95 changes: 61 additions & 34 deletions downsample_rs/src/m4.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use argminmax::ArgMinMax;
use argminmax::{ArgMinMax, NaNArgMinMax};
use num_traits::{AsPrimitive, FromPrimitive};
use rayon::iter::IndexedParallelIterator;
use rayon::prelude::*;
Expand All @@ -13,55 +13,82 @@ use super::POOL;

// ----------- WITH X

pub fn m4_with_x<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: ArgMinMax,
Tx: Num + FromPrimitive + AsPrimitive<f64>,
Ty: Copy + PartialOrd,
{
assert_eq!(n_out % 4, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator(x, n_out / 4);
m4_generic_with_x(arr, bin_idx_iterator, n_out, |arr| arr.argminmax())
macro_rules! m4_with_x {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: $trait,
Tx: Num + FromPrimitive + AsPrimitive<f64>,
Ty: Copy + PartialOrd,
{
assert_eq!(n_out % 4, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator(x, n_out / 4);
m4_generic_with_x(arr, bin_idx_iterator, n_out, $f_argminmax)
}
};
}

m4_with_x!(m4_with_x, ArgMinMax, |arr| arr.argminmax());
m4_with_x!(m4_with_x_nan, NaNArgMinMax, |arr| arr.nanargminmax());

// ----------- WITHOUT X

pub fn m4_without_x<T: Copy + PartialOrd>(arr: &[T], n_out: usize) -> Vec<usize>
where
for<'a> &'a [T]: ArgMinMax,
{
assert_eq!(n_out % 4, 0);
m4_generic(arr, n_out, |arr| arr.argminmax())
macro_rules! m4_without_x {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<T: Copy + PartialOrd>(arr: &[T], n_out: usize) -> Vec<usize>
where
for<'a> &'a [T]: $trait,
{
assert_eq!(n_out % 4, 0);
m4_generic(arr, n_out, $f_argminmax)
}
};
}

m4_without_x!(m4_without_x, ArgMinMax, |arr| arr.argminmax());
m4_without_x!(m4_without_x_nan, NaNArgMinMax, |arr| arr.nanargminmax());

// ------------------------------------- PARALLEL --------------------------------------

// ----------- WITH X

pub fn m4_with_x_parallel<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: ArgMinMax,
Tx: Num + FromPrimitive + AsPrimitive<f64> + Send + Sync,
Ty: Copy + PartialOrd + Send + Sync,
{
assert_eq!(n_out % 4, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator_parallel(x, n_out / 4);
m4_generic_with_x_parallel(arr, bin_idx_iterator, n_out, |arr| arr.argminmax())
macro_rules! m4_with_x_parallel {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: $trait,
Tx: Num + FromPrimitive + AsPrimitive<f64> + Send + Sync,
Ty: Copy + PartialOrd + Send + Sync,
{
assert_eq!(n_out % 4, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator_parallel(x, n_out / 4);
m4_generic_with_x_parallel(arr, bin_idx_iterator, n_out, $f_argminmax)
}
};
}

m4_with_x_parallel!(m4_with_x_parallel, ArgMinMax, |arr| arr.argminmax());
m4_with_x_parallel!(m4_with_x_parallel_nan, NaNArgMinMax, |arr| arr
.nanargminmax());

// ----------- WITHOUT X

pub fn m4_without_x_parallel<T: Copy + PartialOrd + Send + Sync>(
arr: &[T],
n_out: usize,
) -> Vec<usize>
where
for<'a> &'a [T]: ArgMinMax,
{
assert_eq!(n_out % 4, 0);
m4_generic_parallel(arr, n_out, |arr| arr.argminmax())
macro_rules! m4_without_x_parallel {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<T: Copy + PartialOrd + Send + Sync>(arr: &[T], n_out: usize) -> Vec<usize>
where
for<'a> &'a [T]: $trait,
{
assert_eq!(n_out % 4, 0);
m4_generic_parallel(arr, n_out, $f_argminmax)
}
};
}

m4_without_x_parallel!(m4_without_x_parallel, ArgMinMax, |arr| arr.argminmax());
m4_without_x_parallel!(m4_without_x_parallel_nan, NaNArgMinMax, |arr| arr
.nanargminmax());

// TODO: check for duplicate data in the output array
// -> In the current implementation we always add 4 datapoints per bin (if of
// course the bin has >= 4 datapoints). However, the argmin and argmax might
Expand Down
96 changes: 62 additions & 34 deletions downsample_rs/src/minmax.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use rayon::iter::IndexedParallelIterator;
use rayon::prelude::*;

use argminmax::ArgMinMax;
use argminmax::{ArgMinMax, NaNArgMinMax};
use num_traits::{AsPrimitive, FromPrimitive};

use super::searchsorted::{
Expand All @@ -14,55 +14,83 @@ use super::POOL;

// ----------- WITH X

pub fn min_max_with_x<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: ArgMinMax,
Tx: Num + FromPrimitive + AsPrimitive<f64>,
Ty: Copy + PartialOrd,
{
assert_eq!(n_out % 2, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator(x, n_out / 2);
min_max_generic_with_x(arr, bin_idx_iterator, n_out, |arr| arr.argminmax())
macro_rules! min_max_with_x {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: $trait,
Tx: Num + FromPrimitive + AsPrimitive<f64>,
Ty: Copy + PartialOrd,
{
assert_eq!(n_out % 2, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator(x, n_out / 2);
min_max_generic_with_x(arr, bin_idx_iterator, n_out, $f_argminmax)
}
};
}

min_max_with_x!(min_max_with_x, ArgMinMax, |arr| arr.argminmax());
min_max_with_x!(min_max_with_x_nan, NaNArgMinMax, |arr| arr.nanargminmax());

// ----------- WITHOUT X

pub fn min_max_without_x<T: Copy + PartialOrd>(arr: &[T], n_out: usize) -> Vec<usize>
where
for<'a> &'a [T]: ArgMinMax,
{
assert_eq!(n_out % 2, 0);
min_max_generic(arr, n_out, |arr| arr.argminmax())
macro_rules! min_max_without_x {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<T: Copy + PartialOrd>(arr: &[T], n_out: usize) -> Vec<usize>
where
for<'a> &'a [T]: $trait,
{
assert_eq!(n_out % 2, 0);
min_max_generic(arr, n_out, $f_argminmax)
}
};
}

min_max_without_x!(min_max_without_x, ArgMinMax, |arr| arr.argminmax());
min_max_without_x!(min_max_without_x_nan, NaNArgMinMax, |arr| arr
.nanargminmax());

// ------------------------------------- PARALLEL --------------------------------------

// ----------- WITH X

pub fn min_max_with_x_parallel<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: ArgMinMax,
Tx: Num + FromPrimitive + AsPrimitive<f64> + Send + Sync,
Ty: Copy + PartialOrd + Send + Sync,
{
assert_eq!(n_out % 2, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator_parallel(x, n_out / 2);
min_max_generic_with_x_parallel(arr, bin_idx_iterator, n_out, |arr| arr.argminmax())
macro_rules! min_max_with_x_parallel {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<Tx, Ty>(x: &[Tx], arr: &[Ty], n_out: usize) -> Vec<usize>
where
for<'a> &'a [Ty]: $trait,
Tx: Num + FromPrimitive + AsPrimitive<f64> + Send + Sync,
Ty: Copy + PartialOrd + Send + Sync,
{
assert_eq!(n_out % 2, 0);
let bin_idx_iterator = get_equidistant_bin_idx_iterator_parallel(x, n_out / 2);
min_max_generic_with_x_parallel(arr, bin_idx_iterator, n_out, $f_argminmax)
}
};
}

min_max_with_x_parallel!(min_max_with_x_parallel, ArgMinMax, |arr| arr.argminmax());
min_max_with_x_parallel!(min_max_with_x_parallel_nan, NaNArgMinMax, |arr| arr
.nanargminmax());

// ----------- WITHOUT X

pub fn min_max_without_x_parallel<T: Copy + PartialOrd + Send + Sync>(
arr: &[T],
n_out: usize,
) -> Vec<usize>
where
for<'a> &'a [T]: ArgMinMax,
{
assert_eq!(n_out % 2, 0);
min_max_generic_parallel(arr, n_out, |arr| arr.argminmax())
macro_rules! min_max_without_x_parallel {
($func_name:ident, $trait:path, $f_argminmax:expr) => {
pub fn $func_name<T: Copy + PartialOrd + Send + Sync>(arr: &[T], n_out: usize) -> Vec<usize>
where
for<'a> &'a [T]: $trait,
{
assert_eq!(n_out % 2, 0);
min_max_generic_parallel(arr, n_out, $f_argminmax)
}
};
}

min_max_without_x_parallel!(min_max_without_x_parallel, ArgMinMax, |arr| arr.argminmax());
min_max_without_x_parallel!(min_max_without_x_parallel_nan, NaNArgMinMax, |arr| arr
.nanargminmax());

// ----------------------------------- GENERICS ------------------------------------

// --------------------- WITHOUT X
Expand Down
Loading
Loading