NVIDIA · cliffburdick · Oct 2, 2023
diff --git a/CITATION.cff b/CITATION.cff
@@ -11,6 +11,6 @@ authors:
   given-names: "Adam"
   orcid: "https://orcid.org/0000-0001-9690-6357"
 title: "MatX Primitives Library for GPU-Accelerated Numerical Computing in C++"
-version: 0.1.0
-date-released: 2021-10-26
+version: 0.6.0
+date-released: 2023-10-02
 url: "https://github.com/NVIDIA/matx"
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -55,7 +55,7 @@ endif()
 project(MATX
         LANGUAGES CUDA CXX
         DESCRIPTION "A modern and efficient header-only C++ library for numerical computing on GPU"
-        VERSION 0.5.0
+        VERSION 0.6.0
         HOMEPAGE_URL "https://github.com/NVIDIA/MatX")
 
 if (NOT CMAKE_CUDA_ARCHITECTURES)

diff --git a/README.md b/README.md
@@ -193,6 +193,17 @@ We provide a variety of training materials and examples to quickly learn the Mat
 - Finally, for new MatX developers, browsing the [example applications](examples) can provide familarity with the API and best practices.
 
 ## Release Major Features
+*v0.6.0*:
+- Breaking changes
+    * This marks the first release of using "transforms as operators". This allows transforms to be used in any operator expression, whereas the previous release required them to be on separate lines. For an example, please see: https://nvidia.github.io/MatX/basics/fusion.html. This also causes a breaking change with transform usage. Converting to the new format is as simple as moving the function parameters. For example: `matmul(C, A, B, stream);` becomes `(C = matmul(A,B)).run(stream);`. 
+- Features
+    * Polyphase channelizer
+    * Many new operators, including upsample, downsample, pwelch, overlap, at, etc
+    * Added more lvalue semantics for operators based on view manipulation
+- Bug fixes
+    * Fixed cache issues
+    * Fixed stride = 0 in matmul
+
 *v0.5.0*:
 * Polyphase resampler
 * Documentation overhaul with examples for each function
@@ -205,15 +216,6 @@ We provide a variety of training materials and examples to quickly learn the Mat
 * 16-bit float reductions
 * Output iterator support in CUB
 
-*v0.3.0*:
-* Many new operators, including `flatten`, `remap`, `lcollapse`. `rcollapse`, `fmod`, `clone`, `slice`
-* Extended N-D tensor support to more functions
-* Allow operators on reduction inputs
-* g++11 support
-* NVTX support
-* Many, many bug fixes
-
-
 ## Discussions
 We have an open discussions board [here](https://github.com/NVIDIA/MatX/discussions). We encourage any questions about the library to be posted here for other users to learn from and read through.
 

diff --git a/docs/_sources/api/creation/tensors/make.rst b/docs/_sources/api/creation/tensors/make.rst
@@ -16,13 +16,11 @@ Return by Value
 .. doxygenfunction:: make_tensor( TensorType &tensor, const index_t (&shape)[TensorType::Rank()], matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor( ShapeType &&shape, matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor( TensorType &tensor, ShapeType &&shape,  matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
-.. doxygenfunction:: make_tensor( matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor( TensorType &tensor, matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor( T *data, const index_t (&shape)[RANK], bool owning = false)
 .. doxygenfunction:: make_tensor( TensorType &tensor, typename TensorType::scalar_type *data, const index_t (&shape)[TensorType::Rank()], bool owning = false)
 .. doxygenfunction:: make_tensor( T *data, ShapeType &&shape, bool owning = false)
 .. doxygenfunction:: make_tensor( TensorType &tensor, typename TensorType::scalar_type *data, typename TensorType::shape_container &&shape, bool owning = false)
-.. doxygenfunction:: make_tensor( T *ptr, bool owning = false)
 .. doxygenfunction:: make_tensor( TensorType &tensor, typename TensorType::scalar_type *ptr, bool owning = false)
 .. doxygenfunction:: make_tensor( Storage &&s, ShapeType &&shape)
 .. doxygenfunction:: make_tensor( TensorType &tensor, typename TensorType::storage_type &&s, typename TensorType::shape_container &&shape)
@@ -38,5 +36,4 @@ Return by Pointer
 .. doxygenfunction:: make_tensor_p( const index_t (&shape)[RANK],  matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor_p( ShapeType &&shape, matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor_p( TensorType &tensor, typename TensorType::shape_container &&shape, matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
-.. doxygenfunction:: make_tensor_p( matxMemorySpace_t space = MATX_MANAGED_MEMORY, cudaStream_t stream = 0)
 .. doxygenfunction:: make_tensor_p( T *const data, ShapeType &&shape, bool owning = false)
diff --git a/docs/_sources/api/dft/fft/fft.rst b/docs/_sources/api/dft/fft/fft.rst
@@ -9,8 +9,8 @@ Perform a 1D FFT
    These functions are currently not supported with host-based executors (CPU)
 
 
-.. doxygenfunction:: fft(OpA &&a, uint64_t fft_size = 0)
-.. doxygenfunction:: fft(OpA &&a, const int32_t (&axis)[1], uint64_t fft_size = 0)
+.. doxygenfunction:: fft(OpA &&a, uint64_t fft_size = 0, FFTNorm norm = FFTNorm::BACKWARD)
+.. doxygenfunction:: fft(OpA &&a, const int32_t (&axis)[1], uint64_t fft_size = 0, FFTNorm norm = FFTNorm::BACKWARD)
 
 Examples
 ~~~~~~~~
@@ -25,7 +25,7 @@ Examples
   :language: cpp
   :start-after: example-begin fft-2
   :end-before: example-end fft-2
-  :dedent:  
+  :dedent:
 
 .. literalinclude:: ../../../../test/00_transform/FFT.cu
   :language: cpp
@@ -43,4 +43,4 @@ Examples
   :language: cpp
   :start-after: example-begin fft-5
   :end-before: example-end fft-5
-  :dedent:  
+  :dedent:
diff --git a/docs/_sources/api/dft/fft/ifft.rst b/docs/_sources/api/dft/fft/ifft.rst
@@ -9,8 +9,8 @@ Perform a 1D inverse FFT
    These functions are currently not supported with host-based executors (CPU)
 
 
-.. doxygenfunction:: ifft(OpA &&a, uint64_t fft_size = 0)
-.. doxygenfunction:: ifft(OpA &&a, const int32_t (&axis)[1], uint64_t fft_size = 0)
+.. doxygenfunction:: ifft(OpA &&a, uint64_t fft_size = 0, FFTNorm norm = FFTNorm::BACKWARD)
+.. doxygenfunction:: ifft(OpA &&a, const int32_t (&axis)[1], uint64_t fft_size = 0, FFTNorm norm = FFTNorm::BACKWARD)
 
 Examples
 ~~~~~~~~

diff --git a/docs/_sources/api/logic/comparison/isclose.rst b/docs/_sources/api/logic/comparison/isclose.rst
@@ -0,0 +1,21 @@
+.. _isclose_func:
+
+isclose
+=======
+
+Determine the closeness of values across two operators using absolute and relative tolerances. The output
+from isclose is an ``int`` value since it's commonly used for reductions and ``bool`` reductions using
+atomics are not available in hardware.
+
+
+.. doxygenfunction:: isclose
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_operators/OperatorTests.cu
+   :language: cpp
+   :start-after: example-begin isclose-test-1
+   :end-before: example-end isclose-test-1
+   :dedent:
+
diff --git a/docs/_sources/api/logic/truth/allclose.rst b/docs/_sources/api/logic/truth/allclose.rst
@@ -0,0 +1,20 @@
+.. _allclose_func:
+
+allclose
+========
+
+Reduce the closeness of two operators to a single scalar (0D) output. The output
+from allclose is an ``int`` value since boolean reductions are not available in hardware
+
+
+.. doxygenfunction:: allclose(OutType dest, const InType1 &in1, const InType2 &in2, double rtol, double atol, SingleThreadHostExecutor exec)
+.. doxygenfunction:: allclose(OutType dest, const InType1 &in1, const InType2 &in2, double rtol, double atol, cudaExecutor exec = 0)
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_operators/ReductionTests.cu
+   :language: cpp
+   :start-after: example-begin allclose-test-1
+   :end-before: example-end allclose-test-1
+   :dedent:
diff --git a/docs/_sources/api/manipulation/rearranging/overlap.rst b/docs/_sources/api/manipulation/rearranging/overlap.rst
@@ -0,0 +1,34 @@
+.. _overlap_func:
+
+overlap
+#######
+
+Create an overlapping view an of input operator giving a higher-rank view of the input
+
+For example, the following 1D tensor [1 2 3 4 5] could be cloned into a 2d tensor with a
+window size of 2 and overlap of 1, resulting in::
+
+  [1 2
+   2 3
+   3 4
+   4 5]
+
+Currently this only works on 1D tensors going to 2D, but may be expanded
+for higher dimensions in the future. Note that if the window size does not
+divide evenly into the existing column dimension, the view may chop off the
+end of the data to make the tensor rectangular.
+
+.. note::
+    Only 1D input operators are accepted at this time
+
+.. doxygenfunction:: overlap( const OpType &op, const index_t (&windows)[N], const index_t (&strides)[N])
+.. doxygenfunction:: overlap( const OpType &op, const std::array<index_t, N> &windows, const std::array<index_t, N> &strides)
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_operators/OperatorTests.cu
+   :language: cpp
+   :start-after: example-begin overlap-test-1
+   :end-before: example-end overlap-test-1
+   :dedent:
diff --git a/docs/_sources/api/manipulation/selecting/at.rst b/docs/_sources/api/manipulation/selecting/at.rst
@@ -0,0 +1,31 @@
+.. _at_func:
+
+at
+==
+
+Selects a single value from an operator. Since `at` is a lazily-evaluated operator, it should be used
+in situations where `operator()` cannot be used. For instance:
+
+.. code-block:: cpp
+
+    (a = b(5)).run();
+
+The code above creates a race condition where `b(5)` is evaluated on the host before launch, but the value may
+not be computed from a previous operation. Instead, the `at()` operator can be used to defer the load until 
+the operation is launched:
+
+.. code-block:: cpp
+
+    (a = at(b, 5)).run();
+
+.. doxygenfunction:: at(const Op op, Is... indices)
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_operators/OperatorTests.cu
+   :language: cpp
+   :start-after: example-begin at-test-1
+   :end-before: example-end at-test-1
+   :dedent:
+
diff --git a/docs/_sources/api/signalimage/convolution/conv1d.rst b/docs/_sources/api/signalimage/convolution/conv1d.rst
@@ -5,7 +5,13 @@ conv1d
 
 1D convolution
 
-.. doxygenfunction:: conv1d(const In1Type &i1, const In2Type &i2, matxConvCorrMode_t mode)
+Performs a convolution operation of two inputs. Three convolution modes are available: full, same, and valid. The
+mode controls how much (if any) of the output is truncated to remove filter ramps. The method parameter allows
+either direct or FFT-based convolution. Direct performs the typical sliding-window dot product approach, whereas
+FFT uses the convolution theorem. The FFT method may be faster for large inputs, but both methods should be tested
+for the target input sizes.
+
+.. doxygenfunction:: conv1d(const In1Type &i1, const In2Type &i2, matxConvCorrMode_t mode, matxConvCorrMethod_t method)
 
 Examples
 ~~~~~~~~
@@ -22,7 +28,7 @@ Examples
    :end-before: example-end conv1d-test-2
    :dedent:
 
-.. doxygenfunction:: conv1d(const In1Type &i1, const In2Type &i2, const int32_t (&axis)[1], matxConvCorrMode_t mode)   
+.. doxygenfunction:: conv1d(const In1Type &i1, const In2Type &i2, const int32_t (&axis)[1], matxConvCorrMode_t mode = MATX_C_MODE_FULL, matxConvCorrMethod_t method = MATX_C_METHOD_DIRECT)
 
 Examples
 ~~~~~~~~

diff --git a/docs/_sources/api/signalimage/filtering/channelize_poly.rst b/docs/_sources/api/signalimage/filtering/channelize_poly.rst
@@ -0,0 +1,18 @@
+.. _channelize_poly_func:
+
+channelize_poly
+===============
+
+Polyphase channelizer with a configurable number of channels
+
+.. doxygenfunction:: matx::channelize_poly(const InType &in, const FilterType &f, index_t num_channels, index_t decimation_factor)
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_transform/ChannelizePoly.cu
+   :language: cpp
+   :start-after: example-begin channelize_poly-test-1
+   :end-before: example-end channelize_poly-test-1
+   :dedent:
+
diff --git a/docs/_sources/api/signalimage/general/pwelch.rst b/docs/_sources/api/signalimage/general/pwelch.rst
@@ -0,0 +1,23 @@
+.. _pwelch_func:
+
+pwelch
+======
+
+Estimate the power spectral density of a signal using Welch's method [1]_
+
+.. doxygenfunction:: pwelch(const xType& x, const wType& w, index_t nperseg, index_t noverlap, index_t nfft)
+.. doxygenfunction:: pwelch(const xType& x, index_t nperseg, index_t noverlap, index_t nfft)
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_operators/PWelch.cu
+   :language: cpp
+   :start-after: example-begin pwelch-test-1
+   :end-before: example-end pwelch-test-1
+   :dedent:
+
+References
+~~~~~~~~~~
+
+  .. [1] \ P. Welch, "The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms," in IEEE Transactions on Audio and Electroacoustics, vol. 15, no. 2, pp. 70-73, June 1967, doi: 10.1109/TAU.1967.1161901.
diff --git a/docs/_sources/api/stats/index.rst b/docs/_sources/api/stats/index.rst
@@ -5,7 +5,8 @@ Statistics
 
 .. toctree::
    :maxdepth: 2
-   
+
    avgvar/index.rst
    corr/index.rst
    hist/index.rst
+   misc/index.rst
diff --git a/docs/_sources/api/stats/misc/index.rst b/docs/_sources/api/stats/misc/index.rst
@@ -0,0 +1,11 @@
+.. _misc_stats:
+
+Misc
+####
+
+
+.. toctree::
+   :maxdepth: 1
+   :glob:
+
+   *
diff --git a/docs/_sources/api/stats/misc/percentile.rst b/docs/_sources/api/stats/misc/percentile.rst
@@ -0,0 +1,24 @@
+.. _percentile_func:
+
+percentile
+##########
+
+Find the q-th percentile of an input sequence. ``q`` is a value between 0 and 100 representing the percentile. A value
+of 0 is equivalent to mean, 100 is max, and 50 is the median when using the ``LINEAR`` method.
+
+.. note::
+    Multiple q values are not supported yet
+
+Supported methods for interpolation are: LINEAR, HAZEN, WEIBULL, LOWER, HIGHER, MIDPOINT, NEAREST, MEDIAN_UNBIASED, and NORMAL_UNBIASED
+
+.. doxygenfunction:: percentile(const InType &in, unsigned char q, PercentileMethod method = PercentileMethod::LINEAR)
+.. doxygenfunction:: percentile(const InType &in, unsigned char q, const int (&dims)[D], PercentileMethod method = PercentileMethod::LINEAR)
+
+Examples
+~~~~~~~~
+
+.. literalinclude:: ../../../../test/00_operators/ReductionTests.cu
+   :language: cpp
+   :start-after: example-begin percentile-test-1
+   :end-before: example-end percentile-test-1
+   :dedent:
diff --git a/docs/_sources/basics/concepts.rst b/docs/_sources/basics/concepts.rst
@@ -66,23 +66,25 @@ require no memory.
 Transform
 ---------
 
-Some functions in MatX can only be executed on a single line without any other operators. For example, an fft is executed by:
+Transforms are operators that take one or more inputs and call a backend library or kernel. Transforms usually changes one or
+more properties of the input, but that is not always the case. An fft may change the input type or shape, but a sort transform
+does not. Depending on the context used, a transform may asynchronously allocate temporary memory if the expression requires it. 
+
+For example:
 
 .. code-block:: cpp
 
-    fft(A, A);
+    (b = fft(A)).run();
 
-It is currently not valid to do something like the following:
+The expression above performs an out-of-place FFT by taking the input ``A`` and storing in output ``B``. Transforms may also be used
+in larger expressions:
 
 .. code-block:: cpp
 
-    (C = B * fft(A, A)).run();
-
-The reason this is invalid is because functions that are classified as transforms launch CUDA kernels to perform a single function,
-and many times they call a CUDA library. Transforms are not operators and cannot be used in operator expressions as shown above.
-Since ``fft`` is not an operator the compiler will give an error.
+    (C = B * fft(A)).run();
 
-This behavior may change in the future or be relaxed for certain transforms.
+In this case ``fft(A)`` may need somewhere to store the output of the FFT, and could asynchronously allocate memory to do so. However,
+MatX may also perform fusion on the expression if possible.
 
 Since some transforms rely on CUDA math library backends not all of them are available with different executors. Please see the
 documentation for the individual function to check compatibility.
@@ -107,4 +109,4 @@ Shape is used to describe the size of each dimension of an operator.
 Stride
 ------
 
-Stride is used to describe the spacing between elements in each dimension of an operator
+Stride is used to describe the spacing between elements in each dimension of an operator