Version v0.9.0 adds comprehensive support for more host CPU transforms such as BLAS and LAPACK, including multi-threaded versions.
Beyond the CPU support, there are many more minor improvements:
- Added several new operators include
vector_norm
,matrix_norm
,frexp
,diag
, and more - Many compiler fixes to support a wider range of older and newer compilers
- Performance improvements to avoid overhead of permutation operators when unnecessary
- Much more!
A full changelist is below
What's Changed
- Update pybyind to v2.12.0. Fixes issue #591. by @tmartin-gh in #604
- Change print macro to matx namespaced function by @tmartin-gh in #607
- Added frexp() operator by @cliffburdick in #609
- Disable CUTLASS compile option by @cliffburdick in #610
- Created dimensionless versions of ones() and zeros() by @cliffburdick in #611
- Add smem-based polyphase channelizer kernel by @tbensonatl in #613
- Eigen guide by @tylera-nvidia in #612
- Multithreaded docs build Fix by @tylera-nvidia in #614
- Fixed issues with static tensor unit tests compiling by @cliffburdick in #615
- Implement csqrt by @tylera-nvidia in #619
- Automatic Enumeration of NVTX Range IDs by @tylera-nvidia in #616
- Fixing Clang errors to compile with clang-17 by @cliffburdick in #621
- Update to CCCL 2.4.0 and fix CMake to not use system includes by @cliffburdick in #623
- Remove options that nvc++ doesn't support by @cliffburdick in #624
- Fixing some warnings on certain compilers by @cliffburdick in #625
- More nvc++ warning fixes. Increase minimum supported CUDA to 11.5 by @cliffburdick in #627
- More nvc++ fixes + code coverage generation by @cliffburdick in #628
- fixed printing 0D tensors by @tylera-nvidia in #618
- Remove conversion for double to half by @cliffburdick in #631
- Add NVTX Tests for Code Coverage by @tylera-nvidia in #632
- Feature/add complex cast operators by @tbensonatl in #633
- Avoid array indices passthrough in matxOpTDKernel by @tbensonatl in #634
- Add mixed precision support for channelize_poly by @tbensonatl in #640
- Add test cases for stride kernels by @cliffburdick in #641
- Basic synchronization support with sync() by @aayushg55 in #642
- Converting old std:: types to cuda::std:: types by @cliffburdick in #629
- Fix pybind iterator bug on newer g++ by @cliffburdick in #643
- Initialize NVTX variable by @cliffburdick in #644
- Fixed remaining nvc++ warnings by @cliffburdick in #645
- Change cmake option/project order by @raplonu in #649
- Change check on build type to avoid short circuiting by @cliffburdick in #647
- Add complex cast operators for split inputs by @tbensonatl in #650
- Added
norm()
operator by @cliffburdick in #620 - Add zero-copy interface from MatX to NumPy by @cliffburdick in #653
- Added host multithreading support for FFTW by @aayushg55 in #652
- Fixed OpenMP compiler flags by @aayushg55 in #654
- Fixed issue with operator types used as both lvalue/rvalue not assigning by @cliffburdick in #655
- Smaller FFT test sizes for faster CI/CD by @aayushg55 in #656
- Docs for matrix/vector norm by @cliffburdick in #657
- Change matmul to use tensor_t temp until issue with impl is fixed by @cliffburdick in #658
- Added plan caching for FFTW host plans by @aayushg55 in #659
- Fixed fftw guards and temp allocation by @aayushg55 in #660
- Fixed fftw guards to be fine-grained by @aayushg55 in #661
- Enabled FFT conv for host by @aayushg55 in #662
- NVPL BLAS Support by @aayushg55 in #665
- Change supported CUDA to 11.8 by @cliffburdick in #670
- enh: add macro to define cuda functions accessible at global scope by @mfzmullen in #668
- Add workaround for pre-11.8 CTK smem init errors by @tbensonatl in #673
- Fix to ConvCorr tests to skip host tests when host not enabled by @aayushg55 in #674
- Expanded Host BLAS support by @aayushg55 in #675
- Update README.md by @HugoPhibbs in #676
- Improved the error messages when sizes are incompatible by @cliffburdick in #682
- Added toeplitz operator by @cliffburdick in #683
- Simplified cmake file so no definitions are required by default by @cliffburdick in #684
- fix type for permuted ops in norm. by @luitjens in #696
- Fix c++20 warning by @cliffburdick in #698
- Update Cub Cache Creation to new Method by @tylera-nvidia in #694
- Fixed base operator types by @cliffburdick in #703
- Update slice.rst by @HugoPhibbs in #704
- Fixed issues with host compiler with C++17 and C++20 modes by @cliffburdick in #706
- NVPL LAPACK Solver Support on ARM by @aayushg55 in #701
- Add detail:: namespace to CUB struct by @cliffburdick in #708
- OpenBLAS LAPACK Solver Support for x86 by @aayushg55 in #709
- Exclude examples/cmake_sample_project/build* from doxygen search by @tmartin-gh in #711
- Fixed random pre/post run signature by @cliffburdick in #715
- Rapids cmake 24 06 package by @cliffburdick in #716
- Add support for UINT Generation by @tylera-nvidia in #695
- Update svd docstring by @cliffburdick in #717
- Solver SVD Optimizations and Improved cuSolver batching by @aayushg55 in #721
- MATX_EN_CUTENSOR / MATX_ENABLE_CUTENSOR Unified Variable by @tylera-nvidia in #720
- mtie should output the correct rank and size for the output operator. by @luitjens in #726
- Update bug_report.md by @HugoPhibbs in #729
- eliminate auto spills in permute by @luitjens in #731
- Revert accidental commit to main by @cliffburdick in #734
- Host Solver workspace query fix by @aayushg55 in #733
- Add in-place transform support for inv() by @tbensonatl in #736
- Allow access to Data() pointer from device by @tmartin-gh in #738
- Use cublasmatinvBatched() for N <= 32 by @tbensonatl in #739
- Added new pinv() operator and updated Reduced SVD by @aayushg55 in #740
- optimize our iterator to avoid an unnecessary constructor call by @luitjens in #741
- Updated Solver documentation by @aayushg55 in #742
- Updated documentation for CPU support by @aayushg55 in #743
- Slice optimizations to reduce spills by @cliffburdick in #732
- Fixing shadow declaration by @cliffburdick in #745
- Workaround for constexpr bug inside lambda in CUDA 11.8 by @cliffburdick in #671
- Added diag operator taking 1D operator to generate 2D operator by @cliffburdick in #746
- Add normcdf docs by @cliffburdick in #747
- Refactor template arguments to reductions to force no permutes when unnecessary by @cliffburdick in #749
- Adding workarounds for false positives on gcc14 by @cliffburdick in #751
- Visibility fix for cache static deinit issue by @nvjonwong in #752
- Don't allow in-place make_tensor to change ownership by @cliffburdick in #753
- Fix for erroneous errors on gcc14.1 by @cliffburdick in #755
- Create temp contiguous tensors if needed for sort by @tbensonatl in #757
- Fix regression in slice by @cliffburdick in #758
- Allow printing const pointers by @cliffburdick in #761
- Switch CMake warnings by author warnings to allow user to disable them by @jjomier in #754
- Major refactoring of the code to better handle tensor_t usage by @cliffburdick in #756
- Fixing sort allocation by @cliffburdick in #764
- Added print_shape for printing shape of operators by @cliffburdick in #763
New Contributors
- @aayushg55 made their first contribution in #642
- @raplonu made their first contribution in #649
- @mfzmullen made their first contribution in #668
- @jjomier made their first contribution in #754
Full Changelog: v0.8.0...v0.9.0