Releases: NVIDIA/MatX
v0.9.0
Version v0.9.0 adds comprehensive support for more host CPU transforms such as BLAS and LAPACK, including multi-threaded versions.
Beyond the CPU support, there are many more minor improvements:
- Added several new operators include
vector_norm
,matrix_norm
,frexp
,diag
, and more - Many compiler fixes to support a wider range of older and newer compilers
- Performance improvements to avoid overhead of permutation operators when unnecessary
- Much more!
A full changelist is below
What's Changed
- Update pybyind to v2.12.0. Fixes issue #591. by @tmartin-gh in #604
- Change print macro to matx namespaced function by @tmartin-gh in #607
- Added frexp() operator by @cliffburdick in #609
- Disable CUTLASS compile option by @cliffburdick in #610
- Created dimensionless versions of ones() and zeros() by @cliffburdick in #611
- Add smem-based polyphase channelizer kernel by @tbensonatl in #613
- Eigen guide by @tylera-nvidia in #612
- Multithreaded docs build Fix by @tylera-nvidia in #614
- Fixed issues with static tensor unit tests compiling by @cliffburdick in #615
- Implement csqrt by @tylera-nvidia in #619
- Automatic Enumeration of NVTX Range IDs by @tylera-nvidia in #616
- Fixing Clang errors to compile with clang-17 by @cliffburdick in #621
- Update to CCCL 2.4.0 and fix CMake to not use system includes by @cliffburdick in #623
- Remove options that nvc++ doesn't support by @cliffburdick in #624
- Fixing some warnings on certain compilers by @cliffburdick in #625
- More nvc++ warning fixes. Increase minimum supported CUDA to 11.5 by @cliffburdick in #627
- More nvc++ fixes + code coverage generation by @cliffburdick in #628
- fixed printing 0D tensors by @tylera-nvidia in #618
- Remove conversion for double to half by @cliffburdick in #631
- Add NVTX Tests for Code Coverage by @tylera-nvidia in #632
- Feature/add complex cast operators by @tbensonatl in #633
- Avoid array indices passthrough in matxOpTDKernel by @tbensonatl in #634
- Add mixed precision support for channelize_poly by @tbensonatl in #640
- Add test cases for stride kernels by @cliffburdick in #641
- Basic synchronization support with sync() by @aayushg55 in #642
- Converting old std:: types to cuda::std:: types by @cliffburdick in #629
- Fix pybind iterator bug on newer g++ by @cliffburdick in #643
- Initialize NVTX variable by @cliffburdick in #644
- Fixed remaining nvc++ warnings by @cliffburdick in #645
- Change cmake option/project order by @raplonu in #649
- Change check on build type to avoid short circuiting by @cliffburdick in #647
- Add complex cast operators for split inputs by @tbensonatl in #650
- Added
norm()
operator by @cliffburdick in #620 - Add zero-copy interface from MatX to NumPy by @cliffburdick in #653
- Added host multithreading support for FFTW by @aayushg55 in #652
- Fixed OpenMP compiler flags by @aayushg55 in #654
- Fixed issue with operator types used as both lvalue/rvalue not assigning by @cliffburdick in #655
- Smaller FFT test sizes for faster CI/CD by @aayushg55 in #656
- Docs for matrix/vector norm by @cliffburdick in #657
- Change matmul to use tensor_t temp until issue with impl is fixed by @cliffburdick in #658
- Added plan caching for FFTW host plans by @aayushg55 in #659
- Fixed fftw guards and temp allocation by @aayushg55 in #660
- Fixed fftw guards to be fine-grained by @aayushg55 in #661
- Enabled FFT conv for host by @aayushg55 in #662
- NVPL BLAS Support by @aayushg55 in #665
- Change supported CUDA to 11.8 by @cliffburdick in #670
- enh: add macro to define cuda functions accessible at global scope by @mfzmullen in #668
- Add workaround for pre-11.8 CTK smem init errors by @tbensonatl in #673
- Fix to ConvCorr tests to skip host tests when host not enabled by @aayushg55 in #674
- Expanded Host BLAS support by @aayushg55 in #675
- Update README.md by @HugoPhibbs in #676
- Improved the error messages when sizes are incompatible by @cliffburdick in #682
- Added toeplitz operator by @cliffburdick in #683
- Simplified cmake file so no definitions are required by default by @cliffburdick in #684
- fix type for permuted ops in norm. by @luitjens in #696
- Fix c++20 warning by @cliffburdick in #698
- Update Cub Cache Creation to new Method by @tylera-nvidia in #694
- Fixed base operator types by @cliffburdick in #703
- Update slice.rst by @HugoPhibbs in #704
- Fixed issues with host compiler with C++17 and C++20 modes by @cliffburdick in #706
- NVPL LAPACK Solver Support on ARM by @aayushg55 in #701
- Add detail:: namespace to CUB struct by @cliffburdick in #708
- OpenBLAS LAPACK Solver Support for x86 by @aayushg55 in #709
- Exclude examples/cmake_sample_project/build* from doxygen search by @tmartin-gh in #711
- Fixed random pre/post run signature by @cliffburdick in #715
- Rapids cmake 24 06 package by @cliffburdick in #716
- Add support for UINT Generation by @tylera-nvidia in #695
- Update svd docstring by @cliffburdick in #717
- Solver SVD Optimizations and Improved cuSolver batching by @aayushg55 in #721
- MATX_EN_CUTENSOR / MATX_ENABLE_CUTENSOR Unified Variable by @tylera-nvidia in #720
- mtie should output the correct rank and size for the output operator. by @luitjens in #726
- Update bug_report.md by @HugoPhibbs in #729
- eliminate auto spills in permute by @luitjens in #731
- Revert accidental commit to main by @cliffburdick in #734
- Host Solver workspace query fix by @aayushg55 in #733
- Add in-place transform support for inv() by @tbensonatl in #736
- Allow access to Data() pointer from device by @tmartin-gh in #738
- Use cublasmatinvBatched() for N <= 32 by @tbensonatl in #739
- Added new pinv() operator and updated Reduced SVD by @aayushg55 in #740
- optimize our iterator to avoid an unnecessary constructor call by @luitjens in #741
- Updated Solver documentation by @aayushg55 in #742
- Updated documentation for CPU support by @aayushg55 in #743
- Slice optimizations to reduce spills by @cliffburdick in #732
- Fixing shadow declaration by @cliffburdick in #745
- Workaround for constexpr bug inside lambda in CUDA 11.8 by @cliffburdick in #671
- Added diag operator taking 1D operator to generate 2D operator by @cliffburdick in #746
- Add normcdf docs by @cliffburdick in #747
- Refactor template arguments to reductions to force no permutes when unnecessary by @cliffburdick in #749
- Adding workarounds for false positives on gcc14 by @cliffburdick in #751
- Visibility fix for cache static deinit issue by @nvjonwong in #752
- Don't allow in-place make_tensor to change ownership by @cliffburdick in #753
- Fix for erroneous errors on gcc14.1 by @cliffburdick in #755
- Create temp contiguous tensors if needed for sor...
v0.8.0
Release highlights:
- Features
- Updated cuTENSOR and cuTensorNet versions
- Added configurable print formatting
- ARM FFT support via NVPL
- New operators: abs2(), outer(), isnan(), isinf()
- Many more unit tests for CPU tests
- Bug fixes for matmul on Hopper, 2D FFTs, and more
Full changelist:
What's Changed
- Increase cublas workspace to 32 MiB for Hopper+ by @tbensonatl in #545
- matmul bug fixes. by @luitjens in #547
- Added missing synchronization by @luitjens in #552
- Refine some file I/O functions' doxygen comments by @AtomicVar in #549
- Update docs by @tmartin-gh in #551
- Export used environment variables in sphinx config by @tmartin-gh in #553
- Import os by @tmartin-gh in #554
- Add version info by @tmartin-gh in #555
- Fix typo by @tmartin-gh in #556
- Adds IsNan and IsInf Operators by @nvjonwong in #557
- Use cmake project version info in sphinx config by @tmartin-gh in #560
- outer() operator for outer product by @cliffburdick in #559
- Fix nans in QR and SVD. by @luitjens in #558
- Update CMakeLists.txt by @cliffburdick in #548
- Fix CMake to allow multiple rapids-cmake to coexist by @cliffburdick in #562
- Return 0D arrays for 0D shape in operators by @cliffburdick in #561
- Fix NVTX3 include path by @AtomicVar in #564
- Add .npy File I/O by @AtomicVar in #565
- SVD & QR improvements by @luitjens in #563
- chore: Fix typo s/whereever/wherever/ by @hugo-syn in #566
- Add rapids-cmake-dir, if defined, to CMAKE_MODULE_PATH by @tbensonatl in #567
- Add abs2() operator for squared abs() by @tbensonatl in #568
- Fixed issue on g++13 with nullptr dereference that cannot happen at r… by @cliffburdick in #571
- Force max(min) size of direct convolution dimension to be < 1024 by @cliffburdick in #573
- Remove incorrect warning check for any compiler other than gcc by @cliffburdick in #577
- stream memory cleanup by @cliffburdick in #579
- Update reshape indices by @cliffburdick in #580
- Update matlabpython.rst by @cliffburdick in #583
- Prevent potential oob read in matxOpTDKernel by @tbensonatl in #586
- Broadcast lower-rank tensors during batched matmul by @tbensonatl in #585
- Fix bugs in 2D FFTs and add tests by @benbarsdell in #587
- Added ARM FFT Support by @cliffburdick in #576
- Various bug fixes for older compilers by @cliffburdick in #588
- Renamed rmin/rmax functions to min/max and element-wise are now minimum/maximum to match Python by @cliffburdick in #589
- Fix clang macro by @cliffburdick in #592
- Fix misplaced sentence in README by @lucifer1004 in #594
- Add configurable print formatting types by @tmartin-gh in #593
- Fixing return types to allow either prvalue or lvalue in operator() by @cliffburdick in #598
- Rework einsum for new cache style. Fix for issue #597 by @tmartin-gh in #599
- Updated cutensornet to 24.03 and cutensor to 2.0.1 by @cliffburdick in #600
- adding file name and line number to ease debug by @bhaskarrakshit in #601
- Updating versions and notes for v0.8.0 by @cliffburdick in #602
New Contributors
- @hugo-syn made their first contribution in #566
- @benbarsdell made their first contribution in #587
- @lucifer1004 made their first contribution in #594
- @bhaskarrakshit made their first contribution in #601
Full Changelog: v0.7.0...v0.8.0
v0.7.0
Features
- Convert libcudacxx to CCCL by @cliffburdick in #501
- Add PreRun and tests for at/clone/diag operators by @tbensonatl in #502
- Add explicit FFT length to fft_conv example by @tbensonatl in #503
- Add Pre/PostRun support for collapse, concat ops by @tbensonatl in #506
- polyval operator by @cliffburdick in #508
- Optimize resample poly kernels by @tbensonatl in #512
- Allow negative indexing on slices by @cliffburdick in #516
- Automatically publish docs to GH Pages on merge to main by @tmartin-gh in #520
- Add configurable precision support of
print()
. by @AtomicVar in #521 - Make matxHalf trivially copyable by @tbensonatl in #513
- Added operator for matvec by @cliffburdick in #514
- New rapids and nvbench by @cliffburdick in #529
Fixes
- Add FFT1D tensor size checks by @tbensonatl in #499
- Fix errors which caused some unit tests failed to compile. by @AtomicVar in #504
- Fix upsample output size by @cliffburdick in #507
- removing print characters accidently left behind by @tylera-nvidia in #510
- Renamed host executor and prepared for multi-threaded additions by @cliffburdick in #511
- removing old hardcoded limit for repmat rank size by @tylera-nvidia in #515
- Avoid async alloc in some Cholesky decomp cases by @tbensonatl in #517
- Workaround for maybe_unused parse bug in old gcc by @tbensonatl in #522
- Fix matvec output dims to match A rather than B by @tbensonatl in #523
- Remove CUDA system include by @cliffburdick in #525
- Zero-initialize batches field in CUB params by @tbensonatl in #527
- Fixing host include guard on resample poly by @cliffburdick in #528
- Update device.h for host compiler by @cliffburdick in #530
- Made allocator an inline function by @cliffburdick in #532
- Build and publish documentation on merge to main by @tmartin-gh in #533
- Remove doxygen parameter to match tensor_t constructor signature by @tmartin-gh in #534
- Update iterator.h by @cliffburdick in #536
- Update Bug Report Issue Template by @AtomicVar in #539
- Fix CCCL libcudacxx path by @cliffburdick in #537
- Check matmul types and error at compile-time if the backend doesn't support them by @cliffburdick in #540
- Fix batched cov transform by @tbensonatl in #541
- Update caching for transforms to fixing all leaks reported by compute-sanitizer by @cliffburdick in #542
- Update docs for v0.7.0 by @cliffburdick in #544
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Notable Updates
- Transforms as operators by @cliffburdick in #452
- resample_poly optimizations and operator support by @tbensonatl in #465
Full changelog below:
What's Changed
- Added upsample and downsample operators by @cliffburdick in #442
- Added lvalue semantics to operators that needed it by @cliffburdick in #443
- Added operator support to solver functions by @cliffburdick in #444
- Added shapeless version of diag() and eye() by @cliffburdick in #445
- Deprecated random interface by @cliffburdick in #446
- Updated cuTENSOR/cuTensorNet and added example for trace by @cliffburdick in #447
- Fixing host compilation where device code snuck in by @cliffburdick in #453
- Added Protections for Shift Operator inputs and fixed issues with size/Shape returns for certain input sizes by @tylera-nvidia in #454
- Added isclose and allclose functions by @cliffburdick in #448
- Adds normalization options for
fft
andifft
by @nvjonwong in #456 - Updated 0D tensor syntax and expanded simple radar pipeline by @cliffburdick in #458
- Add initial polyphase channelizer operator by @tbensonatl in #459
- Fixed inverse from stomping on input by @cliffburdick in #461
- Fix cache issue with strides by @cliffburdick in #460
- Added const to Pre/PostRun by @cliffburdick in #462
- Revert inv by @cliffburdick in #463
- Added proper LHS handling for transforms by @cliffburdick in #464
- Updated incorrect license by @cliffburdick in #466
- Use device mem instead of managed for fft workbuf by @tbensonatl in #467
- Added at() and percentile() operators by @cliffburdick in #471
- Add overlap operator by @cliffburdick in #472
- Support stride 0 A/B batches for GEMMs by @cliffburdick in #473
- Added FFT-based convolution to conv1d() by @cliffburdick in #475
- Documentation cleanup by @tmartin-gh in #477
- Adding FFT convolution benchmarks by @cliffburdick in #476
- Fixed rank of output in matmul operator when A/B had 0 stride by @cliffburdick in #478
- Updating header image by @cliffburdick in #480
- Add pwelch operator by @tmartin-gh in #479
- Docs cleanup. Enforce warning-as-error for doxygen and sphinx. by @tmartin-gh in #481
- Fixes for CUDA 12.3 compiler by @cliffburdick in #483
- Update pwelch.h by @cliffburdick in #486
- Fixes for new compiler issues by @cliffburdick in #488
- Fixing sample Cmake Project by @tylera-nvidia in #489
- Update base_operator.h by @cliffburdick in #490
- Add window operator input to pwelch by @tmartin-gh in #491
- Add PreRun methods for slice/fftshift operators by @tbensonatl in #493
- PreRun support for r2c and other fft related fixes by @tbensonatl in #494
New Contributors
- @tmartin-gh made their first contribution in #477
Full Changelog: v0.5.0...v0.6.0
v0.5.0
Notable Updates
- Documentation rewritten to include working examples for every function based on unit tests
- Polyphase resampler based on SciPy/cuSignal's
resample_poly
Full changelog below:
What's Changed
- Modifies TensorViewToNumpy and NumpyToTensorView for rank = 5 by @nvjonwong in #427
- NumpyToTensorView overload which returns new TensorView by @nvjonwong in #428
- Added fftfreq() generator by @cliffburdick in #430
- Latest NumpyToTensorView function requires complex conversion for complex types by @nvjonwong in #431
- Fixed print function to work on device in certain cases by @cliffburdick in #436
- Fixed unused variable warning by @cliffburdick in #435
- Adding initial polyphase resampler transform by @tbensonatl in #437
- Revamped documentation by @cliffburdick in #438
- Fixing typo in Cholesky docs by @cliffburdick in #439
- Added broadcasting documentation by @cliffburdick in #440
- Broadcast docs by @cliffburdick in #441
New Contributors
- @nvjonwong made their first contribution in #427
Full Changelog: v0.4.1...v0.5.0
v0.4.1
This is a minor release mostly focused on bug fixes for different compilers and CUDA versions. One major feature added was all reductions are supported on the host using a single threaded executor. Multi-threaded executor support coming soon.
What's Changed
- Host reductions by @cliffburdick in #385
- Reduced cuBLASLt workspace size by @cliffburdick in #404
- Fix benchmarks that broke with new executors by @cliffburdick in #405
- All operator tests converted to use host and device, and improved 16b by @cliffburdick in #403
- Add single argument copy() and copy() tests by @tbensonatl in #407
- Add rank0 tensor remap support by @tbensonatl in #408
- Add Mutex to support multithread NVTX markers by @tylera-nvidia in #406
- Fix a few issues highlighted by linters/clang by @tbensonatl in #409
- Fixed compilation for Pascal by @cliffburdick in #412
- Fixed issue with constructor when passing strides and sizes by @cliffburdick in #413
- CMake fixes found by user by @cliffburdick in #416
- Update libcudacxx to 2.1.0 by @cliffburdick in #417
- Fixed cupy check for unit tests, default constructors, and file IO by @cliffburdick in #419
- Added delta degrees of freedom on var() to mimic Python by @cliffburdick in #421
- Adding correct license on files that were wrong by @cliffburdick in #423
- Fixed two issues with release mode and DLPack and reductions on the host by @cliffburdick in #424
Full Changelog: v0.4.0...v0.4.1
v0.4.0
New Features
- slice optimization to use builtin tensor function when possible by @luitjens in #360
- Slice support for std::array shapes by @luitjens in #363
- svd power iteration example, benchmark and unit tests. by @luitjens in #366
- matmul: support real/complex tensors by @kshitij12345 in #362
- Adding sign/index operators: by @luitjens in #369
- optimized cast and conj op to return a tensor view when possible. by @luitjens in #371
- implement QR for small batched matrices. by @luitjens in #373
- Implement block power iteration (qr iterations) for svd by @luitjens in #375
- Added output iterator support for CUB sums, and converted all sum() by @cliffburdick in #380
- Removing inheritance from std::iterator by @cliffburdick in #381
- DLPack support by @cliffburdick in #392
- Adding ref-count for DLPack by @cliffburdick in #394
- updating cub optimization selection for >= 2.0 by @tylera-nvidia in #395
- Refactored make_tensor to allow lvalue init by @cliffburdick in #397
- Updated notebook documentation and refactored some code by @cliffburdick in #398
- Allow 0-stride dimensions for cublas input/output by @tbensonatl in #400
- 16-bit float reductions + updated softmax by @cliffburdick in #399
Bug Fixes
- Fix Duplicate Print and remove member prints by @tylera-nvidia in #364
- cublasLT col major detection fix. by @luitjens in #368
- Fixes for 32b mode by @cliffburdick in #388
- Fixed a bogus maybe-unitialized warning/error in release mode by @cliffburdick in #389
- Fixed issue with using const pointers by @cliffburdick in #393
- Generator Printing Patch by @tylera-nvidia in #370
New Contributors
- @kshitij12345 made their first contribution in #362
- @tbensonatl made their first contribution in #400
Full Changelog: v0.3.0...v0.4.0
v0.3.0
v0.3.0 marks a major release with over 100 features and bug fixes. Release cadence will occur more frequently after this release to support users not living at the HEAD.
What's Changed
- Added squeeze operator by @cliffburdick in #163
- Change name of squeeze to flatten by @cliffburdick in #164
- Updated version of cuTENSOR and fixed paths by @cliffburdick in #166
- Added reduction example with einsum by @cliffburdick in #168
- Fixed bug with wrong type on argmin/max by @cliffburdick in #170
- Fixed missing return on operator() for sum by @cliffburdick in #171
- Fixed error with reduction with invalid indices. Only shows up on Jetson by @cliffburdick in #172
- Fixed bug with matmul use-after-free by @cliffburdick in #173
- Added test for batches GEMMs by @cliffburdick in #174
- Throw an exception if using SetVals on non-managed pointer by @cliffburdick in #176
- Added missing assert in release mode by @cliffburdick in #178
- Fixed einsum in release mode by @cliffburdick in #179
- Updates to docs by @cliffburdick in #180
- Added unit test for transpose and fixed bug with grid size by @cliffburdick in #181
- Fix grid dimensions for transpose. by @galv in #182
- Added missing include by @cliffburdick in #184
- Remove CUB from sum reduction while bug is being investigated by @cliffburdick in #186
- Fix for cub reductions by @luitjens in #187
- Reenable CUB tests by @cliffburdick in #188
- Fixing incorrect parameter to CUB sort for 2D tensors by @cliffburdick in #190
- Remove 4D restriction on Clone by @cliffburdick in #191
- Added support for N-D convolutions by @cliffburdick in #189
- Download RAPIDS.cmake only if it does not exist. by @cwharris in #192
- Fix 11.4 compilation issues by @cliffburdick in #195
- Improve FFT batching by @cliffburdick in #196
- Fixed argmax initialization value by @cliffburdick in #198
- Fix issue #199 by @pkestene in #200
- Fix type on concatenate by @cliffburdick in #201
- Fix documentation type-o by @dagardner-nv in #202
- Missing host annotation on some generators by @cliffburdick in #203
- Fixed TotalSize on cub operators by @cliffburdick in #204
- Implementing remap operator. by @luitjens in #205
- Update reverse/shift APIs by @luitjens in #207
- batching conv1d across filters. by @luitjens in #208
- Added Print for operators by @cliffburdick in #211
- Complex div by @cliffburdick in #213
- Added lcollapse and rcollapse operator by @luitjens in #212
- Baseops by @luitjens in #214
- Only allow View() on contigious tensors. by @luitjens in #215
- Remove caching on some CUB types temporarily by @cliffburdick in #216
- Fixed convolution mode SAME and added unit tests by @cliffburdick in #217
- Added convolution VALID support by @cliffburdick in #218
- Allow operators on cumsum by @cliffburdick in #219
- Using async allocation in median() by @cliffburdick in #220
- Various CUB fixes -- got rid of offset pointers (async allocation + copy), allowed operators on more types, and fixed caching on sort by @cliffburdick in #222
- Fixed memory leak on CUB cache bypass by @cliffburdick in #223
- Update to pipe type through for scalars on set operation by @tylera-nvidia in #225
- Added complex version of mean and variance by @cliffburdick in #227
- Fixed FFT batching for non-contiguous tensors by @cliffburdick in #228
- Added fmod operator by @cliffburdick in #230
- Fmod by @cliffburdick in #231
- Changing name to fmod by @cliffburdick in #232
- Cloneop by @luitjens in #233
- Making the shift parameter in shift an operator by @luitjens in #234
- Change sign of shift to match python/matlab. by @luitjens in #235
- Changing output operator type to by value to allow temporary operators to be used as an output type. by @luitjens in #236
- Adding slice() operator. by @luitjens in #237
- Fix cuTensorNet workspace size by @leofang in #241
- adding permute operator by @luitjens in #239
- Cleaning up operators/transforms. by @luitjens in #243
- Rapids cmake no fetch by @cliffburdick in #245
- Cleanup of include directory by @luitjens in #246
- Fixed conv SAME mode by @cliffburdick in #248
- Use singleton on GIL interpreter by @cliffburdick in #249
- make owning a runtime parameter by @luitjens in #247
- Fixed bug with batched 1D convoultion size by @cliffburdick in #250
- Adding 2d convolution tests by @luitjens in #251
- Properly initialize pybind object by @cliffburdick in #252
- Fixed sum() using wrong iterator type by @cliffburdick in #253
- g++11 fixes by @cliffburdick in #254
- Fixed size on conv and added benchmarks by @cliffburdick in #256
- Adding unit tests for collapse with remap by @luitjens in #255
- Collapse tests by @luitjens in #257
- adding madd function to improve convolution throughput by @luitjens in #258
- Conv opt by @luitjens in #259
- Fixed compiler errors in release mode by @cliffburdick in #261
- Add streaming make_tensor APIs. by @luitjens in #262
- adding random benchmark by @luitjens in #264
- remove depricated APIs in make_tensor by @luitjens in #266
- Host unit tests by @luitjens in #267
- Fixed bug with FFT size shorter than length of tensor by @cliffburdick in #270
- removing unused pybind call made before pybind initialize by @tylera-nvidia in #271
- Fixed visualization tests by @cliffburdick in #275
- Fix cmake function check_python_libs. by @pkestene in #274
- Support CubSortSegmented by @tylera-nvidia in #272
- Executor cleanup. by @luitjens in #277
- Transpose operators changes by @luitjens in #278
- Remove Deprecated Shape and add metadata to Print by @tylera-nvidia in #280
- Update Documentation by @tylera-nvidia in #282
- NVTX Macros by @tylera-nvidia in #276
- Adding throw to file reading by @tylera-nvidia in #281
- Adding str() function to generators and operators by @luitjens in #283
- Added reshape op by @luitjens in #287
- 0D tensor printing was broken since they don't have a stride by @cliffburdick in #289
- Allow hermitian to take any rank by @cliffburdick in #292
- Hermitian nd by @cliffburdick in #293
- Fixed batched inverse by @cliffburdick in #294
- Added 4D matmul unit test and fixed batching bug by @cliffburdick in #297
- Fixing batched half precision complex GEMM by @cliffburdick in #298
- Rename simple_pipeline to simple_radar_pipeline for added clarity by @awthomp in #299
- Remove cuda::std::min/max by @cliffburdick in #301
- Fixed chained concatenations by @cliffburdick ...
Minor fix on name collision
v0.2.5 Changed MAX name to not collide with other libraries (#162)
Minor fix
Fixed argmin initialization issue that gave wrong results sometimes