Add safe methods set_pointer_mode
and get_pointer_mode
#291
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds two safe methods to
CudaBlas
:set_pointer_mode
: is this function https://docs.nvidia.com/cuda/cublas/#cublassetpointermodeget_pointer_mode
: is this function https://docs.nvidia.com/cuda/cublas/#cublasgetpointermodeThere is also a test to ensure it works as expected.
This is important to have, as some cuBLAS functions require the
CUBLAS_POINTER_MODE_DEVICE
to be set when attempting to pass in device memory as a result buffer.
If the
cublasPointerMode_t
is not changed from the defaultCUBLAS_POINTER_MODE_HOST
in that case,then the function panics with
SIGSEGV: invalid memory reference
.I discovered that mechanism while trying to use the
cublas<t>asum()
function (https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-asum)Here is an example that illustrates the importance of setting the value properly:
Happy to include the test example in the PR as well if desired.
PS.: Setting
CUBLAS_POINTER_MODE_DEVICE
also increases performance by 50% as I can show with a benchmark. I'm not sure why there is such a big gain but it happens for thedot()
function as well even though the inputs and outputs are using device memory both times. Thedot()
function shows a decrease in execution time from 9 micros to 4 micros for a whole range of slice lengths including (but not limited to) 8192 elements. I assume it has to do with async mem copies.