How to get a slice of a tensor from tl.load()? #1313

zw2326 · 2023-03-11T00:23:39Z

zw2326
Mar 11, 2023

Noob question - say I tl.load() a 2D tensor A of shape [M, N], how do I get the 2nd row?

@triton.jit
def test_kernel(a_ptr, BLOCK_SIZE_M, BLOCK_SIZE_N):
    offsets = a_ptr + tl.arange(0, BLOCK_SIZE_M * BLOCK_SIZE_N)
    a_data = tl.load(a_ptr + offsets)
    # How to reference 2nd row from a_data?

Answered by tristanheywood

Mar 12, 2023

Unfortunately, Triton does not currently support indexing, so there is no good way to access the second row. For your test kernel, you could get around this by using tl.load() again to load the 2nd row independently. In general you can use a combination of tl.store() and tl.load() to perform indexing, however this will be likely result in poor performance.

For some context around why indexing isn't supported: from what I understand, each kernel instance (i.e. each program id) will be automatically parallelized across multiple GPU threads. So you write your kernel to parallelize the given task between a grid of kernel instances, and Triton further parallelizes each kernel instance. My impr…

View full answer

tristanheywood · 2023-03-12T23:51:04Z

tristanheywood
Mar 12, 2023

Unfortunately, Triton does not currently support indexing, so there is no good way to access the second row. For your test kernel, you could get around this by using tl.load() again to load the 2nd row independently. In general you can use a combination of tl.store() and tl.load() to perform indexing, however this will be likely result in poor performance.

For some context around why indexing isn't supported: from what I understand, each kernel instance (i.e. each program id) will be automatically parallelized across multiple GPU threads. So you write your kernel to parallelize the given task between a grid of kernel instances, and Triton further parallelizes each kernel instance. My impression is that this automatic parallelization relies on each block/tensor of data being indivisible, and supporting indexing would make the parallization process much more difficult.

0 replies

brandus1 · 2023-07-24T08:18:15Z

brandus1
Jul 24, 2023

Hi,

based on this reply I believe there is no way of efficiently doing something like this right?

ex -= (hy[:, :, 1:] - hy[:, :, :-1]) - (hz[:, 1:, :] - hz[:, :-1, :])
ey -= (hz[1:, :, :] - hz[:-1, :, :]) - (hx[:, :, 1:] - hx[:, :, :-1])
ez -= (hx[:, 1:, :] - hx[:, :-1, :]) - (hy[1:, :, :] - hy[:-1, :, :]))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get a slice of a tensor from tl.load()? #1313

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to get a slice of a tensor from tl.load()? #1313

zw2326 Mar 11, 2023

Replies: 2 comments

tristanheywood Mar 12, 2023

brandus1 Jul 24, 2023

zw2326
Mar 11, 2023

tristanheywood
Mar 12, 2023

brandus1
Jul 24, 2023