Example Conv2D in Triton #591
Replies: 4 comments 4 replies
-
I think you can try to start from here BTW, as to conv performance, do you have any previous empirical data to be shared? @ptillet |
Beta Was this translation helpful? Give feedback.
-
Two years have been passed, are there any updates? |
Beta Was this translation helpful? Give feedback.
-
@sebastienwood @ptillet you see "The associated paper proposed a Conv2d implementation in C-Triton." i implement one in python, but it performance is poor! i do not know how to improve it, could you tell me which paper。i want to refer to it to improve some performance https://github.com/l1351868270/implicit_gemm.triton/blob/main/triton_implicit_gemm.py |
Beta Was this translation helpful? Give feedback.
-
i found why the performance is poor? when the data load from global memory to shared memory, the ptx code do not use the cp.async future. could i force the tl.load compile to use cp.async? ps: command: |
Beta Was this translation helpful? Give feedback.
-
Hi !
I'm interested in implementing a Conv-Nd-like operation in Triton. I'm pretty new to GPU programming and Triton. The associated paper proposed a Conv2d implementation in C-Triton. Is there a Python equivalent available ?
My main technical questions lies in the computation of the offsets and the general optimization strategy. Should there be two or more program ids ? (I'm not so sure what is the use of
tl.program_id
and the documentation is not so clear on it).My end goal is not amenable to unfolding. To be precise, I only need to compute the following equation (+ batched):
$Z_{a, b, c, a, d, e} = \sum_g \sum_i \sum_j \gamma_{g, b+i, c+j, g, d+i, e+j}$
Let me know if I can provide further information !
Beta Was this translation helpful? Give feedback.
All reactions