[RFC] Add convension from Triton to Linalg #1542

sethbrin · 2023-04-18T16:24:13Z

sethbrin
Apr 18, 2023

Hi Dears,

This is an RFC for adding convention from Triton to Linalg.

Background

The current lowering of TritonGPU -> LLVM is too direct and abrupt, maybe making it hard to debug/analyze in case of a Codegen bug or performance issue.

MLIR community currently provides a CodeGen progressively lowering path, which can disassemble the entire Codegen path into Linalg->Vector->GPU->LLVM dialects, and defines various transform dialect mechanisms for Codegen on Linalg dialect, including tiling, reduction, promotion, pad, interchange, fusion, etc.

If we can build a bridge between Triton Dialect and Linalg Dialect, we can make use of the MLIR community infrastructure, reduce the complexity of adding a new backend, and the code generation of each backend can reuse the backend-independent optimization passes.

Proposal

Triton currently supports very flexible pointer operations, but the current Triton repository has only the use of a pointer of element type (May be incorrect?). If we met the pointer of the pointer, the backend may be difficult to do continuity analysis, and performance is difficult to optimize.

We think that the use scenario of pointer of pointer is not a lot, so the following design is only for a pointer of element type. The last also briefly discusses the Linalg program for how to deal with pointer of pointer.

Auxiliary Dialect

For the pointer type, we can think of it as an offset relative to the base address, so that we can express it on the Tensor, for which we introduce the auxiliary dialect, which contains the tensor_view and store operators.

//===----------------------------------------------------------------------===//
// TensorViewOp
//===----------------------------------------------------------------------===//
def TensorViewOp : Aux_Op<"tensor_view", []> {
  let summary = "To view a ptr as a tensor.";
  let description = [{
    To new an unranked tensor from a pointer type gained from user resource.
    After bufferization, tensor_view should be converted to mlu.view as its
    memref version.
    Nested pointer is currently not allowed.

    Example:
    ```
    %0 = "some.op"() : () -> !llvm.ptr<f32>
    %1 = aux.tensor_view %0 : !llvm.ptr<f32> to tensor<*xf32>
    %2 = aux.tensor_view %0[%shape1] : !llvm.ptr<f32> to tensor<128x?xf32>
    ```
  }];

  let arguments = (ins
    LLVM_PointerTo<AnyType>:$ptr
  );
  let results = (outs
    UnrankedTensorOf<[AnyType]>:$tensor
  );
 ...
 ];

def StoreOp : Aux_Op<"store", [
  AllTypesMatch<["to", "from"]>
]> {
  let summary =
      "Stores a value to a tensor.";
  let description = [{
    An operation with side-effect that represents a store operation on tensor
    level. 'to' must comes from a tensor view op.

    Example:
    ```
    %0 = aux.tensor_view %ptr : tensor<*xf32>
    %ret = ....
    aux.store ret% to %0
    ```
  }];

  let arguments = (ins
    AnyType:$to,
    AnyType:$from
  );
  let results = (outs);
  ...
}

LinalgExt Dialect

If we get the offset corresponding to the pointer, for the tt.load operator, semantically it actually takes data from a bunch of pointer addresses and assembles it into a tensor, which semantically is tensor.gather, but the official tensor.gather does not support mask/other, for this reason, we defined a LinalgExt dialect as a supplement to Linalg Dialect(The main reason why we don't introduce this operator on Tensor Dialect is that the infrastructure on Linalg, such as promotion, fusion etc, is currently not available on the Tensor Dialect)


def GatherOp : LinalgExtStructuredBase_Op<"gather", [
    DeclareOpInterfaceMethods<OpAsmOpInterface, ["getAsmResultNames"]>,
    DeclareOpInterfaceMethods<OpAsmOpInterface, ["getAsmBlockArgumentNames"]>]> {

    let description = [{
    Example:
    ```
    %gather = linalg_ext.gather 
            ins(%data, %offset: tensor<32x8xf32>, tensor<4x1xi32>)
            outs(%init: tensor<4x8xf32>) -> tensor<4x8xf32>
    ```
    }];

    let arguments = (ins
      Variadic<TensorOrMemref>:$inputs,
      TensorOrMemref:$init
    );
    let results = (outs Variadic<AnyTensor>:$result);
    let regions = (region SizedRegion<1>:$mapper);
    ...
   let extraClassDeclaration = structuredOpsBaseDecls # [{
      // base helper function
      Value data() {
        return getInputOperand(0)->get();
      }
      Value indices() {
        return getInputOperand(1)->get();
      }
      Value mask() {
        return getInputOperand(2)->get();
      }
      Value other() {
        return getInputOperand(3)->get();
      }
     ...

Convention from Triton to Linalg

We give a simple example to describe the conversion algorithm.


func.func @add(%arg0 : !tt.ptr<f32>, %arg1 : !tt.ptr<f32>) {
  %0 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<128x!tt.ptr<f32>>
  %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
  %2 = tt.addptr %0, %1 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
  %3 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<128x!tt.ptr<f32>>
  %4 = tt.addptr %3, %1 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
  %5 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xf32>
  %6 = arith.negf %5 : tensor<128xf32>
  tt.store %4, %6 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xf32>
  return
}

---------------------->

func.func @add(%arg0: !llvm.ptr<f32, 1>, %arg1: !llvm.ptr<f32, 1>) {
  // Add tensor_view to convert a llvm pointer to unranked tensor type
  %0 = aux.tensor_view %arg1 : !llvm.ptr<f32, 1> to tensor<*xf32>
  // Cast llvm pointer to i64 type, we just use it to calculate the pointer offset
  %1 = llvm.ptrtoint %arg1 : !llvm.ptr<f32, 1> to i64
  %2 = aux.tensor_view %arg0 : !llvm.ptr<f32, 1> to tensor<*xf32>
  %3 = llvm.ptrtoint %arg0 : !llvm.ptr<f32, 1> to i64

  // Convert tt.splat to linalg.fill
  %4 = linalg.init_tensor [128] : tensor<128xi64>
  %5 = linalg.fill ins(%3 : i64) outs(%4 : tensor<128xi64>) -> tensor<128xi64>

  // Convert tt.make_range to linalg_ext.make_range, add definition to linalg_ext dialect
  %6 = linalg.init_tensor [128] : tensor<128xi32>
  %7 = linalg.make_range {end = 128 : i32, start = 0 : i32} outs(%6: tensor<128xi32>) -> tensor<128xi32>

  // Convert tt.addptr to arith.addi
  %8 = arith.extsi %7 : tensor<128xi32> to tensor<128xi64> 
  %9 = arith.addi %5, %8 : tensor<128xi64>

  // Convert tt.splat to linalg.fill
  %10 = linalg.init_tensor [128] : tensor<128xi64>
  %11 = linalg.fill ins(%1 : i64) outs(%10 : tensor<128xi64>) -> tensor<128xi64>

  // Convert tt.addptr to arith.addi
  %12 = arith.addi %5, %10 : tensor<128xi64>
  
  // Calculate the offset and convert tt.load to linalg_ext.gather
  %13 = arith.subi %9, %5 : tensor<128xi64>  
  // We cast the pointer tensor to the rank tensor type of dynamic shape according to the result rank of load
  %14 = tensor.cast %2 : tensor<*xf32> to tensor<?xf32>
  %15 = arith.index_cast %13 : tensor<128xi64> to tensor<128xindex>
  %16 = linalg.init_tensor [128] : tensor<128xf32>
  %17 = linalg_ext.gather ins(%14, %15 : tensor<?xf32>, tensor<128xindex>) outs(%16 : tensor<128xf32>) -> tensor<128xf32>

  %18 = arith.negf %17 : tensor<128xf32>

  // The same as gather, we convert tt.store to linalg_ext.scatter
  ... calc offset
  %19 = tensor.cast %0 : tensor<*xf32> to tensor<?xf32>
  %20 = linalg_ext.scatter(%17, %offset : tensor<128xf32>, tensor<128xindex>) outs(%19 : tensor<?xf32>) -> tensor<?xf32>
  
  // Add aux.store to indicates that the final result is written to the output tensor, it has side effects. 
  aux.store %20 to %0(tensor<?xf32>)
  return
}

Dim Mapping

For triton, it expresses the computational logic inside the block, and use tt.get_program_id/tt.get_num_programs to get the index infomation.

In order to have the same expression on all backends and reuse the outlining kernel pass in GPU backend, we recover it by bringing dim mapping information in scf.forall op.

For example:

func.func @add(%arg0 : !tt.ptr<f32>, %arg1 : !tt.ptr<f32>) {
    %0 = tt.get_program_id {axis = 0 : i32} : i32
    %1 = tt.get_num_programs {axis = 0 : i32} : i32
     foo(%0, %1) // use of %0, %1
}

---------------------->

func.func @add(%arg0 : !tt.ptr<f32>, %arg1 : !tt.ptr<f32>, %num_programs: i32) {
   %add = scf.forall (%program_id) in (%num_programs) shared_outs(...)  -> (...) {
        foo(%program_id, %num_programs) // replace use of %0/%1 to %program_id and %num_programs
  } { mapping = [#gpu.block<x>] }
}

Optimization

In the previous example, we can reuse AxisInfoAnalysis to easily know that it is continuous throughout the dim, for which we can use tensor.extract instead of linalg_ext.gather and simply compute the offset of the first address by the scalar.

func.func @add(%arg0 : !tt.ptr<f32>, %arg1 : !tt.ptr<f32>, %arg2: !tt.ptr<f32>) {
  %0 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<128x!tt.ptr<f32>>
  %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
  %2 = tt.addptr %0, %1 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>

  %3 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<128x!tt.ptr<f32>>
  %4 = tt.addptr %3, %1 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>

  %5 = tt.splat %arg2 : (!tt.ptr<f32>) -> tensor<128x!tt.ptr<f32>>
  %6 = tt.addptr %5, %1 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>

  %a = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xf32>
  %b = tt.load %4 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xf32>
  %sum = arith.addf %a, %b  : tensor<128xf32>

  tt.store %6, %sum {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xf32>
  return
}

after conversion


func.func @add(%arg0: !llvm.ptr<f32, 1>, %arg1: !llvm.ptr<f32, 1>) {
     %0 = aux.tensor_view %arg0 : !llvm.ptr<f32, 1> to tensor<128xf32>
    %1 = aux.tensor_view %arg1 : !llvm.ptr<f32, 1> to tensor<128xf32>
    %2 = linalg.init_tensor [128] : tensor<128xf32>
    %mapped = linalg_ext.map { arith.addf } ins(%0, %1 : tensor<128xf32>, tensor<128xf32>) outs(%2 : tensor<128xf32>)
    %3 = aux.tensor_view %arg2 : !llvm.ptr<f32, 1> to tensor<128xf32>
    aux.store %mapped to %3(tensor<128xf32>)
    return
  }

At this point, we are done converting Triton to the Linalg dialect, and we can use the transform ops defined on Linalg to optimize the operator.

transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):
  transform.genesis.canonicalized_sequence %arg0 failures(propagate) {
    ^bb0(%arg1: !pdl.operation):
      %func = transform.structured.match ops{["func.func"]} in %arg1

      %root = transform.structured.match ops{["linalg.generic"]} in %func
      %foreach_thread_op, %tiled_op = transform.structured.tile_to_foreach_thread_op %root num_threads [2] (mapped to dims [8])

      // Bufferization.
      %bufferized = transform.structured.bufferize %arg1

      // Promotion
      %new_root = transform.structured.match ops{["linalg.generic"]} in %func
      %promoted = transform.structured.promote %new_root {memory_space = 3 : i64}
      %vectorize_func = transform.structured.vectorize %func
  }

The previous example deals with the one-dimensional continuous case, for matrix multiplication, which actually has two dimensions, we need to enhance AxisInfoAnalysis, introducing strides information, to map the d-th dimension to the length of the shortest sequence of the same stride.

We intercepted part of matmul's tt ir code snippet.

%35 = tt.expand_dims %25 {axis = 1 : i32} : (tensor<64xi32>) -> tensor<64x1xi32> // contiguity = [64, 1]
%36 = tt.splat %arg7 : (i32) -> tensor<64x1xi32> // constancy = [64, 1]
%37 = arith.muli %35, %36 : tensor<64x1xi32>  // we then know that 0-th dimension of %37 has the same strides, then strides = [64, 1]

Pointer of Pointer Discussion

Since each pointer actually corresponds to a large global tensor, for the second level pointer, we represent it by an attribute that indexes the specific value directly according to the offset.

Take the following code as a example:

func.func @add(%arg0 : !tt.ptr<!tt.ptr<f32>>) {
  %0 = tt.splat %arg0 : (!tt.ptr<!tt.ptr<f32>>) -> tensor<128x!tt.ptr<!tt.ptr<f32>>>
  %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
  %2 = tt.addptr %0, %1 : tensor<128x!tt.ptr<!tt.ptr<f32>>>, tensor<128xi32>
  %3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128x!tt.ptr<f32>>
  %4 = tt.load %3 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xf32>
  ....
}

------------------------------>

func.func @add(%arg0: !llvm.ptr<!llvm.ptr<f32, 1>, 1>) {
  // Add tensor_view to convert a llvm pointer to unranked tensor type, use i64 and is_pointer attribute to show that the result is a pointer
  %0 = aux.tensor_view %arg0 {is_pointer = True } : !llvm.ptr<!llvm.ptr<f32>, 1> to tensor<*xi64>
  // Cast llvm pointer to i64 type, we just use it to calculate the pointer offset
  %1 = llvm.ptrtoint %arg0 : !llvm.ptr<!llvm.ptr<f32, 1> to i64

  // Convert tt.splat to linalg.fill
  %2 = linalg.init_tensor [128] : tensor<128xi64>
  %3 = linalg.fill ins(%3 : i64) outs(%4 : tensor<128xi64>) -> tensor<128xi64>

  // Convert tt.make_range to linalg_ext.make_range, add definition to linalg_ext dialect
  %4 = linalg.init_tensor [128] : tensor<128xi32>
  %5 = linalg.make_range {end = 128 : i32, start = 0 : i32} outs(%4: tensor<128xi32>) -> tensor<128xi32>

  // Convert tt.addptr to arith.addi
  %6 = arith.extsi %5 : tensor<128xi32> to tensor<128xi64> 
  %7 = arith.addi %1, %6 : tensor<128xi64>
  
  // Calculate the offset and convert tt.load to linalg_ext.gather
  %8 = arith.subi %7, %1 : tensor<128xi64>  
  // We cast the pointer tensor to the rank tensor type of dynamic shape according to the result rank of load
  %9 = tensor.cast %0 : tensor<*xi64> to tensor<?xi64>
  %10 = arith.index_cast %7 : tensor<128xi64> to tensor<128xindex>
  %11 = linalg.init_tensor [128] : tensor<128xi64>
  %12 = linalg_ext.gather ins(%14, %15 : tensor<?xi64>, tensor<128xindex>) outs(%16 : tensor<128xi64>) -> tensor<128xi64>
  
   // introduce a generic tensor to model the whole gdram
   %generic_tensor = aux.init_tensor : tensor<?xf32>
   // use the %12 as a pointer offset of the generic tensor
   %13 = arith.index_cast %12 : tensor<128xi64> to tensor<128xindex>
   %14 = linalg.init_tensor [128] : tensor<128xf32>
   %15 = linalg_ext.gather {is_pointer=True} ins(%generic_tensor, %13 : tensor<?xf32>, tensor<128xindex>) outs(%14 : tensor<128xf32>) -> tensor<128xf32>
   ....
}

The implementation of linalg_ext.gather may shows here.

// is_pointer = False;
scf.for %arg0 = %c0 to %c128 step %c1 {
    %offset = memref.load %indices[%arg0]
     %data = memref.load %data[%offset]
     memref.store ...
}

// is_pointer = True;
scf.for %arg0 = %c0 to %c128 step %c1 {
    %offset = memref.load %indices[%arg0]
     // cast offset to llvm.ptr
     %data = llvm.load volatile %offset : !llvm.ptr -> f32
     memref.store ...
}

ptillet · 2023-04-18T16:54:33Z

ptillet
Apr 18, 2023
Maintainer

Hello,

This RFC is interesting, but I think I disagree with its entire premise. A couple of specific points:

The current lowering of TritonGPU -> LLVM is too direct and abrupt, maybe making it hard to debug/analyze in case of a Codegen bug or performance issue.
MLIR community currently provides a CodeGen progressively lowering path, which can disassemble the entire Codegen path into Linalg->Vector->GPU->LLVM dialects, and defines various transform dialect mechanisms for Codegen on Linalg dialect, including tiling, reduction, promotion, pad, interchange, fusion, etc.

I don't think this is true. While TritonGPU -> LLVM is indeed abrupt when it comes to shared memory management, the rest actually just implements the semantics of TritonGPU without making any performance optimization decision. Performance issues at this level of abstractions typically come from failure modes of nvptx or ptxas.

Triton currently supports very flexible pointer operations, but the current Triton repository has only the use of a pointer of element type (May be incorrect?). If we met the pointer of the pointer, the backend may be difficult to do continuity analysis, and performance is difficult to optimize.
We think that the use scenario of pointer of pointer is not a lot, so the following design is only for a pointer of element type. The last also briefly discusses the Linalg program for how to deal with pointer of pointer.

I don't think that this is a reasonable assumption. The Triton memory model opens up a lot of interesting possibilities (e.g., linked lists, tensor of trees, hashmaps, etc.) that are just fundamentally incompatible with Linalg. Going from Triton to Linalg feels a bit like trying to generate numpy code from multi-threaded C: it's likely to have a lot of limitations in practice. Although the public repo does not have many example kernels that run into this failure mode, any legal Triton kernel should get compiled gracefully.

I can see how a Triton -> Linalg conversion pass makes sense if you have some hardware whose compiler already relies on Linalg, but this is only going to work on a subset of Triton that should be very well specified so the user can't write programs that aren't be compatible. Then this wouldn't quite be Triton anymore (although it could rely on our dialect for GPU codegen), and it would make more sense to do it in a fork.

10 replies

gongzg Apr 24, 2023

I mean, for example, design a basic profile which limit the usage of some features such as some advanced pointers' usage and thus the basic profile could make sure the compatibility with Linalg.

What is a basic profile? Can you please elaborate?

Thanks for your reply. I thought something like Philippe mentioned above which is to restrict pointers' usage.

gongzg Apr 24, 2023

I thought about it some time ago, and ended up thinking it probably wouldn't work out. The cleanest way I could think of to separate multiple usages of Triton was to optionally disallow tensor of pointers -- and require users to use block pointers instead. However, I don't think even these restricted specs would give a guarantee of convertibility to something like Linalg since scalar pointers would still be an issue. Furthermore, there are some important cases (e.g., gather/scatter) that these restricted specs wouldn't cover.

I think you guys should look into Pallas. It is a more restrictive frontend to Triton-IR that doesn't expose pointers at all, and is more likely to generate Triton-IR kernels that are translatable to Linalg.

Thanks for your clarification and suggestion.

makslevental May 5, 2023

@Jokeren

I think Ping's point here is to add a pass that converts Triton to Linalg and reuses the existing codegen designed for Linalg. Maybe instead of fork, it could be an independent repo that uses Triton as a submodule and only relies on the Triton Dialect?

I have an example of this kind of layering here iree-org/iree-llvm-sandbox#695

If you're interested I am more than happy to contribute some CMake regularization PRs to enable this.

makslevental May 5, 2023

@ptillet

I think you guys should look into Pallas. It is a more restrictive frontend to Triton-IR that doesn't expose pointers at all, and is more likely to generate Triton-IR kernels that are translatable to Linalg.

Sorry what is Pallas? Googling "Pallas Triton GitHub" only lands on https://github.com/jax-ml/jax-triton.

Jokeren May 5, 2023
Maintainer

I have an example of this kind of layering here iree-org/iree-llvm-sandbox#695

Make sense to me.

If you're interested I am more than happy to contribute some CMake regularization PRs to enable this.

What are the CMake regularization stuff? Can you describe it a bit more?

ptillet · 2023-06-16T21:09:38Z

ptillet
Jun 16, 2023
Maintainer

@sethbrin you may be interested in #1797. There is no plan to merge this into main branch soon (this would probably live in a fork), but this should solve a lot of your problems

3 replies

sethbrin Jun 19, 2023
Author

@ptillet Thanks for your notification.

manbearian Jun 21, 2023
Collaborator

hi @sethbrin , your approach here is interesting. My team at Microsoft has been looking at this a bit differently in that we haven't looked at how we might insert linalg between Triton and TritonGPU, but rather as an alternative path for existing MLIR compiler back-ends.

sethbrin Jun 22, 2023
Author

I think our needs should be the same, and we are doing this solution to better integrate our internal Linalg-based compiler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add convension from Triton to Linalg #1542

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RFC] Add convension from Triton to Linalg #1542

sethbrin Apr 18, 2023

Background

Proposal

Auxiliary Dialect

LinalgExt Dialect

Convention from Triton to Linalg

Dim Mapping

Optimization

Pointer of Pointer Discussion

Replies: 2 comments · 13 replies

ptillet Apr 18, 2023 Maintainer

gongzg Apr 24, 2023

gongzg Apr 24, 2023

makslevental May 5, 2023

makslevental May 5, 2023

Jokeren May 5, 2023 Maintainer

ptillet Jun 16, 2023 Maintainer

sethbrin Jun 19, 2023 Author

manbearian Jun 21, 2023 Collaborator

sethbrin Jun 22, 2023 Author

sethbrin
Apr 18, 2023

Replies: 2 comments 13 replies

ptillet
Apr 18, 2023
Maintainer

Jokeren May 5, 2023
Maintainer

ptillet
Jun 16, 2023
Maintainer

sethbrin Jun 19, 2023
Author

manbearian Jun 21, 2023
Collaborator

sethbrin Jun 22, 2023
Author