This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Make Linalg vectorizer lower affine.apply
ClosedPublic

Authored by awarzynski on Jan 23 2023, 8:31 AM.

Details

Summary

Uploaded to facilitate the discussion for
https://github.com/iree-org/iree/issues/10876. It is yet to be decided
whether this is the right approach.

Diff Detail

Event Timeline

awarzynski created this revision.Jan 23 2023, 8:31 AM
Herald added a project: Restricted Project. · View Herald Transcript
awarzynski requested review of this revision.Jan 23 2023, 8:31 AM

For context, I would like to vectorize the following snippet:

#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>

func.func @generic_interchanged_transpose(%arg0: tensor<128x12x32xf32>, %arg3: index) -> tensor<128x12x32xf32> {
  %0 = tensor.empty() : tensor<128x12x32xf32>
  %1 = linalg.generic {indexing_maps = [#map0, #map0],
                       iterator_types = ["parallel", "parallel", "parallel"]}
    ins(%arg0 : tensor<128x12x32xf32>)
    outs(%0 : tensor<128x12x32xf32>) {
  ^bb0(%arg1: f32, %arg2: f32):

    %2 = linalg.index 2 : index
    %12 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%2, %arg3)
    %3 = arith.index_cast %12 : index to i32
    %4 = arith.uitofp %3 : i32 to f32
    %5 = arith.mulf %4, %arg1 : f32
    linalg.yield %5 : f32
  } -> tensor<128x12x32xf32>
  return %1 : tensor<128x12x32xf32>
}

transform.sequence failures(propagate) {
 ^bb1(%arg1: !pdl.operation):
   %0 = transform.structured.match ops{["linalg.generic"]} in %arg1
   %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
   %2 = transform.structured.vectorize %1 { vectorize_nd_extract }
 }

So far I am doing it wrong:

mlir-opt  -test-transform-dialect-interpreter -split-input-file input.mlir
within split at input.mlir:1 offset :11:10: error: 'linalg.index' op operation destroyed but still has uses
    %2 = linalg.index 2 : index
         ^
within split at input.mlir:1 offset :11:10: note: see current operation: %0 = "linalg.index"() {dim = 2 : i64} : () -> index

@dcaballe I've just noticed that you've recently added this comment - is that what's happening here?

awarzynski planned changes to this revision.Jan 23 2023, 8:36 AM

Fix the crash, add a test

Thanks for adding support for this! It looks great! A few comments

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1089

nit: ii typo

1095

When we have to change the insertion point, we use OpBuilder::Guard using RAII. I'd suggest that you move this to a utility function and then you can do RAII using the scope of the function. You can look at other examples in MLIR. Just search for OpBuilder::Guard.

This IR change could be part of a larger set of "linalgOp pre-processing" transformations that happens right before vectorization starts but after we know we can vectorize the op.

1101–1107

Couldn't we just do rewriter.replaceOp(op, expanded) and avoid the manual U-D chain update?

mlir/test/Dialect/Linalg/vectorization.mlir
2039

This looks like a new feature to me more than a regression. I think we should match the decomposed ops and make sure they are vectorized accordingly.

Move the new code into a dedicated hook, convertAffineApply, and extend the test.

awarzynski added inline comments.Jan 26 2023, 9:09 AM
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1095

This IR change could be part of a larger set of "linalgOp pre-processing" transformations that happens right before vectorization starts but after we know we can vectorize the op.

I like this idea :) Just to double-check - that set is yet to be created, right?

1101–1107

Perhaps I'm being daft, but things go horrible wrong when I do that. And I assume that that's because rewriter.replaceOp invalidates the iterators in the surrounding for loop.

Simplify the hook

dcaballe accepted this revision.Jan 26 2023, 10:29 AM

Awesome!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1052

nit: // -> /// and . at the end per coding guidelines.

This revision is now accepted and ready to land.Jan 26 2023, 10:29 AM
This revision was landed with ongoing or failed builds.Jan 27 2023, 12:31 AM
This revision was automatically updated to reflect the committed changes.

We are seeing some correctness issues when integrating this commit. We attempted to fix it here https://reviews.llvm.org/D143243 but it looks like something else is still not working.
For the following affine.apply:

func.func @_iota_dim0_dispatch_0_generic_2x3() {                                                                                                                                                 
  %c1 = arith.constant 1 : index                                                                                                                                                                 
  %c0 = arith.constant 0 : index                                                                                                                                                                 
  %c3 = arith.constant 3 : index                                                                                                                                                                 
  %c2 = arith.constant 2 : index                                                                                                                                                                 
  %c64 = arith.constant 64 : index                                                                                                                                                               
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c64) : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>>                                        
  %workgroup_id_x = hal.interface.workgroup.id[0] : index                                                                                                                                        
  %workgroup_count_x = hal.interface.workgroup.count[0] : index                                                                                                                                  
  %workgroup_id_y = hal.interface.workgroup.id[1] : index                                                                                                                                        
  %workgroup_count_y = hal.interface.workgroup.count[1] : index                                                                                                                                  
  %1 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_y]                                                                                                                            
  %2 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_y]                                                                                                                         
  %3 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_id_x]                                                                                                                            
  %4 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_count_x]                                                                                                                         
  scf.for %arg0 = %1 to %c2 step %2 {                                                                                                                                                            
    scf.for %arg1 = %3 to %c3 step %4 {                                                                                                                                                          
      %5 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> -> tensor<2x3xf32>                        
      %6 = scf.for %arg2 = %c0 to %c2 step %c1 iter_args(%arg3 = %5) -> (tensor<2x3xf32>) {                                                                                                      
        %extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 3] [1, 1] : tensor<2x3xf32> to tensor<1x3xf32>                                                                               
        %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} outs(%extracted_slice : tensor<1x3xf32>) attrs =  {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[2, 3], [1, 16], [0, 0]]>} {                                                                                    
        ^bb0(%out: f32):                                                                                                                                                                         
          %8 = linalg.index 0 : index                                                                                                                                                            
          %9 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %8, %arg2)                                                                                                         
          %10 = arith.index_cast %9 : index to i32                                                                                                                                               
          %11 = arith.sitofp %10 : i32 to f32                                                                                                                                                    
          linalg.yield %11 : f32                                                                                                                                                                 
        } -> tensor<1x3xf32>                                                                                                                                                                     
        %inserted_slice = tensor.insert_slice %7 into %arg3[%arg2, 0] [1, 3] [1, 1] : tensor<1x3xf32> into tensor<2x3xf32>                                                                       
        scf.yield %inserted_slice : tensor<2x3xf32>                                                                                                                                              
      }                                                                                                                                                                                          
      flow.dispatch.tensor.store %6, %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : tensor<2x3xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x3xf32>>                        
    }                                                                                                                                                                                            
  }                                                                                                                                                                                              
  return                                                                                                                                                                                         
}

it looks like some dimensions are missing after the expansion:

func.func @_iota_dim1_dispatch_0_generic_2x3() {                                                                                                                                                 
  %cst = arith.constant dense<[0, 1, 2]> : vector<3xindex>                                                                                                                                       
  %c1 = arith.constant 1 : index                                                                                                                                                                 
  %c0 = arith.constant 0 : index                                                                                                                                                                 
  %c3 = arith.constant 3 : index                                                                                                                                                                 
  %c2 = arith.constant 2 : index                                                                                                                                                                 
  %c64 = arith.constant 64 : index                                                                                                                                                               
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c64) : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>>                                        
  %workgroup_id_x = hal.interface.workgroup.id[0] : index                                                                                                                                        
  %workgroup_count_x = hal.interface.workgroup.count[0] : index                                                                                                                                  
  %workgroup_id_y = hal.interface.workgroup.id[1] : index                                                                                                                                        
  %workgroup_count_y = hal.interface.workgroup.count[1] : index                                                                                                                                  
  %1 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_y]                                                                                                                            
  %2 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_y]                                                                                                                         
  %3 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_id_x]                                                                                                                            
  %4 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_count_x]                                                                                                                         
  scf.for %arg0 = %1 to %c2 step %2 {                                                                                                                                                            
    scf.for %arg1 = %3 to %c3 step %4 {                                                                                                                                                          
      %5 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> -> tensor<2x3xf32>                        
      %6 = scf.for %arg2 = %c0 to %c2 step %c1 iter_args(%arg3 = %5) -> (tensor<2x3xf32>) {                                                                                                      
        %7 = vector.broadcast %arg1 : index to vector<3xindex>                                                                                                                                   
        %8 = arith.addi %7, %cst : vector<3xindex>                                                                                                                                               
        %9 = arith.index_cast %8 : vector<3xindex> to vector<3xi32>                                                                                                                              
        %10 = arith.sitofp %9 : vector<3xi32> to vector<3xf32>                                                                                                                                   
        %11 = vector.broadcast %10 : vector<3xf32> to vector<1x3xf32>                                                                                                                            
        %12 = vector.transfer_write %11, %arg3[%arg2, %c0] {in_bounds = [true, true]} : vector<1x3xf32>, tensor<2x3xf32>                                                                         
        scf.yield %12 : tensor<2x3xf32>                                                                                                                                                          
      }                                                                                                                                                                                          
      flow.dispatch.tensor.store %6, %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : tensor<2x3xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x3xf32>>                        
    }                                                                                                                                                                                            
  }                                                                                                                                                                                              
  return                                                                                                                                                                                         
}

I didn't have time to look into the details but I'm going to revert this for now so that we have a healthy ToT and nobody hits the same problem.
Sorry for the inconvenience.

awarzynski added a comment.EditedFeb 7 2023, 3:58 AM

@dcaballe It actually seems fine to me 🤔 . I reduced your example to:

module {
  func.func @_iota_dim0_dispatch_0_generic_2x3(%1: index, %2: index, %3: index, %4: index, %5: tensor<2x3xf32>) -> tensor<2x3xf32>{
    %c3 = arith.constant 3 : index
    %6 = scf.for %arg1 = %3 to %c3 step %4 iter_args(%arg3 = %5) -> (tensor<2x3xf32>) {
      %extracted_slice = tensor.extract_slice %arg3[%arg1, 0] [1, 3] [1, 1] : tensor<2x3xf32> to tensor<1x3xf32>
      %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} outs(%extracted_slice : tensor<1x3xf32>) {
      ^bb0(%out: f32):
        %8 = linalg.index 0 : index
        %9 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg1, %8)
        %10 = arith.index_cast %9 : index to i32
        %11 = arith.sitofp %10 : i32 to f32
        linalg.yield %11 : f32
      } -> tensor<1x3xf32>
      %inserted_slice = tensor.insert_slice %7 into %arg3[%arg1, 0] [1, 3] [1, 1] : tensor<1x3xf32> into tensor<2x3xf32>
      scf.yield %inserted_slice : tensor<2x3xf32>
    }
    return %6: tensor<2x3xf32>
  }

  transform.sequence  failures(propagate) {
  ^bb0(%arg0: !pdl.operation):
    %0 = transform.structured.match ops{["linalg.generic"]} in %arg0 : (!pdl.operation) -> !pdl.operation
    %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
    %2 = transform.structured.vectorize %1 {vectorize_nd_extract}
  }
}

Here's the output:

$ bin/mlir-opt -test-transform-dialect-interpreter file.mlir
module {
  func.func @_iota_dim0_dispatch_0_generic_2x3(%arg0: index, %arg1: index, %arg2: index, %arg3: index, %arg4: tensor<2x3xf32>) -> tensor<2x3xf32> {
    %c0 = arith.constant 0 : index
    %c3 = arith.constant 3 : index
    %0 = scf.for %arg5 = %arg2 to %c3 step %arg3 iter_args(%arg6 = %arg4) -> (tensor<2x3xf32>) {
      %1 = arith.index_cast %arg5 : index to i32
      %2 = arith.sitofp %1 : i32 to f32
      %3 = vector.broadcast %2 : f32 to vector<1x3xf32>
      %4 = vector.transfer_write %3, %arg6[%arg5, %c0] {in_bounds = [true, true]} : vector<1x3xf32>, tensor<2x3xf32>
      scf.yield %4 : tensor<2x3xf32>
    }
    return %0 : tensor<2x3xf32>
  }
  transform.sequence  failures(propagate) {
  ^bb0(%arg0: !pdl.operation):
    %0 = transform.structured.match ops{["linalg.generic"]} in %arg0 : (!pdl.operation) -> !pdl.operation
    %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
    %2 = transform.structured.vectorize %1 {vectorize_nd_extract}
  }
}

Similarly to your example, linalg.index 0 :index yields single value: 0. Hence that "element" of affine.apply is not present in the output after the expansion. Similar thing happens in your example (that's why there's only one arith.addi despite affine.apply OP adding 3 elements).

Does this make sense to you?