Uploaded to facilitate the discussion for
https://github.com/iree-org/iree/issues/10876. It is yet to be decided
whether this is the right approach.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
For context, I would like to vectorize the following snippet:
#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)> func.func @generic_interchanged_transpose(%arg0: tensor<128x12x32xf32>, %arg3: index) -> tensor<128x12x32xf32> { %0 = tensor.empty() : tensor<128x12x32xf32> %1 = linalg.generic {indexing_maps = [#map0, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0 : tensor<128x12x32xf32>) outs(%0 : tensor<128x12x32xf32>) { ^bb0(%arg1: f32, %arg2: f32): %2 = linalg.index 2 : index %12 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%2, %arg3) %3 = arith.index_cast %12 : index to i32 %4 = arith.uitofp %3 : i32 to f32 %5 = arith.mulf %4, %arg1 : f32 linalg.yield %5 : f32 } -> tensor<128x12x32xf32> return %1 : tensor<128x12x32xf32> } transform.sequence failures(propagate) { ^bb1(%arg1: !pdl.operation): %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation %2 = transform.structured.vectorize %1 { vectorize_nd_extract } }
So far I am doing it wrong:
mlir-opt -test-transform-dialect-interpreter -split-input-file input.mlir within split at input.mlir:1 offset :11:10: error: 'linalg.index' op operation destroyed but still has uses %2 = linalg.index 2 : index ^ within split at input.mlir:1 offset :11:10: note: see current operation: %0 = "linalg.index"() {dim = 2 : i64} : () -> index
@dcaballe I've just noticed that you've recently added this comment - is that what's happening here?
Thanks for adding support for this! It looks great! A few comments
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp | ||
---|---|---|
1089 | nit: ii typo | |
1095 | When we have to change the insertion point, we use OpBuilder::Guard using RAII. I'd suggest that you move this to a utility function and then you can do RAII using the scope of the function. You can look at other examples in MLIR. Just search for OpBuilder::Guard. This IR change could be part of a larger set of "linalgOp pre-processing" transformations that happens right before vectorization starts but after we know we can vectorize the op. | |
1101–1107 | Couldn't we just do rewriter.replaceOp(op, expanded) and avoid the manual U-D chain update? | |
mlir/test/Dialect/Linalg/vectorization.mlir | ||
2039 | This looks like a new feature to me more than a regression. I think we should match the decomposed ops and make sure they are vectorized accordingly. |
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp | ||
---|---|---|
1095 |
I like this idea :) Just to double-check - that set is yet to be created, right? | |
1101–1107 | Perhaps I'm being daft, but things go horrible wrong when I do that. And I assume that that's because rewriter.replaceOp invalidates the iterators in the surrounding for loop. |
Awesome!
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp | ||
---|---|---|
1052 | nit: // -> /// and . at the end per coding guidelines. |
We are seeing some correctness issues when integrating this commit. We attempted to fix it here https://reviews.llvm.org/D143243 but it looks like something else is still not working.
For the following affine.apply:
func.func @_iota_dim0_dispatch_0_generic_2x3() { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %c3 = arith.constant 3 : index %c2 = arith.constant 2 : index %c64 = arith.constant 64 : index %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c64) : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> %workgroup_id_x = hal.interface.workgroup.id[0] : index %workgroup_count_x = hal.interface.workgroup.count[0] : index %workgroup_id_y = hal.interface.workgroup.id[1] : index %workgroup_count_y = hal.interface.workgroup.count[1] : index %1 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_y] %2 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_y] %3 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_id_x] %4 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_count_x] scf.for %arg0 = %1 to %c2 step %2 { scf.for %arg1 = %3 to %c3 step %4 { %5 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> -> tensor<2x3xf32> %6 = scf.for %arg2 = %c0 to %c2 step %c1 iter_args(%arg3 = %5) -> (tensor<2x3xf32>) { %extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 3] [1, 1] : tensor<2x3xf32> to tensor<1x3xf32> %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} outs(%extracted_slice : tensor<1x3xf32>) attrs = {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[2, 3], [1, 16], [0, 0]]>} { ^bb0(%out: f32): %8 = linalg.index 0 : index %9 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %8, %arg2) %10 = arith.index_cast %9 : index to i32 %11 = arith.sitofp %10 : i32 to f32 linalg.yield %11 : f32 } -> tensor<1x3xf32> %inserted_slice = tensor.insert_slice %7 into %arg3[%arg2, 0] [1, 3] [1, 1] : tensor<1x3xf32> into tensor<2x3xf32> scf.yield %inserted_slice : tensor<2x3xf32> } flow.dispatch.tensor.store %6, %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : tensor<2x3xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> } } return }
it looks like some dimensions are missing after the expansion:
func.func @_iota_dim1_dispatch_0_generic_2x3() { %cst = arith.constant dense<[0, 1, 2]> : vector<3xindex> %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %c3 = arith.constant 3 : index %c2 = arith.constant 2 : index %c64 = arith.constant 64 : index %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c64) : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> %workgroup_id_x = hal.interface.workgroup.id[0] : index %workgroup_count_x = hal.interface.workgroup.count[0] : index %workgroup_id_y = hal.interface.workgroup.id[1] : index %workgroup_count_y = hal.interface.workgroup.count[1] : index %1 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_y] %2 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_y] %3 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_id_x] %4 = affine.apply affine_map<()[s0] -> (s0 * 3)>()[%workgroup_count_x] scf.for %arg0 = %1 to %c2 step %2 { scf.for %arg1 = %3 to %c3 step %4 { %5 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> -> tensor<2x3xf32> %6 = scf.for %arg2 = %c0 to %c2 step %c1 iter_args(%arg3 = %5) -> (tensor<2x3xf32>) { %7 = vector.broadcast %arg1 : index to vector<3xindex> %8 = arith.addi %7, %cst : vector<3xindex> %9 = arith.index_cast %8 : vector<3xindex> to vector<3xi32> %10 = arith.sitofp %9 : vector<3xi32> to vector<3xf32> %11 = vector.broadcast %10 : vector<3xf32> to vector<1x3xf32> %12 = vector.transfer_write %11, %arg3[%arg2, %c0] {in_bounds = [true, true]} : vector<1x3xf32>, tensor<2x3xf32> scf.yield %12 : tensor<2x3xf32> } flow.dispatch.tensor.store %6, %0, offsets = [%arg0, %arg1], sizes = [2, 3], strides = [1, 1] : tensor<2x3xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x3xf32>> } } return }
I didn't have time to look into the details but I'm going to revert this for now so that we have a healthy ToT and nobody hits the same problem.
Sorry for the inconvenience.
@dcaballe It actually seems fine to me 🤔 . I reduced your example to:
module { func.func @_iota_dim0_dispatch_0_generic_2x3(%1: index, %2: index, %3: index, %4: index, %5: tensor<2x3xf32>) -> tensor<2x3xf32>{ %c3 = arith.constant 3 : index %6 = scf.for %arg1 = %3 to %c3 step %4 iter_args(%arg3 = %5) -> (tensor<2x3xf32>) { %extracted_slice = tensor.extract_slice %arg3[%arg1, 0] [1, 3] [1, 1] : tensor<2x3xf32> to tensor<1x3xf32> %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} outs(%extracted_slice : tensor<1x3xf32>) { ^bb0(%out: f32): %8 = linalg.index 0 : index %9 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg1, %8) %10 = arith.index_cast %9 : index to i32 %11 = arith.sitofp %10 : i32 to f32 linalg.yield %11 : f32 } -> tensor<1x3xf32> %inserted_slice = tensor.insert_slice %7 into %arg3[%arg1, 0] [1, 3] [1, 1] : tensor<1x3xf32> into tensor<2x3xf32> scf.yield %inserted_slice : tensor<2x3xf32> } return %6: tensor<2x3xf32> } transform.sequence failures(propagate) { ^bb0(%arg0: !pdl.operation): %0 = transform.structured.match ops{["linalg.generic"]} in %arg0 : (!pdl.operation) -> !pdl.operation %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation %2 = transform.structured.vectorize %1 {vectorize_nd_extract} } }
Here's the output:
$ bin/mlir-opt -test-transform-dialect-interpreter file.mlir module { func.func @_iota_dim0_dispatch_0_generic_2x3(%arg0: index, %arg1: index, %arg2: index, %arg3: index, %arg4: tensor<2x3xf32>) -> tensor<2x3xf32> { %c0 = arith.constant 0 : index %c3 = arith.constant 3 : index %0 = scf.for %arg5 = %arg2 to %c3 step %arg3 iter_args(%arg6 = %arg4) -> (tensor<2x3xf32>) { %1 = arith.index_cast %arg5 : index to i32 %2 = arith.sitofp %1 : i32 to f32 %3 = vector.broadcast %2 : f32 to vector<1x3xf32> %4 = vector.transfer_write %3, %arg6[%arg5, %c0] {in_bounds = [true, true]} : vector<1x3xf32>, tensor<2x3xf32> scf.yield %4 : tensor<2x3xf32> } return %0 : tensor<2x3xf32> } transform.sequence failures(propagate) { ^bb0(%arg0: !pdl.operation): %0 = transform.structured.match ops{["linalg.generic"]} in %arg0 : (!pdl.operation) -> !pdl.operation %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation %2 = transform.structured.vectorize %1 {vectorize_nd_extract} } }
Similarly to your example, linalg.index 0 :index yields single value: 0. Hence that "element" of affine.apply is not present in the output after the expansion. Similar thing happens in your example (that's why there's only one arith.addi despite affine.apply OP adding 3 elements).
Does this make sense to you?
nit: // -> /// and . at the end per coding guidelines.