This change adds the ability to tile an operation that implements the
TilingInterface using scf.foreach_thread. The implementation is
made to be similar to the scf.for tiling in the following sense: The
transform accepts a set of tile sizes. Before creating the
scf.foreach_thread operation, it calculates the required number of
threads/tiles using ceilDiv(ub-lb, tileSize). Within the body of the
scf.foreach_thread operation, it then materializes the values
corresponding to the tile offset and tile size. Finally, the tiled
operation is inserted and the parallel_insert_slice operations are
inserted for each destination operand.
Details
- Reviewers
- nicolasvasilache - mravishankar - ThomasRaoux 
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
If someone else has a different implementation coming, I'm happy to scrap this. This is just what I've been working with for the past couple weeks as I have been trying to utilize this operation. I saw what is currently available in linalg_ext in IREE is slightly different in terms of how the loop parameters are calculated. This form would be lowered post-bufferization in a follow on diff to
%num_tiles_0 = affine.apply ...
%num_tiles_1 = affine.apply ...
scf.foreach_thread (%tid_0, %tid_1) in (%num_tiles_0, %num_tiles_1) { ... }to
%tid_0 = [operation to get threadId for threadIndexMap[0]]
%tid_1 = [operation to get threadId for threadIndexMap[0]]
%num_threads_0 = [operation to get number of processing elements]
%num_threads_1 = [operation to get number of processing elements]
scf.for %iv_0 = %tid_0 to %num_tiles_0 step %num_threads_0 {
scf.for %iv_1 = %tid_1 to %num_tiles_1 step %num_threads_1 {...}
}Where thread id/num_threads is given by a callback. The generation of the scf.for can be made optional if caller knows by convention that the number of threads will always be sufficient. It can try to fold "num_tiles" in the static case and return that information to the caller for configuration management. Otherwise, this form is sufficient for the dynamic shape case. Further, you can detect that the only uses of %iv_0 and %iv_1 should be one affine.min (for the tile size bound) and one affine.apply (to calculate %tid * %tile_size), and then use that to recover the scf.for form generated by existing distribution transformations in order to apply existing analysis/simplification paths.
Nice! Overall looks fine to me. Have a few nits. Ill probably do a second round of review later to catch things I missed in this first round.
| mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
|---|---|---|
| 417 | Nit: I think convention is to make anything "induction variable" like a dimension, and anything else a symbol. I'd expect the loopLB to be a symbol. | |
| 421 | I think having AffineMap return multiple results this way is not accepted practice in MLIR (even though affine maps can have multiple values). I think things will compose better if you just generate different affine maps, and return a SmallVector<Value> for each of the offsets. Or better yet, have a utility function that just returns/generates the affine.apply op for a given loopLB, tileSize and tileIdx and call it multiple times in the caller. | |
| 436 | Nit: I havent looked at the use cases yet but I think here too making this just return a single value for a given tileSize, ub and tileOffset makes it easier to read. | |
| 457 | Same comment as above. Modify to just handle a single loop dimension? | |
| 516 | Nit: replace with tileOffsets = llvm::to_vector(llvm::map_range(....). Same for the others. | |
| 596 | Instead of this, I think it is better to use the getResultTilePosition method of the TilingInterface to get the offsets/sizes you want? | |
| 648 | Nit: Make this a utility function for this file. I think a similar thing is used in other patterns in this file. | |
Thanks for looking into this!
In principle this is great and connecting to a bunch of pieces we have in flight at the right time.
From a practical perspective, can we make the logic related to calculate the required number of threads/tiles using ceilDiv(ub-lb, tileSize) separate and more composable? 
This should then compose with the similar transform that exists in IREE (feel free to upstream that transform and rebase on top to make sure things compose or if you are stuck by it please lmk and I can prioritize).
If we cannot connect the pieces as suggested, it would be a sign of some missing abstraction that we want to get right.
For a gradient of where this is going, see Alex's work on adding dynamic dim support to the tiling transform: https://reviews.llvm.org/D129217 and composing it to perform more aggressive transforms: https://reviews.llvm.org/D129287
Where thread id/num_threads is given by a callback...
I would go as far as trying to encore that information directly on the scf.foreach_thread op to make the folding from virtual to physical thread ids explicit in the IR.
Technically we could lower that on tensors as scf.foreach_thread + nested scf.for that are not subject to the racy tensor semantics anymore; but it may not be worth it short-term so after bufferization is perfectly fine too.
| mlir/test/Interfaces/TilingInterface/tile-using-interface.mlir | ||
|---|---|---|
| 335 ↗ | (On Diff #443092) | nit | 
| mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp | ||
| 113 ↗ | (On Diff #443092) | Please add a new op to the transform interface and not avoid adding more pattern/filter-based test cases, we want to retire those. | 
I gave a first stab at extracting the transform from IREE here: https://reviews.llvm.org/D129577
From a practical perspective, can we make the logic related to calculate the required number of threads/tiles using ceilDiv(ub-lb, tileSize) separate and more composable?
If we cannot connect the pieces as suggested, it would be a sign of some missing abstraction that we want to get right.
I gave a first stab at extracting the transform from IREE here: https://reviews.llvm.org/D129577
Here's the summary of the differences:
// S1: materialize SSA value for "%numThreads" and %tileSize"
// Case 1: tile_size determines numThreads
%tileSize= arith.constant ... 
%numThreads = ceilDiv(rangeUpperBound - rangeLowerbound, %tileSize)  
                         = dialectA.virtual_num_threads lb %rangeLowerBound to %rangeUpperbound, %tileSize
// Case 2: num_threads determines tile size
%numThreads = dialectB.virtual_num_threads : index
%tileSize= ceilDiv(%rangeUpperbound-%rangeLowerBound, %numThreads)
scf.foreach_thread %i in (%numThreads) ... {
  // Determine offset
  %lb = %i * %tileSize
  // S3: Determine corrected tile size.
  // Case 1: Since we derived numThreads, we are guaranteed that %lb is not out of bounds.
  %size = min ( -%lb+ %rangeUpperBound, %tile_size)
  // Case 2: We need a max as well as a min, because we could have excessive threads
  %size = max(0, min(-%lb+ %rangeUpperbound, %tile_size)
   ...
}In both cases it would be great if there's a way to directly insert simplified expressions when a user doesn't want to create the conservative min/max ops.
I'm not sure if this is what you have in mind:
We could have an interface "TileParameterInterface" that has some methods:
- getNumThreadsOrTiles: (range) -> Value
- getTileSize: (range) -> Value
- getCorrectedTileSize(range, offset, tileSize) -> Value
- getDistributedId -> () -> Value [optional]
Both dialectA.virtual_num_threads and dialectB.virtual_num_threads can implement the interface. The last method might be optional or part of a separate distribution interface.
numThreads may or may not be produced by an op that implements this interface. If not, fall back to your implementation.
Otherwise, create %numThreads, %tileSize, and %size from the interface.
This also gives users an easy way to bypass creation of affine min/max ops for a downstream user's custom op.
A virtual thread id operation can implement both paths based on an optional tile size argument:
// S1: materialize SSA value for "%numThreads" and %tileSize"
// Case 1: tile_size determines numThreads
%tileSize= arith.constant ... 
%numThreads = gpu.virtual_num_blocks %tileSize
// Case 2: num_threads determines tile size
%numThreads = gpu.virtual_num_blocks
%tileSize= ceilDiv(%rangeUpperbound-%rangeLowerBound, %numThreads)
scf.foreach_thread %i in (%numThreads) ... {
  // Determine offset
  %lb = %i * %tileSize
  // S3: Determine corrected tile size.
  // Case 1: Since we derived numThreads, we are guaranteed that %lb is not out of bounds.
  %size = min ( -%lb+ %rangeUpperBound, %tile_size)
  // Case 2: We need a max as well as a min, because we could have excessive threads
  %size = max(0, min(-%lb+ %rangeUpperbound, %tile_size)
   ...
}It's then easy to interact with these interface ops for extracting/inserting the final value or number that needs to replace virtual_num_X ops.
Thanks for your analysis!
My interpretation of your explanation is that there are 2 things:
- computing tile_size + num_threads, with the differences you mention.
- depending on how 1. is done, generate an extra max or not.
If the logic can really be contained to those 2 changes, I think we have a nice and easy path forward.
We can add an in_bounds attribute to the tiling transform itself to avoid generating the max, it is the responsibility of the user to activate or not.
We can also add transformation ops for step 1. that encodes the "tile-first" or "thread-first" aspect and return the sizes of interest (see https://reviews.llvm.org/D129287 for an example of such a transform that returns dynamic sizes).
With both ops expressed in MLIR, there are opportunities for new things to appear: a canonicalization is now possible directly in the transform dialect where tile_to_foreach_thread automatically gains the in_bounds attribute when the proper op form is used for step 1.
If the logic can really be contained to those 2 changes, I think we have a nice and easy path forward.
Yes, I think those two things are the only differences.
We can add an in_bounds attribute to the tiling transform itself to avoid generating the max, it is the responsibility of the user to activate or not.
We can also add transformation ops for step 1. that encodes the "tile-first" or "thread-first" aspect and return the sizes of interest (see https://reviews.llvm.org/D129287 for an example of such a transform that returns dynamic sizes).
With both ops expressed in MLIR, there are opportunities for new things to appear: a canonicalization is now possible directly in the transform dialect where tile_to_foreach_thread automatically gains the in_bounds attribute when the proper op form is used for step 1.
OK, how do you want to proceed here? I can go ahead and prototype these things unless you want to take it.
I worked on it and have it mostly working. Need to add the "in_bounds" canonicalization, then I'll push it up early next week.
Add Transform ops: structured.get_num_threads and structured.tile_to_foreach.
Remove non-TransformOp based test pathways.
Add an inBounds variable to the transform. The inBounds is determined 
right before applying the transform by structured.tile_to_foreach by determining 
if the dynamic transform op handles for num_threads are produced by structured.get_num_threads.
Note that in the new Transform ops that I added, the get_tile_size op is missing. If the way I did get_num_threads is OK, I can do get_tile_size the same way.
I also still need to address some of the earlier comments.
I was out Friday and Monday, hence the delay, I have addressed and landed https://reviews.llvm.org/D129577.
Now that I see the implementation, I fear I sent you in an unnecessarily complex direction, I apologize..
We should be able to simplify this, let me deconstruct:
- the inbounds discussion is only useful when using the num_threads form and we have extra static information about the number of threads. I think this is an optimization we can put aside for now.
- the transform itself feels like it should have either a num_threads form (with the extra conditional which is what I landed above) or a tile_sizes spec but not both. In the case of the tile size spec, the num_threads is computed like you do on the fly and we know we don't need the conditional by construction.
- the get_num_threads op may be useful in the future but for what you are doing here, it should be redundant with the previous bulletpoint, I'd take it out of this revision for now.
- the dynamic dim support is nice, I would prob. split it up in a followup commit to better isolate.
Bottom-line, I think it would be easy to rebase on the PR above and land bullet 2. independently (feel free to modify the custom assembly format however you see fit), then a followup could add bullet 4.
The rest can be kept for later on a per-need basis.
Thoughts?
| mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td | ||
|---|---|---|
| 604 | typo: number | |
| mlir/test/Dialect/Linalg/transform-ops-tile-to-foreach.mlir | ||
| 108 | we shouldn't need these attributes anymore, here and above/below. | |
Sounds good, I'll look at adding changes for bullet #2 on top of your PR in a separate diff later today.
Approved https://reviews.llvm.org/D130139.
Note: head ups, if you have cycles to slice out bulletpoint 4. soon that would be great to extend the transform for @springerm to further play with, I will be out for 2 weeks starting tomorrow pm CEST but others can review too.
Thanks!
Sounds good, it shouldn't be too much work over the next couple days to port that over. I'll ping @springerm on the diff.
typo: number