The patch optimizes the tiling of parallel loops with static bounds if the number of loop iterations is an integer multiple of the tile size.
Why do you dislike the llvm::zip?
This optimization makes sense here, as it is hard to recover this out of the generated affine min expression. But the comment could explain this better. Something like:
If we statically know the size of the outer loops iteration and it is divisible by the tiling factor, we can use a static bound for the inner loop. Otherwise, we have to dynamically compute the bound for each iteration of the outer loop.
I would rather have this as a separate canonicalization pattern that removes parallel loops with trip-count 1.
- I started from the fixed tiling pass and tried to apply minimal changes (drop the minimum operation if the loop sizes is a multiple of the tile size)
- I added a canonicalization to drop single iteration loops
- I use CX / VX instead of VAL_X
I may missunderstand the thing but in mlir/test/Transforms/parallel-loop-collapsing.mlir the computation of I3 directly uses NEW_I0 and multiplies it by 10 and adds 9. Since NEW_I0 takes the values 0 and 1 this means we get the value 19 for I3 but we have the loop range 9 to 11 in the original loop. I guess NEW_I0 should be divided by two before computing the index I3?
Yes, you are right, there is a div missing there. Do you want to fix it or file a bug?
Also, it would be super awesome if you could split this into two, one adding the canonicalization and one changing the tiling.
You also need to check whether upperBound - lowerBound > 0. In the other case, the loop has no iterations while the rewritten loop has one.
Why not zip over op.step and tileSizeConstants to get these two? That avoids the pattern matching. You can then assign all of them at the beginning of the loop to give them local names. Something like
Value lowerBound, upperBound, originalStep, newStep, index; std::tie...
You have assigned names to these std::get previously by assigning to locals. Why not use them here?
Please rephrase this: "size of the outer loops iteration" isn't meaningful. Also, you are only considering the case of lb, ub, step being constant, which isn't the same of trip counts being statically known/constant (for eg., %i = %N to %N + 16).
- addressed the review comments
- use ceil division by the step to compute the number of loop iterations (to support the cases where the number of loop iterations is known but not a multiple of the step)
- adapted one of the test cases to test this scenario
- removed some unnecessary headers