This extends transform.structured.tile_reduction_using_forall to
operations with multiple reduction dimensions as implied by the thread
counts. This enables reduction splitting strategies for operations with
Can we write this with LinalgOp::getReductionDims and a followup filter ?
It's both getting the reduction dims and also identifying which thread counts in the scf.forall correspond to reduction dimensions. I can do this but the logic here will look quite similar. I'll also add a comment.
thanks for generalizing this transform!
Can we extract this in a meaningfully named helper function?
it is weird to me that you only need to specify 2 entries in num_threads here, I would have expected you'd need [0, 4, 2] (like in your second test below).
It would be good to also have a [red, par, red] test for the "interleaved parallel" case.
This is intentional, as I'm trying to tile parallel dimensions as well as reductions here. As far as I could tell, this was never explicitly prohibited by the pattern and I find it convenient to be able to tile both at the same time (and otherwise avoid nested foralls which interact poorly with distribution later on). The interleaved parallel case is a good idea though, will add a test for it.
In terms of forcing rank to align, unlike the scf.for version of this pattern, additional tile sizes require corresponding entries in the mapping which restricts the mapping options for distribution. For example, now I need to distribute explicitly along gpu.thread<x> in addition to gpu.thread<z> and gpu.thread<y> if I want to tile the parallel and first reduction dimensions only, and adding more dimensions requires going to linearized thread indices which don't work well when we are intentionally avoiding distribution along a specific dimension (e.g. x for later use with warp distribution patterns).
Re tiling parallel and reduction at once, this is a great idea indeed, thanks for pushing on this.