This is the first step to support software pipeline for scf.for loops.
This is only the transformation to create pipelined kernel and
prologue/epilogue.
The scheduling needs to be given by user as many different algorithm
and heuristic could be applied.
This currently doesn't handle loop arguments, this will be added in a
follow up patch.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/include/mlir/Dialect/SCF/Transforms.h | ||
---|---|---|
70 | PipeliningOption? |
fix PipeliningOption name
mlir/include/mlir/Dialect/SCF/Transforms.h | ||
---|---|---|
70 | Oops, thanks for catching it. |
Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading
- If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.
- It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.
For now "request revision" to indicate that I will come back for a further review.
mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp | ||
---|---|---|
84 | Do we really need this? We could just generate the guards to make sure this work even for dynamic cases (which will then just canonicalize away for static cases) | |
mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp | ||
83 | Not really clear what the cycle is for? |
Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.
If you want to do pipelining automatically what analysis would you need.
For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.
It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.
I don't understand this comment. In general the pipelining can be applied at many different levels (could be applied on linalg on tensor, or on vector level, etc...) so having a generic solution for it seems like a better solution. Overall this is just the first draft and in my experience pipelining can become quite complex so I would think that separating the logic from other lowering is going to be the way to go.
mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp | ||
---|---|---|
84 | The dynamic case is much harder since we need to be sure that the number f iterations is greater than the number of stages. If we want to handle a fully generic case we need potentially special control flow in the prologue/epilogue which would make the implementation significantly more complicated. | |
mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp | ||
83 | I was trying to pick a naming the naming used in typical pipeliners although I agree it is a bit confusing in this case. I'll change it to something like test_pipelining_op_order if that sounds clearer. |
Rename the test marker to __test_pipelining_op_order__
mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp | ||
---|---|---|
83 | Renamed this marker. |
Coming from a callers perspective, you need to look at the sequence of operations within the scf.for, carefully partition it into different stages, and then specify those stages to the transformation. Recovering the different stages after lowering to loops (especially if we say run canonicalizations before pipelining) is going to be very tricky and cumbersome. What is easier is to reason about the different stages while lowering to the loop. It is easier to build the pipelined implementation directly while creating the scf.for. If the build method takes multiple lambdas, each of which represents the IR to be used for one stage of the loop, then I can generate the pipelined implementation directly. I cant see a robust way of discovering these stages after the fact. It will invariably be very specific to the exact sequence of operations in the loop that is being pipelined and would be hard to maintain. I dont think the fact that pipelining can be applied at different stages (tensors, vectors, etc.) matters here. You start with something that is "higher level" representation (like linalg, etc.) and when you lower the to loops you explicitly reason about the different stages and pipeline them. During lowering you have enough information about the loop body (since you are creating the body) to be able to reason about the different stages you want to use.
If you do want to have a mechanism to take an scf.for that is already lowered and convert it to a "pipelined" version, then you can use the same mechanism in this change to do a pattern rewriter from the old scf.for to a pipelined implementation. Though, I think such cases should be really limited.
To clarify, I think the transformation itself is sound AFAICS. I am just not sure the mechanism to specify stages is very usable.
I suspect the documentation in the patch itself could be nicely improved with some of the description you provided here :)
I'm confuse about what loop lowering you are talking about. The loop could come from different kind of transformations, it could be from tiling, from linalg to loops or tiled_op to loops. I don't understand how you would hook add a hook to the scf for creation in every single case. Also it needs to creates the prologue epilogue which are going to be awkward to add in a the middle of such transformation. I guess the main thing is I'm not able to picture how the mechanism you are suggestion would look like in practice. How would it look like?
Coming from a callers perspective, you need to look at the sequence of operations within the scf.for, carefully partition it into different stages, and then specify those stages to the transformation. Recovering the different stages after lowering to loops (especially if we say run canonicalizations before pipelining) is going to be very tricky and cumbersome.
I don't think it is that bad. If you know the operation with high latency within your loop it should be easy to come up with an approximative schedule for the loop to hide latency. Typically it would be a DMA kind of operation and it should be easy to pick a schedule around it. What would you use from the higher level information? In general the stages are picked based on simple scheduling of instructions in the loop.
mlir/include/mlir/Dialect/SCF/Transforms.h | ||
---|---|---|
94 | While correct I was confused about this way to express the stage as letters, wouldn't S0 S1 S2 be more clear? You could still write S0', S0'' etc. if you want to express prolog stages |
mlir/include/mlir/Dialect/SCF/Transforms.h | ||
---|---|---|
94 | You're right, I changed the example hopefully it is clearer. |
It is awkward for sure, but to me doing it in a callback is even more awkward. I am apprehensive that this approach is scalable from a callers side. I can only see this working when the loop is generated in a very specific context. You can always make things work :) , but this to me falls into "hero optimization" category, which almost always can be solved by better abstractions. In this case, for example, you can just create a different operation scf.pipelined_for where the op has multiple scf.pipelined_stage operation that explicitly specify each stage and you can use standard SSA values to represent dependencies between different stages. The way of trying to recover the stages is what is traditionally done, but that is because until MLIR there wasnt a way to represent a "pipelined loop". You can easily lower such an operation to an scf.for in the exact form this patch is doing. Then you have to put in the work of having a lowering to scf.pipelined_for but I think that composes better.
You can always make things work :) , but this to me falls into "hero optimization" category, which almost always can be solved by better abstractions. In this case, for example, you can just create a different operation scf.pipelined_for where the op has multiple scf.pipelined_stage operation that explicitly specify each stage and you can use standard SSA values to represent dependencies between different stages. The way of trying to recover the stages is what is traditionally done, but that is because until MLIR there wasnt a way to represent a "pipelined loop". You can easily lower such an operation to an scf.for in the exact form this patch is doing. Then you have to put in the work of having a lowering to scf.pipelined_for but I think that composes better.
I agree having a higher level abstraction without having to do the transformation would be better. scf.pipelined_for is actually the first thing I tried however I couldn't find a way to express all the information we need while having a reasonable IR actually being high level. For instance splitting into stages is not enough as the order of the operations will matter to make sure the IR is correct and retrieving that later is likely to be hard.
If we can find such a representation then we can replace this by a lowering of scf.pipelined_for to scf.for. However so far I don't see a way to both encapsulate all the information and keep the representation high level that can compose with other transformations.
Thanks for engaging with me on this discussion. I am not going to block this patch from landing (let me know how I can remove my "request changes" marker , or just feel free to ignore if someone else can stamp this). It is interesting to hear that it was harder to specify the scf.pipeline_for with all the information needed, rather than being able to recover it from the generated scf.for loop. I obviously havent thought deeply about it here, so this is more my ignorance. Though I would be interested in learning if you have time to describe the issues in some place (I think it would make a good discourse post, and a very valuable thing to record anyway). Like I said earlier, the core changes here look fine mostly fine. I can take a pass at more details review as well, but would like Stephan/Alex to take a look here since they probably have more opinions on this as well.
Did a pass. Overall looks fine. Just one main comment, and a nit below. I will "accept" to remove my blocker, but would probably be better to get someone else to review as well. I am on the fence about the structure here.
mlir/include/mlir/Dialect/SCF/Transforms.h | ||
---|---|---|
73 | I am still not convinced about this callback. It is returning the instructions per stage, but also returning how these instructions are ordered globally across stages. That still strange to me. But, not going to block this on that one (but just noting it). Thomas and I started designing a scf.pipeline_for loop that hopefully will make this better. Will probably post an RFC for this. | |
mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp | ||
58 | Nit: In general, I avoid having private methods. Its much easier to just make this class a struct that carries data, and have static functions that either take the struct by reference or fields of the struct needed by the operation. |
Thanks for looking at it Mahesh! Nicolas mentioned he wouldn't have time to look at it so if the code looks fine to you I'd like to move forward with it. I can always address concerns in follow up pass. This will allow me to make progress.
mlir/include/mlir/Dialect/SCF/Transforms.h | ||
---|---|---|
73 | I agree, this is a good design point to discuss. I agree that if we can come up scf.pipeline_for it will allow us to have a higher level representation that will compose better with other transformations however couple things are still unclear to me:
That being said I think scf.pipeline_for will fit nicely on top of this code, the transformation needs the operation order no matter what, once we have the design for scf.pipeline_for it will decide the order during lowering and pass it to the transformation code. So I think this is the right interface for the mechanical transformation needed to create the scf.for code. | |
mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp | ||
58 | This method is not private, are you talking about setValueMapping? passing the struct by reference seems equivalent to me and unfortunately those transformation have a lot of state and I wasn't able to reduce it more without passing a long list of arguments so most of the function would take the whole structure as argument. Therefore it seems simpler this way. |
PipeliningOption?