This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/SCF/
-
mlir/
-
Dialect/
-
SCF/
3/6
Transforms.h
-
lib/Dialect/SCF/Transforms/
-
Dialect/
-
SCF/
-
Transforms/
-
CMakeLists.txt
2/4
LoopPipelining.cpp
-
test/
-
Dialect/SCF/
-
SCF/
-
loop-pipelining.mlir
-
lib/Dialect/SCF/
-
Dialect/
-
SCF/
2/3
TestSCFUtils.cpp

Differential D105868

[mlir] Add software pipelining transformation for scf.for op
ClosedPublic

Authored by ThomasRaoux on Jul 12 2021, 9:36 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
herhut
mravishankar
hgreving

Commits

rGf6f88e66cedc: [mlir] Add software pipelining transformation for scf.For op

Summary

This is the first step to support software pipeline for scf.for loops.
This is only the transformation to create pipelined kernel and
prologue/epilogue.
The scheduling needs to be given by user as many different algorithm
and heuristic could be applied.
This currently doesn't handle loop arguments, this will be added in a
follow up patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ThomasRaoux created this revision.Jul 12 2021, 9:36 PM

Herald added a reviewer: mravishankar. · View Herald TranscriptJul 12 2021, 9:36 PM

Herald added subscribers: dcaballe, cota, teijeong and 17 others. · View Herald Transcript

ThomasRaoux requested review of this revision.Jul 12 2021, 9:36 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2021, 9:36 PM

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

ThomasRaoux added a reviewer: hgreving.Jul 12 2021, 9:37 PM

Harbormaster completed remote builds in B113647: Diff 358140.Jul 12 2021, 10:09 PM

Fix clang-tidy warnings

Harbormaster completed remote builds in B113675: Diff 358182.Jul 13 2021, 1:20 AM

mehdi_amini added inline comments.Jul 13 2021, 10:03 PM

mlir/include/mlir/Dialect/SCF/Transforms.h
70	`PipeliningOption`?

fix PipeliningOption name

mlir/include/mlir/Dialect/SCF/Transforms.h
70	Oops, thanks for catching it.

Harbormaster completed remote builds in B113924: Diff 358521.Jul 13 2021, 11:25 PM

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.
It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp
84	Do we really need this? We could just generate the guards to make sure this work even for dynamic cases (which will then just canonicalize away for static cases)
mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp
83	Not really clear what the cycle is for?

This revision now requires changes to proceed.Jul 14 2021, 11:14 PM

In D105868#2879120, @mravishankar wrote:

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.

If you want to do pipelining automatically what analysis would you need.

For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

I don't understand this comment. In general the pipelining can be applied at many different levels (could be applied on linalg on tensor, or on vector level, etc...) so having a generic solution for it seems like a better solution. Overall this is just the first draft and in my experience pipelining can become quite complex so I would think that separating the logic from other lowering is going to be the way to go.

mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp
84	The dynamic case is much harder since we need to be sure that the number f iterations is greater than the number of stages. If we want to handle a fully generic case we need potentially special control flow in the prologue/epilogue which would make the implementation significantly more complicated.
mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp
83	I was trying to pick a naming the naming used in typical pipeliners although I agree it is a bit confusing in this case. I'll change it to something like test_pipelining_op_order if that sounds clearer.

Rename the test marker to __test_pipelining_op_order__

mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp
83	Renamed this marker.

In D105868#2879166, @ThomasRaoux wrote:

In D105868#2879120, @mravishankar wrote:

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.

If you want to do pipelining automatically what analysis would you need.

For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

I don't understand this comment. In general the pipelining can be applied at many different levels (could be applied on linalg on tensor, or on vector level, etc...) so having a generic solution for it seems like a better solution. Overall this is just the first draft and in my experience pipelining can become quite complex so I would think that separating the logic from other lowering is going to be the way to go.

Coming from a callers perspective, you need to look at the sequence of operations within the scf.for, carefully partition it into different stages, and then specify those stages to the transformation. Recovering the different stages after lowering to loops (especially if we say run canonicalizations before pipelining) is going to be very tricky and cumbersome. What is easier is to reason about the different stages while lowering to the loop. It is easier to build the pipelined implementation directly while creating the scf.for. If the build method takes multiple lambdas, each of which represents the IR to be used for one stage of the loop, then I can generate the pipelined implementation directly. I cant see a robust way of discovering these stages after the fact. It will invariably be very specific to the exact sequence of operations in the loop that is being pipelined and would be hard to maintain. I dont think the fact that pipelining can be applied at different stages (tensors, vectors, etc.) matters here. You start with something that is "higher level" representation (like linalg, etc.) and when you lower the to loops you explicitly reason about the different stages and pipeline them. During lowering you have enough information about the loop body (since you are creating the body) to be able to reason about the different stages you want to use.
If you do want to have a mechanism to take an scf.for that is already lowered and convert it to a "pipelined" version, then you can use the same mechanism in this change to do a pattern rewriter from the old scf.for to a pipelined implementation. Though, I think such cases should be really limited.
To clarify, I think the transformation itself is sound AFAICS. I am just not sure the mechanism to specify stages is very usable.

This revision now requires changes to proceed.Jul 15 2021, 8:50 AM

In D105868#2879166, @ThomasRaoux wrote:

In D105868#2879120, @mravishankar wrote:

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.

If you want to do pipelining automatically what analysis would you need.

For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.

I suspect the documentation in the patch itself could be nicely improved with some of the description you provided here :)

In D105868#2880327, @mravishankar wrote:

In D105868#2879166, @ThomasRaoux wrote:

In D105868#2879120, @mravishankar wrote:

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.

If you want to do pipelining automatically what analysis would you need.

For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

I don't understand this comment. In general the pipelining can be applied at many different levels (could be applied on linalg on tensor, or on vector level, etc...) so having a generic solution for it seems like a better solution. Overall this is just the first draft and in my experience pipelining can become quite complex so I would think that separating the logic from other lowering is going to be the way to go.

Coming from a callers perspective, you need to look at the sequence of operations within the scf.for, carefully partition it into different stages, and then specify those stages to the transformation. Recovering the different stages after lowering to loops (especially if we say run canonicalizations before pipelining) is going to be very tricky and cumbersome. What is easier is to reason about the different stages while lowering to the loop. It is easier to build the pipelined implementation directly while creating the scf.for. If the build method takes multiple lambdas, each of which represents the IR to be used for one stage of the loop, then I can generate the pipelined implementation directly. I cant see a robust way of discovering these stages after the fact. It will invariably be very specific to the exact sequence of operations in the loop that is being pipelined and would be hard to maintain. I dont think the fact that pipelining can be applied at different stages (tensors, vectors, etc.) matters here. You start with something that is "higher level" representation (like linalg, etc.) and when you lower the to loops you explicitly reason about the different stages and pipeline them. During lowering you have enough information about the loop body (since you are creating the body) to be able to reason about the different stages you want to use.
If you do want to have a mechanism to take an scf.for that is already lowered and convert it to a "pipelined" version, then you can use the same mechanism in this change to do a pattern rewriter from the old scf.for to a pipelined implementation. Though, I think such cases should be really limited.
To clarify, I think the transformation itself is sound AFAICS. I am just not sure the mechanism to specify stages is very usable.

I'm confuse about what loop lowering you are talking about. The loop could come from different kind of transformations, it could be from tiling, from linalg to loops or tiled_op to loops. I don't understand how you would hook add a hook to the scf for creation in every single case. Also it needs to creates the prologue epilogue which are going to be awkward to add in a the middle of such transformation. I guess the main thing is I'm not able to picture how the mechanism you are suggestion would look like in practice. How would it look like?

Coming from a callers perspective, you need to look at the sequence of operations within the scf.for, carefully partition it into different stages, and then specify those stages to the transformation. Recovering the different stages after lowering to loops (especially if we say run canonicalizations before pipelining) is going to be very tricky and cumbersome.

I don't think it is that bad. If you know the operation with high latency within your loop it should be easy to come up with an approximative schedule for the loop to hide latency. Typically it would be a DMA kind of operation and it should be easy to pick a schedule around it. What would you use from the higher level information? In general the stages are picked based on simple scheduling of instructions in the loop.

Harbormaster completed remote builds in B114252: Diff 358984.Jul 15 2021, 9:32 AM

Add more detailed doc to the pattern.

In D105868#2880375, @mehdi_amini wrote:

In D105868#2879166, @ThomasRaoux wrote:

In D105868#2879120, @mravishankar wrote:

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.

If you want to do pipelining automatically what analysis would you need.

For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.

I suspect the documentation in the patch itself could be nicely improved with some of the description you provided here :)

Good point :) I summarized it in the pattern doc.

hgreving added inline comments.Jul 15 2021, 9:41 AM

mlir/include/mlir/Dialect/SCF/Transforms.h
94	While correct I was confused about this way to express the stage as letters, wouldn't S0 S1 S2 be more clear? You could still write S0', S0'' etc. if you want to express prolog stages

Update the example doc

ThomasRaoux updated this revision to Diff 359024.Jul 15 2021, 9:50 AM

ThomasRaoux added inline comments.

mlir/include/mlir/Dialect/SCF/Transforms.h
94	You're right, I changed the example hopefully it is clearer.

In D105868#2880417, @ThomasRaoux wrote:

In D105868#2880327, @mravishankar wrote:

In D105868#2879166, @ThomasRaoux wrote:

In D105868#2879120, @mravishankar wrote:

Did a preliminary pass. I understand the core transformation is mostly generating the code and the actual dependence analysis is left to the caller through the callback. So in that way, mostly seems OK. Some questions though to get a better idea of where this heading

If you want to do pipelining automatically what analysis would you need. Also not clear to me what the test_pipelining_cycle marker is in the test pass.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

For now "request revision" to indicate that I will come back for a further review.

Correct, the scheduling itself is left outside as it will require a lot of more heuristics. Pipeliner are usually split into two pieces, the scheduling part that picks a scheduling based on latencies, dependencies, encoding, etc.. and expender that is the mechanical way to generate the loops.

If you want to do pipelining automatically what analysis would you need.

For the full automatic pipelining we would need to have some information about the latency of different operations. In general when done at high level the pipelining will be mostly deciding to overlap different part of the loop. This could be analyze but it is most likely to something hardcoded for some given ops as doing accurate latency analysis at this stage is going to be hard. Usually pipelining is done late in the compilation flow and require exact latency informations.
Since the scheduling part will be heavy on heuristic I don't think it can live in the core transformation for a while.

It almost seems like it might just be easier to create a build method for scf.for that creates the "pipelined" implementation while lowering into loops. I can see how you would lower linalg ops to this, etc. Creating the pipelined loops while lowering seems to be more straight-forward than creating stages after the fact.

I don't understand this comment. In general the pipelining can be applied at many different levels (could be applied on linalg on tensor, or on vector level, etc...) so having a generic solution for it seems like a better solution. Overall this is just the first draft and in my experience pipelining can become quite complex so I would think that separating the logic from other lowering is going to be the way to go.

Coming from a callers perspective, you need to look at the sequence of operations within the scf.for, carefully partition it into different stages, and then specify those stages to the transformation. Recovering the different stages after lowering to loops (especially if we say run canonicalizations before pipelining) is going to be very tricky and cumbersome. What is easier is to reason about the different stages while lowering to the loop. It is easier to build the pipelined implementation directly while creating the scf.for. If the build method takes multiple lambdas, each of which represents the IR to be used for one stage of the loop, then I can generate the pipelined implementation directly. I cant see a robust way of discovering these stages after the fact. It will invariably be very specific to the exact sequence of operations in the loop that is being pipelined and would be hard to maintain. I dont think the fact that pipelining can be applied at different stages (tensors, vectors, etc.) matters here. You start with something that is "higher level" representation (like linalg, etc.) and when you lower the to loops you explicitly reason about the different stages and pipeline them. During lowering you have enough information about the loop body (since you are creating the body) to be able to reason about the different stages you want to use.
If you do want to have a mechanism to take an scf.for that is already lowered and convert it to a "pipelined" version, then you can use the same mechanism in this change to do a pattern rewriter from the old scf.for to a pipelined implementation. Though, I think such cases should be really limited.
To clarify, I think the transformation itself is sound AFAICS. I am just not sure the mechanism to specify stages is very usable.

I'm confuse about what loop lowering you are talking about. The loop could come from different kind of transformations, it could be from tiling, from linalg to loops or tiled_op to loops. I don't understand how you would hook add a hook to the scf for creation in every single case. Also it needs to creates the prologue epilogue which are going to be awkward to add in a the middle of such transformation. I guess the main thing is I'm not able to picture how the mechanism you are suggestion would look like in practice. How would it look like?

It is awkward for sure, but to me doing it in a callback is even more awkward. I am apprehensive that this approach is scalable from a callers side. I can only see this working when the loop is generated in a very specific context. You can always make things work :) , but this to me falls into "hero optimization" category, which almost always can be solved by better abstractions. In this case, for example, you can just create a different operation scf.pipelined_for where the op has multiple scf.pipelined_stage operation that explicitly specify each stage and you can use standard SSA values to represent dependencies between different stages. The way of trying to recover the stages is what is traditionally done, but that is because until MLIR there wasnt a way to represent a "pipelined loop". You can easily lower such an operation to an scf.for in the exact form this patch is doing. Then you have to put in the work of having a lowering to scf.pipelined_for but I think that composes better.

You can always make things work :) , but this to me falls into "hero optimization" category, which almost always can be solved by better abstractions. In this case, for example, you can just create a different operation scf.pipelined_for where the op has multiple scf.pipelined_stage operation that explicitly specify each stage and you can use standard SSA values to represent dependencies between different stages. The way of trying to recover the stages is what is traditionally done, but that is because until MLIR there wasnt a way to represent a "pipelined loop". You can easily lower such an operation to an scf.for in the exact form this patch is doing. Then you have to put in the work of having a lowering to scf.pipelined_for but I think that composes better.

I agree having a higher level abstraction without having to do the transformation would be better. scf.pipelined_for is actually the first thing I tried however I couldn't find a way to express all the information we need while having a reasonable IR actually being high level. For instance splitting into stages is not enough as the order of the operations will matter to make sure the IR is correct and retrieving that later is likely to be hard.
If we can find such a representation then we can replace this by a lowering of scf.pipelined_for to scf.for. However so far I don't see a way to both encapsulate all the information and keep the representation high level that can compose with other transformations.

Harbormaster completed remote builds in B114279: Diff 359024.Jul 15 2021, 11:41 AM

In D105868#2880795, @ThomasRaoux wrote:

You can always make things work :) , but this to me falls into "hero optimization" category, which almost always can be solved by better abstractions. In this case, for example, you can just create a different operation scf.pipelined_for where the op has multiple scf.pipelined_stage operation that explicitly specify each stage and you can use standard SSA values to represent dependencies between different stages. The way of trying to recover the stages is what is traditionally done, but that is because until MLIR there wasnt a way to represent a "pipelined loop". You can easily lower such an operation to an scf.for in the exact form this patch is doing. Then you have to put in the work of having a lowering to scf.pipelined_for but I think that composes better.

I agree having a higher level abstraction without having to do the transformation would be better. scf.pipelined_for is actually the first thing I tried however I couldn't find a way to express all the information we need while having a reasonable IR actually being high level. For instance splitting into stages is not enough as the order of the operations will matter to make sure the IR is correct and retrieving that later is likely to be hard.
If we can find such a representation then we can replace this by a lowering of scf.pipelined_for to scf.for. However so far I don't see a way to both encapsulate all the information and keep the representation high level that can compose with other transformations.

Thanks for engaging with me on this discussion. I am not going to block this patch from landing (let me know how I can remove my "request changes" marker , or just feel free to ignore if someone else can stamp this). It is interesting to hear that it was harder to specify the scf.pipeline_for with all the information needed, rather than being able to recover it from the generated scf.for loop. I obviously havent thought deeply about it here, so this is more my ignorance. Though I would be interested in learning if you have time to describe the issues in some place (I think it would make a good discourse post, and a very valuable thing to record anyway). Like I said earlier, the core changes here look fine mostly fine. I can take a pass at more details review as well, but would like Stephan/Alex to take a look here since they probably have more opinions on this as well.

Did a pass. Overall looks fine. Just one main comment, and a nit below. I will "accept" to remove my blocker, but would probably be better to get someone else to review as well. I am on the fence about the structure here.

mlir/include/mlir/Dialect/SCF/Transforms.h
73	I am still not convinced about this callback. It is returning the instructions per stage, but also returning how these instructions are ordered globally across stages. That still strange to me. But, not going to block this on that one (but just noting it). Thomas and I started designing a `scf.pipeline_for` loop that hopefully will make this better. Will probably post an RFC for this.
mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp
58	Nit: In general, I avoid having `private` methods. Its much easier to just make this class a `struct` that carries data, and have static functions that either take the `struct` by reference or fields of the struct needed by the operation.

This revision is now accepted and ready to land.Jul 19 2021, 9:18 AM

In D105868#2887615, @mravishankar wrote:

Did a pass. Overall looks fine. Just one main comment, and a nit below. I will "accept" to remove my blocker, but would probably be better to get someone else to review as well. I am on the fence about the structure here.

Thanks for looking at it Mahesh! Nicolas mentioned he wouldn't have time to look at it so if the code looks fine to you I'd like to move forward with it. I can always address concerns in follow up pass. This will allow me to make progress.

mlir/include/mlir/Dialect/SCF/Transforms.h
73	I agree, this is a good design point to discuss. I agree that if we can come up `scf.pipeline_for` it will allow us to have a higher level representation that will compose better with other transformations however couple things are still unclear to me: How to represent edges of different distances between stages (edge distance may be different than UseStage - DefStage) Whether having the code convert `scf.pipeline_for` to `scf.for` decide on operation order is always good enough. That being said I think `scf.pipeline_for` will fit nicely on top of this code, the transformation needs the operation order no matter what, once we have the design for `scf.pipeline_for` it will decide the order during lowering and pass it to the transformation code. So I think this is the right interface for the mechanical transformation needed to create the `scf.for` code.
mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp
58	This method is not `private`, are you talking about `setValueMapping`? passing the `struct` by reference seems equivalent to me and unfortunately those transformation have a lot of state and I wasn't able to reduce it more without passing a long list of arguments so most of the function would take the whole structure as argument. Therefore it seems simpler this way.

Closed by commit rGf6f88e66cedc: [mlir] Add software pipelining transformation for scf.For op (authored by ThomasRaoux). · Explain WhyJul 19 2021, 1:44 PM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rGf6f88e66cedc: [mlir] Add software pipelining transformation for scf.For op.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

SCF/

Transforms.h

35 lines

lib/

Dialect/

SCF/

Transforms/

CMakeLists.txt

1 line

LoopPipelining.cpp

385 lines

test/

Dialect/

SCF/

loop-pipelining.mlir

173 lines

lib/

Dialect/

SCF/

TestSCFUtils.cpp

52 lines

Diff 359900

mlir/include/mlir/Dialect/SCF/Transforms.h

	Show All 17 Lines
	namespace mlir {			namespace mlir {

	class ConversionTarget;			class ConversionTarget;
	class MLIRContext;			class MLIRContext;
	class Region;			class Region;
	class TypeConverter;			class TypeConverter;
	class RewritePatternSet;			class RewritePatternSet;
	using OwningRewritePatternList = RewritePatternSet;			using OwningRewritePatternList = RewritePatternSet;
				class Operation;

	namespace scf {			namespace scf {

	class ParallelOp;			class ParallelOp;
				class ForOp;

	/// Fuses all adjacent scf.parallel operations with identical bounds and step			/// Fuses all adjacent scf.parallel operations with identical bounds and step
	/// into one scf.parallel operations. Uses a naive aliasing and dependency			/// into one scf.parallel operations. Uses a naive aliasing and dependency
	/// analysis.			/// analysis.
	void naivelyFuseParallelOps(Region &region);			void naivelyFuseParallelOps(Region &region);

	/// Tile a parallel loop of the form			/// Tile a parallel loop of the form
	/// scf.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)			/// scf.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
	Show All 21 Lines
	/// completely agnostic to the actual types involved and simply need to update			/// completely agnostic to the actual types involved and simply need to update
	/// their types. An example of this is scf.if -- the scf.if op and the			/// their types. An example of this is scf.if -- the scf.if op and the
	/// corresponding scf.yield ops need to update their types accordingly to the			/// corresponding scf.yield ops need to update their types accordingly to the
	/// TypeConverter, but otherwise don't care what type conversions are happening.			/// TypeConverter, but otherwise don't care what type conversions are happening.
	void populateSCFStructuralTypeConversionsAndLegality(			void populateSCFStructuralTypeConversionsAndLegality(
	TypeConverter &typeConverter, RewritePatternSet &patterns,			TypeConverter &typeConverter, RewritePatternSet &patterns,
	ConversionTarget &target);			ConversionTarget &target);

				/// Options to dictate how loops should be pipelined.
				struct PipeliningOption {
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions `PipeliningOption`? mehdi_amini: `PipeliningOption`?
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Oops, thanks for catching it. ThomasRaoux: Oops, thanks for catching it.
				/// Lambda returning all the operation in the forOp, with their stage, in the
				/// order picked for the pipelined loop.
				using GetScheduleFnType = std::function<void(
				mravishankarUnsubmitted Not Done Reply Inline Actions I am still not convinced about this callback. It is returning the instructions per stage, but also returning how these instructions are ordered globally across stages. That still strange to me. But, not going to block this on that one (but just noting it). Thomas and I started designing a `scf.pipeline_for` loop that hopefully will make this better. Will probably post an RFC for this. mravishankar: I am still not convinced about this callback. It is returning the instructions per stage, but…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions I agree, this is a good design point to discuss. I agree that if we can come up `scf.pipeline_for` it will allow us to have a higher level representation that will compose better with other transformations however couple things are still unclear to me: How to represent edges of different distances between stages (edge distance may be different than UseStage - DefStage) Whether having the code convert `scf.pipeline_for` to `scf.for` decide on operation order is always good enough. That being said I think `scf.pipeline_for` will fit nicely on top of this code, the transformation needs the operation order no matter what, once we have the design for `scf.pipeline_for` it will decide the order during lowering and pass it to the transformation code. So I think this is the right interface for the mechanical transformation needed to create the `scf.for` code. ThomasRaoux: I agree, this is a good design point to discuss. I agree that if we can come up `scf.
				scf::ForOp, std::vector<std::pair<Operation *, unsigned>> &)>;
				GetScheduleFnType getScheduleFn;
				// TODO: add option to decide if the prologue/epilogue should be peeled.
				};

				/// Populate patterns for SCF software pipelining transformation.
				/// This transformation generates the pipelined loop and doesn't do any
				/// assumptions on the schedule dictated by the option structure.
				/// Software pipelining is usually done in two part. The first part of
				/// pipelining is to schedule the loop and assign a stage and cycle to each
				/// operations. This is highly dependent on the target and is implemented as an
				/// heuristic based on operation latencies, and other hardware characteristics.
				/// The second part is to take the schedule and generate the pipelined loop as
				/// well as the prologue and epilogue. It is independent of the target.
				/// This pattern only implement the second part.
				/// For example if we break a loop into 3 stages named S0, S1, S2 we would
				/// generate the following code with the number in parenthesis the iteration
				/// index:
				/// S0(0) // Prologue
				/// S0(1) S1(0) // Prologue
				/// scf.for %I = %C0 to %N - 2 {
				hgrevingUnsubmitted Not Done Reply Inline Actions While correct I was confused about this way to express the stage as letters, wouldn't S0 S1 S2 be more clear? You could still write S0', S0'' etc. if you want to express prolog stages hgreving: While correct I was confused about this way to express the stage as letters, wouldn't S0 S1 S2…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions You're right, I changed the example hopefully it is clearer. ThomasRaoux: You're right, I changed the example hopefully it is clearer.
				/// S0(I+2) S1(I+1) S2(I) // Pipelined kernel
				/// }
				/// S1(N) S2(N-1) // Epilogue
				/// S2(N) // Epilogue
				void populateSCFLoopPipeliningPatterns(RewritePatternSet &patterns,
				const PipeliningOption &options);

	} // namespace scf			} // namespace scf
	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_SCF_TRANSFORMS_H_			#endif // MLIR_DIALECT_SCF_TRANSFORMS_H_

mlir/lib/Dialect/SCF/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRSCFTransforms			add_mlir_dialect_library(MLIRSCFTransforms
	Bufferize.cpp			Bufferize.cpp
				LoopPipelining.cpp
	LoopRangeFolding.cpp			LoopRangeFolding.cpp
	LoopSpecialization.cpp			LoopSpecialization.cpp
	ParallelLoopFusion.cpp			ParallelLoopFusion.cpp
	ParallelLoopTiling.cpp			ParallelLoopTiling.cpp
	StructuralTypeConversions.cpp			StructuralTypeConversions.cpp
	Utils.cpp			Utils.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	Show All 16 Lines

mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp

This file was added.

				//===- LoopPipelining.cpp - Code to perform loop software pipelining-------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements loop software pipelining
				//
				//===----------------------------------------------------------------------===//

				#include "PassDetail.h"
				#include "mlir/Dialect/SCF/SCF.h"
				#include "mlir/Dialect/SCF/Transforms.h"
				#include "mlir/Dialect/SCF/Utils.h"
				#include "mlir/Dialect/StandardOps/IR/Ops.h"
				#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Support/MathExtras.h"

				using namespace mlir;
				using namespace mlir::scf;

				namespace {

				/// Helper to keep internal information during pipelining transformation.
				struct LoopPipelinerInternal {
				/// Coarse liverange information for ops used across stages.
				struct LiverangeInfo {
				unsigned lastUseStage = 0;
				unsigned defStage = 0;
				};

				protected:
				ForOp forOp;
				unsigned maxStage = 0;
				DenseMap<Operation *, unsigned> stages;
				std::vector<Operation *> opOrder;
				int64_t ub;
				int64_t lb;
				int64_t step;

				// When peeling the kernel we generate several version of each value for
				// different stage of the prologue. This map tracks the mapping between
				// original Values in the loop and the different versions
				// peeled from the loop.
				DenseMap<Value, llvm::SmallVector<Value>> valueMapping;

				/// Assign a value to `valueMapping`, this means `val` represents the version
				/// `idx` of `key` in the epilogue.
				void setValueMapping(Value key, Value el, int64_t idx);

				public:
				/// Initalize the information for the given `op`, return true if it
				/// satisfies the pre-condition to apply pipelining.
				bool initializeLoopInfo(ForOp op, const PipeliningOption &options);
				/// Emits the prologue, this creates `maxStage - 1` part which will contain
				mravishankarUnsubmitted Not Done Reply Inline Actions Nit: In general, I avoid having `private` methods. Its much easier to just make this class a `struct` that carries data, and have static functions that either take the `struct` by reference or fields of the struct needed by the operation. mravishankar: Nit: In general, I avoid having `private` methods. Its much easier to just make this class a…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions This method is not `private`, are you talking about `setValueMapping`? passing the `struct` by reference seems equivalent to me and unfortunately those transformation have a lot of state and I wasn't able to reduce it more without passing a long list of arguments so most of the function would take the whole structure as argument. Therefore it seems simpler this way. ThomasRaoux: This method is not `private`, are you talking about `setValueMapping`? passing the `struct` by…
				/// operations from stages [0; i], where i is the part index.
				void emitPrologue(PatternRewriter &rewriter);
				/// Gather liverange information for Values that are used in a different stage
				/// than its definition.
				llvm::MapVector<Value, LiverangeInfo> analyzeCrossStageValues();
				scf::ForOp createKernelLoop(
				const llvm::MapVector<Value, LiverangeInfo> &crossStageValues,
				PatternRewriter &rewriter,
				llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap);
				/// Emits the pipelined kernel. This clones loop operations following user
				/// order and remaps operands defined in a different stage as their use.
				void createKernel(
				scf::ForOp newForOp,
				const llvm::MapVector<Value, LiverangeInfo> &crossStageValues,
				const llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap,
				PatternRewriter &rewriter);
				/// Emits the epilogue, this creates `maxStage - 1` part which will contain
				/// operations from stages [i; maxStage], where i is the part index.
				void emitEpilogue(PatternRewriter &rewriter);
				};

				bool LoopPipelinerInternal::initializeLoopInfo(
				ForOp op, const PipeliningOption &options) {
				forOp = op;
				auto upperBoundCst = forOp.upperBound().getDefiningOp<ConstantIndexOp>();
				auto lowerBoundCst = forOp.lowerBound().getDefiningOp<ConstantIndexOp>();
				mravishankarUnsubmitted Not Done Reply Inline Actions Do we really need this? We could just generate the guards to make sure this work even for dynamic cases (which will then just canonicalize away for static cases) mravishankar: Do we really need this? We could just generate the guards to make sure this work even for…
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions The dynamic case is much harder since we need to be sure that the number f iterations is greater than the number of stages. If we want to handle a fully generic case we need potentially special control flow in the prologue/epilogue which would make the implementation significantly more complicated. ThomasRaoux: The dynamic case is much harder since we need to be sure that the number f iterations is…
				auto stepCst = forOp.step().getDefiningOp<ConstantIndexOp>();
				if (!upperBoundCst \|\| !lowerBoundCst \|\| !stepCst)
				return false;
				ub = upperBoundCst.getValue();
				lb = lowerBoundCst.getValue();
				step = stepCst.getValue();
				int64_t numIteration = ceilDiv(ub - lb, step);
				std::vector<std::pair<Operation *, unsigned>> schedule;
				options.getScheduleFn(forOp, schedule);
				if (schedule.empty())
				return false;

				opOrder.reserve(schedule.size());
				for (auto &opSchedule : schedule) {
				maxStage = std::max(maxStage, opSchedule.second);
				stages[opSchedule.first] = opSchedule.second;
				opOrder.push_back(opSchedule.first);
				}
				if (numIteration <= maxStage)
				return false;

				// All operations need to have a stage.
				if (forOp
				.walk([this](Operation *op) {
				if (op != forOp.getOperation() && !isa<scf::YieldOp>(op) &&
				stages.find(op) == stages.end())
				return WalkResult::interrupt();
				return WalkResult::advance();
				})
				.wasInterrupted())
				return false;

				// TODO: Add support for loop with operands.
				if (forOp.getNumIterOperands() > 0)
				return false;

				return true;
				}

				void LoopPipelinerInternal::emitPrologue(PatternRewriter &rewriter) {
				for (int64_t i = 0; i < maxStage; i++) {
				// special handling for induction variable as the increment is implicit.
				Value iv = rewriter.create<ConstantIndexOp>(forOp.getLoc(), lb + i);
				setValueMapping(forOp.getInductionVar(), iv, i);
				for (Operation *op : opOrder) {
				if (stages[op] > i)
				continue;
				Operation newOp = rewriter.clone(op);
				for (unsigned opIdx = 0; opIdx < op->getNumOperands(); opIdx++) {
				auto it = valueMapping.find(op->getOperand(opIdx));
				if (it != valueMapping.end())
				newOp->setOperand(opIdx, it->second[i - stages[op]]);
				}
				for (unsigned destId : llvm::seq(unsigned(0), op->getNumResults())) {
				setValueMapping(op->getResult(destId), newOp->getResult(destId),
				i - stages[op]);
				}
				}
				}
				}

				llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>
				LoopPipelinerInternal::analyzeCrossStageValues() {
				llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo> crossStageValues;
				for (Operation *op : opOrder) {
				unsigned stage = stages[op];
				for (OpOperand &operand : op->getOpOperands()) {
				Operation *def = operand.get().getDefiningOp();
				if (!def)
				continue;
				auto defStage = stages.find(def);
				if (defStage == stages.end() \|\| defStage->second == stage)
				continue;
				assert(stage > defStage->second);
				LiverangeInfo &info = crossStageValues[operand.get()];
				info.defStage = defStage->second;
				info.lastUseStage = std::max(info.lastUseStage, stage);
				}
				}
				return crossStageValues;
				}

				scf::ForOp LoopPipelinerInternal::createKernelLoop(
				const llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>
				&crossStageValues,
				PatternRewriter &rewriter,
				llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap) {
				// Creates the list of initial values associated to values used across
				// stages. The initial values come from the prologue created above.
				// Keep track of the kernel argument associated to each version of the
				// values passed to the kernel.
				auto newLoopArg = llvm::to_vector<8>(forOp.getIterOperands());
				for (auto escape : crossStageValues) {
				LiverangeInfo &info = escape.second;
				Value value = escape.first;
				for (unsigned stageIdx = 0; stageIdx < info.lastUseStage - info.defStage;
				stageIdx++) {
				Value valueVersion =
				valueMapping[value][maxStage - info.lastUseStage + stageIdx];
				assert(valueVersion);
				newLoopArg.push_back(valueVersion);
				loopArgMap[std::make_pair(value, info.lastUseStage - info.defStage -
				stageIdx)] = newLoopArg.size() - 1;
				}
				}

				// Create the new kernel loop. Since we need to peel `numStages - 1`
				// iteration we change the upper bound to remove those iterations.
				Value newUb =
				rewriter.create<ConstantIndexOp>(forOp.getLoc(), ub - maxStage * step);
				auto newForOp = rewriter.create<scf::ForOp>(
				forOp.getLoc(), forOp.lowerBound(), newUb, forOp.step(), newLoopArg);
				return newForOp;
				}

				void LoopPipelinerInternal::createKernel(
				scf::ForOp newForOp,
				const llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>
				&crossStageValues,
				const llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap,
				PatternRewriter &rewriter) {
				valueMapping.clear();

				// Create the kernel, we clone instruction based on the order given by
				// user and remap operands coming from a previous stages.
				rewriter.setInsertionPoint(newForOp.getBody(), newForOp.getBody()->begin());
				BlockAndValueMapping mapping;
				mapping.map(forOp.getInductionVar(), newForOp.getInductionVar());
				for (Operation *op : opOrder) {
				int64_t useStage = stages[op];
				auto newOp = rewriter.clone(op, mapping);
				for (OpOperand &operand : op->getOpOperands()) {
				// Special case for the induction variable uses. We replace it with a
				// version incremented based on the stage where it is used.
				if (operand.get() == forOp.getInductionVar()) {
				rewriter.setInsertionPoint(newOp);
				Value offset = rewriter.create<ConstantIndexOp>(
				forOp.getLoc(), (maxStage - stages[op]) * step);
				Value iv = rewriter.create<AddIOp>(forOp.getLoc(),
				newForOp.getInductionVar(), offset);
				newOp->setOperand(operand.getOperandNumber(), iv);
				rewriter.setInsertionPointAfter(newOp);
				continue;
				}
				// For operands defined in a previous stage we need to remap it to use
				// the correct region argument. We look for the right version of the
				// Value based on the stage where it is used.
				Operation *def = operand.get().getDefiningOp();
				if (!def)
				continue;
				auto stageDef = stages.find(def);
				if (stageDef == stages.end() \|\| stageDef->second == useStage)
				continue;
				auto remap = loopArgMap.find(
				std::make_pair(operand.get(), useStage - stageDef->second));
				assert(remap != loopArgMap.end());
				newOp->setOperand(operand.getOperandNumber(),
				newForOp.getRegionIterArgs()[remap->second]);
				}
				}

				// Collect the Values that need to be returned by the forOp. For each
				// value we need to have `LastUseStage - DefStage` number of versions
				// returned.
				// We create a mapping between original values and the associated loop
				// returned values that will be needed by the epilogue.
				llvm::SmallVector<Value> yieldOperands;
				for (auto &it : crossStageValues) {
				int64_t version = maxStage - it.second.lastUseStage + 1;
				unsigned numVersionReturned = it.second.lastUseStage - it.second.defStage;
				// add the original verstion to yield ops.
				// If there is a liverange spanning across more than 2 stages we need to add
				// extra arg.
				for (unsigned i = 1; i < numVersionReturned; i++) {
				setValueMapping(it.first, newForOp->getResult(yieldOperands.size()),
				version++);
				yieldOperands.push_back(
				newForOp.getBody()->getArguments()[yieldOperands.size() + 1 +
				newForOp.getNumInductionVars()]);
				}
				setValueMapping(it.first, newForOp->getResult(yieldOperands.size()),
				version++);
				yieldOperands.push_back(mapping.lookupOrDefault(it.first));
				}
				rewriter.create<scf::YieldOp>(forOp.getLoc(), yieldOperands);
				}

				void LoopPipelinerInternal::emitEpilogue(PatternRewriter &rewriter) {
				// Emit different versions of the induction variable. They will be
				// removed by dead code if not used.
				for (int64_t i = 0; i < maxStage; i++) {
				Value newlastIter = rewriter.create<ConstantIndexOp>(
				forOp.getLoc(), lb + step * ((((ub - 1) - lb) / step) - i));
				setValueMapping(forOp.getInductionVar(), newlastIter, maxStage - i);
				}
				// Emit `maxStage - 1` epilogue part that includes operations fro stages
				// [i; maxStage].
				for (int64_t i = 1; i <= maxStage; i++) {
				for (Operation *op : opOrder) {
				if (stages[op] < i)
				continue;
				Operation newOp = rewriter.clone(op);
				for (unsigned opIdx = 0; opIdx < op->getNumOperands(); opIdx++) {
				auto it = valueMapping.find(op->getOperand(opIdx));
				if (it != valueMapping.end()) {
				Value v = it->second[maxStage - stages[op] + i];
				assert(v);
				newOp->setOperand(opIdx, v);
				}
				}
				for (unsigned destId : llvm::seq(unsigned(0), op->getNumResults())) {
				setValueMapping(op->getResult(destId), newOp->getResult(destId),
				maxStage - stages[op] + i);
				}
				}
				}
				}

				void LoopPipelinerInternal::setValueMapping(Value key, Value el, int64_t idx) {
				auto it = valueMapping.find(key);
				// If the value is not in the map yet add a vector big enough to store all
				// versions.
				if (it == valueMapping.end())
				it =
				valueMapping
				.insert(std::make_pair(key, llvm::SmallVector<Value>(maxStage + 1)))
				.first;
				it->second[idx] = el;
				}

				/// Generate a pipelined version of the scf.for loop based on the schedule given
				/// as option. This applies the mechanical transformation of changing the loop
				/// and generating the prologue/epilogue for the pipelining and doesn't make any
				/// decision regarding the schedule.
				/// Based on the option the loop is split into several stages.
				/// The transformation assumes that the scheduling given by user is valid.
				/// For example if we break a loop into 3 stages named S0, S1, S2 we would
				/// generate the following code with the number in parenthesis the iteration
				/// index:
				/// S0(0) // Prologue
				/// S0(1) S1(0) // Prologue
				/// scf.for %I = %C0 to %N - 2 {
				/// S0(I+2) S1(I+1) S2(I) // Pipelined kernel
				/// }
				/// S1(N) S2(N-1) // Epilogue
				/// S2(N) // Epilogue
				struct ForLoopPipelining : public OpRewritePattern<ForOp> {
				ForLoopPipelining(const PipeliningOption &options, MLIRContext *context)
				: OpRewritePattern<ForOp>(context), options(options) {}
				LogicalResult matchAndRewrite(ForOp forOp,
				PatternRewriter &rewriter) const override {

				LoopPipelinerInternal pipeliner;
				if (!pipeliner.initializeLoopInfo(forOp, options))
				return failure();

				// 1. Emit prologue.
				pipeliner.emitPrologue(rewriter);

				// 2. Track values used across stages. When a value cross stages it will
				// need to be passed as loop iteration arguments.
				// We first collect the values that are used in a different stage than where
				// they are defined.
				llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>
				crossStageValues = pipeliner.analyzeCrossStageValues();

				// Mapping between original loop values used cross stage and the block
				// arguments associated after pipelining. A Value may map to several
				// arguments if its liverange spans across more than 2 stages.
				llvm::DenseMap<std::pair<Value, unsigned>, unsigned> loopArgMap;
				// 3. Create the new kernel loop and return the block arguments mapping.
				ForOp newForOp =
				pipeliner.createKernelLoop(crossStageValues, rewriter, loopArgMap);
				// Create the kernel block, order ops based on user choice and remap
				// operands.
				pipeliner.createKernel(newForOp, crossStageValues, loopArgMap, rewriter);

				// 4. Emit the epilogue after the new forOp.
				rewriter.setInsertionPointAfter(newForOp);
				pipeliner.emitEpilogue(rewriter);

				// 5. Erase the original loop and replace the uses with the epilogue output.
				if (forOp->getNumResults() > 0)
				rewriter.replaceOp(
				forOp, newForOp.getResults().take_front(forOp->getNumResults()));
				else
				rewriter.eraseOp(forOp);

				return success();
				}

				protected:
				PipeliningOption options;
				};

				} // namespace

				void mlir::scf::populateSCFLoopPipeliningPatterns(
				RewritePatternSet &patterns, const PipeliningOption &options) {
				patterns.add<ForLoopPipelining>(options, patterns.getContext());
				}

mlir/test/Dialect/SCF/loop-pipelining.mlir

This file was added.

				// RUN: mlir-opt %s -test-scf-pipelining -split-input-file \| FileCheck %s

				// CHECK-LABEL: simple_pipeline(
				// CHECK-SAME: %[[A:.]]: memref<?xf32>, %[[R:.]]: memref<?xf32>) {
				// CHECK-DAG: %[[C0:.*]] = constant 0 : index
				// CHECK-DAG: %[[C1:.*]] = constant 1 : index
				// CHECK-DAG: %[[C3:.*]] = constant 3 : index
				// Prologue:
				// CHECK: %[[L0:.*]] = memref.load %[[A]][%[[C0]]] : memref<?xf32>
				// Kernel:
				// CHECK-NEXT: %[[L1:.]] = scf.for %[[IV:.]] = %[[C0]] to %[[C3]]
				// CHECK-SAME: step %[[C1]] iter_args(%[[LARG:.*]] = %[[L0]]) -> (f32) {
				// CHECK-NEXT: %[[ADD0:.]] = addf %[[LARG]], %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD0]], %[[R]][%[[IV]]] : memref<?xf32>
				// CHECK-NEXT: %[[IV1:.*]] = addi %[[IV]], %[[C1]] : index
				// CHECK-NEXT: %[[LR:.*]] = memref.load %[[A]][%[[IV1]]] : memref<?xf32>
				// CHECK-NEXT: scf.yield %[[LR]] : f32
				// CHECK-NEXT: }
				// Epilogue:
				// CHECK-NEXT: %[[ADD1:.]] = addf %[[L1]], %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD1]], %[[R]][%[[C3]]] : memref<?xf32>
				func @simple_pipeline(%A: memref<?xf32>, %result: memref<?xf32>) {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c4 = constant 4 : index
				%cf = constant 1.0 : f32
				scf.for %i0 = %c0 to %c4 step %c1 {
				%A_elem = memref.load %A[%i0] { __test_pipelining_stage__ = 0, __test_pipelining_op_order__ = 2 } : memref<?xf32>
				%A1_elem = addf %A_elem, %cf { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 0 } : f32
				memref.store %A1_elem, %result[%i0] { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 1 } : memref<?xf32>
				} { __test_pipelining_loop__ }
				return
				}

				// -----

				// CHECK-LABEL: three_stage(
				// CHECK-SAME: %[[A:.]]: memref<?xf32>, %[[R:.]]: memref<?xf32>) {
				// CHECK-DAG: %[[C0:.*]] = constant 0 : index
				// CHECK-DAG: %[[C1:.*]] = constant 1 : index
				// CHECK-DAG: %[[C2:.*]] = constant 2 : index
				// CHECK-DAG: %[[C3:.*]] = constant 3 : index
				// Prologue:
				// CHECK: %[[L0:.*]] = memref.load %[[A]][%[[C0]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD0:.]] = addf %[[L0]], %{{.}} : f32
				// CHECK-NEXT: %[[L1:.*]] = memref.load %[[A]][%[[C1]]] : memref<?xf32>
				// Kernel:
				// CHECK-NEXT: %[[LR:.]]:2 = scf.for %[[IV:.]] = %[[C0]] to %[[C2]]
				// CHECK-SAME: step %[[C1]] iter_args(%[[ADDARG:.*]] = %[[ADD0]],
				// CHECK-SAME: %[[LARG:.*]] = %[[L1]]) -> (f32, f32) {
				// CHECK-NEXT: memref.store %[[ADDARG]], %[[R]][%[[IV]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD1:.]] = addf %[[LARG]], %{{.}} : f32
				// CHECK-NEXT: %[[IV2:.*]] = addi %[[IV]], %[[C2]] : index
				// CHECK-NEXT: %[[L3:.*]] = memref.load %[[A]][%[[IV2]]] : memref<?xf32>
				// CHECK-NEXT: scf.yield %[[ADD1]], %[[L3]] : f32, f32
				// CHECK-NEXT: }
				// Epilogue:
				// CHECK-NEXT: memref.store %[[LR]]#0, %[[R]][%[[C2]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD2:.]] = addf %[[LR]]#1, %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD2]], %[[R]][%[[C3]]] : memref<?xf32>
				func @three_stage(%A: memref<?xf32>, %result: memref<?xf32>) {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c4 = constant 4 : index
				%cf = constant 1.0 : f32
				scf.for %i0 = %c0 to %c4 step %c1 {
				%A_elem = memref.load %A[%i0] { __test_pipelining_stage__ = 0, __test_pipelining_op_order__ = 2 } : memref<?xf32>
				%A1_elem = addf %A_elem, %cf { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 1 } : f32
				memref.store %A1_elem, %result[%i0] { __test_pipelining_stage__ = 2, __test_pipelining_op_order__ = 0 } : memref<?xf32>
				} { __test_pipelining_loop__ }
				return
				}

				// -----
				// CHECK-LABEL: long_liverange(
				// CHECK-SAME: %[[A:.]]: memref<?xf32>, %[[R:.]]: memref<?xf32>) {
				// CHECK-DAG: %[[C0:.*]] = constant 0 : index
				// CHECK-DAG: %[[C1:.*]] = constant 1 : index
				// CHECK-DAG: %[[C2:.*]] = constant 2 : index
				// CHECK-DAG: %[[C3:.*]] = constant 3 : index
				// CHECK-DAG: %[[C4:.*]] = constant 4 : index
				// CHECK-DAG: %[[C6:.*]] = constant 6 : index
				// CHECK-DAG: %[[C7:.*]] = constant 7 : index
				// CHECK-DAG: %[[C8:.*]] = constant 8 : index
				// CHECK-DAG: %[[C9:.*]] = constant 9 : index
				// Prologue:
				// CHECK: %[[L0:.*]] = memref.load %[[A]][%[[C0]]] : memref<?xf32>
				// CHECK-NEXT: %[[L1:.*]] = memref.load %[[A]][%[[C1]]] : memref<?xf32>
				// CHECK-NEXT: %[[L2:.*]] = memref.load %[[A]][%[[C2]]] : memref<?xf32>
				// CHECK-NEXT: %[[L3:.*]] = memref.load %[[A]][%[[C3]]] : memref<?xf32>
				// Kernel:
				// CHECK-NEXT: %[[LR:.]]:4 = scf.for %[[IV:.]] = %[[C0]] to %[[C6]]
				// CHECK-SAME: step %[[C1]] iter_args(%[[LA0:.*]] = %[[L0]],
				// CHECK-SAME: %[[LA1:.]] = %[[L1]], %[[LA2:.]] = %[[L2]],
				// CHECK-SAME: %[[LA3:.*]] = %[[L3]]) -> (f32, f32, f32, f32) {
				// CHECK-NEXT: %[[ADD0:.]] = addf %[[LA0]], %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD0]], %[[R]][%[[IV]]] : memref<?xf32>
				// CHECK-NEXT: %[[IV4:.*]] = addi %[[IV]], %[[C4]] : index
				// CHECK-NEXT: %[[L4:.*]] = memref.load %[[A]][%[[IV4]]] : memref<?xf32>
				// CHECK-NEXT: scf.yield %[[LA1]], %[[LA2]], %[[LA3]], %[[L4]] : f32, f32, f32, f32
				// CHECK-NEXT: }
				// Epilogue:
				// CHECK-NEXT: %[[ADD1:.]] = addf %[[LR]]#0, %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD1]], %[[R]][%[[C6]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD2:.]] = addf %[[LR]]#1, %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD2]], %[[R]][%[[C7]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD3:.]] = addf %[[LR]]#2, %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD3]], %[[R]][%[[C8]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD4:.]] = addf %[[LR]]#3, %{{.}} : f32
				// CHECK-NEXT: memref.store %[[ADD4]], %[[R]][%[[C9]]] : memref<?xf32>
				func @long_liverange(%A: memref<?xf32>, %result: memref<?xf32>) {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c10 = constant 10 : index
				%cf = constant 1.0 : f32
				scf.for %i0 = %c0 to %c10 step %c1 {
				%A_elem = memref.load %A[%i0] { __test_pipelining_stage__ = 0, __test_pipelining_op_order__ = 2 } : memref<?xf32>
				%A1_elem = addf %A_elem, %cf { __test_pipelining_stage__ = 4, __test_pipelining_op_order__ = 0 } : f32
				memref.store %A1_elem, %result[%i0] { __test_pipelining_stage__ = 4, __test_pipelining_op_order__ = 1 } : memref<?xf32>
				} { __test_pipelining_loop__ }
				return
				}

				// -----

				// CHECK-LABEL: multiple_uses(
				// CHECK-SAME: %[[A:.]]: memref<?xf32>, %[[R:.]]: memref<?xf32>) {
				// CHECK-DAG: %[[C0:.*]] = constant 0 : index
				// CHECK-DAG: %[[C1:.*]] = constant 1 : index
				// CHECK-DAG: %[[C2:.*]] = constant 2 : index
				// CHECK-DAG: %[[C3:.*]] = constant 3 : index
				// CHECK-DAG: %[[C7:.*]] = constant 7 : index
				// CHECK-DAG: %[[C8:.*]] = constant 8 : index
				// CHECK-DAG: %[[C9:.*]] = constant 9 : index
				// Prologue:
				// CHECK: %[[L0:.*]] = memref.load %[[A]][%[[C0]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD0:.]] = addf %[[L0]], %{{.}} : f32
				// CHECK-NEXT: %[[L1:.*]] = memref.load %[[A]][%[[C1]]] : memref<?xf32>
				// CHECK-NEXT: %[[ADD1:.]] = addf %[[L1]], %{{.}} : f32
				// CHECK-NEXT: %[[MUL0:.*]] = mulf %[[ADD0]], %[[L0]] : f32
				// CHECK-NEXT: %[[L2:.*]] = memref.load %[[A]][%[[C2]]] : memref<?xf32>
				// Kernel:
				// CHECK-NEXT: %[[LR:.]]:4 = scf.for %[[IV:.]] = %[[C0]] to %[[C7]]
				// CHECK-SAME: step %[[C1]] iter_args(%[[LA1:.*]] = %[[L1]],
				// CHECK-SAME: %[[LA2:.]] = %[[L2]], %[[ADDARG1:.]] = %[[ADD1]],
				// CHECK-SAME: %[[MULARG0:.*]] = %[[MUL0]]) -> (f32, f32, f32, f32) {
				// CHECK-NEXT: %[[ADD2:.]] = addf %[[LA2]], %{{.}} : f32
				// CHECK-NEXT: %[[MUL1:.*]] = mulf %[[ADDARG1]], %[[LA1]] : f32
				// CHECK-NEXT: memref.store %[[MULARG0]], %[[R]][%[[IV]]] : memref<?xf32>
				// CHECK-NEXT: %[[IV3:.*]] = addi %[[IV]], %[[C3]] : index
				// CHECK-NEXT: %[[L3:.*]] = memref.load %[[A]][%[[IV3]]] : memref<?xf32>
				// CHECK-NEXT: scf.yield %[[LA2]], %[[L3]], %[[ADD2]], %[[MUL1]] : f32, f32, f32, f32
				// CHECK-NEXT: }
				// Epilogue:
				// CHECK-NEXT: %[[ADD3:.]] = addf %[[LR]]#1, %{{.}} : f32
				// CHECK-NEXT: %[[MUL2:.*]] = mulf %[[LR]]#2, %[[LR]]#0 : f32
				// CHECK-NEXT: memref.store %[[LR]]#3, %[[R]][%[[C7]]] : memref<?xf32>
				// CHECK-NEXT: %[[MUL3:.*]] = mulf %[[ADD3]], %[[LR]]#1 : f32
				// CHECK-NEXT: memref.store %[[MUL2]], %[[R]][%[[C8]]] : memref<?xf32>
				// CHECK-NEXT: memref.store %[[MUL3]], %[[R]][%[[C9]]] : memref<?xf32>
				func @multiple_uses(%A: memref<?xf32>, %result: memref<?xf32>) {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c10 = constant 10 : index
				%cf = constant 1.0 : f32
				scf.for %i0 = %c0 to %c10 step %c1 {
				%A_elem = memref.load %A[%i0] { __test_pipelining_stage__ = 0, __test_pipelining_op_order__ = 3 } : memref<?xf32>
				%A1_elem = addf %A_elem, %cf { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 0 } : f32
				%A2_elem = mulf %A1_elem, %A_elem { __test_pipelining_stage__ = 2, __test_pipelining_op_order__ = 1 } : f32
				memref.store %A2_elem, %result[%i0] { __test_pipelining_stage__ = 3, __test_pipelining_op_order__ = 2 } : memref<?xf32>
				} { __test_pipelining_loop__ }
				return
				}

mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp

//===- TestSCFUtils.cpp --- Pass to test independent SCF dialect utils ----===//		//===- TestSCFUtils.cpp --- Pass to test independent SCF dialect utils ----===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements a pass to test SCF dialect utils.		// This file implements a pass to test SCF dialect utils.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/SCF/SCF.h"		#include "mlir/Dialect/SCF/SCF.h"
		#include "mlir/Dialect/SCF/Transforms.h"
#include "mlir/Dialect/SCF/Utils.h"		#include "mlir/Dialect/SCF/Utils.h"
#include "mlir/IR/Builders.h"		#include "mlir/IR/Builders.h"
		#include "mlir/IR/PatternMatch.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "mlir/Transforms/Passes.h"		#include "mlir/Transforms/Passes.h"

#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"

using namespace mlir;		using namespace mlir;

namespace {		namespace {
class TestSCFForUtilsPass		class TestSCFForUtilsPass
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	func.walk([&](scf::IfOp ifOp) {
auto strCount = std::to_string(count++);		auto strCount = std::to_string(count++);
FuncOp thenFn, elseFn;		FuncOp thenFn, elseFn;
OpBuilder b(ifOp);		OpBuilder b(ifOp);
outlineIfOp(b, ifOp, &thenFn, std::string("outlined_then") + strCount,		outlineIfOp(b, ifOp, &thenFn, std::string("outlined_then") + strCount,
&elseFn, std::string("outlined_else") + strCount);		&elseFn, std::string("outlined_else") + strCount);
});		});
}		}
};		};

		static const StringLiteral kTestPipeliningLoopMarker =
		"__test_pipelining_loop__";
		static const StringLiteral kTestPipeliningStageMarker =
		"__test_pipelining_stage__";
		/// Marker to express the order in which operations should be after pipelining.
		mravishankarUnsubmitted Not Done Reply Inline Actions Not really clear what the cycle is for? mravishankar: Not really clear what the cycle is for?
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions I was trying to pick a naming the naming used in typical pipeliners although I agree it is a bit confusing in this case. I'll change it to something like test_pipelining_op_order if that sounds clearer. ThomasRaoux: I was trying to pick a naming the naming used in typical pipeliners although I agree it is a…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Renamed this marker. ThomasRaoux: Renamed this marker.
		static const StringLiteral kTestPipeliningOpOrderMarker =
		"__test_pipelining_op_order__";

		class TestSCFPipeliningPass
		: public PassWrapper<TestSCFPipeliningPass, FunctionPass> {
		public:
		StringRef getArgument() const final { return "test-scf-pipelining"; }
		StringRef getDescription() const final { return "test scf.forOp pipelining"; }
		explicit TestSCFPipeliningPass() = default;

		static void
		getSchedule(scf::ForOp forOp,
		std::vector<std::pair<Operation *, unsigned>> &schedule) {
		if (!forOp->hasAttr(kTestPipeliningLoopMarker))
		return;
		schedule.resize(forOp.getBody()->getOperations().size() - 1);
		forOp.walk([&schedule](Operation *op) {
		auto attrStage =
		op->getAttrOfType<IntegerAttr>(kTestPipeliningStageMarker);
		auto attrCycle =
		op->getAttrOfType<IntegerAttr>(kTestPipeliningOpOrderMarker);
		if (attrCycle && attrStage) {
		schedule[attrCycle.getInt()] =
		std::make_pair(op, unsigned(attrStage.getInt()));
		}
		});
		}

		void runOnFunction() override {
		RewritePatternSet patterns(&getContext());
		mlir::scf::PipeliningOption options;
		options.getScheduleFn = getSchedule;

		scf::populateSCFLoopPipeliningPatterns(patterns, options);
		(void)applyPatternsAndFoldGreedily(getFunction(), std::move(patterns));
		getFunction().walk([](Operation *op) {
		// Clean up the markers.
		op->removeAttr(kTestPipeliningStageMarker);
		op->removeAttr(kTestPipeliningOpOrderMarker);
		});
		}
		};
} // namespace		} // namespace

namespace mlir {		namespace mlir {
namespace test {		namespace test {
void registerTestSCFUtilsPass() {		void registerTestSCFUtilsPass() {
PassRegistration<TestSCFForUtilsPass>();		PassRegistration<TestSCFForUtilsPass>();
PassRegistration<TestSCFIfUtilsPass>();		PassRegistration<TestSCFIfUtilsPass>();
		PassRegistration<TestSCFPipeliningPass>();
}		}
} // namespace test		} // namespace test
} // namespace mlir		} // namespace mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add software pipelining transformation for scf.for opClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 359900

mlir/include/mlir/Dialect/SCF/Transforms.h

mlir/lib/Dialect/SCF/Transforms/CMakeLists.txt

mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp

mlir/test/Dialect/SCF/loop-pipelining.mlir

mlir/test/lib/Dialect/SCF/TestSCFUtils.cpp

[mlir] Add software pipelining transformation for scf.for op
ClosedPublic