To unblock other work, this implements basic lowering based on mapping
attributes that have to be provided on all loop.parallel. The lowering
does not yet support reduce.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I started implementing the lowering of parallel loops to gpu launch code. This code uses mapping annotations in the form of attributes. Currently, one can only map to a single block/thread id. Also, upper bounds for the launch can be defined based on the number of iterations of the mapped iteration of the parallel loop. This is probably not what we want in the long run but a good starting point to inform the design we take.
Feedback highly appreciated.
Unit tests: unknown.
clang-tidy: fail. clang-tidy found 2 errors and 2 warnings. 0 of them are added as review comments below (why?).
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt
Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.
Thanks Stephan for pushing on this concretely!
I am very interested in this representation and I would welcome a (future) refactoring that will make this as reusable as possible so we can also use it directly in Linalg.
As you may or may not know, other people internally are using a very similar representation in a very successful fashion ;) ;)
Won't have time to dig into details but a few comments:
- made a cursory glance, it's great to start with this and refactor as we go when others need this too
- one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.
- Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?
- Depending on how much it is possible to go towards 3. or not, I'll comment that the Builder API is really underwhelming.. I understand the context behind not using EDSC atm, I'll post something on discourse to reopen that discussion (was too swamped until now).
In any case, great stuff and thanks!
mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir | ||
---|---|---|
165 | Could we use capture names such as TIDX, BIDY or something similar that would help "see" the mapping better? |
I suppose you want high-level feedback on this, so I'm not nitpicking in the code.
I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU. We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions. That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.
One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.
mlir/include/mlir/Dialect/LoopOps/LoopOps.td | ||
---|---|---|
16 ↗ | (On Diff #242069) | It's a bit reversed to have loops depend on affine loops. Is this necessary for AffineMapAttr? |
I use attributes here instead of baking the mapping into the lowering part to enable reuse. My idea was that we can have different implementations/strategies for mapping that inform a single code generation. I'd be happy to get to a shared "mapping language" that can be reused but it has to start somewhere.
- Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?
There is no code at all yet to produce the mapping attributes. But map LoopToProcessorIds could be one producer indeed.
Indeed. Thanks for the feedback.
I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU.
I have not written code yet to produce the mapping attributes but a greedy mapper will probably be the first thing I'll do.
We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions.
I agree. To do any meaningful mapping one needs to see the whole tree. Also, the heuristics that does the tiling might already know how it wants things to be mapped. Likewise, if you explicit loads into shared memory, for example, you would know that this has to be mapped to thread groups (like warps). So maybe one want to specify the mapping at that point already, which is why I though attributes are a good way to model this.
That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.
I am strongly of the opinion that the mapping should be attributes so that the producer and consumer of the mapping decisions is decoupled. I agree that they should be invisible to the op and I just added them to the ,td file temporarily for documentation. They attribute is already optional.
One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.
There is a TODO in there that it only supports nesting of the instructions up to the innermost nest are sideeffect free. Otherwise we will need predication to have only a single hardware thread materialize the sideeffects and, as you state, a barrier if other iterations of nested loops depend on that code. Let's start with the simple case, though.
mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir | ||
---|---|---|
165 | Will do once this has stabilized a bit. For now, the tests are fully auto-generated. |
Unit tests: pass. 62417 tests passed, 0 failed and 839 were skipped.
clang-tidy: fail. clang-tidy found 0 errors and 2 warnings. 0 of them are added as review comments below (why?).
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
503 | I would suggest creating a struct with named fields here for better readability. And document the function plz | |
505 | .cast here and below, otherwise you may be accessing a null pointer | |
507 | Nit: can we factor out these names into constants? | |
517 | Nit: mlir uses camelBack names | |
521 | Should we use string attributes instead? E.g. having mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"] ? | |
538 | dyn_cast_or_null since val is not guaranteed to have a defining op | |
582 | We should have patterns to fold constants into affine maps (in canonicalization), eg %0 = constant 42 : index should fold into %0 = affine.min affine_map<(d0)->(d0,42)>(%0) LMK if it's not the case. |
(Sorry for jumping on this late)
This is now adding three different lowering of loops to GPU
- From perfectly nested loops to GPU
- From imperfectly nested loops to GPU
- Parallel loops to GPU
I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
503 | Add some comments please. | |
504 | If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr | |
516 | Please add some comments about this method, preconditions, etc. | |
522 | I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added? | |
524 | nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.getNumResults() != 0 | |
671 | This comment seems outdated now. | |
703 | It would be useful to also expose a method to populate this pattern (and also other patterns that might be needed in the future) in a method like populateParallelLoopToGPUPatterns . Then a pass can just import these patterns without having to run a separate pass to do this lowering. |
All but 3 should probably go away unless someone invests into the dependency analysis to make 1 and 2 safe.
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
504 | I had a struct attribute before but I did not want to specify the attribute on the loop.parallel operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor. | |
521 | I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using. | |
522 | This is added by producers of mappings that want to lower GPU. I have a GreedyMapper (I will send that out shortly) that essentially implements what the existing helper functions did. | |
552 | A lot of it is comments but I have moved some things out. | |
582 | So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this. | |
671 | No, it is unfortunately still true. If there are nested loop.parallel the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version. |
Thanks Stephan! I see overall where this is headed. This is nice to have indeed.
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
504 | You can add it as an attribute to GPU dialect. There is no harm in adding this attribute "while" lowering to GPU dialect, or a pre-pass before lowering to GPU dialect right? |
Looks okay for the first take at it.
In general, I wouldn't bother with sequential loops here and implement a separate rewrite that we could run before this that splits the parallel loop into two nested parallel ops, and then use parallel-to-sequential transformation on the inner op.
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
576 | Nit: why not nullptr as a sentinel? | |
611 | Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The body of this loop is 3 screens long (consider refactoring in a follow-up :)) and it's hard to keep track of what was zip'ed. | |
627 | Nit: you can use OpRewriter::InsertionGuard with a C++ scope | |
701 | For a future revision, I'd consider having a transformation function that is controlled by actual arguments rather than by an attached attribute, and have the pattern read the attribute and call that function. This way, we can call the transformation programmatically without modifying the IR. | |
748 | Can we just walk recursively and check for the no side effects trait, returning matchFailure if there are side effects? | |
750 | The comment above says "it is only correct if there either is no further loop.parallel", which I cannot match with this code. |
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
576 | I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug errors. The gpu.launch on the other hand definitely won't. I would have created an explicit sentinel but that seemed a bit overkill. | |
748 | I added some testing code that is conservative but ensures correctness. |
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp | ||
---|---|---|
751 | Remove this debugging. |
I would suggest creating a struct with named fields here for better readability. And document the function plz