Page MenuHomePhabricator

[MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.
ClosedPublic

Authored by herhut on Mon, Feb 3, 7:24 AM.

Details

Summary

To unblock other work, this implements basic lowering based on mapping
attributes that have to be provided on all loop.parallel. The lowering
does not yet support reduce.

Diff Detail

Event Timeline

herhut created this revision.Mon, Feb 3, 7:24 AM

I started implementing the lowering of parallel loops to gpu launch code. This code uses mapping annotations in the form of attributes. Currently, one can only map to a single block/thread id. Also, upper bounds for the launch can be defined based on the number of iterations of the mapped iteration of the parallel loop. This is probably not what we want in the long run but a good starting point to inform the design we take.

Feedback highly appreciated.

Unit tests: unknown.

clang-tidy: fail. clang-tidy found 2 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Thanks Stephan for pushing on this concretely!
I am very interested in this representation and I would welcome a (future) refactoring that will make this as reusable as possible so we can also use it directly in Linalg.
As you may or may not know, other people internally are using a very similar representation in a very successful fashion ;) ;)

Won't have time to dig into details but a few comments:

  1. made a cursory glance, it's great to start with this and refactor as we go when others need this too
  2. one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.
  3. Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?
  4. Depending on how much it is possible to go towards 3. or not, I'll comment that the Builder API is really underwhelming.. I understand the context behind not using EDSC atm, I'll post something on discourse to reopen that discussion (was too swamped until now).

In any case, great stuff and thanks!

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
165

Could we use capture names such as TIDX, BIDY or something similar that would help "see" the mapping better?

ftynse added a comment.Mon, Feb 3, 1:45 PM

I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU. We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions. That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

mlir/include/mlir/Dialect/LoopOps/LoopOps.td
16 ↗(On Diff #242069)

It's a bit reversed to have loops depend on affine loops. Is this necessary for AffineMapAttr?

herhut added a comment.Tue, Feb 4, 2:24 AM
  1. one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.

I use attributes here instead of baking the mapping into the lowering part to enable reuse. My idea was that we can have different implementations/strategies for mapping that inform a single code generation. I'd be happy to get to a shared "mapping language" that can be reused but it has to start somewhere.

  1. Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?

There is no code at all yet to produce the mapping attributes. But map LoopToProcessorIds could be one producer indeed.

herhut updated this revision to Diff 242294.Tue, Feb 4, 4:46 AM
herhut marked 2 inline comments as done.

Remove attribute from td file and minor cleanup.

herhut added a comment.Tue, Feb 4, 4:47 AM

I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

Indeed. Thanks for the feedback.

I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU.

I have not written code yet to produce the mapping attributes but a greedy mapper will probably be the first thing I'll do.

We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions.

I agree. To do any meaningful mapping one needs to see the whole tree. Also, the heuristics that does the tiling might already know how it wants things to be mapped. Likewise, if you explicit loads into shared memory, for example, you would know that this has to be mapped to thread groups (like warps). So maybe one want to specify the mapping at that point already, which is why I though attributes are a good way to model this.

That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

I am strongly of the opinion that the mapping should be attributes so that the producer and consumer of the mapping decisions is decoupled. I agree that they should be invisible to the op and I just added them to the ,td file temporarily for documentation. They attribute is already optional.

One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

There is a TODO in there that it only supports nesting of the instructions up to the innermost nest are sideeffect free. Otherwise we will need predication to have only a single hardware thread materialize the sideeffects and, as you state, a barrier if other iterations of nested loops depend on that code. Let's start with the simple case, though.

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
165

Will do once this has stabilized a bit. For now, the tests are fully auto-generated.

Unit tests: pass. 62417 tests passed, 0 failed and 839 were skipped.

clang-tidy: fail. clang-tidy found 0 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

herhut updated this revision to Diff 242937.Thu, Feb 6, 10:20 AM

Also support linalg tiled loops

herhut updated this revision to Diff 243135.Fri, Feb 7, 4:32 AM

Correct tests and some minor cleanup.

ftynse added inline comments.Fri, Feb 7, 5:30 AM
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503

I would suggest creating a struct with named fields here for better readability. And document the function plz

505

.cast here and below, otherwise you may be accessing a null pointer

507

Nit: can we factor out these names into constants?

517

Nit: mlir uses camelBack names

521

Should we use string attributes instead? E.g. having mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"] ?

538

dyn_cast_or_null since val is not guaranteed to have a defining op

582

We should have patterns to fold constants into affine maps (in canonicalization), eg

%0 = constant 42 : index
%1 = afifne.min affine_map<(d0,d1)->(d0,d1)>(%0, %arg0)

should fold into

%0 = affine.min affine_map<(d0)->(d0,42)>(%0)

LMK if it's not the case.

ftynse added a comment.Fri, Feb 7, 5:52 AM

I'm missing some higher-level documentation on what each function is supposed to do.

rriddle added inline comments.Fri, Feb 7, 10:21 AM
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503

Generally only classes should be within anonmyous namespaces. static functions should be in global scope and marked 'static'.

552

This function is quite large. Can you split it up a bit?

mravishankar requested changes to this revision.Sat, Feb 8, 10:52 AM

(Sorry for jumping on this late)

This is now adding three different lowering of loops to GPU

  1. From perfectly nested loops to GPU
  2. From imperfectly nested loops to GPU
  3. Parallel loops to GPU

I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503

Add some comments please.

504

If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr

516

Please add some comments about this method, preconditions, etc.

522

I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added?

524

nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.getNumResults() != 0

671

This comment seems outdated now.

703

It would be useful to also expose a method to populate this pattern (and also other patterns that might be needed in the future) in a method like populateParallelLoopToGPUPatterns . Then a pass can just import these patterns without having to run a separate pass to do this lowering.

This revision now requires changes to proceed.Sat, Feb 8, 10:52 AM
herhut updated this revision to Diff 243835.Tue, Feb 11, 5:40 AM
herhut marked 19 inline comments as done.

Comments, comments, comments :-)

(Sorry for jumping on this late)

This is now adding three different lowering of loops to GPU

  1. From perfectly nested loops to GPU
  2. From imperfectly nested loops to GPU
  3. Parallel loops to GPU I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

All but 3 should probably go away unless someone invests into the dependency analysis to make 1 and 2 safe.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
504

I had a struct attribute before but I did not want to specify the attribute on the loop.parallel operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor.

521

I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using.

522

This is added by producers of mappings that want to lower GPU. I have a GreedyMapper (I will send that out shortly) that essentially implements what the existing helper functions did.

552

A lot of it is comments but I have moved some things out.

582

So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this.

671

No, it is unfortunately still true. If there are nested loop.parallel the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version.

mravishankar resigned from this revision.Tue, Feb 11, 1:05 PM

Thanks Stephan! I see overall where this is headed. This is nice to have indeed.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
504

You can add it as an attribute to GPU dialect. There is no harm in adding this attribute "while" lowering to GPU dialect, or a pre-pass before lowering to GPU dialect right?

ftynse accepted this revision.Wed, Feb 12, 12:12 PM

Looks okay for the first take at it.

In general, I wouldn't bother with sequential loops here and implement a separate rewrite that we could run before this that splits the parallel loop into two nested parallel ops, and then use parallel-to-sequential transformation on the inner op.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
576

Nit: why not nullptr as a sentinel?

611

Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The body of this loop is 3 screens long (consider refactoring in a follow-up :)) and it's hard to keep track of what was zip'ed.

627

Nit: you can use OpRewriter::InsertionGuard with a C++ scope

701

For a future revision, I'd consider having a transformation function that is controlled by actual arguments rather than by an attached attribute, and have the pattern read the attribute and call that function. This way, we can call the transformation programmatically without modifying the IR.

748

Can we just walk recursively and check for the no side effects trait, returning matchFailure if there are side effects?

750

The comment above says "it is only correct if there either is no further loop.parallel", which I cannot match with this code.

This revision is now accepted and ready to land.Wed, Feb 12, 12:12 PM
sri added a subscriber: sri.Wed, Feb 12, 8:08 PM
herhut updated this revision to Diff 244413.Thu, Feb 13, 6:05 AM
herhut marked 7 inline comments as done.

Some more comments addressed.

herhut marked 3 inline comments as done.Thu, Feb 13, 6:08 AM
herhut added inline comments.
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
576

I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug errors. The gpu.launch on the other hand definitely won't. I would have created an explicit sentinel but that seemed a bit overkill.

748

I added some testing code that is conservative but ensures correctness.

This revision was automatically updated to reflect the committed changes.
herhut marked an inline comment as done.
rriddle added inline comments.Thu, Feb 13, 11:25 AM
mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
751

Remove this debugging.