Page MenuHomePhabricator

[mlir] Add in-dialect lowering of gpu.all_reduce.
ClosedPublic

Authored by csigg on Jan 3 2020, 12:53 AM.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

ftynse marked an inline comment as done.Jan 6 2020, 2:53 AM

Wrong clang-tidy checks are annoying here, will make another pass later.

mlir/include/mlir/Dialect/GPU/GPUOps.td
167

Style nit: add whitespace around "+"

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
30

Nit: explicit is unnecessary for multi-argument constructors.

40

Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?

48

Should this be store %warp_reduce, %workgroup_buffer[%warp_id]?
Same below for loads and stores

53

Should this be cond_br %is_valid_warp instead?

68

Big and mechanical change: Value is now value-based, please use it without pointers.

82

Style nit: add a comment /*bitwidth=*/ for 32

89

Not sure I follow here. numThreadsWithSmallerWarpId is the equivalent of warp_id * warp_size, or floor(linear_thread_id, warp_size) * warp_size. Subtracting that from blockSize will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp.

105

This looks like you want EDSC :)

+ @nicolasvasilache

139

Maybe you can store the location and pass it as first argument? It's the same for all operations you create above.

151

Nit: let's have a named constant for the address space

202

Why only F32? Could you have isa<FloatType>() instead?

244

Is this lambda worse it?

mlir/test/Dialect/GPU/all-reduce.mlir
8

Is there a reason why you can't use .* as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names.

8–177
csigg updated this revision to Diff 236427.Jan 6 2020, 11:31 AM
csigg marked 20 inline comments as done.

Update reflecting review comments from ftynse.

csigg added a comment.Jan 6 2020, 11:31 AM

Thanks a lot for the review, Alex!

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
30

Removed.

Nit the nit: explicit does prevent copy-initialization from initializer list.

89

Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments.

151

Punted by just adding a local variable.

244

Haha, I like your comment. Very subtle ;-)

mlir/test/Dialect/GPU/all-reduce.mlir
8

Replaced with generate-test-checks.py result.

8–177

Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it).

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

rriddle added inline comments.Jan 6 2020, 12:08 PM
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
163

auto op = reduceOp.op() to avoid recomputing below.

301

Remove llvm:: from most of the things in this file. They are re-exported in the mlir namespace already.

353

When does this ever fail?

csigg updated this revision to Diff 236553.Jan 7 2020, 5:03 AM

Addressing rriddle's review comments.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

csigg marked 4 inline comments as done.Jan 7 2020, 5:08 AM
csigg added inline comments.
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
353

See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced.

herhut added a subscriber: herhut.Jan 7 2020, 7:38 AM

I somehow only see a subset of changes now...

csigg updated this revision to Diff 236769.Jan 7 2020, 11:23 PM
csigg marked an inline comment as done.

Updating again, hopefully with all changes this time.

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

ftynse requested changes to this revision.Jan 8 2020, 6:07 AM
ftynse marked an inline comment as done.

Almost there, only minor things necessary to improve understanding.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
151

I'll add that later anyway :)

269

I'd add an assertion that predicatedOpsFactory() does not return any values that it might expect to be passed to the continuation.

355

Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation?

This revision now requires changes to proceed.Jan 8 2020, 6:07 AM
csigg updated this revision to Diff 237293.Jan 10 2020, 5:34 AM
csigg marked 4 inline comments as done.

Address ftynse's review comments.

csigg added a comment.Jan 10 2020, 5:34 AM

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

Yes, I'm happy to change this to EDSC as a follow-up. For now I think it is easier to keep it similar to the existing lowering to NVVM.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
151

Thanks!

355

Correct, the walk iterators get invalidated from the replace. Comment added.

Unit tests: unknown.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Yes, from scanning the doc and code it looks like this this shouldn't be too difficult. But I haven't actually tried.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

csigg updated this revision to Diff 237993.Jan 14 2020, 8:33 AM

Fix build error after Value* -> Value change.

csigg updated this revision to Diff 237994.Jan 14 2020, 8:36 AM

Apply clang-format.

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I pointed Christian to the EDSC doc (https://mlir.llvm.org/docs/EDSC/) and the builder-api-test but otherwise nothing else was discussed.
I didn't look in the details of what would be involved in making the transition.
Can you share what the simpler proposal, than just using the declarative builders, is?

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I'm interested to see some movement on this, at this point EDSC isn't totally endorsed by the whole team, and the uncontrolled use in the codebase isn't a great situation.

herhut accepted this revision.Jan 15 2020, 3:01 AM

I have reviewed this before (the basic approach has not changed). Thanks for adding more comments and tests.

LGTM.

Regarding the EDSC vs. use of locally grown alternative in this patch: I try to avoid doing things like the template for adding location. While it saves some characters, it makes code look unfamiliar, which is also a cost. Adding helpers to emit conditionals etc. on the other hand reduces repetition, so I see that as a benefit. Still, I am fine with landing this as it seems to be the plan to evolve it to some EDSC like thing once there is an endorsed one.

@mehdi_amini @nicolasvasilache @herhut Let's take some time and discuss builder APIs outside this diff (also involving @rriddle). My basic observations are that (1) writing structured IR, as in "with nested regions", looks unnecessarily complicated with builders, arguments are the same as those against goto-style programming; (2) a lot of IR construction internally happens in rewrite patterns, where location almost always remains the same, that of the matched operation root; (3) current EDSC APIs are contentious partly because it is unclear when reading the code when the function call creates the IR vs. when it's just a function call.

@csigg I'm fine accepting this, but the test is currently broken. Please fix and we'll be ready to land.

@herhut @ftynse @mehdi_amini sounds good, it's high time to rediscuss this and to make MLIR play nicely with metaprogramming in a way that feels comfortable to everyone.

csigg updated this revision to Diff 238844.Jan 17 2020, 12:05 PM

Do not wrap temporaries in ValueRange.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

ftynse accepted this revision.Jan 20 2020, 1:16 AM
This revision is now accepted and ready to land.Jan 20 2020, 1:16 AM
csigg updated this revision to Diff 239074.Jan 20 2020, 4:29 AM

clang-format.

This revision was automatically updated to reflect the committed changes.

Unit tests: pass. 62015 tests passed, 0 failed and 783 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml