This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/lib/Dialect/GPU/Transforms/
-
lib/
-
Dialect/
-
GPU/
-
Transforms/
25/25
AllReduceLowering.cpp

Differential D72129

[mlir] Add in-dialect lowering of gpu.all_reduce.
ClosedPublic

Authored by csigg on Jan 3 2020, 12:53 AM.

Download Raw Diff

Details

Reviewers

ftynse
nicolasvasilache
herhut

Commits

rG8b2eb7c494b2: [mlir] Add in-dialect lowering of gpu.all_reduce.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

csigg created this revision.Jan 3 2020, 12:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 3 2020, 12:53 AM

Herald added subscribers: llvm-commits, lucyrfox, mgester and 9 others. · View Herald Transcript

lebedev.ri retitled this revision from Add in-dialect lowering of gpu.all_reduce. to [mlir] Add in-dialect lowering of gpu.all_reduce..Jan 3 2020, 1:00 AM

csigg added a reviewer: ftynse.Jan 3 2020, 1:02 AM

csigg marked an inline comment as done.

csigg added inline comments.

mlir/test/Dialect/GPU/all-reduce.mlir

7–176 ↗

(On Diff #236000)

These CHECKs were generated from the output with:

sed -r \
-e 's|\t+//.*||' \
-e 's|%([a-z_0-9]+) = |%[[v\1:[a-z_0-9]+]] = |' \
-e 's|\(%([a-z_0-9]+): ([a-z_0-9]+)\):|(%[[v\1:[a-z_0-9]+]]: \2):|' \
-e 's|%([a-z_0-9]+)|%[[v\1]]|g' \
-e 's|bb([0-9]+)|bb[[#b\1]]|g' \
-e 's|^ |    // CHECK:|'

and manaul edits from there.

merge_guards_bot added a subscriber: merge_guards_bot.Jan 3 2020, 1:03 AM

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43216: Diff 236000!Jan 3 2020, 1:03 AM

Wrong clang-tidy checks are annoying here, will make another pass later.

mlir/include/mlir/Dialect/GPU/GPUOps.td
163 ↗	(On Diff #236000)	Style nit: add whitespace around "+"
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
29	Nit: `explicit` is unnecessary for multi-argument constructors.
39	Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?
47	Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and stores
52	Should this be `cond_br %is_valid_warp` instead?
67	Big and mechanical change: `Value` is now value-based, please use it without pointers.
81	Style nit: add a comment `/bitwidth=/` for `32`
88	Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp.
104	This looks like you want EDSC :) + @nicolasvasilache
138	Maybe you can store the location and pass it as first argument? It's the same for all operations you create above.
150	Nit: let's have a named constant for the address space
201	Why only `F32`? Could you have `isa<FloatType>()` instead?
243	Is this lambda worse it?
mlir/test/Dialect/GPU/all-reduce.mlir
7 ↗	(On Diff #236000)	Is there a reason why you can't use `.*` as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names.
7–176 ↗	(On Diff #236000)	We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it?

nicolasvasilache added a reviewer: nicolasvasilache.Jan 6 2020, 10:13 AM

Update reflecting review comments from ftynse.

Thanks a lot for the review, Alex!

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
29	Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list.
88	Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments.
150	Punted by just adding a local variable.
243	Haha, I like your comment. Very subtle ;-)
mlir/test/Dialect/GPU/all-reduce.mlir
7 ↗	(On Diff #236000)	Replaced with generate-test-checks.py result.
7–176 ↗	(On Diff #236000)	Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it).

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43364: Diff 236427!Jan 6 2020, 11:55 AM

rriddle added inline comments.Jan 6 2020, 12:08 PM

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
162	`auto op = reduceOp.op()` to avoid recomputing below.
300	Remove `llvm::` from most of the things in this file. They are re-exported in the mlir namespace already.
352	When does this ever fail?

Addressing rriddle's review comments.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Harbormaster failed remote builds in B43416: Diff 236553!Jan 7 2020, 5:05 AM

csigg marked 4 inline comments as done.Jan 7 2020, 5:08 AM

csigg added inline comments.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
352	See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced.

herhut added a subscriber: herhut.Jan 7 2020, 7:38 AM

I somehow only see a subset of changes now...

Updating again, hopefully with all changes this time.

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43489: Diff 236769!Jan 7 2020, 11:47 PM

Almost there, only minor things necessary to improve understanding.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
150	I'll add that later anyway :)
268	I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might expect to be passed to the continuation.
354	Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation?

This revision now requires changes to proceed.Jan 8 2020, 6:07 AM

Address ftynse's review comments.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

Yes, I'm happy to change this to EDSC as a follow-up. For now I think it is easier to keep it similar to the existing lowering to NVVM.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
150	Thanks!
354	Correct, the walk iterators get invalidated from the replace. Comment added.

Unit tests: unknown.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B43673: Diff 237293!Jan 10 2020, 5:47 AM

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Herald added a subscriber: aartbik. · View Herald TranscriptJan 10 2020, 4:07 PM

In D72129#1815174, @nicolasvasilache wrote:

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Yes, from scanning the doc and code it looks like this this shouldn't be too difficult. But I haven't actually tried.

Rebase.

Herald added a subscriber: liufengdb. · View Herald TranscriptJan 14 2020, 12:44 AM

csigg added a reviewer: herhut.Jan 14 2020, 12:46 AM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B43902: Diff 237871!Jan 14 2020, 12:55 AM

Fix build error after Value* -> Value change.

Apply clang-format.

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43947: Diff 237993!Jan 14 2020, 8:50 AM

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43948: Diff 237994!Jan 14 2020, 8:50 AM

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I pointed Christian to the EDSC doc (https://mlir.llvm.org/docs/EDSC/) and the builder-api-test but otherwise nothing else was discussed.
I didn't look in the details of what would be involved in making the transition.
Can you share what the simpler proposal, than just using the declarative builders, is?

In D72129#1820630, @ftynse wrote:

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I'm interested to see some movement on this, at this point EDSC isn't totally endorsed by the whole team, and the uncontrolled use in the codebase isn't a great situation.

I have reviewed this before (the basic approach has not changed). Thanks for adding more comments and tests.

LGTM.

Regarding the EDSC vs. use of locally grown alternative in this patch: I try to avoid doing things like the template for adding location. While it saves some characters, it makes code look unfamiliar, which is also a cost. Adding helpers to emit conditionals etc. on the other hand reduces repetition, so I see that as a benefit. Still, I am fine with landing this as it seems to be the plan to evolve it to some EDSC like thing once there is an endorsed one.

@mehdi_amini @nicolasvasilache @herhut Let's take some time and discuss builder APIs outside this diff (also involving @rriddle). My basic observations are that (1) writing structured IR, as in "with nested regions", looks unnecessarily complicated with builders, arguments are the same as those against goto-style programming; (2) a lot of IR construction internally happens in rewrite patterns, where location almost always remains the same, that of the matched operation root; (3) current EDSC APIs are contentious partly because it is unclear when reading the code when the function call creates the IR vs. when it's just a function call.

@csigg I'm fine accepting this, but the test is currently broken. Please fix and we'll be ready to land.

@herhut @ftynse @mehdi_amini sounds good, it's high time to rediscuss this and to make MLIR play nicely with metaprogramming in a way that feels comfortable to everyone.

Do not wrap temporaries in ValueRange.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B44300: Diff 238844!Jan 17 2020, 12:18 PM

ftynse accepted this revision.Jan 20 2020, 1:16 AM

This revision is now accepted and ready to land.Jan 20 2020, 1:16 AM

clang-format.

Closed by commit rG8b2eb7c494b2: [mlir] Add in-dialect lowering of gpu.all_reduce. (authored by csigg). · Explain WhyJan 20 2020, 4:44 AM

This revision was automatically updated to reflect the committed changes.

Unit tests: pass. 62015 tests passed, 0 failed and 783 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster completed remote builds in B44390: Diff 239074.Jan 20 2020, 4:54 AM

Revision Contents

Path

Size

mlir/

lib/

Dialect/

GPU/

Transforms/

AllReduceLowering.cpp

20 lines

Diff 236553

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

Show All 20 Lines

using namespace mlir;		using namespace mlir;

namespace {		namespace {

struct GpuAllReduceRewriter {		struct GpuAllReduceRewriter {
using AccumulatorFactory = std::function<Value (Value , Value )>;		using AccumulatorFactory = std::function<Value (Value , Value )>;

GpuAllReduceRewriter(gpu::GPUFuncOp funcOp_,		GpuAllReduceRewriter(gpu::GPUFuncOp funcOp_,
		ftynseUnsubmitted Done Reply Inline Actions Nit: `explicit` is unnecessary for multi-argument constructors. ftynse: Nit: `explicit` is unnecessary for multi-argument constructors.
		csiggAuthorUnsubmitted Done Reply Inline Actions Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list. csigg: Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list.
gpu::AllReduceOp reduceOp_,		gpu::AllReduceOp reduceOp_,
PatternRewriter &rewriter_)		PatternRewriter &rewriter_)
: funcOp(funcOp_), reduceOp(reduceOp_), rewriter(rewriter_),		: funcOp(funcOp_), reduceOp(reduceOp_), rewriter(rewriter_),
loc(reduceOp.getLoc()), valueType(reduceOp.value()->getType()),		loc(reduceOp.getLoc()), valueType(reduceOp.value()->getType()),
indexType(IndexType::get(reduceOp.getContext())),		indexType(IndexType::get(reduceOp.getContext())),
int32Type(IntegerType::get(/width=/ 32, reduceOp.getContext())) {}		int32Type(IntegerType::get(/width=/ 32, reduceOp.getContext())) {}

/// Creates an all_reduce across the workgroup.		/// Creates an all_reduce across the workgroup.
///		///
/// First reduce the elements within a subgroup. The first invocation of each subgroup		/// First reduce the elements within a subgroup. The first invocation of each subgroup
		ftynseUnsubmitted Done Reply Inline Actions Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ? ftynse: Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?
/// writes the intermediate result to workgroup memory. After synchronizing the		/// writes the intermediate result to workgroup memory. After synchronizing the
/// workgroup, the first subgroup reduces the values from workgroup memory. The result		/// workgroup, the first subgroup reduces the values from workgroup memory. The result
/// is broadcasted to all invocations through workgroup memory.		/// is broadcasted to all invocations through workgroup memory.
///		///
/// %subgroup_reduce = `createSubgroupReduce(%operand)`		/// %subgroup_reduce = `createSubgroupReduce(%operand)`
/// cond_br %is_first_lane, ^then1, ^continue1		/// cond_br %is_first_lane, ^then1, ^continue1
/// ^then1:		/// ^then1:
/// store %subgroup_reduce, %workgroup_buffer[%subgroup_id]		/// store %subgroup_reduce, %workgroup_buffer[%subgroup_id]
		ftynseUnsubmitted Done Reply Inline Actions Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and stores ftynse: Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and…
/// br ^continue1		/// br ^continue1
/// ^continue1:		/// ^continue1:
/// gpu.barrier		/// gpu.barrier
/// %is_valid_subgroup = cmpi "slt" %invocation_idx, %num_subgroups		/// %is_valid_subgroup = cmpi "slt" %invocation_idx, %num_subgroups
/// cond_br %is_valid_subgroup, ^then2, ^continue2		/// cond_br %is_valid_subgroup, ^then2, ^continue2
		ftynseUnsubmitted Done Reply Inline Actions Should this be `cond_br %is_valid_warp` instead? ftynse: Should this be `cond_br %is_valid_warp` instead?
/// ^then2:		/// ^then2:
/// %partial_reduce = load %workgroup_buffer[%invocation_idx]		/// %partial_reduce = load %workgroup_buffer[%invocation_idx]
/// %all_reduce = `createSubgroupReduce(%partial_reduce)`		/// %all_reduce = `createSubgroupReduce(%partial_reduce)`
/// store %all_reduce, %workgroup_buffer[%zero]		/// store %all_reduce, %workgroup_buffer[%zero]
/// llvm.br ^continue2		/// llvm.br ^continue2
/// ^continue2:		/// ^continue2:
/// gpu.barrier		/// gpu.barrier
/// %result = load %workgroup_buffer[%zero]		/// %result = load %workgroup_buffer[%zero]
/// return %result		/// return %result
///		///
void rewrite() {		void rewrite() {
rewriter.setInsertionPoint(reduceOp);		rewriter.setInsertionPoint(reduceOp);

// Compute linear invocation index and workgroup size.		// Compute linear invocation index and workgroup size.
Value dimX = getDimOp<gpu::BlockDimOp>("x");		Value dimX = getDimOp<gpu::BlockDimOp>("x");
		ftynseUnsubmitted Done Reply Inline Actions Big and mechanical change: `Value` is now value-based, please use it without pointers. ftynse: Big and mechanical change: `Value` is now value-based, please use it without pointers.
Value dimY = getDimOp<gpu::BlockDimOp>("y");		Value dimY = getDimOp<gpu::BlockDimOp>("y");
Value dimZ = getDimOp<gpu::BlockDimOp>("z");		Value dimZ = getDimOp<gpu::BlockDimOp>("z");
Value tidX = getDimOp<gpu::ThreadIdOp>("x");		Value tidX = getDimOp<gpu::ThreadIdOp>("x");
Value tidY = getDimOp<gpu::ThreadIdOp>("y");		Value tidY = getDimOp<gpu::ThreadIdOp>("y");
Value tidZ = getDimOp<gpu::ThreadIdOp>("z");		Value tidZ = getDimOp<gpu::ThreadIdOp>("z");
Value tmp1 = create<MulIOp>(int32Type, tidZ, dimY);		Value tmp1 = create<MulIOp>(int32Type, tidZ, dimY);
Value tmp2 = create<AddIOp>(int32Type, tmp1, tidY);		Value tmp2 = create<AddIOp>(int32Type, tmp1, tidY);
Value tmp3 = create<MulIOp>(int32Type, tmp2, dimX);		Value tmp3 = create<MulIOp>(int32Type, tmp2, dimX);
Value tmp4 = create<MulIOp>(int32Type, dimX, dimY);		Value tmp4 = create<MulIOp>(int32Type, dimX, dimY);
Value invocationIdx = create<AddIOp>(int32Type, tmp3, tidX);		Value invocationIdx = create<AddIOp>(int32Type, tmp3, tidX);
Value workgroupSize = create<MulIOp>(int32Type, tmp4, dimZ);		Value workgroupSize = create<MulIOp>(int32Type, tmp4, dimZ);

// Compute lane id (invocation id withing the subgroup).		// Compute lane id (invocation id withing the subgroup).
Value subgroupMask = create<ConstantIntOp>(kSubgroupSize - 1, int32Type);		Value subgroupMask = create<ConstantIntOp>(kSubgroupSize - 1, int32Type);
		ftynseUnsubmitted Done Reply Inline Actions Style nit: add a comment `/bitwidth=/` for `32` ftynse: Style nit: add a comment `/bitwidth=/` for `32`
Value laneId = create<AndOp>(invocationIdx, subgroupMask);		Value laneId = create<AndOp>(invocationIdx, subgroupMask);
Value isFirstLane = create<CmpIOp>(CmpIPredicate::eq, laneId,		Value isFirstLane = create<CmpIOp>(CmpIPredicate::eq, laneId,
create<ConstantIntOp>(0, int32Type));		create<ConstantIntOp>(0, int32Type));

Value numThreadsWithSmallerSubgroupId = create<SubIOp>(invocationIdx, laneId);		Value numThreadsWithSmallerSubgroupId = create<SubIOp>(invocationIdx, laneId);
// The number of active invocations starting from the current subgroup.		// The number of active invocations starting from the current subgroup.
// The consumers do not require the value to be clamped to the size of the		// The consumers do not require the value to be clamped to the size of the
		ftynseUnsubmitted Done Reply Inline Actions Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp. ftynse: Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id *…
		csiggAuthorUnsubmitted Done Reply Inline Actions Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments. csigg: Correct. The activeWidth does not need to be clamped to subgroup width though. I added two…
// subgroup.		// subgroup.
Value activeWidth =		Value activeWidth =
create<SubIOp>(workgroupSize, numThreadsWithSmallerSubgroupId);		create<SubIOp>(workgroupSize, numThreadsWithSmallerSubgroupId);

// Create factory for op which accumulates to values.		// Create factory for op which accumulates to values.
AccumulatorFactory accumFactory = getFactory();		AccumulatorFactory accumFactory = getFactory();
assert(accumFactory && "failed to create accumulator factory");		assert(accumFactory && "failed to create accumulator factory");

// Reduce elements within each subgroup to produce the intermediate results.		// Reduce elements within each subgroup to produce the intermediate results.
Value subgroupReduce =		Value subgroupReduce =
createSubgroupReduce(activeWidth, laneId, reduceOp.value(), accumFactory);		createSubgroupReduce(activeWidth, laneId, reduceOp.value(), accumFactory);

// Add workgroup buffer to parent function for intermediate result.		// Add workgroup buffer to parent function for intermediate result.
Value buffer = createWorkgroupBuffer();		Value buffer = createWorkgroupBuffer();

// Write the intermediate results to workgroup memory, using the first lane of		// Write the intermediate results to workgroup memory, using the first lane of
		ftynseUnsubmitted Done Reply Inline Actions This looks like you want EDSC :) + @nicolasvasilache ftynse: This looks like you want EDSC :) + @nicolasvasilache
// each subgroup.		// each subgroup.
createPredicatedBlock(isFirstLane, [&] {		createPredicatedBlock(isFirstLane, [&] {
Value subgroupId = getDivideBySubgroupSize(invocationIdx);		Value subgroupId = getDivideBySubgroupSize(invocationIdx);
Value index = create<IndexCastOp>(indexType, subgroupId);		Value index = create<IndexCastOp>(indexType, subgroupId);
create<StoreOp>(subgroupReduce, buffer, index);		create<StoreOp>(subgroupReduce, buffer, index);
});		});
create<gpu::BarrierOp>();		create<gpu::BarrierOp>();

Show All 17 Lines	void rewrite() {
// Synchronize workgroup and load result from workgroup memory.		// Synchronize workgroup and load result from workgroup memory.
create<gpu::BarrierOp>();		create<gpu::BarrierOp>();
Value result = create<LoadOp>(valueType, buffer, zero);		Value result = create<LoadOp>(valueType, buffer, zero);

rewriter.replaceOp(reduceOp, result);		rewriter.replaceOp(reduceOp, result);
}		}

private:		private:
// Shortcut to create an op from rewriter using loc as the first argument.		// Shortcut to create an op from rewriter using loc as the first argument.
		ftynseUnsubmitted Done Reply Inline Actions Maybe you can store the location and pass it as first argument? It's the same for all operations you create above. ftynse: Maybe you can store the location and pass it as first argument? It's the same for all…
template <typename T, typename... Args> T create(Args... args) {		template <typename T, typename... Args> T create(Args... args) {
return rewriter.create<T>(loc, std::forward<Args>(args)...);		return rewriter.create<T>(loc, std::forward<Args>(args)...);
}		}

// Creates dimension op of type T, with the result casted to int32.		// Creates dimension op of type T, with the result casted to int32.
template <typename T> Value getDimOp(StringRef dimension) {		template <typename T> Value getDimOp(StringRef dimension) {
Value dim = create<T>(indexType, rewriter.getStringAttr(dimension));		Value dim = create<T>(indexType, rewriter.getStringAttr(dimension));
return create<IndexCastOp>(int32Type, dim);		return create<IndexCastOp>(int32Type, dim);
}		}

/// Adds type to funcOp's workgroup attributions.		/// Adds type to funcOp's workgroup attributions.
Value createWorkgroupBuffer() {		Value createWorkgroupBuffer() {
		ftynseUnsubmitted Done Reply Inline Actions Nit: let's have a named constant for the address space ftynse: Nit: let's have a named constant for the address space
		csiggAuthorUnsubmitted Done Reply Inline Actions Punted by just adding a local variable. csigg: Punted by just adding a local variable.
		ftynseUnsubmitted Done Reply Inline Actions I'll add that later anyway :) ftynse: I'll add that later anyway :)
		csiggAuthorUnsubmitted Done Reply Inline Actions Thanks! csigg: Thanks!
int workgroupMemoryAddressSpace = 3;		int workgroupMemoryAddressSpace = 3;
auto bufferType =		auto bufferType =
MemRefType::get({kSubgroupSize}, valueType, ArrayRef<AffineMap>{}, workgroupMemoryAddressSpace);		MemRefType::get({kSubgroupSize}, valueType, ArrayRef<AffineMap>{}, workgroupMemoryAddressSpace);
return funcOp.addWorkgroupAttribution(bufferType);		return funcOp.addWorkgroupAttribution(bufferType);
}		}

/// Returns an accumulator factory using either the op attribute or the body		/// Returns an accumulator factory using either the op attribute or the body
/// region.		/// region.
AccumulatorFactory getFactory() {		AccumulatorFactory getFactory() {
if (!reduceOp.body().empty())		auto body = reduceOp.body();
return getFactory(reduceOp.body());		if (!body.empty())
if (reduceOp.op())		return getFactory(body);
		rriddleUnsubmitted Done Reply Inline Actions `auto op = reduceOp.op()` to avoid recomputing below. rriddle: `auto op = reduceOp.op()` to avoid recomputing below.
return getFactory(*reduceOp.op());		auto opAttr = reduceOp.op();
		if (opAttr)
		return getFactory(*opAttr);
return AccumulatorFactory();		return AccumulatorFactory();
}		}

/// Returns an accumulator factory that clones the body. The body's entry		/// Returns an accumulator factory that clones the body. The body's entry
/// block is expected to have 2 arguments. The gpu.yield return the		/// block is expected to have 2 arguments. The gpu.yield return the
/// accumulated value of the same type.		/// accumulated value of the same type.
AccumulatorFactory getFactory(Region &body) {		AccumulatorFactory getFactory(Region &body) {
return AccumulatorFactory([&](Value lhs, Value rhs) {		return AccumulatorFactory([&](Value lhs, Value rhs) {
Show All 9 Lines	return AccumulatorFactory([&](Value lhs, Value rhs) {

// Add branch before inserted body, into body.		// Add branch before inserted body, into body.
block = block->getNextNode();		block = block->getNextNode();
create<BranchOp>(block, ValueRange());		create<BranchOp>(block, ValueRange());

// Replace all gpu.yield ops with branch out of body.		// Replace all gpu.yield ops with branch out of body.
for (; block != split; block = block->getNextNode()) {		for (; block != split; block = block->getNextNode()) {
Operation *terminator = block->getTerminator();		Operation *terminator = block->getTerminator();
if (!llvm::isa<gpu::YieldOp>(terminator))		if (!isa<gpu::YieldOp>(terminator))
continue;		continue;
rewriter.setInsertionPointToEnd(block);		rewriter.setInsertionPointToEnd(block);
rewriter.replaceOpWithNewOp<BranchOp>(		rewriter.replaceOpWithNewOp<BranchOp>(
terminator, split, ValueRange(terminator->getOperand(0)));		terminator, split, ValueRange(terminator->getOperand(0)));
}		}

// Return accumulator result.		// Return accumulator result.
rewriter.setInsertionPointToStart(split);		rewriter.setInsertionPointToStart(split);
return split->addArgument(lhs->getType());		return split->addArgument(lhs->getType());
});		});
		ftynseUnsubmitted Done Reply Inline Actions Why only `F32`? Could you have `isa<FloatType>()` instead? ftynse: Why only `F32`? Could you have `isa<FloatType>()` instead?
}		}

/// Returns an accumulator factory that creates an op specified by opName.		/// Returns an accumulator factory that creates an op specified by opName.
AccumulatorFactory getFactory(StringRef opName) {		AccumulatorFactory getFactory(StringRef opName) {
bool isFloatingPoint = valueType.isa<FloatType>();		bool isFloatingPoint = valueType.isa<FloatType>();
if (opName == "add")		if (opName == "add")
return isFloatingPoint ? getFactory<AddFOp>() : getFactory<AddIOp>();		return isFloatingPoint ? getFactory<AddFOp>() : getFactory<AddIOp>();
if (opName == "mul")		if (opName == "mul")
Show All 25 Lines	void createIf(Value condition, ThenOpsFactory &&thenOpsFactory,
ElseOpsFactory &&elseOpsFactory) {		ElseOpsFactory &&elseOpsFactory) {
Block *currentBlock = rewriter.getInsertionBlock();		Block *currentBlock = rewriter.getInsertionBlock();
auto currentPoint = rewriter.getInsertionPoint();		auto currentPoint = rewriter.getInsertionPoint();

Block *thenBlock = rewriter.splitBlock(currentBlock, currentPoint);		Block *thenBlock = rewriter.splitBlock(currentBlock, currentPoint);
Block *elseBlock = rewriter.splitBlock(thenBlock, thenBlock->begin());		Block *elseBlock = rewriter.splitBlock(thenBlock, thenBlock->begin());
Block *continueBlock = rewriter.splitBlock(elseBlock, elseBlock->begin());		Block *continueBlock = rewriter.splitBlock(elseBlock, elseBlock->begin());

rewriter.setInsertionPointToEnd(currentBlock);		rewriter.setInsertionPointToEnd(currentBlock);
		ftynseUnsubmitted Done Reply Inline Actions Is this lambda worse it? ftynse: Is this lambda worse it?
		csiggAuthorUnsubmitted Done Reply Inline Actions Haha, I like your comment. Very subtle ;-) csigg: Haha, I like your comment. Very subtle ;-)
create<CondBranchOp>(condition, thenBlock,		create<CondBranchOp>(condition, thenBlock,
/trueOperands=/ArrayRef<Value >(), elseBlock,		/trueOperands=/ArrayRef<Value >(), elseBlock,
/falseOperands=/ArrayRef<Value >());		/falseOperands=/ArrayRef<Value >());

rewriter.setInsertionPointToStart(thenBlock);		rewriter.setInsertionPointToStart(thenBlock);
ValueRange thenOperands = thenOpsFactory();		ValueRange thenOperands = thenOpsFactory();
create<BranchOp>(continueBlock, thenOperands);		create<BranchOp>(continueBlock, thenOperands);

rewriter.setInsertionPointToStart(elseBlock);		rewriter.setInsertionPointToStart(elseBlock);
ValueRange elseOperands = elseOpsFactory();		ValueRange elseOperands = elseOpsFactory();
create<BranchOp>(continueBlock, elseOperands);		create<BranchOp>(continueBlock, elseOperands);

assert(thenOperands.size() == elseOperands.size());		assert(thenOperands.size() == elseOperands.size());
rewriter.setInsertionPointToStart(continueBlock);		rewriter.setInsertionPointToStart(continueBlock);
for (auto operand : thenOperands)		for (auto operand : thenOperands)
continueBlock->addArgument(operand->getType());		continueBlock->addArgument(operand->getType());
}		}

/// Shortcut for createIf with empty else block and no block operands.		/// Shortcut for createIf with empty else block and no block operands.
template <typename Factory>		template <typename Factory>
void createPredicatedBlock(Value condition, Factory &&predicatedOpsFactory) {		void createPredicatedBlock(Value condition, Factory &&predicatedOpsFactory) {
createIf(		createIf(
condition,		condition,
[&] {		[&] {
predicatedOpsFactory();		predicatedOpsFactory();
		ftynseUnsubmitted Done Reply Inline Actions I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might expect to be passed to the continuation. ftynse: I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might…
return ArrayRef<Value >();		return ArrayRef<Value >();
},		},
[&] { return ArrayRef<Value >(); });		[&] { return ArrayRef<Value >(); });
}		}

/// Creates a reduction across the first activeWidth lanes of a subgroup, or		/// Creates a reduction across the first activeWidth lanes of a subgroup, or
/// the entire subgroup if activeWidth is larger than the subgroup width.		/// the entire subgroup if activeWidth is larger than the subgroup width.
/// The first lane returns the result, all others return values are undefined.		/// The first lane returns the result, all others return values are undefined.
Show All 15 Lines	createIf(
// in the first lane.		// in the first lane.
for (int i = 1; i < kSubgroupSize; i <<= 1) {		for (int i = 1; i < kSubgroupSize; i <<= 1) {
Value offset = create<ConstantIntOp>(i, int32Type);		Value offset = create<ConstantIntOp>(i, int32Type);
auto shuffleOp = create<gpu::ShuffleOp>(		auto shuffleOp = create<gpu::ShuffleOp>(
shuffleType, value, offset, activeWidth, xorAttr);		shuffleType, value, offset, activeWidth, xorAttr);
// Skip the accumulation if the shuffle op read from a lane outside		// Skip the accumulation if the shuffle op read from a lane outside
// of the active range.		// of the active range.
createIf(		createIf(
shuffleOp.getResult(1),		shuffleOp.getResult(1),
		rriddleUnsubmitted Done Reply Inline Actions Remove `llvm::` from most of the things in this file. They are re-exported in the mlir namespace already. rriddle: Remove `llvm::` from most of the things in this file. They are re-exported in the mlir…
[&] {		[&] {
return llvm::SmallVector<Value , 1>{		return SmallVector<Value , 1>{
accumFactory(value, shuffleOp.getResult(0))};		accumFactory(value, shuffleOp.getResult(0))};
},		},
[&] { return llvm::makeArrayRef(value); });		[&] { return llvm::makeArrayRef(value); });
value = rewriter.getInsertionBlock()->getArgument(0);		value = rewriter.getInsertionBlock()->getArgument(0);
}		}
return llvm::SmallVector<Value , 1>{value};		return SmallVector<Value , 1>{value};
},		},
// Generate a reduction over the entire subgroup. This is a specialization		// Generate a reduction over the entire subgroup. This is a specialization
// of the above reduction with unconditional accumulation.		// of the above reduction with unconditional accumulation.
[&] {		[&] {
Value value = operand;		Value value = operand;
for (int i = 1; i < kSubgroupSize; i <<= 1) {		for (int i = 1; i < kSubgroupSize; i <<= 1) {
Value offset = create<ConstantIntOp>(i, int32Type);		Value offset = create<ConstantIntOp>(i, int32Type);
auto shuffleOp = create<gpu::ShuffleOp>(shuffleType, value,		auto shuffleOp = create<gpu::ShuffleOp>(shuffleType, value,
offset, subgroupSize, xorAttr);		offset, subgroupSize, xorAttr);
value = accumFactory(value, shuffleOp.getResult(0));		value = accumFactory(value, shuffleOp.getResult(0));
}		}
return llvm::SmallVector<Value , 1>{value};		return SmallVector<Value , 1>{value};
});		});
return rewriter.getInsertionBlock()->getArgument(0);		return rewriter.getInsertionBlock()->getArgument(0);
}		}

/// Returns value divided by the subgroup size (i.e. 32).		/// Returns value divided by the subgroup size (i.e. 32).
Value getDivideBySubgroupSize(Value value) {		Value getDivideBySubgroupSize(Value value) {
Value subgroupSize = create<ConstantIntOp>(kSubgroupSize, int32Type);		Value subgroupSize = create<ConstantIntOp>(kSubgroupSize, int32Type);
return create<SignedDivIOp>(int32Type, value, subgroupSize);		return create<SignedDivIOp>(int32Type, value, subgroupSize);
Show All 12 Lines
};		};

struct GpuAllReduceConversion : public RewritePattern {		struct GpuAllReduceConversion : public RewritePattern {
explicit GpuAllReduceConversion(MLIRContext *context)		explicit GpuAllReduceConversion(MLIRContext *context)
: RewritePattern(gpu::GPUFuncOp::getOperationName(), 1, context) {}		: RewritePattern(gpu::GPUFuncOp::getOperationName(), 1, context) {}

PatternMatchResult matchAndRewrite(Operation *op,		PatternMatchResult matchAndRewrite(Operation *op,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
auto funcOp = llvm::cast<gpu::GPUFuncOp>(op);		auto funcOp = cast<gpu::GPUFuncOp>(op);
auto callback = [&](gpu::AllReduceOp reduceOp) {		auto callback = [&](gpu::AllReduceOp reduceOp) {
GpuAllReduceRewriter(funcOp, reduceOp, rewriter).rewrite();		GpuAllReduceRewriter(funcOp, reduceOp, rewriter).rewrite();
return WalkResult::interrupt();		return WalkResult::interrupt();
		rriddleUnsubmitted Done Reply Inline Actions When does this ever fail? rriddle: When does this ever fail?
		csiggAuthorUnsubmitted Done Reply Inline Actions See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced. csigg: See two lines above: every time the callback is invoked. This makes sure that all occurrences…
};		};
while (funcOp.walk(callback).wasInterrupted()) {		while (funcOp.walk(callback).wasInterrupted()) {
		ftynseUnsubmitted Done Reply Inline Actions Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation? ftynse: Could you please add a comment on why this approach is necessary? I understand that you need…
		csiggAuthorUnsubmitted Done Reply Inline Actions Correct, the walk iterators get invalidated from the replace. Comment added. csigg: Correct, the walk iterators get invalidated from the replace. Comment added.
}		}
return matchSuccess();		return matchSuccess();
}		}
};		};
} // namespace		} // namespace

void mlir::populateGpuRewritePatterns(MLIRContext *context,		void mlir::populateGpuRewritePatterns(MLIRContext *context,
OwningRewritePatternList &patterns) {		OwningRewritePatternList &patterns) {
patterns.insert<GpuAllReduceConversion>(context);		patterns.insert<GpuAllReduceConversion>(context);
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add in-dialect lowering of gpu.all_reduce.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 236553

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

[mlir] Add in-dialect lowering of gpu.all_reduce.
ClosedPublic