This is an archive of the discontinued LLVM Phabricator instance.

lebedev.ri retitled this revision from Add in-dialect lowering of gpu.all_reduce. to [mlir] Add in-dialect lowering of gpu.all_reduce..Jan 3 2020, 1:00 AM

csigg added a reviewer: ftynse.Jan 3 2020, 1:02 AM

csigg marked an inline comment as done.

csigg added inline comments.

mlir/test/Dialect/GPU/all-reduce.mlir

8–177

These CHECKs were generated from the output with:

sed -r \
-e 's|\t+//.*||' \
-e 's|%([a-z_0-9]+) = |%[[v\1:[a-z_0-9]+]] = |' \
-e 's|\(%([a-z_0-9]+): ([a-z_0-9]+)\):|(%[[v\1:[a-z_0-9]+]]: \2):|' \
-e 's|%([a-z_0-9]+)|%[[v\1]]|g' \
-e 's|bb([0-9]+)|bb[[#b\1]]|g' \
-e 's|^ |    // CHECK:|'

and manaul edits from there.

merge_guards_bot added a subscriber: merge_guards_bot.Jan 3 2020, 1:03 AM

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43216: Diff 236000!Jan 3 2020, 1:03 AM

Wrong clang-tidy checks are annoying here, will make another pass later.

mlir/include/mlir/Dialect/GPU/GPUOps.td
167	Style nit: add whitespace around "+"
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
30	Nit: `explicit` is unnecessary for multi-argument constructors.
40	Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?
48	Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and stores
53	Should this be `cond_br %is_valid_warp` instead?
68	Big and mechanical change: `Value` is now value-based, please use it without pointers.
82	Style nit: add a comment `/bitwidth=/` for `32`
89	Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp.
105	This looks like you want EDSC :) + @nicolasvasilache
139	Maybe you can store the location and pass it as first argument? It's the same for all operations you create above.
151	Nit: let's have a named constant for the address space
202	Why only `F32`? Could you have `isa<FloatType>()` instead?
244	Is this lambda worse it?
mlir/test/Dialect/GPU/all-reduce.mlir
8	Is there a reason why you can't use `.*` as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names.
8–177	We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it?

nicolasvasilache added a reviewer: nicolasvasilache.Jan 6 2020, 10:13 AM

Update reflecting review comments from ftynse.

Thanks a lot for the review, Alex!

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
30	Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list.
89	Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments.
151	Punted by just adding a local variable.
244	Haha, I like your comment. Very subtle ;-)
mlir/test/Dialect/GPU/all-reduce.mlir
8	Replaced with generate-test-checks.py result.
8–177	Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it).

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43364: Diff 236427!Jan 6 2020, 11:55 AM

rriddle added inline comments.Jan 6 2020, 12:08 PM

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
163	`auto op = reduceOp.op()` to avoid recomputing below.
301	Remove `llvm::` from most of the things in this file. They are re-exported in the mlir namespace already.
353	When does this ever fail?

Addressing rriddle's review comments.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Harbormaster failed remote builds in B43416: Diff 236553!Jan 7 2020, 5:05 AM

csigg marked 4 inline comments as done.Jan 7 2020, 5:08 AM

csigg added inline comments.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
353	See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced.

herhut added a subscriber: herhut.Jan 7 2020, 7:38 AM

I somehow only see a subset of changes now...

Updating again, hopefully with all changes this time.

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43489: Diff 236769!Jan 7 2020, 11:47 PM

Almost there, only minor things necessary to improve understanding.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
151	I'll add that later anyway :)
269	I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might expect to be passed to the continuation.
355	Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation?

This revision now requires changes to proceed.Jan 8 2020, 6:07 AM

Address ftynse's review comments.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

Yes, I'm happy to change this to EDSC as a follow-up. For now I think it is easier to keep it similar to the existing lowering to NVVM.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
151	Thanks!
355	Correct, the walk iterators get invalidated from the replace. Comment added.

Unit tests: unknown.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B43673: Diff 237293!Jan 10 2020, 5:47 AM

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Herald added a subscriber: aartbik. · View Herald TranscriptJan 10 2020, 4:07 PM

In D72129#1815174, @nicolasvasilache wrote:

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Yes, from scanning the doc and code it looks like this this shouldn't be too difficult. But I haven't actually tried.

Rebase.

Herald added a subscriber: liufengdb. · View Herald TranscriptJan 14 2020, 12:44 AM

csigg added a reviewer: herhut.Jan 14 2020, 12:46 AM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B43902: Diff 237871!Jan 14 2020, 12:55 AM

Fix build error after Value* -> Value change.

Apply clang-format.

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43947: Diff 237993!Jan 14 2020, 8:50 AM

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43948: Diff 237994!Jan 14 2020, 8:50 AM

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I pointed Christian to the EDSC doc (https://mlir.llvm.org/docs/EDSC/) and the builder-api-test but otherwise nothing else was discussed.
I didn't look in the details of what would be involved in making the transition.
Can you share what the simpler proposal, than just using the declarative builders, is?

In D72129#1820630, @ftynse wrote:

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I'm interested to see some movement on this, at this point EDSC isn't totally endorsed by the whole team, and the uncontrolled use in the codebase isn't a great situation.

I have reviewed this before (the basic approach has not changed). Thanks for adding more comments and tests.

LGTM.

Regarding the EDSC vs. use of locally grown alternative in this patch: I try to avoid doing things like the template for adding location. While it saves some characters, it makes code look unfamiliar, which is also a cost. Adding helpers to emit conditionals etc. on the other hand reduces repetition, so I see that as a benefit. Still, I am fine with landing this as it seems to be the plan to evolve it to some EDSC like thing once there is an endorsed one.

@mehdi_amini @nicolasvasilache @herhut Let's take some time and discuss builder APIs outside this diff (also involving @rriddle). My basic observations are that (1) writing structured IR, as in "with nested regions", looks unnecessarily complicated with builders, arguments are the same as those against goto-style programming; (2) a lot of IR construction internally happens in rewrite patterns, where location almost always remains the same, that of the matched operation root; (3) current EDSC APIs are contentious partly because it is unclear when reading the code when the function call creates the IR vs. when it's just a function call.

@csigg I'm fine accepting this, but the test is currently broken. Please fix and we'll be ready to land.

@herhut @ftynse @mehdi_amini sounds good, it's high time to rediscuss this and to make MLIR play nicely with metaprogramming in a way that feels comfortable to everyone.

Do not wrap temporaries in ValueRange.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B44300: Diff 238844!Jan 17 2020, 12:18 PM

ftynse accepted this revision.Jan 20 2020, 1:16 AM

This revision is now accepted and ready to land.Jan 20 2020, 1:16 AM

clang-format.

Closed by commit rG8b2eb7c494b2: [mlir] Add in-dialect lowering of gpu.all_reduce. (authored by csigg). · Explain WhyJan 20 2020, 4:44 AM

This revision was automatically updated to reflect the committed changes.

Unit tests: pass. 62015 tests passed, 0 failed and 783 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster completed remote builds in B44390: Diff 239074.Jan 20 2020, 4:54 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

10 lines

Passes.h

6 lines

IR/

Block.h

3 lines

lib/

Dialect/

GPU/

CMakeLists.txt

1 line

Transforms/

AllReduceLowering.cpp

376 lines

IR/

Block.cpp

7 lines

test/

Dialect/

GPU/

all-reduce.mlir

183 lines

lib/

Transforms/

CMakeLists.txt

2 lines

TestAllReduceLowering.cpp

32 lines

Diff 237994

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	let extraClassDeclaration = [{
/// the workgroup memory		/// the workgroup memory
ArrayRef<BlockArgument> getWorkgroupAttributions() {		ArrayRef<BlockArgument> getWorkgroupAttributions() {
auto begin =		auto begin =
std::next(getBody().front().args_begin(), getType().getNumInputs());		std::next(getBody().front().args_begin(), getType().getNumInputs());
auto end = std::next(begin, getNumWorkgroupAttributions());		auto end = std::next(begin, getNumWorkgroupAttributions());
return {begin, end};		return {begin, end};
}		}

		// Adds a new block argument that corresponds to buffers located in
		// workgroup memory.
		BlockArgument addWorkgroupAttribution(Type type) {
		auto attrName = getNumWorkgroupAttributionsAttrName();
		auto attr = getAttrOfType<IntegerAttr>(attrName);
		setAttr(attrName, IntegerAttr::get(attr.getType(), attr.getValue() + 1));
		ftynseUnsubmitted Done Reply Inline Actions Style nit: add whitespace around "+" ftynse: Style nit: add whitespace around "+"
		return getBody().front().insertArgument(
		getType().getNumInputs() + attr.getInt(), type);
		}

/// Returns a list of block arguments that correspond to buffers located in		/// Returns a list of block arguments that correspond to buffers located in
/// the private memory.		/// the private memory.
ArrayRef<BlockArgument> getPrivateAttributions() {		ArrayRef<BlockArgument> getPrivateAttributions() {
auto begin =		auto begin =
std::next(getBody().front().args_begin(),		std::next(getBody().front().args_begin(),
getType().getNumInputs() + getNumWorkgroupAttributions());		getType().getNumInputs() + getNumWorkgroupAttributions());
return {begin, getBody().front().args_end()};		return {begin, getBody().front().args_end()};
}		}
▲ Show 20 Lines • Show All 422 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/Passes.h

	Show All 11 Lines

	#ifndef MLIR_DIALECT_GPU_PASSES_H_			#ifndef MLIR_DIALECT_GPU_PASSES_H_
	#define MLIR_DIALECT_GPU_PASSES_H_			#define MLIR_DIALECT_GPU_PASSES_H_

	#include <memory>			#include <memory>

	namespace mlir {			namespace mlir {

				class MLIRContext;
	class ModuleOp;			class ModuleOp;
	template <typename T> class OpPassBase;			template <typename T> class OpPassBase;
				class OwningRewritePatternList;

	std::unique_ptr<OpPassBase<ModuleOp>> createGpuKernelOutliningPass();			std::unique_ptr<OpPassBase<ModuleOp>> createGpuKernelOutliningPass();

				/// Collect a set of patterns to rewrite ops within the GPU dialect.
				void populateGpuRewritePatterns(MLIRContext *context,
				OwningRewritePatternList &patterns);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_GPU_PASSES_H_			#endif // MLIR_DIALECT_GPU_PASSES_H_

mlir/include/mlir/IR/Block.h

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	public:
/// Insert one value to the position in the argument list indicated by the		/// Insert one value to the position in the argument list indicated by the
/// given iterator. The existing arguments are shifted. The block is expected		/// given iterator. The existing arguments are shifted. The block is expected
/// not to have predecessors.		/// not to have predecessors.
BlockArgument insertArgument(args_iterator it, Type type);		BlockArgument insertArgument(args_iterator it, Type type);

/// Add one argument to the argument list for each type specified in the list.		/// Add one argument to the argument list for each type specified in the list.
iterator_range<args_iterator> addArguments(ArrayRef<Type> types);		iterator_range<args_iterator> addArguments(ArrayRef<Type> types);

		// Add one value to the argument list at the specified position.
		BlockArgument insertArgument(unsigned index, Type type);

/// Erase the argument at 'index' and remove it from the argument list. If		/// Erase the argument at 'index' and remove it from the argument list. If
/// 'updatePredTerms' is set to true, this argument is also removed from the		/// 'updatePredTerms' is set to true, this argument is also removed from the
/// terminators of each predecessor to this block.		/// terminators of each predecessor to this block.
void eraseArgument(unsigned index, bool updatePredTerms = true);		void eraseArgument(unsigned index, bool updatePredTerms = true);

unsigned getNumArguments() { return arguments.size(); }		unsigned getNumArguments() { return arguments.size(); }
BlockArgument getArgument(unsigned i) { return arguments[i]; }		BlockArgument getArgument(unsigned i) { return arguments[i]; }

▲ Show 20 Lines • Show All 243 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/CMakeLists.txt

	add_llvm_library(MLIRGPU			add_llvm_library(MLIRGPU
	IR/GPUDialect.cpp			IR/GPUDialect.cpp
	IR/DialectRegistration.cpp			IR/DialectRegistration.cpp
				Transforms/AllReduceLowering.cpp
	Transforms/KernelOutlining.cpp			Transforms/KernelOutlining.cpp
	Transforms/MemoryPromotion.cpp			Transforms/MemoryPromotion.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU
	)			)
	add_dependencies(MLIRGPU			add_dependencies(MLIRGPU
	MLIRGPUOpsIncGen			MLIRGPUOpsIncGen
	Show All 15 Lines

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

This file was added.

				//===- AllReduceLowering.cpp - Implementation of all-reduce lowering ------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements in-dialect lowering of the all-reduce op to a block of
				// simpler instructions.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/GPU/GPUDialect.h"
				#include "mlir/Dialect/GPU/Passes.h"
				#include "mlir/Dialect/StandardOps/Ops.h"
				#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;

				namespace {

				struct GpuAllReduceRewriter {
				using AccumulatorFactory = std::function<Value(Value, Value)>;

				GpuAllReduceRewriter(gpu::GPUFuncOp funcOp_, gpu::AllReduceOp reduceOp_,
				PatternRewriter &rewriter_)
				ftynseUnsubmitted Done Reply Inline Actions Nit: `explicit` is unnecessary for multi-argument constructors. ftynse: Nit: `explicit` is unnecessary for multi-argument constructors.
				csiggAuthorUnsubmitted Done Reply Inline Actions Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list. csigg: Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list.
				: funcOp(funcOp_), reduceOp(reduceOp_), rewriter(rewriter_),
				loc(reduceOp.getLoc()), valueType(reduceOp.value().getType()),
				indexType(IndexType::get(reduceOp.getContext())),
				int32Type(IntegerType::get(/width=/32, reduceOp.getContext())) {}

				/// Creates an all_reduce across the workgroup.
				///
				/// First reduce the elements within a subgroup. The first invocation of each
				/// subgroup writes the intermediate result to workgroup memory. After
				/// synchronizing the workgroup, the first subgroup reduces the values from
				ftynseUnsubmitted Done Reply Inline Actions Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ? ftynse: Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?
				/// workgroup memory. The result is broadcasted to all invocations through
				/// workgroup memory.
				///
				/// %subgroup_reduce = `createSubgroupReduce(%operand)`
				/// cond_br %is_first_lane, ^then1, ^continue1
				/// ^then1:
				/// store %subgroup_reduce, %workgroup_buffer[%subgroup_id]
				/// br ^continue1
				ftynseUnsubmitted Done Reply Inline Actions Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and stores ftynse: Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and…
				/// ^continue1:
				/// gpu.barrier
				/// %is_valid_subgroup = cmpi "slt" %invocation_idx, %num_subgroups
				/// cond_br %is_valid_subgroup, ^then2, ^continue2
				/// ^then2:
				ftynseUnsubmitted Done Reply Inline Actions Should this be `cond_br %is_valid_warp` instead? ftynse: Should this be `cond_br %is_valid_warp` instead?
				/// %partial_reduce = load %workgroup_buffer[%invocation_idx]
				/// %all_reduce = `createSubgroupReduce(%partial_reduce)`
				/// store %all_reduce, %workgroup_buffer[%zero]
				/// llvm.br ^continue2
				/// ^continue2:
				/// gpu.barrier
				/// %result = load %workgroup_buffer[%zero]
				/// return %result
				///
				void rewrite() {
				rewriter.setInsertionPoint(reduceOp);

				// Compute linear invocation index and workgroup size.
				Value dimX = getDimOp<gpu::BlockDimOp>("x");
				Value dimY = getDimOp<gpu::BlockDimOp>("y");
				ftynseUnsubmitted Done Reply Inline Actions Big and mechanical change: `Value` is now value-based, please use it without pointers. ftynse: Big and mechanical change: `Value` is now value-based, please use it without pointers.
				Value dimZ = getDimOp<gpu::BlockDimOp>("z");
				Value tidX = getDimOp<gpu::ThreadIdOp>("x");
				Value tidY = getDimOp<gpu::ThreadIdOp>("y");
				Value tidZ = getDimOp<gpu::ThreadIdOp>("z");
				Value tmp1 = create<MulIOp>(int32Type, tidZ, dimY);
				Value tmp2 = create<AddIOp>(int32Type, tmp1, tidY);
				Value tmp3 = create<MulIOp>(int32Type, tmp2, dimX);
				Value tmp4 = create<MulIOp>(int32Type, dimX, dimY);
				Value invocationIdx = create<AddIOp>(int32Type, tmp3, tidX);
				Value workgroupSize = create<MulIOp>(int32Type, tmp4, dimZ);

				// Compute lane id (invocation id withing the subgroup).
				Value subgroupMask = create<ConstantIntOp>(kSubgroupSize - 1, int32Type);
				Value laneId = create<AndOp>(invocationIdx, subgroupMask);
				ftynseUnsubmitted Done Reply Inline Actions Style nit: add a comment `/bitwidth=/` for `32` ftynse: Style nit: add a comment `/bitwidth=/` for `32`
				Value isFirstLane = create<CmpIOp>(CmpIPredicate::eq, laneId,
				create<ConstantIntOp>(0, int32Type));

				Value numThreadsWithSmallerSubgroupId =
				create<SubIOp>(invocationIdx, laneId);
				// The number of active invocations starting from the current subgroup.
				// The consumers do not require the value to be clamped to the size of the
				ftynseUnsubmitted Done Reply Inline Actions Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp. ftynse: Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id *…
				csiggAuthorUnsubmitted Done Reply Inline Actions Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments. csigg: Correct. The activeWidth does not need to be clamped to subgroup width though. I added two…
				// subgroup.
				Value activeWidth =
				create<SubIOp>(workgroupSize, numThreadsWithSmallerSubgroupId);

				// Create factory for op which accumulates to values.
				AccumulatorFactory accumFactory = getFactory();
				assert(accumFactory && "failed to create accumulator factory");

				// Reduce elements within each subgroup to produce the intermediate results.
				Value subgroupReduce = createSubgroupReduce(activeWidth, laneId,
				reduceOp.value(), accumFactory);

				// Add workgroup buffer to parent function for intermediate result.
				Value buffer = createWorkgroupBuffer();

				// Write the intermediate results to workgroup memory, using the first lane
				ftynseUnsubmitted Done Reply Inline Actions This looks like you want EDSC :) + @nicolasvasilache ftynse: This looks like you want EDSC :) + @nicolasvasilache
				// of each subgroup.
				createPredicatedBlock(isFirstLane, [&] {
				Value subgroupId = getDivideBySubgroupSize(invocationIdx);
				Value index = create<IndexCastOp>(indexType, subgroupId);
				create<StoreOp>(subgroupReduce, buffer, index);
				});
				create<gpu::BarrierOp>();

				// Compute number of active subgroups.
				Value biasedBlockSize =
				create<AddIOp>(int32Type, workgroupSize, subgroupMask);
				Value numSubgroups = getDivideBySubgroupSize(biasedBlockSize);
				Value isValidSubgroup =
				create<CmpIOp>(CmpIPredicate::slt, invocationIdx, numSubgroups);

				// Use the first numSubgroups invocations to reduce the intermediate results
				// from workgroup memory. The final result is written to workgroup memory
				// again.
				Value zero = create<ConstantIndexOp>(0);
				createPredicatedBlock(isValidSubgroup, [&] {
				Value index = create<IndexCastOp>(indexType, invocationIdx);
				Value value = create<LoadOp>(valueType, buffer, index);
				Value result =
				createSubgroupReduce(numSubgroups, laneId, value, accumFactory);
				create<StoreOp>(result, buffer, zero);
				});

				// Synchronize workgroup and load result from workgroup memory.
				create<gpu::BarrierOp>();
				Value result = create<LoadOp>(valueType, buffer, zero);

				rewriter.replaceOp(reduceOp, result);
				}

				ftynseUnsubmitted Done Reply Inline Actions Maybe you can store the location and pass it as first argument? It's the same for all operations you create above. ftynse: Maybe you can store the location and pass it as first argument? It's the same for all…
				private:
				// Shortcut to create an op from rewriter using loc as the first argument.
				template <typename T, typename... Args>
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - template <typename T, typename... Args> - T create(Args... args) { + template <typename T, typename... Args> T create(Args... args) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - template <typename T, typename... Args> - T…
				T create(Args... args) {
				return rewriter.create<T>(loc, std::forward<Args>(args)...);
				}

				// Creates dimension op of type T, with the result casted to int32.
				template <typename T>
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - template <typename T> - Value getDimOp(StringRef dimension) { + template <typename T> Value getDimOp(StringRef dimension) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - template <typename T> - Value getDimOp(StringRef…
				Value getDimOp(StringRef dimension) {
				Value dim = create<T>(indexType, rewriter.getStringAttr(dimension));
				return create<IndexCastOp>(int32Type, dim);
				ftynseUnsubmitted Done Reply Inline Actions Nit: let's have a named constant for the address space ftynse: Nit: let's have a named constant for the address space
				csiggAuthorUnsubmitted Done Reply Inline Actions Punted by just adding a local variable. csigg: Punted by just adding a local variable.
				ftynseUnsubmitted Done Reply Inline Actions I'll add that later anyway :) ftynse: I'll add that later anyway :)
				csiggAuthorUnsubmitted Done Reply Inline Actions Thanks! csigg: Thanks!
				}

				/// Adds type to funcOp's workgroup attributions.
				Value createWorkgroupBuffer() {
				int workgroupMemoryAddressSpace = 3;
				auto bufferType =
				MemRefType::get({kSubgroupSize}, valueType, ArrayRef<AffineMap>{},
				workgroupMemoryAddressSpace);
				return funcOp.addWorkgroupAttribution(bufferType);
				}

				/// Returns an accumulator factory using either the op attribute or the body
				rriddleUnsubmitted Done Reply Inline Actions `auto op = reduceOp.op()` to avoid recomputing below. rriddle: `auto op = reduceOp.op()` to avoid recomputing below.
				/// region.
				AccumulatorFactory getFactory() {
				auto &body = reduceOp.body();
				if (!body.empty())
				return getFactory(body);
				auto opAttr = reduceOp.op();
				if (opAttr)
				return getFactory(*opAttr);
				return AccumulatorFactory();
				}

				/// Returns an accumulator factory that clones the body. The body's entry
				/// block is expected to have 2 arguments. The gpu.yield return the
				/// accumulated value of the same type.
				AccumulatorFactory getFactory(Region &body) {
				return AccumulatorFactory([&](Value lhs, Value rhs) {
				Block *block = rewriter.getInsertionBlock();
				Block *split = rewriter.splitBlock(block, rewriter.getInsertionPoint());

				// Insert accumulator body between split block.
				BlockAndValueMapping mapping;
				mapping.map(body.front().getArgument(0), lhs);
				mapping.map(body.front().getArgument(1), rhs);
				rewriter.cloneRegionBefore(body, *split->getParent(),
				split->getIterator(), mapping);

				// Add branch before inserted body, into body.
				block = block->getNextNode();
				create<BranchOp>(block, ValueRange());

				// Replace all gpu.yield ops with branch out of body.
				for (; block != split; block = block->getNextNode()) {
				Operation *terminator = block->getTerminator();
				if (!isa<gpu::YieldOp>(terminator))
				continue;
				rewriter.setInsertionPointToEnd(block);
				rewriter.replaceOpWithNewOp<BranchOp>(
				terminator, split, ValueRange(terminator->getOperand(0)));
				}
				ftynseUnsubmitted Done Reply Inline Actions Why only `F32`? Could you have `isa<FloatType>()` instead? ftynse: Why only `F32`? Could you have `isa<FloatType>()` instead?

				// Return accumulator result.
				rewriter.setInsertionPointToStart(split);
				return split->addArgument(lhs.getType());
				});
				}

				/// Returns an accumulator factory that creates an op specified by opName.
				AccumulatorFactory getFactory(StringRef opName) {
				bool isFloatingPoint = valueType.isa<FloatType>();
				if (opName == "add")
				return isFloatingPoint ? getFactory<AddFOp>() : getFactory<AddIOp>();
				if (opName == "mul")
				return isFloatingPoint ? getFactory<MulFOp>() : getFactory<MulIOp>();
				return AccumulatorFactory();
				}

				/// Returns an accumulator factory that creates an op of type T.
				template <typename T>
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - template <typename T> - AccumulatorFactory getFactory() { + template <typename T> AccumulatorFactory getFactory() { Lint: Pre-merge checks: clang-format: please reformat the code ``` - template <typename T> - AccumulatorFactory…
				AccumulatorFactory getFactory() {
				return [&](Value lhs, Value rhs) {
				return create<T>(lhs.getType(), lhs, rhs);
				};
				}

				/// Creates an if-block skeleton and calls the two factories to generate the
				/// ops in the `then` and `else` block..
				///
				/// llvm.cond_br %condition, ^then, ^continue
				/// ^then:
				/// %then_operands = `thenOpsFactory()`
				/// llvm.br ^continue(%then_operands)
				/// ^else:
				/// %else_operands = `elseOpsFactory()`
				/// llvm.br ^continue(%else_operands)
				/// ^continue(%block_operands):
				///
				template <typename ThenOpsFactory, typename ElseOpsFactory>
				void createIf(Value condition, ThenOpsFactory &&thenOpsFactory,
				ElseOpsFactory &&elseOpsFactory) {
				Block *currentBlock = rewriter.getInsertionBlock();
				auto currentPoint = rewriter.getInsertionPoint();
				ftynseUnsubmitted Done Reply Inline Actions Is this lambda worse it? ftynse: Is this lambda worse it?
				csiggAuthorUnsubmitted Done Reply Inline Actions Haha, I like your comment. Very subtle ;-) csigg: Haha, I like your comment. Very subtle ;-)

				Block *thenBlock = rewriter.splitBlock(currentBlock, currentPoint);
				Block *elseBlock = rewriter.splitBlock(thenBlock, thenBlock->begin());
				Block *continueBlock = rewriter.splitBlock(elseBlock, elseBlock->begin());

				rewriter.setInsertionPointToEnd(currentBlock);
				create<CondBranchOp>(condition, thenBlock,
				/trueOperands=/ArrayRef<Value>(), elseBlock,
				/falseOperands=/ArrayRef<Value>());

				rewriter.setInsertionPointToStart(thenBlock);
				ValueRange thenOperands = thenOpsFactory();
				create<BranchOp>(continueBlock, thenOperands);

				rewriter.setInsertionPointToStart(elseBlock);
				ValueRange elseOperands = elseOpsFactory();
				create<BranchOp>(continueBlock, elseOperands);

				assert(thenOperands.size() == elseOperands.size());
				rewriter.setInsertionPointToStart(continueBlock);
				for (auto operand : thenOperands)
				continueBlock->addArgument(operand.getType());
				}

				/// Shortcut for createIf with empty else block and no block operands.
				ftynseUnsubmitted Done Reply Inline Actions I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might expect to be passed to the continuation. ftynse: I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might…
				template <typename Factory>
				void createPredicatedBlock(Value condition, Factory &&predicatedOpsFactory) {
				static_assert(std::is_same<decltype(predicatedOpsFactory()), void>::value,
				"predicatedOpsFactory should not return any value");
				createIf(
				condition,
				[&] {
				predicatedOpsFactory();
				return ArrayRef<Value>();
				},
				[&] { return ArrayRef<Value>(); });
				}

				/// Creates a reduction across the first activeWidth lanes of a subgroup, or
				/// the entire subgroup if activeWidth is larger than the subgroup width.
				/// The first lane returns the result, all others return values are undefined.
				Value createSubgroupReduce(Value activeWidth, Value laneId, Value operand,
				AccumulatorFactory &accumFactory) {
				Value subgroupSize = create<ConstantIntOp>(kSubgroupSize, int32Type);
				Value isPartialSubgroup =
				create<CmpIOp>(CmpIPredicate::slt, activeWidth, subgroupSize);
				SmallVector<Type, 2> shuffleType = {valueType, rewriter.getI1Type()};
				auto xorAttr = rewriter.getStringAttr("xor");

				createIf(
				isPartialSubgroup,
				// Generate reduction over a (potentially) partial subgroup.
				[&] {
				Value value = operand;
				// Repeatedly shuffle value from 'laneId ^ i' and accumulate if source
				// lane is within the active range. The accumulated value is available
				// in the first lane.
				rriddleUnsubmitted Done Reply Inline Actions Remove `llvm::` from most of the things in this file. They are re-exported in the mlir namespace already. rriddle: Remove `llvm::` from most of the things in this file. They are re-exported in the mlir…
				for (int i = 1; i < kSubgroupSize; i <<= 1) {
				Value offset = create<ConstantIntOp>(i, int32Type);
				auto shuffleOp = create<gpu::ShuffleOp>(shuffleType, value, offset,
				activeWidth, xorAttr);
				// Skip the accumulation if the shuffle op read from a lane outside
				// of the active range.
				createIf(
				shuffleOp.getResult(1),
				[&] {
				return SmallVector<Value, 1>{
				accumFactory(value, shuffleOp.getResult(0))};
				},
				[&] { return llvm::makeArrayRef(value); });
				value = rewriter.getInsertionBlock()->getArgument(0);
				}
				return SmallVector<Value, 1>{value};
				},
				// Generate a reduction over the entire subgroup. This is a
				// specialization of the above reduction with unconditional
				// accumulation.
				[&] {
				Value value = operand;
				for (int i = 1; i < kSubgroupSize; i <<= 1) {
				Value offset = create<ConstantIntOp>(i, int32Type);
				auto shuffleOp = create<gpu::ShuffleOp>(shuffleType, value, offset,
				subgroupSize, xorAttr);
				value = accumFactory(value, shuffleOp.getResult(0));
				}
				return SmallVector<Value, 1>{value};
				});
				return rewriter.getInsertionBlock()->getArgument(0);
				}

				/// Returns value divided by the subgroup size (i.e. 32).
				Value getDivideBySubgroupSize(Value value) {
				Value subgroupSize = create<ConstantIntOp>(kSubgroupSize, int32Type);
				return create<SignedDivIOp>(int32Type, value, subgroupSize);
				}

				gpu::GPUFuncOp funcOp;
				gpu::AllReduceOp reduceOp;
				PatternRewriter &rewriter;

				Location loc;
				Type valueType;
				Type indexType;
				Type int32Type;

				static constexpr int kSubgroupSize = 32;
				};

				struct GpuAllReduceConversion : public RewritePattern {
				rriddleUnsubmitted Done Reply Inline Actions When does this ever fail? rriddle: When does this ever fail?
				csiggAuthorUnsubmitted Done Reply Inline Actions See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced. csigg: See two lines above: every time the callback is invoked. This makes sure that all occurrences…
				explicit GpuAllReduceConversion(MLIRContext *context)
				: RewritePattern(gpu::GPUFuncOp::getOperationName(), 1, context) {}
				ftynseUnsubmitted Done Reply Inline Actions Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation? ftynse: Could you please add a comment on why this approach is necessary? I understand that you need…
				csiggAuthorUnsubmitted Done Reply Inline Actions Correct, the walk iterators get invalidated from the replace. Comment added. csigg: Correct, the walk iterators get invalidated from the replace. Comment added.

				PatternMatchResult matchAndRewrite(Operation *op,
				PatternRewriter &rewriter) const override {
				auto funcOp = cast<gpu::GPUFuncOp>(op);
				auto callback = [&](gpu::AllReduceOp reduceOp) {
				GpuAllReduceRewriter(funcOp, reduceOp, rewriter).rewrite();
				// Performing a rewrite invalidates the walk iterator. Report interrupt
				// so that we can start a new walk until all all_reduce ops are replaced.
				return WalkResult::interrupt();
				};
				while (funcOp.walk(callback).wasInterrupted()) {
				}
				return matchSuccess();
				}
				};
				} // namespace

				void mlir::populateGpuRewritePatterns(MLIRContext *context,
				OwningRewritePatternList &patterns) {
				patterns.insert<GpuAllReduceConversion>(context);
				}

mlir/lib/IR/Block.cpp

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	auto Block::addArguments(ArrayRef<Type> types)
arguments.reserve(arguments.size() + types.size());		arguments.reserve(arguments.size() + types.size());
auto initialSize = arguments.size();		auto initialSize = arguments.size();
for (auto type : types) {		for (auto type : types) {
addArgument(type);		addArgument(type);
}		}
return {arguments.data() + initialSize, arguments.data() + arguments.size()};		return {arguments.data() + initialSize, arguments.data() + arguments.size()};
}		}

		BlockArgument Block::insertArgument(unsigned index, Type type) {
		auto arg = BlockArgument::create(type, this);
		assert(index <= arguments.size());
		arguments.insert(arguments.begin() + index, arg);
		return arg;
		}

void Block::eraseArgument(unsigned index, bool updatePredTerms) {		void Block::eraseArgument(unsigned index, bool updatePredTerms) {
assert(index < arguments.size());		assert(index < arguments.size());

// Delete the argument.		// Delete the argument.
arguments[index].destroy();		arguments[index].destroy();
arguments.erase(arguments.begin() + index);		arguments.erase(arguments.begin() + index);

// If we aren't updating predecessors, there is nothing left to do.		// If we aren't updating predecessors, there is nothing left to do.
▲ Show 20 Lines • Show All 118 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/all-reduce.mlir

This file was added.

				// RUN: mlir-opt -test-all-reduce-lowering %s \| FileCheck %s

				// NOTE: Assertions have been autogenerated by utils/generate-test-checks.py
				// CHECK: module @kernels attributes {gpu.kernel_module} {
				module @kernels attributes {gpu.kernel_module} {

				// CHECK-LABEL: gpu.func @kernel(
				// CHECK-SAME: [[VAL_0:%.]]: f32) workgroup([[VAL_1:%.]] : memref<32xf32, 3>) kernel {
				ftynseUnsubmitted Done Reply Inline Actions Is there a reason why you can't use `.` as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names. ftynse:* Is there a reason why you can't use `.*` as a regexp for names? Seeing the full with ranges…
				csiggAuthorUnsubmitted Done Reply Inline Actions Replaced with generate-test-checks.py result. csigg: Replaced with generate-test-checks.py result.
				gpu.func @kernel(%arg0 : f32) attributes { gpu.kernel } {
				// CHECK: [[VAL_2:%.*]] = constant 31 : i32
				// CHECK: [[VAL_3:%.*]] = constant 0 : i32
				// CHECK: [[VAL_4:%.*]] = constant 0 : index
				// CHECK: [[VAL_5:%.*]] = constant 32 : i32
				// CHECK: [[VAL_6:%.*]] = constant 1 : i32
				// CHECK: [[VAL_7:%.*]] = constant 2 : i32
				// CHECK: [[VAL_8:%.*]] = constant 4 : i32
				// CHECK: [[VAL_9:%.*]] = constant 8 : i32
				// CHECK: [[VAL_10:%.*]] = constant 16 : i32
				// CHECK: [[VAL_11:%.*]] = "gpu.block_dim"() {dimension = "x"} : () -> index
				// CHECK: [[VAL_12:%.*]] = index_cast [[VAL_11]] : index to i32
				// CHECK: [[VAL_13:%.*]] = "gpu.block_dim"() {dimension = "y"} : () -> index
				// CHECK: [[VAL_14:%.*]] = index_cast [[VAL_13]] : index to i32
				// CHECK: [[VAL_15:%.*]] = "gpu.block_dim"() {dimension = "z"} : () -> index
				// CHECK: [[VAL_16:%.*]] = index_cast [[VAL_15]] : index to i32
				// CHECK: [[VAL_17:%.*]] = "gpu.thread_id"() {dimension = "x"} : () -> index
				// CHECK: [[VAL_18:%.*]] = index_cast [[VAL_17]] : index to i32
				// CHECK: [[VAL_19:%.*]] = "gpu.thread_id"() {dimension = "y"} : () -> index
				// CHECK: [[VAL_20:%.*]] = index_cast [[VAL_19]] : index to i32
				// CHECK: [[VAL_21:%.*]] = "gpu.thread_id"() {dimension = "z"} : () -> index
				// CHECK: [[VAL_22:%.*]] = index_cast [[VAL_21]] : index to i32
				// CHECK: [[VAL_23:%.*]] = muli [[VAL_22]], [[VAL_14]] : i32
				// CHECK: [[VAL_24:%.*]] = addi [[VAL_23]], [[VAL_20]] : i32
				// CHECK: [[VAL_25:%.*]] = muli [[VAL_24]], [[VAL_12]] : i32
				// CHECK: [[VAL_26:%.*]] = muli [[VAL_12]], [[VAL_14]] : i32
				// CHECK: [[VAL_27:%.*]] = addi [[VAL_25]], [[VAL_18]] : i32
				// CHECK: [[VAL_28:%.*]] = muli [[VAL_26]], [[VAL_16]] : i32
				// CHECK: [[VAL_29:%.*]] = and [[VAL_27]], [[VAL_2]] : i32
				// CHECK: [[VAL_30:%.*]] = cmpi "eq", [[VAL_29]], [[VAL_3]] : i32
				// CHECK: [[VAL_31:%.*]] = subi [[VAL_27]], [[VAL_29]] : i32
				// CHECK: [[VAL_32:%.*]] = subi [[VAL_28]], [[VAL_31]] : i32
				// CHECK: [[VAL_33:%.*]] = cmpi "slt", [[VAL_32]], [[VAL_5]] : i32
				// CHECK: cond_br [[VAL_33]], ^bb1, ^bb17
				// CHECK: ^bb1:
				// CHECK: [[VAL_34:%.]], [[VAL_35:%.]] = gpu.shuffle [[VAL_0]], [[VAL_6]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_35]], ^bb2, ^bb3
				// CHECK: ^bb2:
				// CHECK: [[VAL_36:%.*]] = addf [[VAL_0]], [[VAL_34]] : f32
				// CHECK: br ^bb4([[VAL_36]] : f32)
				// CHECK: ^bb3:
				// CHECK: br ^bb4([[VAL_0]] : f32)
				// CHECK: ^bb4([[VAL_37:%.*]]: f32):
				// CHECK: [[VAL_38:%.]], [[VAL_39:%.]] = gpu.shuffle [[VAL_37]], [[VAL_7]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_39]], ^bb5, ^bb6
				// CHECK: ^bb5:
				// CHECK: [[VAL_40:%.*]] = addf [[VAL_37]], [[VAL_38]] : f32
				// CHECK: br ^bb7([[VAL_40]] : f32)
				// CHECK: ^bb6:
				// CHECK: br ^bb7([[VAL_37]] : f32)
				// CHECK: ^bb7([[VAL_41:%.*]]: f32):
				// CHECK: [[VAL_42:%.]], [[VAL_43:%.]] = gpu.shuffle [[VAL_41]], [[VAL_8]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_43]], ^bb8, ^bb9
				// CHECK: ^bb8:
				// CHECK: [[VAL_44:%.*]] = addf [[VAL_41]], [[VAL_42]] : f32
				// CHECK: br ^bb10([[VAL_44]] : f32)
				// CHECK: ^bb9:
				// CHECK: br ^bb10([[VAL_41]] : f32)
				// CHECK: ^bb10([[VAL_45:%.*]]: f32):
				// CHECK: [[VAL_46:%.]], [[VAL_47:%.]] = gpu.shuffle [[VAL_45]], [[VAL_9]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_47]], ^bb11, ^bb12
				// CHECK: ^bb11:
				// CHECK: [[VAL_48:%.*]] = addf [[VAL_45]], [[VAL_46]] : f32
				// CHECK: br ^bb13([[VAL_48]] : f32)
				// CHECK: ^bb12:
				// CHECK: br ^bb13([[VAL_45]] : f32)
				// CHECK: ^bb13([[VAL_49:%.*]]: f32):
				// CHECK: [[VAL_50:%.]], [[VAL_51:%.]] = gpu.shuffle [[VAL_49]], [[VAL_10]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_51]], ^bb14, ^bb15
				// CHECK: ^bb14:
				// CHECK: [[VAL_52:%.*]] = addf [[VAL_49]], [[VAL_50]] : f32
				// CHECK: br ^bb16([[VAL_52]] : f32)
				// CHECK: ^bb15:
				// CHECK: br ^bb16([[VAL_49]] : f32)
				// CHECK: ^bb16([[VAL_53:%.*]]: f32):
				// CHECK: br ^bb18([[VAL_53]] : f32)
				// CHECK: ^bb17:
				// CHECK: [[VAL_54:%.]], [[VAL_55:%.]] = gpu.shuffle [[VAL_0]], [[VAL_6]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_56:%.*]] = addf [[VAL_0]], [[VAL_54]] : f32
				// CHECK: [[VAL_57:%.]], [[VAL_58:%.]] = gpu.shuffle [[VAL_56]], [[VAL_7]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_59:%.*]] = addf [[VAL_56]], [[VAL_57]] : f32
				// CHECK: [[VAL_60:%.]], [[VAL_61:%.]] = gpu.shuffle [[VAL_59]], [[VAL_8]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_62:%.*]] = addf [[VAL_59]], [[VAL_60]] : f32
				// CHECK: [[VAL_63:%.]], [[VAL_64:%.]] = gpu.shuffle [[VAL_62]], [[VAL_9]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_65:%.*]] = addf [[VAL_62]], [[VAL_63]] : f32
				// CHECK: [[VAL_66:%.]], [[VAL_67:%.]] = gpu.shuffle [[VAL_65]], [[VAL_10]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_68:%.*]] = addf [[VAL_65]], [[VAL_66]] : f32
				// CHECK: br ^bb18([[VAL_68]] : f32)
				// CHECK: ^bb18([[VAL_69:%.*]]: f32):
				// CHECK: cond_br [[VAL_30]], ^bb19, ^bb20
				// CHECK: ^bb19:
				// CHECK: [[VAL_70:%.*]] = divi_signed [[VAL_27]], [[VAL_5]] : i32
				// CHECK: [[VAL_71:%.*]] = index_cast [[VAL_70]] : i32 to index
				// CHECK: store [[VAL_69]], [[VAL_1]]{{\[}}[[VAL_71]]] : memref<32xf32, 3>
				// CHECK: br ^bb21
				// CHECK: ^bb20:
				// CHECK: br ^bb21
				// CHECK: ^bb21:
				// CHECK: gpu.barrier
				// CHECK: [[VAL_72:%.*]] = addi [[VAL_28]], [[VAL_2]] : i32
				// CHECK: [[VAL_73:%.*]] = divi_signed [[VAL_72]], [[VAL_5]] : i32
				// CHECK: [[VAL_74:%.*]] = cmpi "slt", [[VAL_27]], [[VAL_73]] : i32
				// CHECK: cond_br [[VAL_74]], ^bb22, ^bb41
				// CHECK: ^bb22:
				// CHECK: [[VAL_75:%.*]] = index_cast [[VAL_27]] : i32 to index
				// CHECK: [[VAL_76:%.*]] = load [[VAL_1]]{{\[}}[[VAL_75]]] : memref<32xf32, 3>
				// CHECK: [[VAL_77:%.*]] = cmpi "slt", [[VAL_73]], [[VAL_5]] : i32
				// CHECK: cond_br [[VAL_77]], ^bb23, ^bb39
				// CHECK: ^bb23:
				// CHECK: [[VAL_78:%.]], [[VAL_79:%.]] = gpu.shuffle [[VAL_76]], [[VAL_6]], [[VAL_73]] xor : f32
				// CHECK: cond_br [[VAL_79]], ^bb24, ^bb25
				// CHECK: ^bb24:
				// CHECK: [[VAL_80:%.*]] = addf [[VAL_76]], [[VAL_78]] : f32
				// CHECK: br ^bb26([[VAL_80]] : f32)
				// CHECK: ^bb25:
				// CHECK: br ^bb26([[VAL_76]] : f32)
				// CHECK: ^bb26([[VAL_81:%.*]]: f32):
				// CHECK: [[VAL_82:%.]], [[VAL_83:%.]] = gpu.shuffle [[VAL_81]], [[VAL_7]], [[VAL_73]] xor : f32
				// CHECK: cond_br [[VAL_83]], ^bb27, ^bb28
				// CHECK: ^bb27:
				// CHECK: [[VAL_84:%.*]] = addf [[VAL_81]], [[VAL_82]] : f32
				// CHECK: br ^bb29([[VAL_84]] : f32)
				// CHECK: ^bb28:
				// CHECK: br ^bb29([[VAL_81]] : f32)
				// CHECK: ^bb29([[VAL_85:%.*]]: f32):
				// CHECK: [[VAL_86:%.]], [[VAL_87:%.]] = gpu.shuffle [[VAL_85]], [[VAL_8]], [[VAL_73]] xor : f32
				// CHECK: cond_br [[VAL_87]], ^bb30, ^bb31
				// CHECK: ^bb30:
				// CHECK: [[VAL_88:%.*]] = addf [[VAL_85]], [[VAL_86]] : f32
				// CHECK: br ^bb32([[VAL_88]] : f32)
				// CHECK: ^bb31:
				// CHECK: br ^bb32([[VAL_85]] : f32)
				// CHECK: ^bb32([[VAL_89:%.*]]: f32):
				// CHECK: [[VAL_90:%.]], [[VAL_91:%.]] = gpu.shuffle [[VAL_89]], [[VAL_9]], [[VAL_73]] xor : f32
				// CHECK: cond_br [[VAL_91]], ^bb33, ^bb34
				// CHECK: ^bb33:
				// CHECK: [[VAL_92:%.*]] = addf [[VAL_89]], [[VAL_90]] : f32
				// CHECK: br ^bb35([[VAL_92]] : f32)
				// CHECK: ^bb34:
				// CHECK: br ^bb35([[VAL_89]] : f32)
				// CHECK: ^bb35([[VAL_93:%.*]]: f32):
				// CHECK: [[VAL_94:%.]], [[VAL_95:%.]] = gpu.shuffle [[VAL_93]], [[VAL_10]], [[VAL_73]] xor : f32
				// CHECK: cond_br [[VAL_95]], ^bb36, ^bb37
				// CHECK: ^bb36:
				// CHECK: [[VAL_96:%.*]] = addf [[VAL_93]], [[VAL_94]] : f32
				// CHECK: br ^bb38([[VAL_96]] : f32)
				// CHECK: ^bb37:
				// CHECK: br ^bb38([[VAL_93]] : f32)
				// CHECK: ^bb38([[VAL_97:%.*]]: f32):
				// CHECK: br ^bb40([[VAL_97]] : f32)
				// CHECK: ^bb39:
				// CHECK: [[VAL_98:%.]], [[VAL_99:%.]] = gpu.shuffle [[VAL_76]], [[VAL_6]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_100:%.*]] = addf [[VAL_76]], [[VAL_98]] : f32
				// CHECK: [[VAL_101:%.]], [[VAL_102:%.]] = gpu.shuffle [[VAL_100]], [[VAL_7]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_103:%.*]] = addf [[VAL_100]], [[VAL_101]] : f32
				// CHECK: [[VAL_104:%.]], [[VAL_105:%.]] = gpu.shuffle [[VAL_103]], [[VAL_8]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_106:%.*]] = addf [[VAL_103]], [[VAL_104]] : f32
				// CHECK: [[VAL_107:%.]], [[VAL_108:%.]] = gpu.shuffle [[VAL_106]], [[VAL_9]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_109:%.*]] = addf [[VAL_106]], [[VAL_107]] : f32
				// CHECK: [[VAL_110:%.]], [[VAL_111:%.]] = gpu.shuffle [[VAL_109]], [[VAL_10]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_112:%.*]] = addf [[VAL_109]], [[VAL_110]] : f32
				// CHECK: br ^bb40([[VAL_112]] : f32)
				// CHECK: ^bb40([[VAL_113:%.*]]: f32):
				// CHECK: store [[VAL_113]], [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>
				// CHECK: br ^bb42
				// CHECK: ^bb41:
				// CHECK: br ^bb42
				// CHECK: ^bb42:
				// CHECK: gpu.barrier
				csiggAuthorUnsubmitted Done Reply Inline Actions These CHECKs were generated from the output with: sed -r \ -e 's\|\t+//.\|\|' \ -e 's\|%([a-z_0-9]+) = \|%[[v\1:[a-z_0-9]+]] = \|' \ -e 's\|$%([a-z_0-9]+): ([a-z_0-9]+)$:\|(%[[v\1:[a-z_0-9]+]]: \2):\|' \ -e 's\|%([a-z_0-9]+)\|%[[v\1]]\|g' \ -e 's\|bb([0-9]+)\|bb[[#b\1]]\|g' \ -e 's\|^ \| // CHECK:\|' and manaul edits from there. csigg:* These CHECKs were generated from the output with: ``` sed -r \ -e 's\|\t+//.*\|\|' \ -e 's\|%([a…
				ftynseUnsubmitted Done Reply Inline Actions We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it? ftynse: We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py…
				csiggAuthorUnsubmitted Done Reply Inline Actions Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it). csigg: Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it).
				// CHECK: [[VAL_114:%.*]] = load [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>
				%sum = "gpu.all_reduce"(%arg0) ({}) {op = "add"} : (f32) -> (f32)
				gpu.return
				}

				}

mlir/test/lib/Transforms/CMakeLists.txt

	add_llvm_library(MLIRTestTransforms			add_llvm_library(MLIRTestTransforms
				TestAllReduceLowering.cpp
	TestCallGraph.cpp			TestCallGraph.cpp
	TestConstantFold.cpp			TestConstantFold.cpp
	TestLoopFusion.cpp			TestLoopFusion.cpp
	TestGpuMemoryPromotion.cpp			TestGpuMemoryPromotion.cpp
	TestInlining.cpp			TestInlining.cpp
	TestLinalgTransforms.cpp			TestLinalgTransforms.cpp
	TestLiveness.cpp			TestLiveness.cpp
	TestLoopMapping.cpp			TestLoopMapping.cpp
	Show All 15 Lines
	add_dependencies(MLIRTestTransforms MLIRTestLinalgTransformPatternsIncGen)			add_dependencies(MLIRTestTransforms MLIRTestLinalgTransformPatternsIncGen)
	add_dependencies(MLIRTestTransforms MLIRTestVectorTransformPatternsIncGen)			add_dependencies(MLIRTestTransforms MLIRTestVectorTransformPatternsIncGen)
	target_link_libraries(MLIRTestTransforms			target_link_libraries(MLIRTestTransforms
	MLIRAffineOps			MLIRAffineOps
	MLIRAnalysis			MLIRAnalysis
	MLIREDSC			MLIREDSC
	MLIRGPU			MLIRGPU
	MLIRLoopOps			MLIRLoopOps
				MLIRGPU
	MLIRPass			MLIRPass
	MLIRTestDialect			MLIRTestDialect
	MLIRVectorOps			MLIRVectorOps
	)			)

mlir/test/lib/Transforms/TestAllReduceLowering.cpp

This file was added.

				//===- TestAllReduceLowering.cpp - Test gpu.all_reduce lowering -----------===//
				//
				// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains test passes for lowering the gpu.all_reduce op.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/GPU/Passes.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;

				namespace {
				struct TestAllReduceLoweringPass
				: public ModulePass<TestAllReduceLoweringPass> {
				void runOnModule() override {
				OwningRewritePatternList patterns;
				populateGpuRewritePatterns(&getContext(), patterns);
				applyPatternsGreedily(getModule(), patterns);
				}
				};
				} // namespace

				static PassRegistration<TestAllReduceLoweringPass>
				pass("test-all-reduce-lowering",
				"Lowers gpu.all-reduce ops within the GPU dialect.");

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add in-dialect lowering of gpu.all_reduce.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 237994

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/include/mlir/Dialect/GPU/Passes.h

mlir/include/mlir/IR/Block.h

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

mlir/lib/IR/Block.cpp

mlir/test/Dialect/GPU/all-reduce.mlir

mlir/test/lib/Transforms/CMakeLists.txt

mlir/test/lib/Transforms/TestAllReduceLowering.cpp

[mlir] Add in-dialect lowering of gpu.all_reduce.
ClosedPublic