This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Dialect/GPU/
-
GPU/
1/1
GPUOps.td
-
Passes.h
-
IR/
-
Block.h
-
lib/
-
Dialect/GPU/
-
GPU/
-
CMakeLists.txt
-
Transforms/
25/25
AllReduceLowering.cpp
-
IR/
-
Block.cpp
-
test/
-
Dialect/GPU/
-
GPU/
5/5
all-reduce.mlir
-
lib/Transforms/
-
Transforms/
-
CMakeLists.txt
-
TestAllReduceLowering.cpp

Differential D72129

[mlir] Add in-dialect lowering of gpu.all_reduce.
ClosedPublic

Authored by csigg on Jan 3 2020, 12:53 AM.

Download Raw Diff

Details

Reviewers

ftynse
nicolasvasilache
herhut

Commits

rG8b2eb7c494b2: [mlir] Add in-dialect lowering of gpu.all_reduce.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

csigg created this revision.Jan 3 2020, 12:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 3 2020, 12:53 AM

Herald added subscribers: llvm-commits, lucyrfox, mgester and 9 others. · View Herald Transcript

lebedev.ri retitled this revision from Add in-dialect lowering of gpu.all_reduce. to [mlir] Add in-dialect lowering of gpu.all_reduce..Jan 3 2020, 1:00 AM

csigg added a reviewer: ftynse.Jan 3 2020, 1:02 AM

csigg marked an inline comment as done.

csigg added inline comments.

mlir/test/Dialect/GPU/all-reduce.mlir

7–176

These CHECKs were generated from the output with:

sed -r \
-e 's|\t+//.*||' \
-e 's|%([a-z_0-9]+) = |%[[v\1:[a-z_0-9]+]] = |' \
-e 's|\(%([a-z_0-9]+): ([a-z_0-9]+)\):|(%[[v\1:[a-z_0-9]+]]: \2):|' \
-e 's|%([a-z_0-9]+)|%[[v\1]]|g' \
-e 's|bb([0-9]+)|bb[[#b\1]]|g' \
-e 's|^ |    // CHECK:|'

and manaul edits from there.

merge_guards_bot added a subscriber: merge_guards_bot.Jan 3 2020, 1:03 AM

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43216: Diff 236000!Jan 3 2020, 1:03 AM

Wrong clang-tidy checks are annoying here, will make another pass later.

mlir/include/mlir/Dialect/GPU/GPUOps.td
163	Style nit: add whitespace around "+"
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
29	Nit: `explicit` is unnecessary for multi-argument constructors.
39	Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?
47	Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and stores
52	Should this be `cond_br %is_valid_warp` instead?
67	Big and mechanical change: `Value` is now value-based, please use it without pointers.
81	Style nit: add a comment `/bitwidth=/` for `32`
88	Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp.
104	This looks like you want EDSC :) + @nicolasvasilache
138	Maybe you can store the location and pass it as first argument? It's the same for all operations you create above.
150	Nit: let's have a named constant for the address space
201	Why only `F32`? Could you have `isa<FloatType>()` instead?
243	Is this lambda worse it?
mlir/test/Dialect/GPU/all-reduce.mlir
7	Is there a reason why you can't use `.*` as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names.
7–176	We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it?

nicolasvasilache added a reviewer: nicolasvasilache.Jan 6 2020, 10:13 AM

Update reflecting review comments from ftynse.

Thanks a lot for the review, Alex!

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
29	Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list.
88	Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments.
150	Punted by just adding a local variable.
243	Haha, I like your comment. Very subtle ;-)
mlir/test/Dialect/GPU/all-reduce.mlir
7	Replaced with generate-test-checks.py result.
7–176	Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it).

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43364: Diff 236427!Jan 6 2020, 11:55 AM

rriddle added inline comments.Jan 6 2020, 12:08 PM

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
163	`auto op = reduceOp.op()` to avoid recomputing below.
301	Remove `llvm::` from most of the things in this file. They are re-exported in the mlir namespace already.
353	When does this ever fail?

Addressing rriddle's review comments.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Harbormaster failed remote builds in B43416: Diff 236553!Jan 7 2020, 5:05 AM

csigg marked 4 inline comments as done.Jan 7 2020, 5:08 AM

csigg added inline comments.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
353	See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced.

herhut added a subscriber: herhut.Jan 7 2020, 7:38 AM

I somehow only see a subset of changes now...

Updating again, hopefully with all changes this time.

Unit tests: pass. 61127 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43489: Diff 236769!Jan 7 2020, 11:47 PM

Almost there, only minor things necessary to improve understanding.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
150	I'll add that later anyway :)
269	I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might expect to be passed to the continuation.
355	Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation?

This revision now requires changes to proceed.Jan 8 2020, 6:07 AM

Address ftynse's review comments.

The approach you took for building code is very similar to our motivation for declarative builders (aka EDSC), maybe it's time to reconsider those again.

Yes, I'm happy to change this to EDSC as a follow-up. For now I think it is easier to keep it similar to the existing lowering to NVVM.

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
150	Thanks!
355	Correct, the walk iterators get invalidated from the replace. Comment added.

Unit tests: unknown.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B43673: Diff 237293!Jan 10 2020, 5:47 AM

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Herald added a subscriber: aartbik. · View Herald TranscriptJan 10 2020, 4:07 PM

In D72129#1815174, @nicolasvasilache wrote:

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Yes, from scanning the doc and code it looks like this this shouldn't be too difficult. But I haven't actually tried.

Rebase.

Herald added a subscriber: liufengdb. · View Herald TranscriptJan 14 2020, 12:44 AM

csigg added a reviewer: herhut.Jan 14 2020, 12:46 AM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B43902: Diff 237871!Jan 14 2020, 12:55 AM

Fix build error after Value* -> Value change.

Apply clang-format.

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43947: Diff 237993!Jan 14 2020, 8:50 AM

Unit tests: fail. 61801 tests passed, 1 failed and 781 were skipped.

failed: MLIR.Dialect/GPU/all-reduce.mlir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster failed remote builds in B43948: Diff 237994!Jan 14 2020, 8:50 AM

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I pointed Christian to the EDSC doc (https://mlir.llvm.org/docs/EDSC/) and the builder-api-test but otherwise nothing else was discussed.
I didn't look in the details of what would be involved in making the transition.
Can you share what the simpler proposal, than just using the declarative builders, is?

In D72129#1820630, @ftynse wrote:

What is the status re EDSC discussion?
Were the pointers I sent offline enough to give a good picture / do you see how to followup on this to use (and maybe extend) EDSCs?

Could you keep me in the loop on that? I have an even simpler proposal in mind that could reconcile EDSC with imperative builders.

I'm interested to see some movement on this, at this point EDSC isn't totally endorsed by the whole team, and the uncontrolled use in the codebase isn't a great situation.

I have reviewed this before (the basic approach has not changed). Thanks for adding more comments and tests.

LGTM.

Regarding the EDSC vs. use of locally grown alternative in this patch: I try to avoid doing things like the template for adding location. While it saves some characters, it makes code look unfamiliar, which is also a cost. Adding helpers to emit conditionals etc. on the other hand reduces repetition, so I see that as a benefit. Still, I am fine with landing this as it seems to be the plan to evolve it to some EDSC like thing once there is an endorsed one.

@mehdi_amini @nicolasvasilache @herhut Let's take some time and discuss builder APIs outside this diff (also involving @rriddle). My basic observations are that (1) writing structured IR, as in "with nested regions", looks unnecessarily complicated with builders, arguments are the same as those against goto-style programming; (2) a lot of IR construction internally happens in rewrite patterns, where location almost always remains the same, that of the matched operation root; (3) current EDSC APIs are contentious partly because it is unclear when reading the code when the function call creates the IR vs. when it's just a function call.

@csigg I'm fine accepting this, but the test is currently broken. Please fix and we'll be ready to land.

@herhut @ftynse @mehdi_amini sounds good, it's high time to rediscuss this and to make MLIR play nicely with metaprogramming in a way that feels comfortable to everyone.

Do not wrap temporaries in ValueRange.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt

Harbormaster failed remote builds in B44300: Diff 238844!Jan 17 2020, 12:18 PM

ftynse accepted this revision.Jan 20 2020, 1:16 AM

This revision is now accepted and ready to land.Jan 20 2020, 1:16 AM

clang-format.

Closed by commit rG8b2eb7c494b2: [mlir] Add in-dialect lowering of gpu.all_reduce. (authored by csigg). · Explain WhyJan 20 2020, 4:44 AM

This revision was automatically updated to reflect the committed changes.

Unit tests: pass. 62015 tests passed, 0 failed and 783 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster completed remote builds in B44390: Diff 239074.Jan 20 2020, 4:54 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

10 lines

Passes.h

6 lines

IR/

Block.h

3 lines

lib/

Dialect/

GPU/

CMakeLists.txt

1 line

Transforms/

AllReduceLowering.cpp

362 lines

IR/

Block.cpp

7 lines

test/

Dialect/

GPU/

all-reduce.mlir

180 lines

lib/

Transforms/

CMakeLists.txt

2 lines

TestAllReduceLowering.cpp

32 lines

Diff 236000

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	let extraClassDeclaration = [{
/// the workgroup memory		/// the workgroup memory
ArrayRef<BlockArgument> getWorkgroupAttributions() {		ArrayRef<BlockArgument> getWorkgroupAttributions() {
auto begin =		auto begin =
std::next(getBody().front().args_begin(), getType().getNumInputs());		std::next(getBody().front().args_begin(), getType().getNumInputs());
auto end = std::next(begin, getNumWorkgroupAttributions());		auto end = std::next(begin, getNumWorkgroupAttributions());
return {begin, end};		return {begin, end};
}		}

		// Adds a new block argument that corresponds to buffers located in
		// workgroup memory.
		BlockArgument* addWorkgroupAttribution(Type type) {
		auto attrName = getNumWorkgroupAttributionsAttrName();
		auto attr = getAttrOfType<IntegerAttr>(attrName);
		setAttr(attrName, IntegerAttr::get(attr.getType(), attr.getValue()+1));
		ftynseUnsubmitted Done Reply Inline Actions Style nit: add whitespace around "+" ftynse: Style nit: add whitespace around "+"
		return getBody().front().insertArgument(
		getType().getNumInputs() + attr.getInt(), type);
		}

/// Returns a list of block arguments that correspond to buffers located in		/// Returns a list of block arguments that correspond to buffers located in
/// the private memory.		/// the private memory.
ArrayRef<BlockArgument> getPrivateAttributions() {		ArrayRef<BlockArgument> getPrivateAttributions() {
auto begin =		auto begin =
std::next(getBody().front().args_begin(),		std::next(getBody().front().args_begin(),
getType().getNumInputs() + getNumWorkgroupAttributions());		getType().getNumInputs() + getNumWorkgroupAttributions());
return {begin, getBody().front().args_end()};		return {begin, getBody().front().args_end()};
}		}
▲ Show 20 Lines • Show All 422 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/Passes.h

	Show All 11 Lines

	#ifndef MLIR_DIALECT_GPU_PASSES_H_			#ifndef MLIR_DIALECT_GPU_PASSES_H_
	#define MLIR_DIALECT_GPU_PASSES_H_			#define MLIR_DIALECT_GPU_PASSES_H_

	#include <memory>			#include <memory>

	namespace mlir {			namespace mlir {

				class MLIRContext;
	class ModuleOp;			class ModuleOp;
	template <typename T> class OpPassBase;			template <typename T> class OpPassBase;
				class OwningRewritePatternList;

	std::unique_ptr<OpPassBase<ModuleOp>> createGpuKernelOutliningPass();			std::unique_ptr<OpPassBase<ModuleOp>> createGpuKernelOutliningPass();

				/// Collect a set of patterns to rewrite ops within the GPU dialect.
				void populateGpuRewritePatterns(MLIRContext *context,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'context' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'context' [readability-identifier-naming]
				OwningRewritePatternList &patterns);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'patterns' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'patterns' [readability-identifier-naming]

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_GPU_PASSES_H_			#endif // MLIR_DIALECT_GPU_PASSES_H_

mlir/include/mlir/IR/Block.h

//===- Block.h - MLIR Block Class -------------------------------- C++ --===//		//===- Block.h - MLIR Block Class -------------------------------- C++ --===//
//		//
// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file defines the Block class.		// This file defines the Block class.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef MLIR_IR_BLOCK_H		#ifndef MLIR_IR_BLOCK_H
#define MLIR_IR_BLOCK_H		#define MLIR_IR_BLOCK_H

#include "mlir/IR/BlockSupport.h"		#include "mlir/IR/BlockSupport.h"
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'mlir/IR/BlockSupport.h' file not found [clang-diagnostic-error] Lint: Pre-merge checks: clang-tidy: error: 'mlir/IR/BlockSupport.h' file not found [clang-diagnostic-error]
#include "mlir/IR/Visitors.h"		#include "mlir/IR/Visitors.h"

namespace mlir {		namespace mlir {
/// `Block` represents an ordered list of `Operation`s.		/// `Block` represents an ordered list of `Operation`s.
class Block : public IRObjectWithUseList,		class Block : public IRObjectWithUseList,
public llvm::ilist_node_with_parent<Block, Region> {		public llvm::ilist_node_with_parent<Block, Region> {
public:		public:
explicit Block() {}		explicit Block() {}
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	public:
bool args_empty() { return arguments.empty(); }		bool args_empty() { return arguments.empty(); }

/// Add one value to the argument list.		/// Add one value to the argument list.
BlockArgument addArgument(Type type);		BlockArgument addArgument(Type type);

/// Add one argument to the argument list for each type specified in the list.		/// Add one argument to the argument list for each type specified in the list.
iterator_range<args_iterator> addArguments(ArrayRef<Type> types);		iterator_range<args_iterator> addArguments(ArrayRef<Type> types);

		// Add one value to the argument list at the specified position.
		BlockArgument *insertArgument(unsigned index, Type type);

/// Erase the argument at 'index' and remove it from the argument list. If		/// Erase the argument at 'index' and remove it from the argument list. If
/// 'updatePredTerms' is set to true, this argument is also removed from the		/// 'updatePredTerms' is set to true, this argument is also removed from the
/// terminators of each predecessor to this block.		/// terminators of each predecessor to this block.
void eraseArgument(unsigned index, bool updatePredTerms = true);		void eraseArgument(unsigned index, bool updatePredTerms = true);

unsigned getNumArguments() { return arguments.size(); }		unsigned getNumArguments() { return arguments.size(); }
BlockArgument getArgument(unsigned i) { return arguments[i]; }		BlockArgument getArgument(unsigned i) { return arguments[i]; }

▲ Show 20 Lines • Show All 243 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/CMakeLists.txt

	add_llvm_library(MLIRGPU			add_llvm_library(MLIRGPU
	IR/GPUDialect.cpp			IR/GPUDialect.cpp
	IR/DialectRegistration.cpp			IR/DialectRegistration.cpp
	Transforms/KernelOutlining.cpp			Transforms/KernelOutlining.cpp
				Transforms/AllReduceLowering.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU
	)			)
	add_dependencies(MLIRGPU MLIRGPUOpsIncGen MLIRIR MLIRLLVMIR LLVMSupport)			add_dependencies(MLIRGPU MLIRGPUOpsIncGen MLIRIR MLIRLLVMIR LLVMSupport)
	target_link_libraries(MLIRGPU MLIRIR MLIRLLVMIR MLIRStandardOps LLVMSupport)			target_link_libraries(MLIRGPU MLIRIR MLIRLLVMIR MLIRStandardOps LLVMSupport)

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

This file was added.

				//===- AllReduceLowering.cpp - Implementation of all-reduce lowering ------===//
				//
				// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements in-dialect lowering of the all-reduce op to a block of
				// simpler instructions.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/GPU/GPUDialect.h"
				Lint: Pre-merge checks Inline Actions clang-tidy: error: 'mlir/Dialect/GPU/GPUDialect.h' file not found [clang-diagnostic-error] Lint: Pre-merge checks: clang-tidy: error: 'mlir/Dialect/GPU/GPUDialect.h' file not found [clang-diagnostic-error]
				#include "mlir/Dialect/GPU/Passes.h"
				#include "mlir/Dialect/StandardOps/Ops.h"
				#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;

				namespace {

				struct GpuAllReduceRewriter {
				using AccumulatorFactory = std::function<Value (Value , Value *)>;

				explicit GpuAllReduceRewriter(gpu::GPUFuncOp funcOp_,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'funcOp_' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'funcOp_' [readability-identifier-naming]
				ftynseUnsubmitted Done Reply Inline Actions Nit: `explicit` is unnecessary for multi-argument constructors. ftynse: Nit: `explicit` is unnecessary for multi-argument constructors.
				csiggAuthorUnsubmitted Done Reply Inline Actions Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list. csigg: Removed. Nit the nit: `explicit` does prevent copy-initialization from initializer list.
				gpu::AllReduceOp reduceOp_,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'reduceOp_' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'reduceOp_' [readability-identifier…
				PatternRewriter &rewriter_)
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'rewriter_' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'rewriter_' [readability-identifier…
				: funcOp(funcOp_), reduceOp(reduceOp_), rewriter(rewriter_),
				loc(reduceOp.getLoc()), valueType(reduceOp.value()->getType()),
				indexType(IndexType::get(reduceOp.getContext())),
				int32Type(IntegerType::get(32, reduceOp.getContext())) {}

				/// Creates an all_reduce across the block.
				///
				/// First reduce the elements within a warp. The first thread of each warp
				ftynseUnsubmitted Done Reply Inline Actions Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ? ftynse: Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?
				/// writes the intermediate result to shared memory. After synchronizing the
				/// block, the first warp reduces the values from shared memory. The result
				/// is broadcasted to all threads through shared memory.
				///
				/// %warp_reduce = `createWarpReduce(%operand)`
				/// cond_br %is_first_lane, ^then1, ^continue1
				/// ^then1:
				/// store %warp_reduce, %workgroup_buffer, %warp_id
				ftynseUnsubmitted Done Reply Inline Actions Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and stores ftynse: Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`? Same below for loads and…
				/// br ^continue1
				/// ^continue1:
				/// gpu.barrier
				/// %is_valid_warp = cmpi "slt" %thread_idx, %num_warps
				/// cond_br %is_first_lane, ^then2, ^continue2
				ftynseUnsubmitted Done Reply Inline Actions Should this be `cond_br %is_valid_warp` instead? ftynse: Should this be `cond_br %is_valid_warp` instead?
				/// ^then2:
				/// %partial_reduce = load %workgroup_buffer, %thread_idx
				/// %all_reduce = `createWarpReduce(%partial_reduce)`
				/// store %all_reduce, %workgroup_buffer, %zero
				/// llvm.br ^continue2
				/// ^continue2:
				/// gpu.barrier
				/// %result = load %workgroup_buffer, %zero
				/// return %result
				///
				void rewrite() {
				rewriter.setInsertionPoint(reduceOp);

				// Compute linear thread index and block size.
				Value *dimX = getDimOp<gpu::BlockDimOp>("x");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dimX' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dimX' [readability-identifier-naming]
				ftynseUnsubmitted Done Reply Inline Actions Big and mechanical change: `Value` is now value-based, please use it without pointers. ftynse: Big and mechanical change: `Value` is now value-based, please use it without pointers.
				Value *dimY = getDimOp<gpu::BlockDimOp>("y");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dimY' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dimY' [readability-identifier-naming]
				Value *dimZ = getDimOp<gpu::BlockDimOp>("z");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dimZ' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dimZ' [readability-identifier-naming]
				Value *tidX = getDimOp<gpu::ThreadIdOp>("x");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tidX' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tidX' [readability-identifier-naming]
				Value *tidY = getDimOp<gpu::ThreadIdOp>("y");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tidY' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tidY' [readability-identifier-naming]
				Value *tidZ = getDimOp<gpu::ThreadIdOp>("z");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tidZ' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tidZ' [readability-identifier-naming]
				Value *tmp1 = create<MulIOp>(loc, int32Type, tidZ, dimY);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tmp1' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tmp1' [readability-identifier-naming]
				Value *tmp2 = create<AddIOp>(loc, int32Type, tmp1, tidY);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tmp2' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tmp2' [readability-identifier-naming]
				Value *tmp3 = create<MulIOp>(loc, int32Type, tmp2, dimX);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tmp3' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tmp3' [readability-identifier-naming]
				Value *tmp4 = create<MulIOp>(loc, int32Type, dimX, dimY);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'tmp4' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'tmp4' [readability-identifier-naming]
				Value *threadIdx = create<AddIOp>(loc, int32Type, tmp3, tidX);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'threadIdx' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'threadIdx' [readability-identifier-naming]
				Value *blockSize = create<MulIOp>(loc, int32Type, tmp4, dimZ);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'blockSize' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'blockSize' [readability-identifier-naming]

				// Compute lane id (invocation id withing the subgroup).
				Value *warpMask = create<ConstantIntOp>(loc, kWarpSize - 1, 32);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'warpMask' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'warpMask' [readability-identifier-naming]
				ftynseUnsubmitted Done Reply Inline Actions Style nit: add a comment `/bitwidth=/` for `32` ftynse: Style nit: add a comment `/bitwidth=/` for `32`
				Value *laneId = create<AndOp>(loc, threadIdx, warpMask);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'laneId' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'laneId' [readability-identifier-naming]
				Value *isFirstLane = create<CmpIOp>(loc, CmpIPredicate::eq, laneId,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'isFirstLane' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'isFirstLane' [readability-identifier…
				create<ConstantIntOp>(loc, 0, 32));

				Value *numThreadsWithSmallerWarpId = create<SubIOp>(loc, threadIdx, laneId);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'numThreadsWithSmallerWarpId' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'numThreadsWithSmallerWarpId' [readability…
				// The number of active threads in the warp, not clamped to 32.
				Value *activeWidth =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'activeWidth' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'activeWidth' [readability-identifier…
				ftynseUnsubmitted Done Reply Inline Actions Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp. ftynse: Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id *…
				csiggAuthorUnsubmitted Done Reply Inline Actions Correct. The activeWidth does not need to be clamped to subgroup width though. I added two comments. csigg: Correct. The activeWidth does not need to be clamped to subgroup width though. I added two…
				create<SubIOp>(loc, blockSize, numThreadsWithSmallerWarpId);

				// Create factory for op which accumulates to values.
				AccumulatorFactory accumFactory = getFactory();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'accumFactory' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'accumFactory' [readability-identifier…
				assert(accumFactory && "failed to create accumulator factory");

				// Reduce elements within each warp to produce the intermediate results.
				Value *warpReduce =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'warpReduce' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'warpReduce' [readability-identifier…
				createWarpReduce(activeWidth, laneId, reduceOp.value(), accumFactory);

				// Add workgroup buffer to parent function for intermediate result.
				Value *buffer = createWorkgroupBuffer();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'buffer' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'buffer' [readability-identifier-naming]

				// Write the intermediate results to shared memory, using the first lane of
				// each warp.
				createPredicatedBlock(isFirstLane, [&] {
				ftynseUnsubmitted Done Reply Inline Actions This looks like you want EDSC :) + @nicolasvasilache ftynse: This looks like you want EDSC :) + @nicolasvasilache
				Value *warpId = getDivideByWarpSize(threadIdx);
				Value *index = create<IndexCastOp>(loc, indexType, warpId);
				create<StoreOp>(loc, warpReduce, buffer, index);
				});
				create<gpu::BarrierOp>(loc);

				// Compute number of active warps.
				Value *biasedBlockSize =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'biasedBlockSize' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'biasedBlockSize' [readability-identifier…
				create<AddIOp>(loc, int32Type, blockSize, warpMask);
				Value *numWarps = getDivideByWarpSize(biasedBlockSize);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'numWarps' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'numWarps' [readability-identifier-naming]
				Value *isValidWarp =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'isValidWarp' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'isValidWarp' [readability-identifier…
				create<CmpIOp>(loc, CmpIPredicate::slt, threadIdx, numWarps);

				// Use the first numWarps threads to reduce the intermediate results from
				// shared memory. The final result is written to shared memory again.
				Value *zero = create<ConstantIndexOp>(loc, 0);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'zero' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'zero' [readability-identifier-naming]
				createPredicatedBlock(isValidWarp, [&] {
				Value *index = create<IndexCastOp>(loc, indexType, threadIdx);
				Value *value = create<LoadOp>(loc, valueType, buffer, index);
				Value *result = createWarpReduce(numWarps, laneId, value, accumFactory);
				create<StoreOp>(loc, result, buffer, zero);
				});

				// Synchronize workgroup and load result from shared memory.
				create<gpu::BarrierOp>(loc);
				Value *result = create<LoadOp>(loc, valueType, buffer, zero);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'result' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'result' [readability-identifier-naming]

				rewriter.replaceOp(reduceOp, result);
				}

				private:
				// Shortcut to create op using rewriter.
				template <typename T, typename... Args> T create(Args... args) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'args' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'args' [readability-identifier-naming]
				return rewriter.create<T>(std::forward<Args>(args)...);
				ftynseUnsubmitted Done Reply Inline Actions Maybe you can store the location and pass it as first argument? It's the same for all operations you create above. ftynse: Maybe you can store the location and pass it as first argument? It's the same for all…
				}

				// Creates dimension op of type T, with the result casted to int32.
				template <typename T> Value *getDimOp(StringRef dimension) {
				Value *dim = create<T>(loc, indexType, rewriter.getStringAttr(dimension));
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dim' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dim' [readability-identifier-naming]
				return create<IndexCastOp>(loc, int32Type, dim);
				}

				/// Adds type to funcOp's workgroup attributions.
				Value *createWorkgroupBuffer() {
				auto bufferType =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'bufferType' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'bufferType' [readability-identifier…
				MemRefType::get({kWarpSize}, valueType, ArrayRef<AffineMap>{}, 3);
				ftynseUnsubmitted Done Reply Inline Actions Nit: let's have a named constant for the address space ftynse: Nit: let's have a named constant for the address space
				csiggAuthorUnsubmitted Done Reply Inline Actions Punted by just adding a local variable. csigg: Punted by just adding a local variable.
				ftynseUnsubmitted Done Reply Inline Actions I'll add that later anyway :) ftynse: I'll add that later anyway :)
				csiggAuthorUnsubmitted Done Reply Inline Actions Thanks! csigg: Thanks!
				return funcOp.addWorkgroupAttribution(bufferType);
				}

				/// Returns an accumulator factory using either the op attribute or the body
				/// region.
				AccumulatorFactory getFactory() {
				if (!reduceOp.body().empty())
				return getFactory(reduceOp.body());
				if (reduceOp.op())
				return getFactory(*reduceOp.op());
				return AccumulatorFactory();
				}

				rriddleUnsubmitted Done Reply Inline Actions `auto op = reduceOp.op()` to avoid recomputing below. rriddle: `auto op = reduceOp.op()` to avoid recomputing below.
				/// Returns an accumulator factory that clones the body. The body's entry
				/// block is expected to have 2 arguments. The gpu.yield return the
				/// accumulated value of the same type.
				AccumulatorFactory getFactory(Region &body) {
				return AccumulatorFactory([&](Value lhs, Value rhs) {
				Block *block = rewriter.getInsertionBlock();
				Block *split = rewriter.splitBlock(block, rewriter.getInsertionPoint());

				// Insert accumulator body between split block.
				BlockAndValueMapping mapping;
				mapping.map(body.front().getArgument(0), lhs);
				mapping.map(body.front().getArgument(1), rhs);
				rewriter.cloneRegionBefore(body, *split->getParent(),
				split->getIterator(), mapping);

				// Add branch before inserted body, into body.
				block = block->getNextNode();
				create<BranchOp>(loc, block, ValueRange());

				// Replace all gpu.yield ops with branch out of body.
				for (; block != split; block = block->getNextNode()) {
				Operation *terminator = block->getTerminator();
				if (!llvm::isa<gpu::YieldOp>(terminator))
				continue;
				rewriter.setInsertionPointToEnd(block);
				rewriter.replaceOpWithNewOp<BranchOp>(
				terminator, split, ValueRange(terminator->getOperand(0)));
				}

				// Return accumulator result.
				rewriter.setInsertionPointToStart(split);
				return split->addArgument(lhs->getType());
				});
				}

				/// Returns an accumulator factory that creates an op specified by opName.
				AccumulatorFactory getFactory(StringRef opName) {
				bool isFloatingPoint = valueType.isF32();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'isFloatingPoint' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'isFloatingPoint' [readability-identifier…
				ftynseUnsubmitted Done Reply Inline Actions Why only `F32`? Could you have `isa<FloatType>()` instead? ftynse: Why only `F32`? Could you have `isa<FloatType>()` instead?
				if (opName == "add")
				return isFloatingPoint ? getFactory<AddFOp>() : getFactory<AddIOp>();
				if (opName == "mul")
				return isFloatingPoint ? getFactory<MulFOp>() : getFactory<MulIOp>();
				return AccumulatorFactory();
				}

				/// Returns an accumulator factory that creates an op of type T.
				template <typename T> AccumulatorFactory getFactory() {
				return [&](Value lhs, Value rhs) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'lhs' [readability-identifier-naming] clang-tidy: warning: invalid case style for parameter 'rhs' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'lhs' [readability-identifier-naming]…
				return create<T>(loc, lhs->getType(), lhs, rhs);
				};
				}

				/// Creates an if-block skeleton and calls the two factories to generate the
				/// ops in the `then` and `else` block..
				///
				/// llvm.cond_br %condition, ^then, ^continue
				/// ^then:
				/// %then_operands = `thenOpsFactory()`
				/// llvm.br ^continue(%then_operands)
				/// ^else:
				/// %else_operands = `elseOpsFactory()`
				/// llvm.br ^continue(%else_operands)
				/// ^continue(%block_operands):
				///
				template <typename ThenOpsFactory, typename ElseOpsFactory>
				void createIf(Value *condition, ThenOpsFactory &&thenOpsFactory,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'condition' [readability-identifier-naming] clang-tidy: warning: invalid case style for parameter 'thenOpsFactory' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'condition' [readability-identifier…
				ElseOpsFactory &&elseOpsFactory) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'elseOpsFactory' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'elseOpsFactory' [readability-identifier…
				Block *currentBlock = rewriter.getInsertionBlock();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'currentBlock' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'currentBlock' [readability-identifier…
				auto currentPoint = rewriter.getInsertionPoint();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'currentPoint' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'currentPoint' [readability-identifier…

				Block *thenBlock = rewriter.splitBlock(currentBlock, currentPoint);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'thenBlock' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'thenBlock' [readability-identifier-naming]
				Block *elseBlock = rewriter.splitBlock(thenBlock, thenBlock->begin());
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'elseBlock' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'elseBlock' [readability-identifier-naming]
				Block *continueBlock = rewriter.splitBlock(elseBlock, elseBlock->begin());
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'continueBlock' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'continueBlock' [readability-identifier…

				rewriter.setInsertionPointToEnd(currentBlock);
				create<CondBranchOp>(loc, condition, thenBlock,
				/trueOperands=/ArrayRef<Value *>(), elseBlock,
				/falseOperands=/ArrayRef<Value *>());

				auto addBranch = [&](ValueRange operands) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'addBranch' [readability-identifier-naming] clang-tidy: warning: invalid case style for parameter 'operands' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'addBranch' [readability-identifier…
				ftynseUnsubmitted Done Reply Inline Actions Is this lambda worse it? ftynse: Is this lambda worse it?
				csiggAuthorUnsubmitted Done Reply Inline Actions Haha, I like your comment. Very subtle ;-) csigg: Haha, I like your comment. Very subtle ;-)
				create<BranchOp>(loc, continueBlock, operands);
				};

				rewriter.setInsertionPointToStart(thenBlock);
				auto thenOperands = thenOpsFactory();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'thenOperands' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'thenOperands' [readability-identifier…
				addBranch(thenOperands);

				rewriter.setInsertionPointToStart(elseBlock);
				auto elseOperands = elseOpsFactory();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'elseOperands' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'elseOperands' [readability-identifier…
				addBranch(elseOperands);

				assert(thenOperands.size() == elseOperands.size());
				rewriter.setInsertionPointToStart(continueBlock);
				for (auto *operand : thenOperands)
				continueBlock->addArgument(operand->getType());
				}

				/// Shortcut for createIf with empty else block and no block operands.
				template <typename Factory>
				void createPredicatedBlock(Value *condition, Factory &&predicatedOpsFactory) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'condition' [readability-identifier-naming] clang-tidy: warning: invalid case style for parameter 'predicatedOpsFactory' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'condition' [readability-identifier…
				createIf(
				condition,
				[&] {
				predicatedOpsFactory();
				return ArrayRef<Value *>();
				},
				ftynseUnsubmitted Done Reply Inline Actions I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might expect to be passed to the continuation. ftynse: I'd add an assertion that `predicatedOpsFactory()` does not return any values that it might…
				[&] { return ArrayRef<Value *>(); });
				}

				/// Creates a reduction across the first activeWidth lanes of a warp.
				/// The first lane returns the result, all others return values are undefined.
				Value createWarpReduce(Value activeWidth, Value laneId, Value operand,
				AccumulatorFactory &accumFactory) {
				Value *warpSize = create<ConstantIntOp>(loc, kWarpSize, 32);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'warpSize' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'warpSize' [readability-identifier-naming]
				Value *isPartialWarp =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'isPartialWarp' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'isPartialWarp' [readability-identifier…
				create<CmpIOp>(loc, CmpIPredicate::slt, activeWidth, warpSize);
				SmallVector<Type, 2> shuffleType = {valueType, rewriter.getI1Type()};
				auto xorAttr = rewriter.getStringAttr("xor");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'xorAttr' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'xorAttr' [readability-identifier-naming]

				createIf(
				isPartialWarp,
				// Generate reduction over a (potentially) partial warp.
				[&] {
				Value *value = operand;
				// Repeatedly shuffle value from 'laneId ^ i' and accumulate if source
				// lane is within the active range. The accumulated value is available
				// in the first lane.
				for (int i = 1; i < kWarpSize; i <<= 1) {
				Value *offset = create<ConstantIntOp>(loc, i, 32);
				auto shuffleOp = create<gpu::ShuffleOp>(
				loc, shuffleType, value, offset, activeWidth, xorAttr);
				// Skip the accumulation if the shuffle op read from a lane outside
				// of the active range.
				createIf(
				shuffleOp.getResult(1),
				[&] {
				return llvm::SmallVector<Value *, 1>{
				accumFactory(value, shuffleOp.getResult(0))};
				rriddleUnsubmitted Done Reply Inline Actions Remove `llvm::` from most of the things in this file. They are re-exported in the mlir namespace already. rriddle: Remove `llvm::` from most of the things in this file. They are re-exported in the mlir…
				},
				[&] { return llvm::makeArrayRef(value); });
				value = rewriter.getInsertionBlock()->getArgument(0);
				}
				return llvm::SmallVector<Value *, 1>{value};
				},
				// Generate a reduction over the entire warp. This is a specialization
				// of the above reduction with unconditional accumulation.
				[&] {
				Value *value = operand;
				for (int i = 1; i < kWarpSize; i <<= 1) {
				Value *offset = create<ConstantIntOp>(loc, i, 32);
				auto shuffleOp = create<gpu::ShuffleOp>(loc, shuffleType, value,
				offset, warpSize, xorAttr);
				value = accumFactory(value, shuffleOp.getResult(0));
				}
				return llvm::SmallVector<Value *, 1>{value};
				});
				return rewriter.getInsertionBlock()->getArgument(0);
				}

				/// Returns value divided by the warp size (i.e. 32).
				Value getDivideByWarpSize(Value value) {
				Value *warpSize = create<ConstantIntOp>(loc, kWarpSize, 32);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'warpSize' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'warpSize' [readability-identifier-naming]
				return create<DivISOp>(loc, int32Type, value, warpSize);
				}

				gpu::GPUFuncOp funcOp;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'funcOp' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'funcOp' [readability-identifier-naming]
				gpu::AllReduceOp reduceOp;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'reduceOp' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'reduceOp' [readability-identifier-naming]
				PatternRewriter &rewriter;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'rewriter' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'rewriter' [readability-identifier-naming]

				Location loc;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'loc' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'loc' [readability-identifier-naming]
				Type valueType;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'valueType' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'valueType' [readability-identifier-naming]
				Type indexType;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'indexType' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'indexType' [readability-identifier-naming]
				Type int32Type;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'int32Type' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'int32Type' [readability-identifier-naming]

				static constexpr int kWarpSize = 32;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'kWarpSize' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'kWarpSize' [readability-identifier-naming]
				};

				struct GpuAllReduceConversion : public RewritePattern {
				explicit GpuAllReduceConversion(MLIRContext *context)
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'context' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'context' [readability-identifier-naming]
				: RewritePattern(gpu::GPUFuncOp::getOperationName(), 1, context) {}

				PatternMatchResult matchAndRewrite(Operation *op,
				PatternRewriter &rewriter) const override {
				auto funcOp = llvm::cast<gpu::GPUFuncOp>(op);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'funcOp' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'funcOp' [readability-identifier-naming]
				auto callback = [&](gpu::AllReduceOp reduceOp) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'callback' [readability-identifier-naming] clang-tidy: warning: invalid case style for parameter 'reduceOp' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'callback' [readability-identifier-naming]…
				GpuAllReduceRewriter(funcOp, reduceOp, rewriter).rewrite();
				return WalkResult::interrupt();
				};
				while (funcOp.walk(callback).wasInterrupted()) {
				}
				rriddleUnsubmitted Done Reply Inline Actions When does this ever fail? rriddle: When does this ever fail?
				csiggAuthorUnsubmitted Done Reply Inline Actions See two lines above: every time the callback is invoked. This makes sure that all occurrences of gpu.all_reduce in the same gpu.function are replaced. csigg: See two lines above: every time the callback is invoked. This makes sure that all occurrences…
				return matchSuccess();
				}
				ftynseUnsubmitted Done Reply Inline Actions Could you please add a comment on why this approach is necessary? I understand that you need to operate on the GPUFuncOp level because you modify the GPUFuncOp itself, which would be incorrect from a nested operation (AllReduceOp). I don't understand why do you need to interrupt the walk after each rewrite. Is it because of some state invalidation? ftynse: Could you please add a comment on why this approach is necessary? I understand that you need…
				csiggAuthorUnsubmitted Done Reply Inline Actions Correct, the walk iterators get invalidated from the replace. Comment added. csigg: Correct, the walk iterators get invalidated from the replace. Comment added.
				};
				} // namespace

				void mlir::populateGpuRewritePatterns(MLIRContext *context,
				OwningRewritePatternList &patterns) {
				patterns.insert<GpuAllReduceConversion>(context);
				}

mlir/lib/IR/Block.cpp

//===- Block.cpp - MLIR Block Class ---------------------------------------===//		//===- Block.cpp - MLIR Block Class ---------------------------------------===//
//		//
// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/IR/Block.h"		#include "mlir/IR/Block.h"
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'mlir/IR/Block.h' file not found [clang-diagnostic-error] Lint: Pre-merge checks: clang-tidy: error: 'mlir/IR/Block.h' file not found [clang-diagnostic-error]
#include "mlir/IR/Builders.h"		#include "mlir/IR/Builders.h"
#include "mlir/IR/Operation.h"		#include "mlir/IR/Operation.h"
using namespace mlir;		using namespace mlir;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// BlockArgument		// BlockArgument
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	auto Block::addArguments(ArrayRef<Type> types)
arguments.reserve(arguments.size() + types.size());		arguments.reserve(arguments.size() + types.size());
auto initialSize = arguments.size();		auto initialSize = arguments.size();
for (auto type : types) {		for (auto type : types) {
addArgument(type);		addArgument(type);
}		}
return {arguments.data() + initialSize, arguments.data() + arguments.size()};		return {arguments.data() + initialSize, arguments.data() + arguments.size()};
}		}

		BlockArgument *Block::insertArgument(unsigned index, Type type) {
		auto *arg = new BlockArgument(type, this);
		assert(index <= arguments.size());
		arguments.insert(arguments.begin() + index, arg);
		return arg;
		}

void Block::eraseArgument(unsigned index, bool updatePredTerms) {		void Block::eraseArgument(unsigned index, bool updatePredTerms) {
assert(index < arguments.size());		assert(index < arguments.size());

// Delete the argument.		// Delete the argument.
arguments[index].destroy();		arguments[index].destroy();
arguments.erase(arguments.begin() + index);		arguments.erase(arguments.begin() + index);

// If we aren't updating predecessors, there is nothing left to do.		// If we aren't updating predecessors, there is nothing left to do.
▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/all-reduce.mlir

This file was added.

				// RUN: mlir-opt -test-all-reduce-lowering %s \| FileCheck %s

				module @kernels attributes {gpu.kernel_module} {

				// CHECK: gpu.func @kernel(%[[varg0:[a-z_0-9]+]]: f32) workgroup(%[[varg1:[a-z_0-9]+]] : memref<32xf32, 3>) kernel {
				gpu.func @kernel(%arg0 : f32) attributes { gpu.kernel } {
				// CHECK: %[[vc0_i32:[a-z_0-9]+]] = constant 0 : i32
				ftynseUnsubmitted Done Reply Inline Actions Is there a reason why you can't use `.` as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names. ftynse:* Is there a reason why you can't use `.*` as a regexp for names? Seeing the full with ranges…
				csiggAuthorUnsubmitted Done Reply Inline Actions Replaced with generate-test-checks.py result. csigg: Replaced with generate-test-checks.py result.
				// CHECK: %[[vc31_i32:[a-z_0-9]+]] = constant 31 : i32
				// CHECK: %[[vc0:[a-z_0-9]+]] = constant 0 : index
				// CHECK: %[[vc32_i32:[a-z_0-9]+]] = constant 32 : i32
				// CHECK: %[[vc1_i32:[a-z_0-9]+]] = constant 1 : i32
				// CHECK: %[[vc2_i32:[a-z_0-9]+]] = constant 2 : i32
				// CHECK: %[[vc4_i32:[a-z_0-9]+]] = constant 4 : i32
				// CHECK: %[[vc8_i32:[a-z_0-9]+]] = constant 8 : i32
				// CHECK: %[[vc16_i32:[a-z_0-9]+]] = constant 16 : i32
				// CHECK: %[[v0:[a-z_0-9]+]] = "gpu.block_dim"() {dimension = "x"} : () -> index
				// CHECK: %[[v1:[a-z_0-9]+]] = index_cast %[[v0]] : index to i32
				// CHECK: %[[v2:[a-z_0-9]+]] = "gpu.block_dim"() {dimension = "y"} : () -> index
				// CHECK: %[[v3:[a-z_0-9]+]] = index_cast %[[v2]] : index to i32
				// CHECK: %[[v4:[a-z_0-9]+]] = "gpu.block_dim"() {dimension = "z"} : () -> index
				// CHECK: %[[v5:[a-z_0-9]+]] = index_cast %[[v4]] : index to i32
				// CHECK: %[[v6:[a-z_0-9]+]] = "gpu.thread_id"() {dimension = "x"} : () -> index
				// CHECK: %[[v7:[a-z_0-9]+]] = index_cast %[[v6]] : index to i32
				// CHECK: %[[v8:[a-z_0-9]+]] = "gpu.thread_id"() {dimension = "y"} : () -> index
				// CHECK: %[[v9:[a-z_0-9]+]] = index_cast %[[v8]] : index to i32
				// CHECK: %[[v10:[a-z_0-9]+]] = "gpu.thread_id"() {dimension = "z"} : () -> index
				// CHECK: %[[v11:[a-z_0-9]+]] = index_cast %[[v10]] : index to i32
				// CHECK: %[[v12:[a-z_0-9]+]] = muli %[[v11]], %[[v3]] : i32
				// CHECK: %[[v13:[a-z_0-9]+]] = addi %[[v12]], %[[v9]] : i32
				// CHECK: %[[v14:[a-z_0-9]+]] = muli %[[v13]], %[[v1]] : i32
				// CHECK: %[[v15:[a-z_0-9]+]] = muli %[[v1]], %[[v3]] : i32
				// CHECK: %[[v16:[a-z_0-9]+]] = addi %[[v14]], %[[v7]] : i32
				// CHECK: %[[v17:[a-z_0-9]+]] = muli %[[v15]], %[[v5]] : i32
				// CHECK: %[[v18:[a-z_0-9]+]] = and %[[v16]], %[[vc31_i32]] : i32
				// CHECK: %[[v19:[a-z_0-9]+]] = cmpi "eq", %[[v18]], %[[vc0_i32]] : i32
				// CHECK: %[[v20:[a-z_0-9]+]] = subi %[[v16]], %[[v18]] : i32
				// CHECK: %[[v21:[a-z_0-9]+]] = subi %[[v17]], %[[v20]] : i32
				// CHECK: %[[v22:[a-z_0-9]+]] = cmpi "slt", %[[v21]], %[[vc32_i32]] : i32
				// CHECK: cond_br %[[v22]], ^bb[[#b1:]], ^bb[[#b17:]]
				// CHECK: ^bb[[#b1]]:
				// CHECK: %[[vresult:[a-z_0-9]+]], %[[vvalid:[a-z_0-9]+]] = gpu.shuffle %[[varg0]], %[[vc1_i32]], %[[v21]] xor : f32
				// CHECK: cond_br %[[vvalid]], ^bb[[#b2:]], ^bb[[#b3:]]
				// CHECK: ^bb[[#b2]]:
				// CHECK: %[[v23:[a-z_0-9]+]] = addf %[[varg0]], %[[vresult]] : f32
				// CHECK: br ^bb[[#b4:]](%[[v23]] : f32)
				// CHECK: ^bb[[#b3]]:
				// CHECK: br ^bb[[#b4]](%[[varg0]] : f32)
				// CHECK: ^bb[[#b4]](%[[v24:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_0:[a-z_0-9]+]], %[[vvalid_1:[a-z_0-9]+]] = gpu.shuffle %[[v24]], %[[vc2_i32]], %[[v21]] xor : f32
				// CHECK: cond_br %[[vvalid_1]], ^bb[[#b5:]], ^bb[[#b6:]]
				// CHECK: ^bb[[#b5]]:
				// CHECK: %[[v25:[a-z_0-9]+]] = addf %[[v24]], %[[vresult_0]] : f32
				// CHECK: br ^bb[[#b7:]](%[[v25]] : f32)
				// CHECK: ^bb[[#b6]]:
				// CHECK: br ^bb[[#b7]](%[[v24]] : f32)
				// CHECK: ^bb[[#b7]](%[[v26:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_2:[a-z_0-9]+]], %[[vvalid_3:[a-z_0-9]+]] = gpu.shuffle %[[v26]], %[[vc4_i32]], %[[v21]] xor : f32
				// CHECK: cond_br %[[vvalid_3]], ^bb[[#b8:]], ^bb[[#b9:]]
				// CHECK: ^bb[[#b8]]:
				// CHECK: %[[v27:[a-z_0-9]+]] = addf %[[v26]], %[[vresult_2]] : f32
				// CHECK: br ^bb[[#b10:]](%[[v27]] : f32)
				// CHECK: ^bb[[#b9]]:
				// CHECK: br ^bb[[#b10]](%[[v26]] : f32)
				// CHECK: ^bb[[#b10]](%[[v28:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_4:[a-z_0-9]+]], %[[vvalid_5:[a-z_0-9]+]] = gpu.shuffle %[[v28]], %[[vc8_i32]], %[[v21]] xor : f32
				// CHECK: cond_br %[[vvalid_5]], ^bb[[#b11:]], ^bb[[#b12:]]
				// CHECK: ^bb[[#b11]]:
				// CHECK: %[[v29:[a-z_0-9]+]] = addf %[[v28]], %[[vresult_4]] : f32
				// CHECK: br ^bb[[#b13:]](%[[v29]] : f32)
				// CHECK: ^bb[[#b12]]:
				// CHECK: br ^bb[[#b13]](%[[v28]] : f32)
				// CHECK: ^bb[[#b13]](%[[v30:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_6:[a-z_0-9]+]], %[[vvalid_7:[a-z_0-9]+]] = gpu.shuffle %[[v30]], %[[vc16_i32]], %[[v21]] xor : f32
				// CHECK: cond_br %[[vvalid_7]], ^bb[[#b14:]], ^bb[[#b15:]]
				// CHECK: ^bb[[#b14]]:
				// CHECK: %[[v31:[a-z_0-9]+]] = addf %[[v30]], %[[vresult_6]] : f32
				// CHECK: br ^bb[[#b16:]](%[[v31]] : f32)
				// CHECK: ^bb[[#b15]]:
				// CHECK: br ^bb[[#b16]](%[[v30]] : f32)
				// CHECK: ^bb[[#b16]](%[[v32:[a-z_0-9]+]]: f32):
				// CHECK: br ^bb[[#b18:]](%[[v32]] : f32)
				// CHECK: ^bb[[#b17]]:
				// CHECK: %[[vresult_8:[a-z_0-9]+]], %[[vvalid_9:[a-z_0-9]+]] = gpu.shuffle %[[varg0]], %[[vc1_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v33:[a-z_0-9]+]] = addf %[[varg0]], %[[vresult_8]] : f32
				// CHECK: %[[vresult_10:[a-z_0-9]+]], %[[vvalid_11:[a-z_0-9]+]] = gpu.shuffle %[[v33]], %[[vc2_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v34:[a-z_0-9]+]] = addf %[[v33]], %[[vresult_10]] : f32
				// CHECK: %[[vresult_12:[a-z_0-9]+]], %[[vvalid_13:[a-z_0-9]+]] = gpu.shuffle %[[v34]], %[[vc4_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v35:[a-z_0-9]+]] = addf %[[v34]], %[[vresult_12]] : f32
				// CHECK: %[[vresult_14:[a-z_0-9]+]], %[[vvalid_15:[a-z_0-9]+]] = gpu.shuffle %[[v35]], %[[vc8_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v36:[a-z_0-9]+]] = addf %[[v35]], %[[vresult_14]] : f32
				// CHECK: %[[vresult_16:[a-z_0-9]+]], %[[vvalid_17:[a-z_0-9]+]] = gpu.shuffle %[[v36]], %[[vc16_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v37:[a-z_0-9]+]] = addf %[[v36]], %[[vresult_16]] : f32
				// CHECK: br ^bb[[#b18]](%[[v37]] : f32)
				// CHECK: ^bb[[#b18]](%[[v38:[a-z_0-9]+]]: f32):
				// CHECK: cond_br %[[v19]], ^bb[[#b19:]], ^bb[[#b20:]]
				// CHECK: ^bb[[#b19]]:
				// CHECK: %[[v39:[a-z_0-9]+]] = divis %[[v16]], %[[vc32_i32]] : i32
				// CHECK: %[[v40:[a-z_0-9]+]] = index_cast %[[v39]] : i32 to index
				// CHECK: store %[[v38]], %[[varg1]][%[[v40]]] : memref<32xf32, 3>
				// CHECK: br ^bb[[#b21:]]
				// CHECK: ^bb[[#b20]]:
				// CHECK: br ^bb[[#b21]]
				// CHECK: ^bb[[#b21]]:
				// CHECK: gpu.barrier
				// CHECK: %[[v41:[a-z_0-9]+]] = addi %[[v17]], %[[vc31_i32]] : i32
				// CHECK: %[[v42:[a-z_0-9]+]] = divis %[[v41]], %[[vc32_i32]] : i32
				// CHECK: %[[v43:[a-z_0-9]+]] = cmpi "slt", %[[v16]], %[[v42]] : i32
				// CHECK: cond_br %[[v43]], ^bb[[#b22:]], ^bb[[#b41:]]
				// CHECK: ^bb[[#b22]]:
				// CHECK: %[[v44:[a-z_0-9]+]] = index_cast %[[v16]] : i32 to index
				// CHECK: %[[v45:[a-z_0-9]+]] = load %[[varg1]][%[[v44]]] : memref<32xf32, 3>
				// CHECK: %[[v46:[a-z_0-9]+]] = cmpi "slt", %[[v42]], %[[vc32_i32]] : i32
				// CHECK: cond_br %[[v46]], ^bb[[#b23:]], ^bb[[#b39:]]
				// CHECK: ^bb[[#b23]]:
				// CHECK: %[[vresult_18:[a-z_0-9]+]], %[[vvalid_19:[a-z_0-9]+]] = gpu.shuffle %[[v45]], %[[vc1_i32]], %[[v42]] xor : f32
				// CHECK: cond_br %[[vvalid_19]], ^bb[[#b24:]], ^bb[[#b25:]]
				// CHECK: ^bb[[#b24]]:
				// CHECK: %[[v47:[a-z_0-9]+]] = addf %[[v45]], %[[vresult_18]] : f32
				// CHECK: br ^bb[[#b26:]](%[[v47]] : f32)
				// CHECK: ^bb[[#b25]]:
				// CHECK: br ^bb[[#b26]](%[[v45]] : f32)
				// CHECK: ^bb[[#b26]](%[[v48:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_20:[a-z_0-9]+]], %[[vvalid_21:[a-z_0-9]+]] = gpu.shuffle %[[v48]], %[[vc2_i32]], %[[v42]] xor : f32
				// CHECK: cond_br %[[vvalid_21]], ^bb[[#b27:]], ^bb[[#b28:]]
				// CHECK: ^bb[[#b27]]:
				// CHECK: %[[v49:[a-z_0-9]+]] = addf %[[v48]], %[[vresult_20]] : f32
				// CHECK: br ^bb[[#b29:]](%[[v49]] : f32)
				// CHECK: ^bb[[#b28]]:
				// CHECK: br ^bb[[#b29]](%[[v48]] : f32)
				// CHECK: ^bb[[#b29]](%[[v50:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_22:[a-z_0-9]+]], %[[vvalid_23:[a-z_0-9]+]] = gpu.shuffle %[[v50]], %[[vc4_i32]], %[[v42]] xor : f32
				// CHECK: cond_br %[[vvalid_23]], ^bb[[#b30:]], ^bb[[#b31:]]
				// CHECK: ^bb[[#b30]]:
				// CHECK: %[[v51:[a-z_0-9]+]] = addf %[[v50]], %[[vresult_22]] : f32
				// CHECK: br ^bb[[#b32:]](%[[v51]] : f32)
				// CHECK: ^bb[[#b31]]:
				// CHECK: br ^bb[[#b32]](%[[v50]] : f32)
				// CHECK: ^bb[[#b32]](%[[v52:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_24:[a-z_0-9]+]], %[[vvalid_25:[a-z_0-9]+]] = gpu.shuffle %[[v52]], %[[vc8_i32]], %[[v42]] xor : f32
				// CHECK: cond_br %[[vvalid_25]], ^bb[[#b33:]], ^bb[[#b34:]]
				// CHECK: ^bb[[#b33]]:
				// CHECK: %[[v53:[a-z_0-9]+]] = addf %[[v52]], %[[vresult_24]] : f32
				// CHECK: br ^bb[[#b35:]](%[[v53]] : f32)
				// CHECK: ^bb[[#b34]]:
				// CHECK: br ^bb[[#b35]](%[[v52]] : f32)
				// CHECK: ^bb[[#b35]](%[[v54:[a-z_0-9]+]]: f32):
				// CHECK: %[[vresult_26:[a-z_0-9]+]], %[[vvalid_27:[a-z_0-9]+]] = gpu.shuffle %[[v54]], %[[vc16_i32]], %[[v42]] xor : f32
				// CHECK: cond_br %[[vvalid_27]], ^bb[[#b36:]], ^bb[[#b37:]]
				// CHECK: ^bb[[#b36]]:
				// CHECK: %[[v55:[a-z_0-9]+]] = addf %[[v54]], %[[vresult_26]] : f32
				// CHECK: br ^bb[[#b38:]](%[[v55]] : f32)
				// CHECK: ^bb[[#b37]]:
				// CHECK: br ^bb[[#b38]](%[[v54]] : f32)
				// CHECK: ^bb[[#b38]](%[[v56:[a-z_0-9]+]]: f32):
				// CHECK: br ^bb[[#b40:]](%[[v56]] : f32)
				// CHECK: ^bb[[#b39]]:
				// CHECK: %[[vresult_28:[a-z_0-9]+]], %[[vvalid_29:[a-z_0-9]+]] = gpu.shuffle %[[v45]], %[[vc1_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v57:[a-z_0-9]+]] = addf %[[v45]], %[[vresult_28]] : f32
				// CHECK: %[[vresult_30:[a-z_0-9]+]], %[[vvalid_31:[a-z_0-9]+]] = gpu.shuffle %[[v57]], %[[vc2_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v58:[a-z_0-9]+]] = addf %[[v57]], %[[vresult_30]] : f32
				// CHECK: %[[vresult_32:[a-z_0-9]+]], %[[vvalid_33:[a-z_0-9]+]] = gpu.shuffle %[[v58]], %[[vc4_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v59:[a-z_0-9]+]] = addf %[[v58]], %[[vresult_32]] : f32
				// CHECK: %[[vresult_34:[a-z_0-9]+]], %[[vvalid_35:[a-z_0-9]+]] = gpu.shuffle %[[v59]], %[[vc8_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v60:[a-z_0-9]+]] = addf %[[v59]], %[[vresult_34]] : f32
				// CHECK: %[[vresult_36:[a-z_0-9]+]], %[[vvalid_37:[a-z_0-9]+]] = gpu.shuffle %[[v60]], %[[vc16_i32]], %[[vc32_i32]] xor : f32
				// CHECK: %[[v61:[a-z_0-9]+]] = addf %[[v60]], %[[vresult_36]] : f32
				// CHECK: br ^bb[[#b40]](%[[v61]] : f32)
				// CHECK: ^bb[[#b40]](%[[v62:[a-z_0-9]+]]: f32):
				// CHECK: store %[[v62]], %[[varg1]][%[[vc0]]] : memref<32xf32, 3>
				// CHECK: br ^bb[[#b42:]]
				// CHECK: ^bb[[#b41]]:
				// CHECK: br ^bb[[#b42]]
				// CHECK: ^bb[[#b42]]:
				// CHECK: gpu.barrier
				// CHECK: %[[v63:[a-z_0-9]+]] = load %[[varg1]][%[[vc0]]] : memref<32xf32, 3>
				%sum = "gpu.all_reduce"(%arg0) ({}) {op = "add"} : (f32) -> (f32)
				csiggAuthorUnsubmitted Done Reply Inline Actions These CHECKs were generated from the output with: sed -r \ -e 's\|\t+//.\|\|' \ -e 's\|%([a-z_0-9]+) = \|%[[v\1:[a-z_0-9]+]] = \|' \ -e 's\|$%([a-z_0-9]+): ([a-z_0-9]+)$:\|(%[[v\1:[a-z_0-9]+]]: \2):\|' \ -e 's\|%([a-z_0-9]+)\|%[[v\1]]\|g' \ -e 's\|bb([0-9]+)\|bb[[#b\1]]\|g' \ -e 's\|^ \| // CHECK:\|' and manaul edits from there. csigg:* These CHECKs were generated from the output with: ``` sed -r \ -e 's\|\t+//.*\|\|' \ -e 's\|%([a…
				ftynseUnsubmitted Done Reply Inline Actions We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it? ftynse: We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py…
				csiggAuthorUnsubmitted Done Reply Inline Actions Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it). csigg: Works very well, I wish I knew about this (well, Mehdi mentioned it but I couldn't find it).
				gpu.return
				}

				}

mlir/test/lib/Transforms/CMakeLists.txt

	add_llvm_library(MLIRTestTransforms			add_llvm_library(MLIRTestTransforms
				TestAllReduceLowering.cpp
	TestCallGraph.cpp			TestCallGraph.cpp
	TestConstantFold.cpp			TestConstantFold.cpp
	TestLoopFusion.cpp			TestLoopFusion.cpp
	TestInlining.cpp			TestInlining.cpp
	TestLinalgTransforms.cpp			TestLinalgTransforms.cpp
	TestLiveness.cpp			TestLiveness.cpp
	TestLoopMapping.cpp			TestLoopMapping.cpp
	TestLoopParametricTiling.cpp			TestLoopParametricTiling.cpp
	Show All 12 Lines
	include_directories(${CMAKE_CURRENT_BINARY_DIR}/../DeclarativeTransforms)			include_directories(${CMAKE_CURRENT_BINARY_DIR}/../DeclarativeTransforms)
	add_dependencies(MLIRTestTransforms MLIRStandardOpsIncGen)			add_dependencies(MLIRTestTransforms MLIRStandardOpsIncGen)
	add_dependencies(MLIRTestTransforms MLIRTestLinalgTransformPatternsIncGen)			add_dependencies(MLIRTestTransforms MLIRTestLinalgTransformPatternsIncGen)
	add_dependencies(MLIRTestTransforms MLIRTestVectorTransformPatternsIncGen)			add_dependencies(MLIRTestTransforms MLIRTestVectorTransformPatternsIncGen)
	target_link_libraries(MLIRTestTransforms			target_link_libraries(MLIRTestTransforms
	MLIRAffineOps			MLIRAffineOps
	MLIRAnalysis			MLIRAnalysis
	MLIRLoopOps			MLIRLoopOps
				MLIRGPU
	MLIRPass			MLIRPass
	MLIRTestDialect			MLIRTestDialect
	MLIRVectorOps			MLIRVectorOps
	)			)

mlir/test/lib/Transforms/TestAllReduceLowering.cpp

This file was added.

				//===- TestAllReduceLowering.cpp - Test gpu.all_reduce lowering -----------===//
				//
				// Part of the MLIR Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains test passes for lowering the gpu.all_reduce op.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/GPU/Passes.h"
				Lint: Pre-merge checks Inline Actions clang-tidy: error: 'mlir/Dialect/GPU/Passes.h' file not found [clang-diagnostic-error] Lint: Pre-merge checks: clang-tidy: error: 'mlir/Dialect/GPU/Passes.h' file not found [clang-diagnostic-error]
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;

				namespace {
				struct TestAllReduceLoweringPass
				: public ModulePass<TestAllReduceLoweringPass> {
				void runOnModule() override {
				OwningRewritePatternList patterns;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'patterns' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'patterns' [readability-identifier-naming]
				populateGpuRewritePatterns(&getContext(), patterns);
				applyPatternsGreedily(getModule(), patterns);
				}
				};
				} // namespace

				static PassRegistration<TestAllReduceLoweringPass>
				pass("test-all-reduce-lowering",
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'pass' [readability-identifier-naming] Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'pass' [readability-identifier-naming]
				"Lowers gpu.all-reduce ops within the GPU dialect.");

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add in-dialect lowering of gpu.all_reduce.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 236000

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/include/mlir/Dialect/GPU/Passes.h

mlir/include/mlir/IR/Block.h

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

mlir/lib/IR/Block.cpp

mlir/test/Dialect/GPU/all-reduce.mlir

mlir/test/lib/Transforms/CMakeLists.txt

mlir/test/lib/Transforms/TestAllReduceLowering.cpp

[mlir] Add in-dialect lowering of gpu.all_reduce.
ClosedPublic