This is an archive of the discontinued LLVM Phabricator instance.

I have some tests ready as well but just wanted to check what kind of test are preferred: test running with the mlir-cuda-runner with a check of the result or a mlir-opt lowering with checks of the lowering?

In D75766#1910251, @clementval wrote:

I have some tests ready as well but just wanted to check what kind of test are preferred: test running with the mlir-cuda-runner with a check of the result or a miler-opt lowering with checks of the lowering?

We can't assume that someone has a GPU to be able to run the compiler unit tests: so lit+FileCheck please.

Thanks Valentin, the change looks good, with one caveat: the all_reduce lowering in LowerGpuOpsToNVVMOps.cpp is on the chopping block.
Would you mind replicating your changes in mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp as well?
I hope I will get to the clean up next week, it's overdue but I've been busy with other stuff.

@csigg Sure I was thinking about it. I'll do it an update the patch.
@mehdi_amini Thanks for the confirmation. I'll add some tests to this patch then.

clementval edited the summary of this revision. (Show Details)Mar 7 2020, 11:12 AM

Nice, thanks for adding this!

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
127	Instead of asserting here, could you extend the verifier to report errors instead? If this combination is illegal, the verifier should reject it.

In D75766#1910973, @mehdi_amini wrote:

In D75766#1910251, @clementval wrote:

I have some tests ready as well but just wanted to check what kind of test are preferred: test running with the mlir-cuda-runner with a check of the result or a miler-opt lowering with checks of the lowering?

We can't assume that someone has a GPU to be able to run the compiler unit tests: so lit+FileCheck please.

It is still nice to have a test with the mlir-cuda-runner to check that it computes what we expect it to compute if you have already written them. But as Mehdi said, testing the generated IR is the mandatory one.

jdoerfert resigned from this revision.Mar 9 2020, 8:35 AM

Add some mlir-cuda-runner tests
Add some mlir-opt + FileCheck tests
Address other comments

Herald added a subscriber: aartbik. · View Herald TranscriptMar 9 2020, 1:02 PM

@csigg @herhut I updated the patch with your suggestions.

bondhugula requested changes to this revision.Mar 10 2020, 5:27 AM

bondhugula added a subscriber: bondhugula.

bondhugula added inline comments.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
156	Nit: "`" -> '`'
mlir/lib/ExecutionEngine/RunnerUtils.cpp
46	Nit: M->rank is int64_t.
46	Unrelated to this patch: an UnrankedMemRefType having a ->rank field is weird. Would have been better to name it UnknownRankMemRefType FWIW.
77	List initialization? `UnrankedMemRefType<int32_t> descriptor = {rank, ptr};`

This revision now requires changes to proceed.Mar 10 2020, 5:27 AM

Address @bondhugula comments

clementval marked 3 inline comments as done.Mar 10 2020, 6:02 AM

Thanks for addressing the comments. Looks good to land.

@herhut Thanks for the review. Do you mind pushing it? I do not have access rights.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 10 2020, 1:41 PM

Closed by commit rG2eff566b07da: [MLIR] Add `and`, `or`, `xor`, `min`, `max` too gpu.all_reduce and the nvvm… (authored by herhut). · Explain Why

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2020, 1:41 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

@herhut Thanks for pushing the patch. I don't see me in the attribution? Is this normal? I don't really mind but just wanted to double check.

In D75766#1916663, @clementval wrote:

@herhut Thanks for pushing the patch. I don't see me in the attribution? Is this normal? I don't really mind but just wanted to double check.

Thanks for flagging this. No, this is not normal. It is me not checking that arc patch does the right thing. You should have remained as the author. I can revert and reland with corrected attribution.

In D75766#1916763, @herhut wrote:

Thanks for flagging this. No, this is not normal. It is me not checking that arc patch does the right thing. You should have remained as the author. I can revert and reland with corrected attribution.

Thanks a lot! Really appreciate!

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

16 lines

ExecutionEngine/

RunnerUtils.h

2 lines

lib/

Conversion/

GPUToNVVM/

LowerGpuOpsToNVVMOps.cpp

39 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

8 lines

Transforms/

AllReduceLowering.cpp

29 lines

ExecutionEngine/

RunnerUtils.cpp

29 lines

test/

Dialect/

GPU/

all-reduce-max.mlir

203 lines

invalid.mlir

8 lines

mlir-cuda-runner/

60 lines

58 lines

58 lines

58 lines

58 lines

Diff 249347

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 476 Lines • ▼ Show 20 Lines	let description = [{

Example:		Example:

```gpu.yield %f0, %f1 : f32, f32		```gpu.yield %f0, %f1 : f32, f32
```		```
}];		}];
}		}

// These mirror the XLA ComparisonDirection enum.		// add, mul mirror the XLA ComparisonDirection enum.
def GPU_AllReduceOpAdd : StrEnumAttrCase<"add">;		def GPU_AllReduceOpAdd : StrEnumAttrCase<"add">;
		def GPU_AllReduceOpAnd : StrEnumAttrCase<"and">;
		def GPU_AllReduceOpMax : StrEnumAttrCase<"max">;
		def GPU_AllReduceOpMin : StrEnumAttrCase<"min">;
def GPU_AllReduceOpMul : StrEnumAttrCase<"mul">;		def GPU_AllReduceOpMul : StrEnumAttrCase<"mul">;
		def GPU_AllReduceOpOr : StrEnumAttrCase<"or">;
		def GPU_AllReduceOpXor : StrEnumAttrCase<"xor">;

def GPU_AllReduceOperationAttr : StrEnumAttr<"AllReduceOperationAttr",		def GPU_AllReduceOperationAttr : StrEnumAttr<"AllReduceOperationAttr",
"built-in reduction operations supported by gpu.allreduce.",		"built-in reduction operations supported by gpu.allreduce.",
[		[
GPU_AllReduceOpAdd,		GPU_AllReduceOpAdd,
		GPU_AllReduceOpAnd,
		GPU_AllReduceOpMax,
		GPU_AllReduceOpMin,
GPU_AllReduceOpMul,		GPU_AllReduceOpMul,
		GPU_AllReduceOpOr,
		GPU_AllReduceOpXor
]>;		]>;

def GPU_AllReduceOp : GPU_Op<"all_reduce",		def GPU_AllReduceOp : GPU_Op<"all_reduce",
[SameOperandsAndResultType, IsolatedFromAbove]>,		[SameOperandsAndResultType, IsolatedFromAbove]>,
Arguments<(ins AnyType:$value,		Arguments<(ins AnyType:$value,
OptionalAttr<GPU_AllReduceOperationAttr>:$op)>,		OptionalAttr<GPU_AllReduceOperationAttr>:$op)>,
Results<(outs AnyType)> {		Results<(outs AnyType)> {
let summary = "Reduce values among workgroup.";		let summary = "Reduce values among workgroup.";
let description = [{		let description = [{
The "all_reduce" op reduces the value of every work item across a local		The "all_reduce" op reduces the value of every work item across a local
workgroup. The result is equal for all work items of a workgroup.		workgroup. The result is equal for all work items of a workgroup.

For example, both		For example, both
```		```
%1 = "gpu.all_reduce"(%0) ({}) { op = "add" } : (f32) -> (f32)		%1 = "gpu.all_reduce"(%0) ({}) { op = "add" } : (f32) -> (f32)
%2 = "gpu.all_reduce"(%0) ({		%2 = "gpu.all_reduce"(%0) ({
^bb(%lhs : f32, %rhs : f32):		^bb(%lhs : f32, %rhs : f32):
%sum = addf %lhs, %rhs : f32		%sum = addf %lhs, %rhs : f32
"gpu.yield"(%sum) : (f32) -> ()		"gpu.yield"(%sum) : (f32) -> ()
}) : (f32) -> (f32)		}) : (f32) -> (f32)
```		```
compute the sum of each work item's %0 value. The first version specifies		compute the sum of each work item's %0 value. The first version specifies
the accumulation as operation, whereas the second version specifies the		the accumulation as operation, whereas the second version specifies the
accumulation as code region. The accumulation operation must either be		accumulation as code region. The accumulation operation must be one of:
`add` or `mul`.		`add`, `and`, `max`, `min`, `mul`, `or`, `xor`.

Either none or all work items of a workgroup need to execute this op		Either none or all work items of a workgroup need to execute this op
in convergence.		in convergence.
}];		}];
let regions = (region AnyRegion:$body);		let regions = (region AnyRegion:$body);
let verifier = [{ return ::verifyAllReduce(*this); }];		let verifier = [{ return ::verifyAllReduce(*this); }];
}		}

▲ Show 20 Lines • Show All 110 Lines • Show Last 20 Lines

mlir/include/mlir/ExecutionEngine/RunnerUtils.h

	Show First 20 Lines • Show All 205 Lines • ▼ Show 20 Lines
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Currently exposed C API.			// Currently exposed C API.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	extern "C" MLIR_RUNNERUTILS_EXPORT void			extern "C" MLIR_RUNNERUTILS_EXPORT void
	_mlir_ciface_print_memref_i8(UnrankedMemRefType<int8_t> *M);			_mlir_ciface_print_memref_i8(UnrankedMemRefType<int8_t> *M);
	extern "C" MLIR_RUNNERUTILS_EXPORT void			extern "C" MLIR_RUNNERUTILS_EXPORT void
	_mlir_ciface_print_memref_f32(UnrankedMemRefType<float> *M);			_mlir_ciface_print_memref_f32(UnrankedMemRefType<float> *M);

				extern "C" MLIR_RUNNERUTILS_EXPORT void print_memref_i32(int64_t rank,
				void *ptr);
	extern "C" MLIR_RUNNERUTILS_EXPORT void print_memref_f32(int64_t rank,			extern "C" MLIR_RUNNERUTILS_EXPORT void print_memref_f32(int64_t rank,
	void *ptr);			void *ptr);

	extern "C" MLIR_RUNNERUTILS_EXPORT void			extern "C" MLIR_RUNNERUTILS_EXPORT void
	_mlir_ciface_print_memref_0d_f32(StridedMemRefType<float, 0> *M);			_mlir_ciface_print_memref_0d_f32(StridedMemRefType<float, 0> *M);
	extern "C" MLIR_RUNNERUTILS_EXPORT void			extern "C" MLIR_RUNNERUTILS_EXPORT void
	_mlir_ciface_print_memref_1d_f32(StridedMemRefType<float, 1> *M);			_mlir_ciface_print_memref_1d_f32(StridedMemRefType<float, 1> *M);
	extern "C" MLIR_RUNNERUTILS_EXPORT void			extern "C" MLIR_RUNNERUTILS_EXPORT void
	Show All 11 Lines

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	AccumulatorFactory getFactory(StringRef opName, llvm::Type *type) const {
if (opName == "add") {		if (opName == "add") {
return isFloatingPoint ? getFactory<LLVM::FAddOp>()		return isFloatingPoint ? getFactory<LLVM::FAddOp>()
: getFactory<LLVM::AddOp>();		: getFactory<LLVM::AddOp>();
}		}
if (opName == "mul") {		if (opName == "mul") {
return isFloatingPoint ? getFactory<LLVM::FMulOp>()		return isFloatingPoint ? getFactory<LLVM::FMulOp>()
: getFactory<LLVM::MulOp>();		: getFactory<LLVM::MulOp>();
}		}
		if (opName == "and") {
		return getFactory<LLVM::AndOp>();
		herhutUnsubmitted Done Reply Inline Actions Instead of asserting here, could you extend the verifier to report errors instead? If this combination is illegal, the verifier should reject it. herhut: Instead of asserting here, could you extend the verifier to report errors instead? If this…
		}
		if (opName == "or") {
		return getFactory<LLVM::OrOp>();
		}
		if (opName == "xor") {
		return getFactory<LLVM::XOrOp>();
		}
		if (opName == "max") {
		return isFloatingPoint ? getCmpFactory<LLVM::FCmpOp, LLVM::FCmpPredicate,
		LLVM::FCmpPredicate::ugt>()
		: getCmpFactory<LLVM::ICmpOp, LLVM::ICmpPredicate,
		LLVM::ICmpPredicate::ugt>();
		}
		if (opName == "min") {
		return isFloatingPoint ? getCmpFactory<LLVM::FCmpOp, LLVM::FCmpPredicate,
		LLVM::FCmpPredicate::ult>()
		: getCmpFactory<LLVM::ICmpOp, LLVM::ICmpPredicate,
		LLVM::ICmpPredicate::ult>();
		}

return AccumulatorFactory();		return AccumulatorFactory();
}		}

/// Returns an accumulator factory that creates an op of type T.		/// Returns an accumulator factory that creates an op of type T.
template <typename T> AccumulatorFactory getFactory() const {		template <typename T>
		AccumulatorFactory getFactory() const {
return [](Location loc, Value lhs, Value rhs,		return [](Location loc, Value lhs, Value rhs,
ConversionPatternRewriter &rewriter) {		ConversionPatternRewriter &rewriter) {
return rewriter.create<T>(loc, lhs.getType(), lhs, rhs);		return rewriter.create<T>(loc, lhs.getType(), lhs, rhs);
};		};
}		}

		/// Returns an accumulator for comparaison such as min, max. T is the type
		/// of the compare op.
		template <typename T, typename PredicateEnum, PredicateEnum predicate>
		AccumulatorFactory getCmpFactory() const {
		return [](Location loc, Value lhs, Value rhs,
		ConversionPatternRewriter &rewriter) {
		Value cmp = rewriter.create<T>(loc, predicate, lhs, rhs);
		return rewriter.create<LLVM::SelectOp>(loc, cmp, lhs, rhs);
		};
		}

/// Creates an all_reduce across the block.		/// Creates an all_reduce across the block.
///		///
/// First reduce the elements within a warp. The first thread of each warp		/// First reduce the elements within a warp. The first thread of each warp
/// writes the intermediate result to shared memory. After synchronizing the		/// writes the intermediate result to shared memory. After synchronizing the
/// block, the first warp reduces the values from shared memory. The result		/// block, the first warp reduces the values from shared memory. The result
/// is broadcasted to all threads through shared memory.		/// is broadcasted to all threads through shared memory.
///		///
/// %warp_reduce = `createWarpReduce(%operand)`		/// %warp_reduce = `createWarpReduce(%operand)`
▲ Show 20 Lines • Show All 554 Lines • ▼ Show 20 Lines	patterns
NVVM::BlockDimYOp, NVVM::BlockDimZOp>,		NVVM::BlockDimYOp, NVVM::BlockDimZOp>,
GPUIndexIntrinsicOpLowering<gpu::BlockIdOp, NVVM::BlockIdXOp,		GPUIndexIntrinsicOpLowering<gpu::BlockIdOp, NVVM::BlockIdXOp,
NVVM::BlockIdYOp, NVVM::BlockIdZOp>,		NVVM::BlockIdYOp, NVVM::BlockIdZOp>,
GPUIndexIntrinsicOpLowering<gpu::GridDimOp, NVVM::GridDimXOp,		GPUIndexIntrinsicOpLowering<gpu::GridDimOp, NVVM::GridDimXOp,
NVVM::GridDimYOp, NVVM::GridDimZOp>,		NVVM::GridDimYOp, NVVM::GridDimZOp>,
GPUAllReduceOpLowering, GPUShuffleOpLowering, GPUFuncOpLowering,		GPUAllReduceOpLowering, GPUShuffleOpLowering, GPUFuncOpLowering,
GPUReturnOpLowering>(converter);		GPUReturnOpLowering>(converter);
patterns.insert<OpToFuncCallLowering<AbsFOp>>(converter, "__nv_fabsf",		patterns.insert<OpToFuncCallLowering<AbsFOp>>(converter, "__nv_fabsf",
"__nv_fabs");		"__nv_fabs");
patterns.insert<OpToFuncCallLowering<CeilFOp>>(converter, "__nv_ceilf",		patterns.insert<OpToFuncCallLowering<CeilFOp>>(converter, "__nv_ceilf",
"__nv_ceil");		"__nv_ceil");
patterns.insert<OpToFuncCallLowering<CosOp>>(converter, "__nv_cosf",		patterns.insert<OpToFuncCallLowering<CosOp>>(converter, "__nv_cosf",
"__nv_cos");		"__nv_cos");
patterns.insert<OpToFuncCallLowering<ExpOp>>(converter, "__nv_expf",		patterns.insert<OpToFuncCallLowering<ExpOp>>(converter, "__nv_expf",
"__nv_exp");		"__nv_exp");
patterns.insert<OpToFuncCallLowering<LogOp>>(converter, "__nv_logf",		patterns.insert<OpToFuncCallLowering<LogOp>>(converter, "__nv_logf",
"__nv_log");		"__nv_log");
patterns.insert<OpToFuncCallLowering<Log10Op>>(converter, "__nv_log10f",		patterns.insert<OpToFuncCallLowering<Log10Op>>(converter, "__nv_log10f",
"__nv_log10");		"__nv_log10");
Show All 13 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	for (Block &block : allReduce.body()) {
return allReduce.emitError("expected one gpu.yield operand");		return allReduce.emitError("expected one gpu.yield operand");
if (yield.getOperand(0).getType() != allReduce.getType())		if (yield.getOperand(0).getType() != allReduce.getType())
return allReduce.emitError("incorrect gpu.yield type");		return allReduce.emitError("incorrect gpu.yield type");
++yieldCount;		++yieldCount;
}		}
}		}
if (yieldCount == 0)		if (yieldCount == 0)
return allReduce.emitError("expected gpu.yield op in region");		return allReduce.emitError("expected gpu.yield op in region");
		} else {
		StringRef opName = *allReduce.op();
		if ((opName == "and" \|\| opName == "or" \|\| opName == "xor") &&
		!allReduce.getType().isa<IntegerType>()) {
		return allReduce.emitError()
		<< '`' << opName << '`'
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: "`" -> '`' bondhugula: Nit: "`" -> '`'
		<< " accumulator is only compatible with Integer type";
		}
}		}
return success();		return success();
}		}

static LogicalResult verifyShuffleOp(gpu::ShuffleOp shuffleOp) {		static LogicalResult verifyShuffleOp(gpu::ShuffleOp shuffleOp) {
auto type = shuffleOp.value().getType();		auto type = shuffleOp.value().getType();
if (shuffleOp.result().getType() != type) {		if (shuffleOp.result().getType() != type) {
return shuffleOp.emitOpError()		return shuffleOp.emitOpError()
▲ Show 20 Lines • Show All 630 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

Show First 20 Lines • Show All 206 Lines • ▼ Show 20 Lines	private:

/// Returns an accumulator factory that creates an op specified by opName.		/// Returns an accumulator factory that creates an op specified by opName.
AccumulatorFactory getFactory(StringRef opName) {		AccumulatorFactory getFactory(StringRef opName) {
bool isFloatingPoint = valueType.isa<FloatType>();		bool isFloatingPoint = valueType.isa<FloatType>();
if (opName == "add")		if (opName == "add")
return isFloatingPoint ? getFactory<AddFOp>() : getFactory<AddIOp>();		return isFloatingPoint ? getFactory<AddFOp>() : getFactory<AddIOp>();
if (opName == "mul")		if (opName == "mul")
return isFloatingPoint ? getFactory<MulFOp>() : getFactory<MulIOp>();		return isFloatingPoint ? getFactory<MulFOp>() : getFactory<MulIOp>();
		if (opName == "and") {
		return getFactory<AndOp>();
		}
		if (opName == "or") {
		return getFactory<OrOp>();
		}
		if (opName == "xor") {
		return getFactory<XOrOp>();
		}
		if (opName == "max") {
		return isFloatingPoint
		? getCmpFactory<CmpFOp, CmpFPredicate, CmpFPredicate::UGT>()
		: getCmpFactory<CmpIOp, CmpIPredicate, CmpIPredicate::ugt>();
		}
		if (opName == "min") {
		return isFloatingPoint
		? getCmpFactory<CmpFOp, CmpFPredicate, CmpFPredicate::ULT>()
		: getCmpFactory<CmpIOp, CmpIPredicate, CmpIPredicate::ult>();
		}
return AccumulatorFactory();		return AccumulatorFactory();
}		}

/// Returns an accumulator factory that creates an op of type T.		/// Returns an accumulator factory that creates an op of type T.
template <typename T> AccumulatorFactory getFactory() {		template <typename T> AccumulatorFactory getFactory() {
return [&](Value lhs, Value rhs) {		return [&](Value lhs, Value rhs) {
return create<T>(lhs.getType(), lhs, rhs);		return create<T>(lhs.getType(), lhs, rhs);
};		};
}		}

		/// Returns an accumulator for comparaison such as min, max. T is the type
		/// of the compare op.
		template <typename T, typename PredicateEnum, PredicateEnum predicate>
		AccumulatorFactory getCmpFactory() const {
		return [&](Value lhs, Value rhs) {
		Value cmp = rewriter.create<T>(loc, predicate, lhs, rhs);
		return rewriter.create<SelectOp>(loc, cmp, lhs, rhs);
		};
		}

/// Creates an if-block skeleton and calls the two factories to generate the		/// Creates an if-block skeleton and calls the two factories to generate the
/// ops in the `then` and `else` block..		/// ops in the `then` and `else` block..
///		///
/// llvm.cond_br %condition, ^then, ^continue		/// llvm.cond_br %condition, ^then, ^continue
/// ^then:		/// ^then:
/// %then_operands = `thenOpsFactory()`		/// %then_operands = `thenOpsFactory()`
/// llvm.br ^continue(%then_operands)		/// llvm.br ^continue(%then_operands)
/// ^else:		/// ^else:
▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

mlir/lib/ExecutionEngine/RunnerUtils.cpp

	Show All 21 Lines

	#define MEMREF_CASE(TYPE, RANK) \			#define MEMREF_CASE(TYPE, RANK) \
	case RANK: \			case RANK: \
	impl::printMemRef((static_cast<StridedMemRefType<TYPE, RANK> >(ptr))); \			impl::printMemRef((static_cast<StridedMemRefType<TYPE, RANK> >(ptr))); \
	break			break

	extern "C" void _mlir_ciface_print_memref_i8(UnrankedMemRefType<int8_t> *M) {			extern "C" void _mlir_ciface_print_memref_i8(UnrankedMemRefType<int8_t> *M) {
	printUnrankedMemRefMetaData(std::cout, *M);			printUnrankedMemRefMetaData(std::cout, *M);
	int rank = M->rank;			int64_t rank = M->rank;
	void *ptr = M->descriptor;			void *ptr = M->descriptor;

	switch (rank) {			switch (rank) {
	MEMREF_CASE(int8_t, 0);			MEMREF_CASE(int8_t, 0);
	MEMREF_CASE(int8_t, 1);			MEMREF_CASE(int8_t, 1);
	MEMREF_CASE(int8_t, 2);			MEMREF_CASE(int8_t, 2);
	MEMREF_CASE(int8_t, 3);			MEMREF_CASE(int8_t, 3);
	MEMREF_CASE(int8_t, 4);			MEMREF_CASE(int8_t, 4);
	default:			default:
	assert(0 && "Unsupported rank to print");			assert(0 && "Unsupported rank to print");
	}			}
	}			}

				extern "C" void _mlir_ciface_print_memref_i32(UnrankedMemRefType<int32_t> *M) {
				printUnrankedMemRefMetaData(std::cout, *M);
				int64_t rank = M->rank;
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: M->rank is int64_t. bondhugula: Nit: M->rank is int64_t.
				bondhugulaUnsubmitted Not Done Reply Inline Actions Unrelated to this patch: an UnrankedMemRefType having a ->rank field is weird. Would have been better to name it UnknownRankMemRefType FWIW. bondhugula: Unrelated to this patch: an UnrankedMemRefType having a ->rank field is weird. Would have been…
				void *ptr = M->descriptor;

				switch (rank) {
				MEMREF_CASE(int32_t, 0);
				MEMREF_CASE(int32_t, 1);
				MEMREF_CASE(int32_t, 2);
				MEMREF_CASE(int32_t, 3);
				MEMREF_CASE(int32_t, 4);
				default:
				assert(0 && "Unsupported rank to print");
				}
				}

	extern "C" void _mlir_ciface_print_memref_f32(UnrankedMemRefType<float> *M) {			extern "C" void _mlir_ciface_print_memref_f32(UnrankedMemRefType<float> *M) {
	printUnrankedMemRefMetaData(std::cout, *M);			printUnrankedMemRefMetaData(std::cout, *M);
	int rank = M->rank;			int64_t rank = M->rank;
	void *ptr = M->descriptor;			void *ptr = M->descriptor;

	switch (rank) {			switch (rank) {
	MEMREF_CASE(float, 0);			MEMREF_CASE(float, 0);
	MEMREF_CASE(float, 1);			MEMREF_CASE(float, 1);
	MEMREF_CASE(float, 2);			MEMREF_CASE(float, 2);
	MEMREF_CASE(float, 3);			MEMREF_CASE(float, 3);
	MEMREF_CASE(float, 4);			MEMREF_CASE(float, 4);
	default:			default:
	assert(0 && "Unsupported rank to print");			assert(0 && "Unsupported rank to print");
	}			}
	}			}

				extern "C" void print_memref_i32(int64_t rank, void *ptr) {
				UnrankedMemRefType<int32_t> descriptor = {rank, ptr};
				bondhugulaUnsubmitted Done Reply Inline Actions List initialization? `UnrankedMemRefType<int32_t> descriptor = {rank, ptr};` bondhugula: List initialization? `UnrankedMemRefType<int32_t> descriptor = {rank, ptr};`
				_mlir_ciface_print_memref_i32(&descriptor);
				}

	extern "C" void print_memref_f32(int64_t rank, void *ptr) {			extern "C" void print_memref_f32(int64_t rank, void *ptr) {
	UnrankedMemRefType<float> descriptor;			UnrankedMemRefType<float> descriptor = {rank, ptr};
	descriptor.rank = rank;
	descriptor.descriptor = ptr;
	_mlir_ciface_print_memref_f32(&descriptor);			_mlir_ciface_print_memref_f32(&descriptor);
	}			}

	extern "C" void			extern "C" void
	_mlir_ciface_print_memref_0d_f32(StridedMemRefType<float, 0> *M) {			_mlir_ciface_print_memref_0d_f32(StridedMemRefType<float, 0> *M) {
	impl::printMemRef(*M);			impl::printMemRef(*M);
	}			}
	extern "C" void			extern "C" void
	Show All 15 Lines

mlir/test/Dialect/GPU/all-reduce-max.mlir

This file was added.

				// RUN: mlir-opt -test-all-reduce-lowering %s \| FileCheck %s

				// NOTE: Assertions have been autogenerated by utils/generate-test-checks.py
				// CHECK: module @kernels attributes {gpu.kernel_module} {
				module @kernels attributes {gpu.kernel_module} {

				// CHECK-LABEL: gpu.func @kernel(
				// CHECK-SAME: [[VAL_0:%.]]: f32) workgroup([[VAL_1:%.]] : memref<32xf32, 3>) kernel {
				gpu.func @kernel(%arg0 : f32) attributes { gpu.kernel } {
				// CHECK: [[VAL_2:%.*]] = constant 31 : i32
				// CHECK: [[VAL_3:%.*]] = constant 0 : i32
				// CHECK: [[VAL_4:%.*]] = constant 0 : index
				// CHECK: [[VAL_5:%.*]] = constant 32 : i32
				// CHECK: [[VAL_6:%.*]] = constant 1 : i32
				// CHECK: [[VAL_7:%.*]] = constant 2 : i32
				// CHECK: [[VAL_8:%.*]] = constant 4 : i32
				// CHECK: [[VAL_9:%.*]] = constant 8 : i32
				// CHECK: [[VAL_10:%.*]] = constant 16 : i32
				// CHECK: [[VAL_11:%.*]] = "gpu.block_dim"() {dimension = "x"} : () -> index
				// CHECK: [[VAL_12:%.*]] = index_cast [[VAL_11]] : index to i32
				// CHECK: [[VAL_13:%.*]] = "gpu.block_dim"() {dimension = "y"} : () -> index
				// CHECK: [[VAL_14:%.*]] = index_cast [[VAL_13]] : index to i32
				// CHECK: [[VAL_15:%.*]] = "gpu.block_dim"() {dimension = "z"} : () -> index
				// CHECK: [[VAL_16:%.*]] = index_cast [[VAL_15]] : index to i32
				// CHECK: [[VAL_17:%.*]] = "gpu.thread_id"() {dimension = "x"} : () -> index
				// CHECK: [[VAL_18:%.*]] = index_cast [[VAL_17]] : index to i32
				// CHECK: [[VAL_19:%.*]] = "gpu.thread_id"() {dimension = "y"} : () -> index
				// CHECK: [[VAL_20:%.*]] = index_cast [[VAL_19]] : index to i32
				// CHECK: [[VAL_21:%.*]] = "gpu.thread_id"() {dimension = "z"} : () -> index
				// CHECK: [[VAL_22:%.*]] = index_cast [[VAL_21]] : index to i32
				// CHECK: [[VAL_23:%.*]] = muli [[VAL_22]], [[VAL_14]] : i32
				// CHECK: [[VAL_24:%.*]] = addi [[VAL_23]], [[VAL_20]] : i32
				// CHECK: [[VAL_25:%.*]] = muli [[VAL_24]], [[VAL_12]] : i32
				// CHECK: [[VAL_26:%.*]] = muli [[VAL_12]], [[VAL_14]] : i32
				// CHECK: [[VAL_27:%.*]] = addi [[VAL_25]], [[VAL_18]] : i32
				// CHECK: [[VAL_28:%.*]] = muli [[VAL_26]], [[VAL_16]] : i32
				// CHECK: [[VAL_29:%.*]] = and [[VAL_27]], [[VAL_2]] : i32
				// CHECK: [[VAL_30:%.*]] = cmpi "eq", [[VAL_29]], [[VAL_3]] : i32
				// CHECK: [[VAL_31:%.*]] = subi [[VAL_27]], [[VAL_29]] : i32
				// CHECK: [[VAL_32:%.*]] = subi [[VAL_28]], [[VAL_31]] : i32
				// CHECK: [[VAL_33:%.*]] = cmpi "slt", [[VAL_32]], [[VAL_5]] : i32
				// CHECK: cond_br [[VAL_33]], ^bb1, ^bb17
				// CHECK: ^bb1:
				// CHECK: [[VAL_34:%.]], [[VAL_35:%.]] = gpu.shuffle [[VAL_0]], [[VAL_6]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_35]], ^bb2, ^bb3
				// CHECK: ^bb2:
				// CHECK: [[VAL_36:%.*]] = cmpf "ugt", [[VAL_0]], [[VAL_34]] : f32
				// CHECK: [[VAL_37:%.*]] = select [[VAL_36]], [[VAL_0]], [[VAL_34]] : f32
				// CHECK: br ^bb4([[VAL_37]] : f32)
				// CHECK: ^bb3:
				// CHECK: br ^bb4([[VAL_0]] : f32)
				// CHECK: ^bb4([[VAL_38:%.*]]: f32):
				// CHECK: [[VAL_39:%.]], [[VAL_40:%.]] = gpu.shuffle [[VAL_38]], [[VAL_7]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_40]], ^bb5, ^bb6
				// CHECK: ^bb5:
				// CHECK: [[VAL_41:%.*]] = cmpf "ugt", [[VAL_38]], [[VAL_39]] : f32
				// CHECK: [[VAL_42:%.*]] = select [[VAL_41]], [[VAL_38]], [[VAL_39]] : f32
				// CHECK: br ^bb7([[VAL_42]] : f32)
				// CHECK: ^bb6:
				// CHECK: br ^bb7([[VAL_38]] : f32)
				// CHECK: ^bb7([[VAL_43:%.*]]: f32):
				// CHECK: [[VAL_44:%.]], [[VAL_45:%.]] = gpu.shuffle [[VAL_43]], [[VAL_8]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_45]], ^bb8, ^bb9
				// CHECK: ^bb8:
				// CHECK: [[VAL_46:%.*]] = cmpf "ugt", [[VAL_43]], [[VAL_44]] : f32
				// CHECK: [[VAL_47:%.*]] = select [[VAL_46]], [[VAL_43]], [[VAL_44]] : f32
				// CHECK: br ^bb10([[VAL_47]] : f32)
				// CHECK: ^bb9:
				// CHECK: br ^bb10([[VAL_43]] : f32)
				// CHECK: ^bb10([[VAL_48:%.*]]: f32):
				// CHECK: [[VAL_49:%.]], [[VAL_50:%.]] = gpu.shuffle [[VAL_48]], [[VAL_9]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_50]], ^bb11, ^bb12
				// CHECK: ^bb11:
				// CHECK: [[VAL_51:%.*]] = cmpf "ugt", [[VAL_48]], [[VAL_49]] : f32
				// CHECK: [[VAL_52:%.*]] = select [[VAL_51]], [[VAL_48]], [[VAL_49]] : f32
				// CHECK: br ^bb13([[VAL_52]] : f32)
				// CHECK: ^bb12:
				// CHECK: br ^bb13([[VAL_48]] : f32)
				// CHECK: ^bb13([[VAL_53:%.*]]: f32):
				// CHECK: [[VAL_54:%.]], [[VAL_55:%.]] = gpu.shuffle [[VAL_53]], [[VAL_10]], [[VAL_32]] xor : f32
				// CHECK: cond_br [[VAL_55]], ^bb14, ^bb15
				// CHECK: ^bb14:
				// CHECK: [[VAL_56:%.*]] = cmpf "ugt", [[VAL_53]], [[VAL_54]] : f32
				// CHECK: [[VAL_57:%.*]] = select [[VAL_56]], [[VAL_53]], [[VAL_54]] : f32
				// CHECK: br ^bb16([[VAL_57]] : f32)
				// CHECK: ^bb15:
				// CHECK: br ^bb16([[VAL_53]] : f32)
				// CHECK: ^bb16([[VAL_58:%.*]]: f32):
				// CHECK: br ^bb18([[VAL_58]] : f32)
				// CHECK: ^bb17:
				// CHECK: [[VAL_59:%.]], [[VAL_60:%.]] = gpu.shuffle [[VAL_0]], [[VAL_6]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_61:%.*]] = cmpf "ugt", [[VAL_0]], [[VAL_59]] : f32
				// CHECK: [[VAL_62:%.*]] = select [[VAL_61]], [[VAL_0]], [[VAL_59]] : f32
				// CHECK: [[VAL_63:%.]], [[VAL_64:%.]] = gpu.shuffle [[VAL_62]], [[VAL_7]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_65:%.*]] = cmpf "ugt", [[VAL_62]], [[VAL_63]] : f32
				// CHECK: [[VAL_66:%.*]] = select [[VAL_65]], [[VAL_62]], [[VAL_63]] : f32
				// CHECK: [[VAL_67:%.]], [[VAL_68:%.]] = gpu.shuffle [[VAL_66]], [[VAL_8]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_69:%.*]] = cmpf "ugt", [[VAL_66]], [[VAL_67]] : f32
				// CHECK: [[VAL_70:%.*]] = select [[VAL_69]], [[VAL_66]], [[VAL_67]] : f32
				// CHECK: [[VAL_71:%.]], [[VAL_72:%.]] = gpu.shuffle [[VAL_70]], [[VAL_9]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_73:%.*]] = cmpf "ugt", [[VAL_70]], [[VAL_71]] : f32
				// CHECK: [[VAL_74:%.*]] = select [[VAL_73]], [[VAL_70]], [[VAL_71]] : f32
				// CHECK: [[VAL_75:%.]], [[VAL_76:%.]] = gpu.shuffle [[VAL_74]], [[VAL_10]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_77:%.*]] = cmpf "ugt", [[VAL_74]], [[VAL_75]] : f32
				// CHECK: [[VAL_78:%.*]] = select [[VAL_77]], [[VAL_74]], [[VAL_75]] : f32
				// CHECK: br ^bb18([[VAL_78]] : f32)
				// CHECK: ^bb18([[VAL_79:%.*]]: f32):
				// CHECK: cond_br [[VAL_30]], ^bb19, ^bb20
				// CHECK: ^bb19:
				// CHECK: [[VAL_80:%.*]] = divi_signed [[VAL_27]], [[VAL_5]] : i32
				// CHECK: [[VAL_81:%.*]] = index_cast [[VAL_80]] : i32 to index
				// CHECK: store [[VAL_79]], [[VAL_1]]{{\[}}[[VAL_81]]] : memref<32xf32, 3>
				// CHECK: br ^bb21
				// CHECK: ^bb20:
				// CHECK: br ^bb21
				// CHECK: ^bb21:
				// CHECK: gpu.barrier
				// CHECK: [[VAL_82:%.*]] = addi [[VAL_28]], [[VAL_2]] : i32
				// CHECK: [[VAL_83:%.*]] = divi_signed [[VAL_82]], [[VAL_5]] : i32
				// CHECK: [[VAL_84:%.*]] = cmpi "slt", [[VAL_27]], [[VAL_83]] : i32
				// CHECK: cond_br [[VAL_84]], ^bb22, ^bb41
				// CHECK: ^bb22:
				// CHECK: [[VAL_85:%.*]] = index_cast [[VAL_27]] : i32 to index
				// CHECK: [[VAL_86:%.*]] = load [[VAL_1]]{{\[}}[[VAL_85]]] : memref<32xf32, 3>
				// CHECK: [[VAL_87:%.*]] = cmpi "slt", [[VAL_83]], [[VAL_5]] : i32
				// CHECK: cond_br [[VAL_87]], ^bb23, ^bb39
				// CHECK: ^bb23:
				// CHECK: [[VAL_88:%.]], [[VAL_89:%.]] = gpu.shuffle [[VAL_86]], [[VAL_6]], [[VAL_83]] xor : f32
				// CHECK: cond_br [[VAL_89]], ^bb24, ^bb25
				// CHECK: ^bb24:
				// CHECK: [[VAL_90:%.*]] = cmpf "ugt", [[VAL_86]], [[VAL_88]] : f32
				// CHECK: [[VAL_91:%.*]] = select [[VAL_90]], [[VAL_86]], [[VAL_88]] : f32
				// CHECK: br ^bb26([[VAL_91]] : f32)
				// CHECK: ^bb25:
				// CHECK: br ^bb26([[VAL_86]] : f32)
				// CHECK: ^bb26([[VAL_92:%.*]]: f32):
				// CHECK: [[VAL_93:%.]], [[VAL_94:%.]] = gpu.shuffle [[VAL_92]], [[VAL_7]], [[VAL_83]] xor : f32
				// CHECK: cond_br [[VAL_94]], ^bb27, ^bb28
				// CHECK: ^bb27:
				// CHECK: [[VAL_95:%.*]] = cmpf "ugt", [[VAL_92]], [[VAL_93]] : f32
				// CHECK: [[VAL_96:%.*]] = select [[VAL_95]], [[VAL_92]], [[VAL_93]] : f32
				// CHECK: br ^bb29([[VAL_96]] : f32)
				// CHECK: ^bb28:
				// CHECK: br ^bb29([[VAL_92]] : f32)
				// CHECK: ^bb29([[VAL_97:%.*]]: f32):
				// CHECK: [[VAL_98:%.]], [[VAL_99:%.]] = gpu.shuffle [[VAL_97]], [[VAL_8]], [[VAL_83]] xor : f32
				// CHECK: cond_br [[VAL_99]], ^bb30, ^bb31
				// CHECK: ^bb30:
				// CHECK: [[VAL_100:%.*]] = cmpf "ugt", [[VAL_97]], [[VAL_98]] : f32
				// CHECK: [[VAL_101:%.*]] = select [[VAL_100]], [[VAL_97]], [[VAL_98]] : f32
				// CHECK: br ^bb32([[VAL_101]] : f32)
				// CHECK: ^bb31:
				// CHECK: br ^bb32([[VAL_97]] : f32)
				// CHECK: ^bb32([[VAL_102:%.*]]: f32):
				// CHECK: [[VAL_103:%.]], [[VAL_104:%.]] = gpu.shuffle [[VAL_102]], [[VAL_9]], [[VAL_83]] xor : f32
				// CHECK: cond_br [[VAL_104]], ^bb33, ^bb34
				// CHECK: ^bb33:
				// CHECK: [[VAL_105:%.*]] = cmpf "ugt", [[VAL_102]], [[VAL_103]] : f32
				// CHECK: [[VAL_106:%.*]] = select [[VAL_105]], [[VAL_102]], [[VAL_103]] : f32
				// CHECK: br ^bb35([[VAL_106]] : f32)
				// CHECK: ^bb34:
				// CHECK: br ^bb35([[VAL_102]] : f32)
				// CHECK: ^bb35([[VAL_107:%.*]]: f32):
				// CHECK: [[VAL_108:%.]], [[VAL_109:%.]] = gpu.shuffle [[VAL_107]], [[VAL_10]], [[VAL_83]] xor : f32
				// CHECK: cond_br [[VAL_109]], ^bb36, ^bb37
				// CHECK: ^bb36:
				// CHECK: [[VAL_110:%.*]] = cmpf "ugt", [[VAL_107]], [[VAL_108]] : f32
				// CHECK: [[VAL_111:%.*]] = select [[VAL_110]], [[VAL_107]], [[VAL_108]] : f32
				// CHECK: br ^bb38([[VAL_111]] : f32)
				// CHECK: ^bb37:
				// CHECK: br ^bb38([[VAL_107]] : f32)
				// CHECK: ^bb38([[VAL_112:%.*]]: f32):
				// CHECK: br ^bb40([[VAL_112]] : f32)
				// CHECK: ^bb39:
				// CHECK: [[VAL_113:%.]], [[VAL_114:%.]] = gpu.shuffle [[VAL_86]], [[VAL_6]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_115:%.*]] = cmpf "ugt", [[VAL_86]], [[VAL_113]] : f32
				// CHECK: [[VAL_116:%.*]] = select [[VAL_115]], [[VAL_86]], [[VAL_113]] : f32
				// CHECK: [[VAL_117:%.]], [[VAL_118:%.]] = gpu.shuffle [[VAL_116]], [[VAL_7]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_119:%.*]] = cmpf "ugt", [[VAL_116]], [[VAL_117]] : f32
				// CHECK: [[VAL_120:%.*]] = select [[VAL_119]], [[VAL_116]], [[VAL_117]] : f32
				// CHECK: [[VAL_121:%.]], [[VAL_122:%.]] = gpu.shuffle [[VAL_120]], [[VAL_8]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_123:%.*]] = cmpf "ugt", [[VAL_120]], [[VAL_121]] : f32
				// CHECK: [[VAL_124:%.*]] = select [[VAL_123]], [[VAL_120]], [[VAL_121]] : f32
				// CHECK: [[VAL_125:%.]], [[VAL_126:%.]] = gpu.shuffle [[VAL_124]], [[VAL_9]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_127:%.*]] = cmpf "ugt", [[VAL_124]], [[VAL_125]] : f32
				// CHECK: [[VAL_128:%.*]] = select [[VAL_127]], [[VAL_124]], [[VAL_125]] : f32
				// CHECK: [[VAL_129:%.]], [[VAL_130:%.]] = gpu.shuffle [[VAL_128]], [[VAL_10]], [[VAL_5]] xor : f32
				// CHECK: [[VAL_131:%.*]] = cmpf "ugt", [[VAL_128]], [[VAL_129]] : f32
				// CHECK: [[VAL_132:%.*]] = select [[VAL_131]], [[VAL_128]], [[VAL_129]] : f32
				// CHECK: br ^bb40([[VAL_132]] : f32)
				// CHECK: ^bb40([[VAL_133:%.*]]: f32):
				// CHECK: store [[VAL_133]], [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>
				// CHECK: br ^bb42
				// CHECK: ^bb41:
				// CHECK: br ^bb42
				// CHECK: ^bb42:
				// CHECK: gpu.barrier
				// CHECK: [[VAL_134:%.*]] = load [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>
				%sum = "gpu.all_reduce"(%arg0) ({}) {op = "max"} : (f32) -> (f32)
				gpu.return
				}

				}

mlir/test/Dialect/GPU/invalid.mlir

	Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines
	func @reduce_invalid_op(%arg0 : f32) {			func @reduce_invalid_op(%arg0 : f32) {
	// expected-error@+1 {{gpu.all_reduce' op attribute 'op' failed to satisfy constraint}}			// expected-error@+1 {{gpu.all_reduce' op attribute 'op' failed to satisfy constraint}}
	%res = "gpu.all_reduce"(%arg0) ({}) {op = "foo"} : (f32) -> (f32)			%res = "gpu.all_reduce"(%arg0) ({}) {op = "foo"} : (f32) -> (f32)
	return			return
	}			}

	// -----			// -----

				func @reduce_invalid_op_type(%arg0 : f32) {
				// expected-error@+1 {{`and` accumulator is only compatible with Integer type}}
				%res = "gpu.all_reduce"(%arg0) ({}) {op = "and"} : (f32) -> (f32)
				return
				}

				// -----

	func @reduce_incorrect_region_arguments(%arg0 : f32) {			func @reduce_incorrect_region_arguments(%arg0 : f32) {
	// expected-error@+1 {{expected two region arguments}}			// expected-error@+1 {{expected two region arguments}}
	%res = "gpu.all_reduce"(%arg0) ({			%res = "gpu.all_reduce"(%arg0) ({
	^bb(%lhs : f32):			^bb(%lhs : f32):
	"gpu.yield"(%lhs) : (f32) -> ()			"gpu.yield"(%lhs) : (f32) -> ()
	}) : (f32) -> (f32)			}) : (f32) -> (f32)
	}			}

	▲ Show 20 Lines • Show All 146 Lines • Show Last 20 Lines

mlir/test/mlir-cuda-runner/all-reduce-and.mlir

This file was added.

				// RUN: mlir-cuda-runner %s --shared-libs=%cuda_wrapper_library_dir/libcuda-runtime-wrappers%shlibext,%linalg_test_lib_dir/libmlir_runner_utils%shlibext --entry-point-result=void \| FileCheck %s

				func @main() {
				%data = alloc() : memref<2x6xi32>
				%sum_and = alloc() : memref<2xi32>
				%sum_or = alloc() : memref<2xi32>
				%sum_min = alloc() : memref<2xi32>
				%cst0 = constant 0 : i32
				%cst1 = constant 1 : i32
				%cst2 = constant 2 : i32
				%cst4 = constant 4 : i32
				%cst8 = constant 8 : i32
				%cst16 = constant 16 : i32

				%cst3 = constant 3 : i32
				%cst6 = constant 6 : i32
				%cst7 = constant 7 : i32
				%cst10 = constant 10 : i32
				%cst11 = constant 11 : i32

				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index
				%c4 = constant 4 : index
				%c5 = constant 5 : index
				%c6 = constant 6 : index

				store %cst0, %data[%c0, %c0] : memref<2x6xi32>
				store %cst1, %data[%c0, %c1] : memref<2x6xi32>
				store %cst2, %data[%c0, %c2] : memref<2x6xi32>
				store %cst4, %data[%c0, %c3] : memref<2x6xi32>
				store %cst8, %data[%c0, %c4] : memref<2x6xi32>
				store %cst16, %data[%c0, %c5] : memref<2x6xi32>

				store %cst2, %data[%c1, %c0] : memref<2x6xi32>
				store %cst3, %data[%c1, %c1] : memref<2x6xi32>
				store %cst6, %data[%c1, %c2] : memref<2x6xi32>
				store %cst7, %data[%c1, %c3] : memref<2x6xi32>
				store %cst10, %data[%c1, %c4] : memref<2x6xi32>
				store %cst11, %data[%c1, %c5] : memref<2x6xi32>

				// AND
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c2, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c6, %block_y = %c1, %block_z = %c1) {
				%val = load %data[%bx, %tx] : memref<2x6xi32>
				%reduced_and = "gpu.all_reduce"(%val) ({}) { op = "and" } : (i32) -> (i32)
				store %reduced_and, %sum_and[%bx] : memref<2xi32>
				gpu.terminator
				}

				%ptr_and = memref_cast %sum_and : memref<2xi32> to memref<*xi32>
				call @print_memref_i32(%ptr_and) : (memref<*xi32>) -> ()
				// CHECK: [0, 2]

				return
				}

				func @print_memref_i32(memref<*xi32>)

mlir/test/mlir-cuda-runner/all-reduce-max.mlir

This file was added.

				// RUN: mlir-cuda-runner %s --shared-libs=%cuda_wrapper_library_dir/libcuda-runtime-wrappers%shlibext,%linalg_test_lib_dir/libmlir_runner_utils%shlibext --entry-point-result=void \| FileCheck %s

				func @main() {
				%data = alloc() : memref<2x6xi32>
				%sum = alloc() : memref<2xi32>
				%cst0 = constant 0 : i32
				%cst1 = constant 1 : i32
				%cst2 = constant 2 : i32
				%cst4 = constant 4 : i32
				%cst8 = constant 8 : i32
				%cst16 = constant 16 : i32

				%cst3 = constant 3 : i32
				%cst6 = constant 6 : i32
				%cst7 = constant 7 : i32
				%cst10 = constant 10 : i32
				%cst11 = constant 11 : i32

				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index
				%c4 = constant 4 : index
				%c5 = constant 5 : index
				%c6 = constant 6 : index

				store %cst0, %data[%c0, %c0] : memref<2x6xi32>
				store %cst1, %data[%c0, %c1] : memref<2x6xi32>
				store %cst2, %data[%c0, %c2] : memref<2x6xi32>
				store %cst4, %data[%c0, %c3] : memref<2x6xi32>
				store %cst8, %data[%c0, %c4] : memref<2x6xi32>
				store %cst16, %data[%c0, %c5] : memref<2x6xi32>

				store %cst2, %data[%c1, %c0] : memref<2x6xi32>
				store %cst3, %data[%c1, %c1] : memref<2x6xi32>
				store %cst6, %data[%c1, %c2] : memref<2x6xi32>
				store %cst7, %data[%c1, %c3] : memref<2x6xi32>
				store %cst10, %data[%c1, %c4] : memref<2x6xi32>
				store %cst11, %data[%c1, %c5] : memref<2x6xi32>

				// MAX
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c2, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c6, %block_y = %c1, %block_z = %c1) {
				%val = load %data[%bx, %tx] : memref<2x6xi32>
				%reduced = "gpu.all_reduce"(%val) ({}) { op = "max" } : (i32) -> (i32)
				store %reduced, %sum[%bx] : memref<2xi32>
				gpu.terminator
				}

				%ptr = memref_cast %sum : memref<2xi32> to memref<*xi32>
				call @print_memref_i32(%ptr) : (memref<*xi32>) -> ()
				// CHECK: [16, 11]

				return
				}

				func @print_memref_i32(memref<*xi32>)

mlir/test/mlir-cuda-runner/all-reduce-min.mlir

This file was added.

				// RUN: mlir-cuda-runner %s --shared-libs=%cuda_wrapper_library_dir/libcuda-runtime-wrappers%shlibext,%linalg_test_lib_dir/libmlir_runner_utils%shlibext --entry-point-result=void \| FileCheck %s

				func @main() {
				%data = alloc() : memref<2x6xi32>
				%sum = alloc() : memref<2xi32>
				%cst0 = constant 0 : i32
				%cst1 = constant 1 : i32
				%cst2 = constant 2 : i32
				%cst4 = constant 4 : i32
				%cst8 = constant 8 : i32
				%cst16 = constant 16 : i32

				%cst3 = constant 3 : i32
				%cst6 = constant 6 : i32
				%cst7 = constant 7 : i32
				%cst10 = constant 10 : i32
				%cst11 = constant 11 : i32

				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index
				%c4 = constant 4 : index
				%c5 = constant 5 : index
				%c6 = constant 6 : index

				store %cst0, %data[%c0, %c0] : memref<2x6xi32>
				store %cst1, %data[%c0, %c1] : memref<2x6xi32>
				store %cst2, %data[%c0, %c2] : memref<2x6xi32>
				store %cst4, %data[%c0, %c3] : memref<2x6xi32>
				store %cst8, %data[%c0, %c4] : memref<2x6xi32>
				store %cst16, %data[%c0, %c5] : memref<2x6xi32>

				store %cst2, %data[%c1, %c0] : memref<2x6xi32>
				store %cst3, %data[%c1, %c1] : memref<2x6xi32>
				store %cst6, %data[%c1, %c2] : memref<2x6xi32>
				store %cst7, %data[%c1, %c3] : memref<2x6xi32>
				store %cst10, %data[%c1, %c4] : memref<2x6xi32>
				store %cst11, %data[%c1, %c5] : memref<2x6xi32>

				// MIN
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c2, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c6, %block_y = %c1, %block_z = %c1) {
				%val = load %data[%bx, %tx] : memref<2x6xi32>
				%reduced = "gpu.all_reduce"(%val) ({}) { op = "min" } : (i32) -> (i32)
				store %reduced, %sum[%bx] : memref<2xi32>
				gpu.terminator
				}

				%ptr = memref_cast %sum : memref<2xi32> to memref<*xi32>
				call @print_memref_i32(%ptr) : (memref<*xi32>) -> ()
				// CHECK: [0, 2]

				return
				}

				func @print_memref_i32(memref<*xi32>)

mlir/test/mlir-cuda-runner/all-reduce-or.mlir

This file was added.

				// RUN: mlir-cuda-runner %s --shared-libs=%cuda_wrapper_library_dir/libcuda-runtime-wrappers%shlibext,%linalg_test_lib_dir/libmlir_runner_utils%shlibext --entry-point-result=void \| FileCheck %s

				func @main() {
				%data = alloc() : memref<2x6xi32>
				%sum = alloc() : memref<2xi32>
				%cst0 = constant 0 : i32
				%cst1 = constant 1 : i32
				%cst2 = constant 2 : i32
				%cst4 = constant 4 : i32
				%cst8 = constant 8 : i32
				%cst16 = constant 16 : i32

				%cst3 = constant 3 : i32
				%cst6 = constant 6 : i32
				%cst7 = constant 7 : i32
				%cst10 = constant 10 : i32
				%cst11 = constant 11 : i32

				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index
				%c4 = constant 4 : index
				%c5 = constant 5 : index
				%c6 = constant 6 : index

				store %cst0, %data[%c0, %c0] : memref<2x6xi32>
				store %cst1, %data[%c0, %c1] : memref<2x6xi32>
				store %cst2, %data[%c0, %c2] : memref<2x6xi32>
				store %cst4, %data[%c0, %c3] : memref<2x6xi32>
				store %cst8, %data[%c0, %c4] : memref<2x6xi32>
				store %cst16, %data[%c0, %c5] : memref<2x6xi32>

				store %cst2, %data[%c1, %c0] : memref<2x6xi32>
				store %cst3, %data[%c1, %c1] : memref<2x6xi32>
				store %cst6, %data[%c1, %c2] : memref<2x6xi32>
				store %cst7, %data[%c1, %c3] : memref<2x6xi32>
				store %cst10, %data[%c1, %c4] : memref<2x6xi32>
				store %cst11, %data[%c1, %c5] : memref<2x6xi32>

				// OR
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c2, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c6, %block_y = %c1, %block_z = %c1) {
				%val = load %data[%bx, %tx] : memref<2x6xi32>
				%reduced = "gpu.all_reduce"(%val) ({}) { op = "or" } : (i32) -> (i32)
				store %reduced, %sum[%bx] : memref<2xi32>
				gpu.terminator
				}

				%ptr = memref_cast %sum : memref<2xi32> to memref<*xi32>
				call @print_memref_i32(%ptr) : (memref<*xi32>) -> ()
				// CHECK: [31, 15]

				return
				}

				func @print_memref_i32(memref<*xi32>)

mlir/test/mlir-cuda-runner/all-reduce-xor.mlir

This file was added.

				// RUN: mlir-cuda-runner %s --shared-libs=%cuda_wrapper_library_dir/libcuda-runtime-wrappers%shlibext,%linalg_test_lib_dir/libmlir_runner_utils%shlibext --entry-point-result=void \| FileCheck %s

				func @main() {
				%data = alloc() : memref<2x6xi32>
				%sum = alloc() : memref<2xi32>
				%cst0 = constant 0 : i32
				%cst1 = constant 1 : i32
				%cst2 = constant 2 : i32
				%cst4 = constant 4 : i32
				%cst8 = constant 8 : i32
				%cst16 = constant 16 : i32

				%cst3 = constant 3 : i32
				%cst6 = constant 6 : i32
				%cst7 = constant 7 : i32
				%cst10 = constant 10 : i32
				%cst11 = constant 11 : i32

				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index
				%c4 = constant 4 : index
				%c5 = constant 5 : index
				%c6 = constant 6 : index

				store %cst0, %data[%c0, %c0] : memref<2x6xi32>
				store %cst1, %data[%c0, %c1] : memref<2x6xi32>
				store %cst2, %data[%c0, %c2] : memref<2x6xi32>
				store %cst4, %data[%c0, %c3] : memref<2x6xi32>
				store %cst8, %data[%c0, %c4] : memref<2x6xi32>
				store %cst16, %data[%c0, %c5] : memref<2x6xi32>

				store %cst2, %data[%c1, %c0] : memref<2x6xi32>
				store %cst3, %data[%c1, %c1] : memref<2x6xi32>
				store %cst6, %data[%c1, %c2] : memref<2x6xi32>
				store %cst7, %data[%c1, %c3] : memref<2x6xi32>
				store %cst10, %data[%c1, %c4] : memref<2x6xi32>
				store %cst11, %data[%c1, %c5] : memref<2x6xi32>

				// XOR
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c2, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c6, %block_y = %c1, %block_z = %c1) {
				%val = load %data[%bx, %tx] : memref<2x6xi32>
				%reduced = "gpu.all_reduce"(%val) ({}) { op = "xor" } : (i32) -> (i32)
				store %reduced, %sum[%bx] : memref<2xi32>
				gpu.terminator
				}

				%ptr = memref_cast %sum : memref<2xi32> to memref<*xi32>
				call @print_memref_i32(%ptr) : (memref<*xi32>) -> ()
				// CHECK: [31, 1]

				return
				}

				func @print_memref_i32(memref<*xi32>)

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] Add `and`, `or`, `xor`, `min`, `max` too gpu.all_reduce and the nvvm loweringClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 249347

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/include/mlir/ExecutionEngine/RunnerUtils.h

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

mlir/lib/ExecutionEngine/RunnerUtils.cpp

mlir/test/Dialect/GPU/all-reduce-max.mlir

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/mlir-cuda-runner/all-reduce-and.mlir

mlir/test/mlir-cuda-runner/all-reduce-max.mlir

mlir/test/mlir-cuda-runner/all-reduce-min.mlir

mlir/test/mlir-cuda-runner/all-reduce-or.mlir

mlir/test/mlir-cuda-runner/all-reduce-xor.mlir

[MLIR] Add `and`, `or`, `xor`, `min`, `max` too gpu.all_reduce and the nvvm lowering
ClosedPublic