This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/IR/
-
mlir/
-
Dialect/
-
GPU/
-
IR/
1/2
GPUOps.td
-
lib/Dialect/GPU/IR/
-
Dialect/
-
GPU/
-
IR/
-
GPUDialect.cpp
-
test/Dialect/GPU/
-
Dialect/
-
GPU/
-
invalid.mlir
-
ops.mlir

Differential D135323

[mlir][gpu] Add `subgroup_reduce` operation
ClosedPublic

Authored by Hardcode84 on Oct 5 2022, 3:02 PM.

Download Raw Diff

Details

Reviewers

bondhugula
ThomasRaoux
nicolasvasilache
herhut

Commits

rGb845addae89b: [mlir][gpu] Add `subgroup_reduce` operation

Summary

Introduce subgroup_reduce operation, similar to all_reduce, but operating on subgroup scope instead of workgroup.
It is intended as low-level building block for more high level abstractions (e.g for workgroup-wide all_reduce ops).
Only introduce version taking reduce operation enum for simplicity sake.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Hardcode84 created this revision.Oct 5 2022, 3:02 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptOct 5 2022, 3:02 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: zero9178, bzcheeseman, sdasgup3 and 20 others. · View Herald Transcript

Hardcode84 requested review of this revision.Oct 5 2022, 3:02 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptOct 5 2022, 3:02 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

fix typos

Harbormaster completed remote builds in B190613: Diff 465568.Oct 5 2022, 3:57 PM

are you planning to add some lowering for it? This can be useful to at least "enforce the semantic". For instance it should be easy to have a lowering of this shuffle ops.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
727–728	can you clarify the behavior when some lanes in the subgroups are inactive?

are you planning to add some lowering for it? This can be useful to at least "enforce the semantic". For instance it should be easy to have a lowering of this shuffle ops.

We are planning to lower them directly to GroupNonUniformXYZ spirv ops (those ops allows both Workgroup and Subgroup scope, but Intel Level Zero/Intel Graphics Compiler we are targeting only supports Subgroup version at the moment).

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
727–728	spirv ops like `GroupNonUniformFAddOp` we are targeting allow non-uniform execution, but we don't need it at the moment. I will add uniform requirement similar to `all_reduce`. We can always add something like `non-uniform` flag to these ops later if needed.

add uniform requirement

Harbormaster completed remote builds in B190694: Diff 465682.Oct 6 2022, 2:30 AM

LGTM

This revision is now accepted and ready to land.Oct 10 2022, 2:16 PM

Closed by commit rGb845addae89b: [mlir][gpu] Add `subgroup_reduce` operation (authored by Hardcode84). · Explain WhyOct 11 2022, 2:49 AM

This revision was automatically updated to reflect the committed changes.

Hardcode84 added a commit: rGb845addae89b: [mlir][gpu] Add `subgroup_reduce` operation.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

IR/

GPUOps.td

24 lines

lib/

Dialect/

GPU/

IR/

GPUDialect.cpp

29 lines

test/

Dialect/

GPU/

invalid.mlir

8 lines

ops.mlir

3 lines

Diff 466749

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 711 Lines • ▼ Show 20 Lines	let description = [{
in convergence.		in convergence.
}];		}];
let regions = (region AnyRegion:$body);		let regions = (region AnyRegion:$body);
let assemblyFormat = [{ custom<AllReduceOperation>($op) $value $body attr-dict		let assemblyFormat = [{ custom<AllReduceOperation>($op) $value $body attr-dict
`:` functional-type(operands, results) }];		`:` functional-type(operands, results) }];
let hasRegionVerifier = 1;		let hasRegionVerifier = 1;
}		}

		def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce",
		[SameOperandsAndResultType]>,
		Arguments<(ins AnyType:$value,
		GPU_AllReduceOperationAttr:$op)>,
		Results<(outs AnyType)> {
		let summary = "Reduce values among subgroup.";
		let description = [{
		The `subgroup_reduce` op reduces the value of every work item across a
		subgroup. The result is equal for all work items of a subgroup.
		ThomasRaouxUnsubmitted Not Done Reply Inline Actions can you clarify the behavior when some lanes in the subgroups are inactive? ThomasRaoux: can you clarify the behavior when some lanes in the subgroups are inactive?
		Hardcode84AuthorUnsubmitted Done Reply Inline Actions spirv ops like `GroupNonUniformFAddOp` we are targeting allow non-uniform execution, but we don't need it at the moment. I will add uniform requirement similar to `all_reduce`. We can always add something like `non-uniform` flag to these ops later if needed. Hardcode84: spirv ops like `GroupNonUniformFAddOp` we are targeting allow non-uniform execution, but we…

		Example:

		```mlir
		%1 = gpu.subgroup_reduce add %0 : (f32) -> (f32)
		```

		Either none or all work items of a subgroup need to execute this op
		in convergence.
		}];
		let assemblyFormat = [{ custom<AllReduceOperation>($op) $value attr-dict
		`:` functional-type(operands, results) }];
		let hasVerifier = 1;
		}

def GPU_ShuffleOpXor : I32EnumAttrCase<"XOR", 0, "xor">;		def GPU_ShuffleOpXor : I32EnumAttrCase<"XOR", 0, "xor">;
def GPU_ShuffleOpDown : I32EnumAttrCase<"DOWN", 1, "down">;		def GPU_ShuffleOpDown : I32EnumAttrCase<"DOWN", 1, "down">;
def GPU_ShuffleOpUp : I32EnumAttrCase<"UP", 2, "up">;		def GPU_ShuffleOpUp : I32EnumAttrCase<"UP", 2, "up">;
def GPU_ShuffleOpIdx : I32EnumAttrCase<"IDX", 3, "idx">;		def GPU_ShuffleOpIdx : I32EnumAttrCase<"IDX", 3, "idx">;

def GPU_ShuffleMode : I32EnumAttr<"ShuffleMode",		def GPU_ShuffleMode : I32EnumAttr<"ShuffleMode",
"Indexing modes supported by gpu.shuffle.",		"Indexing modes supported by gpu.shuffle.",
[		[
▲ Show 20 Lines • Show All 559 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 303 Lines • ▼ Show 20 Lines	static void printAsyncDependencies(OpAsmPrinter &printer, Operation *op,
llvm::interleaveComma(asyncDependencies, printer);		llvm::interleaveComma(asyncDependencies, printer);
printer << ']';		printer << ']';
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AllReduceOp		// AllReduceOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		static bool verifyReduceOpAndType(gpu::AllReduceOperation opName,
		Type resType) {
		if ((opName == gpu::AllReduceOperation::AND \|\|
		opName == gpu::AllReduceOperation::OR \|\|
		opName == gpu::AllReduceOperation::XOR) &&
		!resType.isa<IntegerType>())
		return false;

		return true;
		}

LogicalResult gpu::AllReduceOp::verifyRegions() {		LogicalResult gpu::AllReduceOp::verifyRegions() {
if (getBody().empty() != getOp().has_value())		if (getBody().empty() != getOp().has_value())
return emitError("expected either an op attribute or a non-empty body");		return emitError("expected either an op attribute or a non-empty body");
if (!getBody().empty()) {		if (!getBody().empty()) {
if (getBody().getNumArguments() != 2)		if (getBody().getNumArguments() != 2)
return emitError("expected two region arguments");		return emitError("expected two region arguments");
for (auto argument : getBody().getArguments()) {		for (auto argument : getBody().getArguments()) {
if (argument.getType() != getType())		if (argument.getType() != getType())
return emitError("incorrect region argument type");		return emitError("incorrect region argument type");
}		}
unsigned yieldCount = 0;		unsigned yieldCount = 0;
for (Block &block : getBody()) {		for (Block &block : getBody()) {
if (auto yield = dyn_cast<gpu::YieldOp>(block.getTerminator())) {		if (auto yield = dyn_cast<gpu::YieldOp>(block.getTerminator())) {
if (yield.getNumOperands() != 1)		if (yield.getNumOperands() != 1)
return emitError("expected one gpu.yield operand");		return emitError("expected one gpu.yield operand");
if (yield.getOperand(0).getType() != getType())		if (yield.getOperand(0).getType() != getType())
return emitError("incorrect gpu.yield type");		return emitError("incorrect gpu.yield type");
++yieldCount;		++yieldCount;
}		}
}		}
if (yieldCount == 0)		if (yieldCount == 0)
return emitError("expected gpu.yield op in region");		return emitError("expected gpu.yield op in region");
} else {		} else {
gpu::AllReduceOperation opName = *getOp();		gpu::AllReduceOperation opName = *getOp();
if ((opName == gpu::AllReduceOperation::AND \|\|		if (!verifyReduceOpAndType(opName, getType())) {
opName == gpu::AllReduceOperation::OR \|\|
opName == gpu::AllReduceOperation::XOR) &&
!getType().isa<IntegerType>()) {
return emitError()		return emitError()
<< '`' << gpu::stringifyAllReduceOperation(opName)		<< '`' << gpu::stringifyAllReduceOperation(opName)
<< "` accumulator is only compatible with Integer type";		<< "` accumulator is only compatible with Integer type";
}		}
}		}
return success();		return success();
}		}

Show All 12 Lines

static void printAllReduceOperation(AsmPrinter &printer, Operation *op,		static void printAllReduceOperation(AsmPrinter &printer, Operation *op,
AllReduceOperationAttr attr) {		AllReduceOperationAttr attr) {
if (attr)		if (attr)
attr.print(printer);		attr.print(printer);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// SubgroupReduceOp
		//===----------------------------------------------------------------------===//

		LogicalResult gpu::SubgroupReduceOp::verify() {
		gpu::AllReduceOperation opName = getOp();
		if (!verifyReduceOpAndType(opName, getType())) {
		return emitError() << '`' << gpu::stringifyAllReduceOperation(opName)
		<< "` accumulator is only compatible with Integer type";
		}
		return success();
		}

		//===----------------------------------------------------------------------===//
// AsyncOpInterface		// AsyncOpInterface
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

void gpu::addAsyncDependency(Operation *op, Value token) {		void gpu::addAsyncDependency(Operation *op, Value token) {
op->insertOperands(0, {token});		op->insertOperands(0, {token});
if (!op->template hasTrait<OpTrait::AttrSizedOperandSegments>())		if (!op->template hasTrait<OpTrait::AttrSizedOperandSegments>())
return;		return;
auto attrName =		auto attrName =
▲ Show 20 Lines • Show All 1,036 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/invalid.mlir

	Show First 20 Lines • Show All 239 Lines • ▼ Show 20 Lines
	func.func @reduce_invalid_op_type(%arg0 : f32) {			func.func @reduce_invalid_op_type(%arg0 : f32) {
	// expected-error@+1 {{`and` accumulator is only compatible with Integer type}}			// expected-error@+1 {{`and` accumulator is only compatible with Integer type}}
	%res = gpu.all_reduce and %arg0 {} : (f32) -> (f32)			%res = gpu.all_reduce and %arg0 {} : (f32) -> (f32)
	return			return
	}			}

	// -----			// -----

				func.func @subgroup_reduce_invalid_op_type(%arg0 : f32) {
				// expected-error@+1 {{`and` accumulator is only compatible with Integer type}}
				%res = gpu.subgroup_reduce and %arg0 : (f32) -> (f32)
				return
				}

				// -----

	func.func @reduce_incorrect_region_arguments(%arg0 : f32) {			func.func @reduce_incorrect_region_arguments(%arg0 : f32) {
	// expected-error@+1 {{expected two region arguments}}			// expected-error@+1 {{expected two region arguments}}
	%res = gpu.all_reduce %arg0 {			%res = gpu.all_reduce %arg0 {
	^bb(%lhs : f32):			^bb(%lhs : f32):
	"gpu.yield"(%lhs) : (f32) -> ()			"gpu.yield"(%lhs) : (f32) -> ()
	} : (f32) -> (f32)			} : (f32) -> (f32)
	return			return
	}			}
	▲ Show 20 Lines • Show All 351 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	gpu.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>) kernel {

%sgId = gpu.subgroup_id : index		%sgId = gpu.subgroup_id : index
%numSg = gpu.num_subgroups : index		%numSg = gpu.num_subgroups : index
%SgSi = gpu.subgroup_size : index		%SgSi = gpu.subgroup_size : index

%one = arith.constant 1.0 : f32		%one = arith.constant 1.0 : f32
%sum = gpu.all_reduce add %one {} : (f32) -> (f32)		%sum = gpu.all_reduce add %one {} : (f32) -> (f32)

		// CHECK: %{{.}} = gpu.subgroup_reduce add %{{.}} : (f32) -> f32
		%sum_subgroup = gpu.subgroup_reduce add %one : (f32) -> f32

%width = arith.constant 7 : i32		%width = arith.constant 7 : i32
%offset = arith.constant 3 : i32		%offset = arith.constant 3 : i32
// CHECK: gpu.shuffle xor %{{.}}, %{{.}}, %{{.*}} : f32		// CHECK: gpu.shuffle xor %{{.}}, %{{.}}, %{{.*}} : f32
%shfl, %pred = gpu.shuffle xor %arg0, %offset, %width : f32		%shfl, %pred = gpu.shuffle xor %arg0, %offset, %width : f32
// CHECK: gpu.shuffle up %{{.}}, %{{.}}, %{{.*}} : f32		// CHECK: gpu.shuffle up %{{.}}, %{{.}}, %{{.*}} : f32
%shfl1, %pred1 = gpu.shuffle up %arg0, %offset, %width : f32		%shfl1, %pred1 = gpu.shuffle up %arg0, %offset, %width : f32
// CHECK: gpu.shuffle down %{{.}}, %{{.}}, %{{.*}} : f32		// CHECK: gpu.shuffle down %{{.}}, %{{.}}, %{{.*}} : f32
%shfl2, %pred2 = gpu.shuffle down %arg0, %offset, %width : f32		%shfl2, %pred2 = gpu.shuffle down %arg0, %offset, %width : f32
▲ Show 20 Lines • Show All 203 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add `subgroup_reduce` operationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 466749

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/ops.mlir

[mlir][gpu] Add `subgroup_reduce` operation
ClosedPublic