This is an archive of the discontinued LLVM Phabricator instance.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
692	I wonder if the flag should be reverted and be `uniform` instead. non_uniform should be the default as it is the most conservative case. This way unless specified the op is created without the uniform assumption. What do you think?
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
398	Should this be an error?

In D138758#3989123, @ThomasRaoux wrote:

Sorry for the slow review. Why are we adding it without any lowering? The goal is to lower to SPIRV that has those semantics?

I was planning to add spirv lowering in separate review.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
692	The idea itself is ok to me, but: We are changing default semantics and it can broke user code without any warning Historically, at least in spirv, uniform ops appeared much earlier than non-uniform (i.e. `GroupFAdd` is 1.0 but `GroupNonUniformFAdd` is 1.3)
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
398	There is no good way to return error from rewrite, I think. Even if we use `emitError` it will only change error message, but won't propagate it.

ThomasRaoux added inline comments.Dec 12 2022, 11:54 AM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
692	using non-uniform version should be conservative and correct at the GPU level. The part that does break is the lowering and there should be an explicit lowering failure there. Historically, at least in spirv, uniform ops appeared much earlier than non-uniform (i.e. GroupFAdd is 1.0 but GroupNonUniformFAdd is 1.3) I wonder if this is due to the order in which those got added. If there were added at the same time and/or we didn't have to keep compatibility I would guess that's not how they would have been done.
mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp
398	Yes, in general we should just fail the pattern, here because this pattern tries to convert the whole function this is a bit awkward. I think the best would be to fail the pattern and return `notifyMatchFailure`.

rebase, invert flag, fix pattern

Herald added subscribers: mattd, gchakrabarti, jholewinski. · View Herald TranscriptDec 13 2022, 10:06 AM

Hardcode84 marked 4 inline comments as done.Dec 13 2022, 10:07 AM

ThomasRaoux accepted this revision.Dec 13 2022, 10:10 AM

This revision is now accepted and ready to land.Dec 13 2022, 10:10 AM

Hardcode84 retitled this revision from [mlir][gpu] Add `non_uniform` flag to gpu reduction ops to [mlir][gpu] Add `uniform` flag to gpu reduction ops.Dec 13 2022, 10:11 AM

Harbormaster completed remote builds in B202891: Diff 482538.Dec 13 2022, 11:22 AM

Closed by commit rG247d8d4f7ab1: [mlir][gpu] Add `uniform` flag to gpu reduction ops (authored by Hardcode84). · Explain WhyDec 14 2022, 4:16 AM

This revision was automatically updated to reflect the committed changes.

Hardcode84 added a commit: rG247d8d4f7ab1: [mlir][gpu] Add `uniform` flag to gpu reduction ops.

Hardcode84 mentioned this in D140014: [mlir][gpu] Fix cuda integration tests.Dec 14 2022, 4:51 AM

Fix for integration tests https://reviews.llvm.org/D140014

Hardcode84 mentioned this in rGbefd167050ce: [mlir][gpu] Fix cuda integration tests.Dec 14 2022, 5:01 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

IR/

GPUOps.td

20 lines

lib/

Dialect/

GPU/

Transforms/

AllReduceLowering.cpp

23 lines

test/

Conversion/

GPUToNVVM/

gpu-to-nvvm.mlir

4 lines

Dialect/

GPU/

all-reduce-max.mlir

2 lines

all-reduce.mlir

2 lines

multiple-all-reduce.mlir

4 lines

ops.mlir

10 lines

Diff 482806

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 682 Lines • ▼ Show 20 Lines	def GPU_AllReduceOperation : I32EnumAttr<"AllReduceOperation",
let cppNamespace = "::mlir::gpu";		let cppNamespace = "::mlir::gpu";
}		}
def GPU_AllReduceOperationAttr : EnumAttr<GPU_Dialect, GPU_AllReduceOperation,		def GPU_AllReduceOperationAttr : EnumAttr<GPU_Dialect, GPU_AllReduceOperation,
"all_reduce_op">;		"all_reduce_op">;

def GPU_AllReduceOp : GPU_Op<"all_reduce",		def GPU_AllReduceOp : GPU_Op<"all_reduce",
[SameOperandsAndResultType, IsolatedFromAbove]>,		[SameOperandsAndResultType, IsolatedFromAbove]>,
Arguments<(ins AnyType:$value,		Arguments<(ins AnyType:$value,
OptionalAttr<GPU_AllReduceOperationAttr>:$op)>,		OptionalAttr<GPU_AllReduceOperationAttr>:$op,
		UnitAttr:$uniform)>,
		ThomasRaouxUnsubmitted Done Reply Inline Actions I wonder if the flag should be reverted and be `uniform` instead. non_uniform should be the default as it is the most conservative case. This way unless specified the op is created without the uniform assumption. What do you think? ThomasRaoux: I wonder if the flag should be reverted and be `uniform` instead. non_uniform should be the…
		Hardcode84AuthorUnsubmitted Done Reply Inline Actions The idea itself is ok to me, but: We are changing default semantics and it can broke user code without any warning Historically, at least in spirv, uniform ops appeared much earlier than non-uniform (i.e. `GroupFAdd` is 1.0 but `GroupNonUniformFAdd` is 1.3) Hardcode84: The idea itself is ok to me, but: * We are changing default semantics and it can broke user…
		ThomasRaouxUnsubmitted Done Reply Inline Actions using non-uniform version should be conservative and correct at the GPU level. The part that does break is the lowering and there should be an explicit lowering failure there. Historically, at least in spirv, uniform ops appeared much earlier than non-uniform (i.e. GroupFAdd is 1.0 but GroupNonUniformFAdd is 1.3) I wonder if this is due to the order in which those got added. If there were added at the same time and/or we didn't have to keep compatibility I would guess that's not how they would have been done. ThomasRaoux: using non-uniform version should be conservative and correct at the GPU level. The part that…
Results<(outs AnyType)> {		Results<(outs AnyType)> {
let summary = "Reduce values among workgroup.";		let summary = "Reduce values among workgroup.";
let description = [{		let description = [{
The `all_reduce` op reduces the value of every work item across a local		The `all_reduce` op reduces the value of every work item across a local
workgroup. The result is equal for all work items of a workgroup.		workgroup. The result is equal for all work items of a workgroup.

For example, both		For example, both

```mlir		```mlir
%1 = gpu.all_reduce add %0 {} : (f32) -> (f32)		%1 = gpu.all_reduce add %0 {} : (f32) -> (f32)
%2 = gpu.all_reduce %0 {		%2 = gpu.all_reduce %0 {
^bb(%lhs : f32, %rhs : f32):		^bb(%lhs : f32, %rhs : f32):
%sum = arith.addf %lhs, %rhs : f32		%sum = arith.addf %lhs, %rhs : f32
"gpu.yield"(%sum) : (f32) -> ()		"gpu.yield"(%sum) : (f32) -> ()
} : (f32) -> (f32)		} : (f32) -> (f32)
```		```

compute the sum of each work item's %0 value. The first version specifies		compute the sum of each work item's %0 value. The first version specifies
the accumulation as operation, whereas the second version specifies the		the accumulation as operation, whereas the second version specifies the
accumulation as code region. The accumulation operation must be one of:		accumulation as code region. The accumulation operation must be one of:
`add`, `and`, `max`, `min`, `mul`, `or`, `xor`.		`add`, `and`, `max`, `min`, `mul`, `or`, `xor`.

Either none or all work items of a workgroup need to execute this op		If `uniform` flag is set either none or all work items of a workgroup
in convergence.		need to execute this op in convergence.
}];		}];
let regions = (region AnyRegion:$body);		let regions = (region AnyRegion:$body);
let assemblyFormat = [{ custom<AllReduceOperation>($op) $value $body attr-dict		let assemblyFormat = [{ custom<AllReduceOperation>($op) $value
		(`uniform` $uniform^)? $body attr-dict
`:` functional-type(operands, results) }];		`:` functional-type(operands, results) }];
let hasRegionVerifier = 1;		let hasRegionVerifier = 1;
}		}

def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce",		def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce",
[SameOperandsAndResultType]>,		[SameOperandsAndResultType]>,
Arguments<(ins AnyType:$value,		Arguments<(ins AnyType:$value,
GPU_AllReduceOperationAttr:$op)>,		GPU_AllReduceOperationAttr:$op,
		UnitAttr:$uniform)>,
Results<(outs AnyType)> {		Results<(outs AnyType)> {
let summary = "Reduce values among subgroup.";		let summary = "Reduce values among subgroup.";
let description = [{		let description = [{
The `subgroup_reduce` op reduces the value of every work item across a		The `subgroup_reduce` op reduces the value of every work item across a
subgroup. The result is equal for all work items of a subgroup.		subgroup. The result is equal for all work items of a subgroup.

Example:		Example:

```mlir		```mlir
%1 = gpu.subgroup_reduce add %0 : (f32) -> (f32)		%1 = gpu.subgroup_reduce add %0 : (f32) -> (f32)
```		```

Either none or all work items of a subgroup need to execute this op		If `uniform` flag is set either none or all work items of a subgroup
in convergence.		need to execute this op in convergence.
}];		}];
let assemblyFormat = [{ custom<AllReduceOperation>($op) $value attr-dict		let assemblyFormat = [{ custom<AllReduceOperation>($op) $value
		(`uniform` $uniform^)? attr-dict
`:` functional-type(operands, results) }];		`:` functional-type(operands, results) }];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

def GPU_ShuffleOpXor : I32EnumAttrCase<"XOR", 0, "xor">;		def GPU_ShuffleOpXor : I32EnumAttrCase<"XOR", 0, "xor">;
def GPU_ShuffleOpDown : I32EnumAttrCase<"DOWN", 1, "down">;		def GPU_ShuffleOpDown : I32EnumAttrCase<"DOWN", 1, "down">;
def GPU_ShuffleOpUp : I32EnumAttrCase<"UP", 2, "up">;		def GPU_ShuffleOpUp : I32EnumAttrCase<"UP", 2, "up">;
def GPU_ShuffleOpIdx : I32EnumAttrCase<"IDX", 3, "idx">;		def GPU_ShuffleOpIdx : I32EnumAttrCase<"IDX", 3, "idx">;
▲ Show 20 Lines • Show All 589 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

	Show First 20 Lines • Show All 388 Lines • ▼ Show 20 Lines

	struct GpuAllReduceConversion : public RewritePattern {			struct GpuAllReduceConversion : public RewritePattern {
	explicit GpuAllReduceConversion(MLIRContext *context)			explicit GpuAllReduceConversion(MLIRContext *context)
	: RewritePattern(gpu::GPUFuncOp::getOperationName(), 1, context) {}			: RewritePattern(gpu::GPUFuncOp::getOperationName(), 1, context) {}

	LogicalResult matchAndRewrite(Operation *op,			LogicalResult matchAndRewrite(Operation *op,
	PatternRewriter &rewriter) const override {			PatternRewriter &rewriter) const override {
	auto funcOp = cast<gpu::GPUFuncOp>(op);			auto funcOp = cast<gpu::GPUFuncOp>(op);
	auto callback = [&](gpu::AllReduceOp reduceOp) {
	GpuAllReduceRewriter(funcOp, reduceOp, rewriter).rewrite();			SmallVector<gpu::AllReduceOp> reduceOps;
				ThomasRaouxUnsubmitted Done Reply Inline Actions Should this be an error? ThomasRaoux: Should this be an error?
				Hardcode84AuthorUnsubmitted Done Reply Inline Actions There is no good way to return error from rewrite, I think. Even if we use `emitError` it will only change error message, but won't propagate it. Hardcode84: There is no good way to return error from rewrite, I think. Even if we use `emitError` it will…
				ThomasRaouxUnsubmitted Done Reply Inline Actions Yes, in general we should just fail the pattern, here because this pattern tries to convert the whole function this is a bit awkward. I think the best would be to fail the pattern and return `notifyMatchFailure`. ThomasRaoux: Yes, in general we should just fail the pattern, here because this pattern tries to convert the…
	// Performing a rewrite invalidates the walk iterator. Report interrupt			auto callback = [&](gpu::AllReduceOp reduceOp) -> WalkResult {
	// so that we can start a new walk until all all_reduce ops are replaced.			if (!reduceOp.getUniform())
	return WalkResult::interrupt();			return WalkResult::interrupt();

				reduceOps.emplace_back(reduceOp);
				return WalkResult::advance();
	};			};
	while (funcOp.walk(callback).wasInterrupted()) {
	}			if (funcOp.walk(callback).wasInterrupted())
				return rewriter.notifyMatchFailure(
				op, "Non uniform reductions are not supported yet.");

				for (gpu::AllReduceOp reduceOp : reduceOps)
				GpuAllReduceRewriter(funcOp, reduceOp, rewriter).rewrite();

	return success();			return success();
	}			}
	};			};
	} // namespace			} // namespace

	void mlir::populateGpuAllReducePatterns(RewritePatternSet &patterns) {			void mlir::populateGpuAllReducePatterns(RewritePatternSet &patterns) {
	patterns.add<GpuAllReduceConversion>(patterns.getContext());			patterns.add<GpuAllReduceConversion>(patterns.getContext());
	}			}

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

	Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines
	gpu.module @test_module {			gpu.module @test_module {
	// CHECK-LABEL: func @gpu_all_reduce_op()			// CHECK-LABEL: func @gpu_all_reduce_op()
	gpu.func @gpu_all_reduce_op() {			gpu.func @gpu_all_reduce_op() {
	%arg0 = arith.constant 1.0 : f32			%arg0 = arith.constant 1.0 : f32
	// TODO: Check full IR expansion once lowering has settled.			// TODO: Check full IR expansion once lowering has settled.
	// CHECK: nvvm.shfl.sync bfly {{.*}}			// CHECK: nvvm.shfl.sync bfly {{.*}}
	// CHECK: nvvm.barrier0			// CHECK: nvvm.barrier0
	// CHECK: llvm.fadd			// CHECK: llvm.fadd
	%result = gpu.all_reduce add %arg0 {} : (f32) -> (f32)			%result = gpu.all_reduce add %arg0 uniform {} : (f32) -> (f32)

	gpu.return			gpu.return
	}			}
	}			}

	// -----			// -----

	gpu.module @test_module {			gpu.module @test_module {
	// CHECK-LABEL: func @gpu_all_reduce_region()			// CHECK-LABEL: func @gpu_all_reduce_region()
	gpu.func @gpu_all_reduce_region() {			gpu.func @gpu_all_reduce_region() {
	%arg0 = arith.constant 1 : i32			%arg0 = arith.constant 1 : i32
	// TODO: Check full IR expansion once lowering has settled.			// TODO: Check full IR expansion once lowering has settled.
	// CHECK: nvvm.shfl.sync bfly {{.*}}			// CHECK: nvvm.shfl.sync bfly {{.*}}
	// CHECK: nvvm.barrier0			// CHECK: nvvm.barrier0
	%result = gpu.all_reduce %arg0 {			%result = gpu.all_reduce %arg0 uniform {
	^bb(%lhs : i32, %rhs : i32):			^bb(%lhs : i32, %rhs : i32):
	%xor = arith.xori %lhs, %rhs : i32			%xor = arith.xori %lhs, %rhs : i32
	"gpu.yield"(%xor) : (i32) -> ()			"gpu.yield"(%xor) : (i32) -> ()
	} : (i32) -> (i32)			} : (i32) -> (i32)
	gpu.return			gpu.return
	}			}
	}			}

	▲ Show 20 Lines • Show All 388 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/all-reduce-max.mlir

Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	gpu.func @kernel(%arg0 : f32) kernel {
// CHECK: cf.br ^bb40([[VAL_132]] : f32)		// CHECK: cf.br ^bb40([[VAL_132]] : f32)
// CHECK: ^bb40([[VAL_133:%.*]]: f32):		// CHECK: ^bb40([[VAL_133:%.*]]: f32):
// CHECK: store [[VAL_133]], [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>		// CHECK: store [[VAL_133]], [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>
// CHECK: cf.br ^bb42		// CHECK: cf.br ^bb42
// CHECK: ^bb41:		// CHECK: ^bb41:
// CHECK: cf.br ^bb42		// CHECK: cf.br ^bb42
// CHECK: ^bb42:		// CHECK: ^bb42:
// CHECK: gpu.barrier		// CHECK: gpu.barrier
%sum = gpu.all_reduce max %arg0 {} : (f32) -> (f32)		%sum = gpu.all_reduce max %arg0 uniform {} : (f32) -> (f32)
gpu.return		gpu.return
}		}

}		}

mlir/test/Dialect/GPU/all-reduce.mlir

Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	gpu.func @kernel(%arg0 : f32) kernel {
// CHECK: cf.br ^bb40([[VAL_112]] : f32)		// CHECK: cf.br ^bb40([[VAL_112]] : f32)
// CHECK: ^bb40([[VAL_113:%.*]]: f32):		// CHECK: ^bb40([[VAL_113:%.*]]: f32):
// CHECK: store [[VAL_113]], [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>		// CHECK: store [[VAL_113]], [[VAL_1]]{{\[}}[[VAL_4]]] : memref<32xf32, 3>
// CHECK: cf.br ^bb42		// CHECK: cf.br ^bb42
// CHECK: ^bb41:		// CHECK: ^bb41:
// CHECK: cf.br ^bb42		// CHECK: cf.br ^bb42
// CHECK: ^bb42:		// CHECK: ^bb42:
// CHECK: gpu.barrier		// CHECK: gpu.barrier
%sum = gpu.all_reduce add %arg0 {} : (f32) -> (f32)		%sum = gpu.all_reduce add %arg0 uniform {} : (f32) -> (f32)
gpu.return		gpu.return
}		}

}		}

mlir/test/Dialect/GPU/multiple-all-reduce.mlir

	// RUN: mlir-opt --gpu-kernel-outlining --convert-gpu-to-nvvm %s \| FileCheck %s			// RUN: mlir-opt --gpu-kernel-outlining --convert-gpu-to-nvvm %s \| FileCheck %s

	func.func @main() {			func.func @main() {
	%data = memref.alloc() : memref<2x6xf32>			%data = memref.alloc() : memref<2x6xf32>
	%sum = memref.alloc() : memref<2xf32>			%sum = memref.alloc() : memref<2xf32>
	%mul = memref.alloc() : memref<2xf32>			%mul = memref.alloc() : memref<2xf32>
	%c1 = arith.constant 1 : index			%c1 = arith.constant 1 : index

	// ADD + MUL			// ADD + MUL
	gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)			gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
	threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1, %block_z = %c1) {			threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1, %block_z = %c1) {
	%val = memref.load %data[%bx, %tx] : memref<2x6xf32>			%val = memref.load %data[%bx, %tx] : memref<2x6xf32>
	%reduced0 = gpu.all_reduce add %val {} : (f32) -> (f32)			%reduced0 = gpu.all_reduce add %val uniform {} : (f32) -> (f32)
	memref.store %reduced0, %sum[%bx] : memref<2xf32>			memref.store %reduced0, %sum[%bx] : memref<2xf32>
	%reduced1 = gpu.all_reduce mul %val {} : (f32) -> (f32)			%reduced1 = gpu.all_reduce mul %val uniform {} : (f32) -> (f32)
	memref.store %reduced1, %mul[%bx] : memref<2xf32>			memref.store %reduced1, %mul[%bx] : memref<2xf32>
	gpu.terminator			gpu.terminator
	}			}

	// CHECK: gpu.module @main_kernel {			// CHECK: gpu.module @main_kernel {
	// CHECK-NEXT: llvm.mlir.global internal @{{.*}}() {addr_space = 3 : i32} : !llvm.array<32 x f32>			// CHECK-NEXT: llvm.mlir.global internal @{{.*}}() {addr_space = 3 : i32} : !llvm.array<32 x f32>
	// CHECK-NEXT: llvm.mlir.global internal @{{.*}}() {addr_space = 3 : i32} : !llvm.array<32 x f32>			// CHECK-NEXT: llvm.mlir.global internal @{{.*}}() {addr_space = 3 : i32} : !llvm.array<32 x f32>

	return			return
	}			}

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	gpu.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>) kernel {
%gIdY = gpu.global_id y		%gIdY = gpu.global_id y
%gIdZ = gpu.global_id z		%gIdZ = gpu.global_id z

%sgId = gpu.subgroup_id : index		%sgId = gpu.subgroup_id : index
%numSg = gpu.num_subgroups : index		%numSg = gpu.num_subgroups : index
%SgSi = gpu.subgroup_size : index		%SgSi = gpu.subgroup_size : index

%one = arith.constant 1.0 : f32		%one = arith.constant 1.0 : f32

		// CHECK: %{{.}} = gpu.all_reduce add %{{.}} {
		// CHECK-NEXT: } : (f32) -> f32
%sum = gpu.all_reduce add %one {} : (f32) -> (f32)		%sum = gpu.all_reduce add %one {} : (f32) -> (f32)

		// CHECK: %{{.}} = gpu.all_reduce add %{{.}} uniform {
		// CHECK-NEXT: } : (f32) -> f32
		%sum1 = gpu.all_reduce add %one uniform {} : (f32) -> f32

// CHECK: %{{.}} = gpu.subgroup_reduce add %{{.}} : (f32) -> f32		// CHECK: %{{.}} = gpu.subgroup_reduce add %{{.}} : (f32) -> f32
%sum_subgroup = gpu.subgroup_reduce add %one : (f32) -> f32		%sum_subgroup = gpu.subgroup_reduce add %one : (f32) -> f32

		// CHECK: %{{.}} = gpu.subgroup_reduce add %{{.}} uniform : (f32) -> f32
		%sum_subgroup1 = gpu.subgroup_reduce add %one uniform : (f32) -> f32

%width = arith.constant 7 : i32		%width = arith.constant 7 : i32
%offset = arith.constant 3 : i32		%offset = arith.constant 3 : i32
// CHECK: gpu.shuffle xor %{{.}}, %{{.}}, %{{.*}} : f32		// CHECK: gpu.shuffle xor %{{.}}, %{{.}}, %{{.*}} : f32
%shfl, %pred = gpu.shuffle xor %arg0, %offset, %width : f32		%shfl, %pred = gpu.shuffle xor %arg0, %offset, %width : f32
// CHECK: gpu.shuffle up %{{.}}, %{{.}}, %{{.*}} : f32		// CHECK: gpu.shuffle up %{{.}}, %{{.}}, %{{.*}} : f32
%shfl1, %pred1 = gpu.shuffle up %arg0, %offset, %width : f32		%shfl1, %pred1 = gpu.shuffle up %arg0, %offset, %width : f32
// CHECK: gpu.shuffle down %{{.}}, %{{.}}, %{{.*}} : f32		// CHECK: gpu.shuffle down %{{.}}, %{{.}}, %{{.*}} : f32
%shfl2, %pred2 = gpu.shuffle down %arg0, %offset, %width : f32		%shfl2, %pred2 = gpu.shuffle down %arg0, %offset, %width : f32
▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add `uniform` flag to gpu reduction opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 482806

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

mlir/test/Dialect/GPU/all-reduce-max.mlir

mlir/test/Dialect/GPU/all-reduce.mlir

mlir/test/Dialect/GPU/multiple-all-reduce.mlir

mlir/test/Dialect/GPU/ops.mlir

[mlir][gpu] Add `uniform` flag to gpu reduction ops
ClosedPublic