mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp
49–88	This implementation is based on clang's implementation of the operation, as GPU natively only support 32 bit shuffles.
mlir/test/Dialect/GPU/shuffle-rewrite.mlir
7–22	This test verifies the op for f64.
22	This `memref.store` is need it otherwise the operation is folded away by `applyPatternsAndFoldGreedily`.
31–51	This test verifies the op for i64.

fmorac published this revision for review.Apr 24 2023, 5:50 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptApr 24 2023, 5:50 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

ping for review

fmorac added a reviewer: makslevental.May 2 2023, 7:41 AM

Rebasing to main.

Harbormaster completed remote builds in B231391: Diff 521377.May 11 2023, 11:32 AM

Rebasing main

Harbormaster completed remote builds in B231401: Diff 521389.May 11 2023, 11:38 AM

Rebasing main.

Harbormaster completed remote builds in B231406: Diff 521395.May 11 2023, 12:29 PM

makslevental added inline comments.May 11 2023, 6:52 PM

mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp
49–88	can you paste a link here to that impl? just so I can compare side-by-side...

fmorac marked an inline comment as done.May 12 2023, 4:14 AM

fmorac added inline comments.

mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp

49–88

In this case clang uses a struct to hide the truncation, shifting and extensions:
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_intrinsics.h#L40-L54
But when optimized it will decay to the above form. To test this:

double __attribute__((__used__)) __device__ shfl(double val, int delta,
                                                 int width) {
  return __shfl_down_sync(0xFFFFFFFF, val, delta, width);
}

When compiled with: clang++ -O3 --offload-device-only shfl.cu -S -emit-llvm -o shfl.ll, you get:

; Function Attrs: convergent mustprogress nounwind
define dso_local noundef double @_Z4shfldii(double noundef %0, i32 noundef %1, i32 noundef %2) #0 {
  %4 = bitcast double %0 to i64
  %5 = trunc i64 %4 to i32
  %6 = lshr i64 %4, 32
  %7 = trunc i64 %6 to i32
  %8 = mul i32 %2, -256
  %9 = add i32 %8, 8223
  %10 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %5, i32 %1, i32 %9)
  %11 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %7, i32 %1, i32 %9)
  %12 = zext i32 %11 to i64
  %13 = shl nuw i64 %12, 32
  %14 = zext i32 %10 to i64
  %15 = or i64 %13, %14
  %16 = bitcast i64 %15 to double
  ret double %16
}

The extra %8, %9 in the above code get created because NVVM doesn't use width as the third parameter, but instead a special value:
https://docs.nvidia.com/cuda/nvvm-ir-spec/#data-movement
But I believe we use width for portability, as AMD only supports width.

Looks good to me - only nit I I would ask is add a comment explaining that the set of shifts and truncates and such can be double checked against clang (just basically one sentence summarizing that can you generate the .ll from the macro that's in clang).

This revision is now accepted and ready to land.May 23 2023, 10:44 AM

The reason behind this change is that both CUDA & HIP support this kind of shuffling.

This is not supported in ptx spec. What kind of lowering are you expecting for this op? If this is going to be broken down anyway I don't think adding it here makes sense. This is the reason why it wasn't added for non 32bits type.

Could you give more details on how you are planning to lower this to rocdl/nvvm or other dialects?

In D148974#4365352, @ThomasRaoux wrote:

Could you give more details on how you are planning to lower this to rocdl/nvvm or other dialects?

You're right, they're not natively supported. However I added a pattern (see file ShuffleRewriter.cpp) to transform those instructions into supported instructions, basically it rewrites the op into 2 shuffles of 32 bits, this is also what HIP and CUDA do internally.

Added more context specifying the implementation provided by this patch mirrors that of clang.

Harbormaster completed remote builds in B234643: Diff 525792.May 25 2023, 2:39 PM

Closed by commit rG330a232ae761: [mlir][gpu] Add i64 & f64 support to gpu.shuffle (authored by fmorac). · Explain WhyMay 25 2023, 2:41 PM

This revision was automatically updated to reflect the committed changes.

fmorac added a commit: rG330a232ae761: [mlir][gpu] Add i64 & f64 support to gpu.shuffle.

Diff 525792

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 930 Lines • ▼ Show 20 Lines	[
GPU_ShuffleOpXor, GPU_ShuffleOpUp, GPU_ShuffleOpDown, GPU_ShuffleOpIdx,		GPU_ShuffleOpXor, GPU_ShuffleOpUp, GPU_ShuffleOpDown, GPU_ShuffleOpIdx,
]> {		]> {
let genSpecializedAttr = 0;		let genSpecializedAttr = 0;
let cppNamespace = "::mlir::gpu";		let cppNamespace = "::mlir::gpu";
}		}
def GPU_ShuffleModeAttr : EnumAttr<GPU_Dialect, GPU_ShuffleMode,		def GPU_ShuffleModeAttr : EnumAttr<GPU_Dialect, GPU_ShuffleMode,
"shuffle_mode">;		"shuffle_mode">;

def I32OrF32 : TypeConstraint<Or<[I32.predicate, F32.predicate]>,		def I32I64F32OrF64 : TypeConstraint<Or<[I32.predicate,
"i32 or f32">;		I64.predicate,
		F32.predicate,
		F64.predicate]>,
		"i32, i64, f32 or f64">;

def GPU_ShuffleOp : GPU_Op<		def GPU_ShuffleOp : GPU_Op<
"shuffle", [Pure, AllTypesMatch<["value", "shuffleResult"]>]>,		"shuffle", [Pure, AllTypesMatch<["value", "shuffleResult"]>]>,
Arguments<(ins I32OrF32:$value, I32:$offset, I32:$width,		Arguments<(ins I32I64F32OrF64:$value, I32:$offset, I32:$width,
GPU_ShuffleModeAttr:$mode)>,		GPU_ShuffleModeAttr:$mode)>,
Results<(outs I32OrF32:$shuffleResult, I1:$valid)> {		Results<(outs I32I64F32OrF64:$shuffleResult, I1:$valid)> {
let summary = "Shuffles values within a subgroup.";		let summary = "Shuffles values within a subgroup.";
let description = [{		let description = [{
The "shuffle" op moves values to a different invocation within the same		The "shuffle" op moves values to a different invocation within the same
subgroup.		subgroup.

Example:		Example:

```mlir		```mlir
▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

	Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	/// encountered to the local workgroup. Within each mapping, the first three			/// encountered to the local workgroup. Within each mapping, the first three
	/// dimensions are mapped to x/y/z hardware ids and all following dimensions are			/// dimensions are mapped to x/y/z hardware ids and all following dimensions are
	/// mapped to sequential loops.			/// mapped to sequential loops.
	std::unique_ptr<OperationPass<func::FuncOp>> createGpuMapParallelLoopsPass();			std::unique_ptr<OperationPass<func::FuncOp>> createGpuMapParallelLoopsPass();

	/// Collect a set of patterns to rewrite GlobalIdOp op within the GPU dialect.			/// Collect a set of patterns to rewrite GlobalIdOp op within the GPU dialect.
	void populateGpuGlobalIdPatterns(RewritePatternSet &patterns);			void populateGpuGlobalIdPatterns(RewritePatternSet &patterns);

				/// Collect a set of patterns to rewrite shuffle ops within the GPU dialect.
				void populateGpuShufflePatterns(RewritePatternSet &patterns);

	/// Collect a set of patterns to rewrite all-reduce ops within the GPU dialect.			/// Collect a set of patterns to rewrite all-reduce ops within the GPU dialect.
	void populateGpuAllReducePatterns(RewritePatternSet &patterns);			void populateGpuAllReducePatterns(RewritePatternSet &patterns);

	/// Collect all patterns to rewrite ops within the GPU dialect.			/// Collect all patterns to rewrite ops within the GPU dialect.
	inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {			inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {
	populateGpuAllReducePatterns(patterns);			populateGpuAllReducePatterns(patterns);
	populateGpuGlobalIdPatterns(patterns);			populateGpuGlobalIdPatterns(patterns);
				populateGpuShufflePatterns(patterns);
	}			}

	namespace gpu {			namespace gpu {
	/// Base pass class to serialize kernel functions through LLVM into			/// Base pass class to serialize kernel functions through LLVM into
	/// user-specified IR and add the resulting blob as module attribute.			/// user-specified IR and add the resulting blob as module attribute.
	class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {			class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {
	public:			public:
	SerializeToBlobPass(TypeID passID);			SerializeToBlobPass(TypeID passID);
	▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/CMakeLists.txt

	Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines

	add_mlir_dialect_library(MLIRGPUTransforms			add_mlir_dialect_library(MLIRGPUTransforms
	Transforms/AllReduceLowering.cpp			Transforms/AllReduceLowering.cpp
	Transforms/AsyncRegionRewriter.cpp			Transforms/AsyncRegionRewriter.cpp
	Transforms/GlobalIdRewriter.cpp			Transforms/GlobalIdRewriter.cpp
	Transforms/KernelOutlining.cpp			Transforms/KernelOutlining.cpp
	Transforms/MemoryPromotion.cpp			Transforms/MemoryPromotion.cpp
	Transforms/ParallelLoopMapper.cpp			Transforms/ParallelLoopMapper.cpp
				Transforms/ShuffleRewriter.cpp
	Transforms/SerializeToBlob.cpp			Transforms/SerializeToBlob.cpp
	Transforms/SerializeToCubin.cpp			Transforms/SerializeToCubin.cpp
	Transforms/SerializeToHsaco.cpp			Transforms/SerializeToHsaco.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU

	LINK_COMPONENTS			LINK_COMPONENTS
	▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp

This file was added.

				//===- ShuffleRewriter.cpp - Implementation of shuffle rewriting ---------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements in-dialect rewriting of the shuffle op for types i64 and
				// f64, rewriting 64bit shuffles into two 32bit shuffles. This particular
				// implementation using shifts and truncations can be obtained using clang: by
				// emitting IR for shuffle operations with `-O3`.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/GPU/Transforms/Passes.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;

				namespace {
				struct GpuShuffleRewriter : public OpRewritePattern<gpu::ShuffleOp> {
				using OpRewritePattern<gpu::ShuffleOp>::OpRewritePattern;

				void initialize() {
				// Required as the pattern will replace the Op with 2 additional ShuffleOps.
				setHasBoundedRewriteRecursion();
				}
				LogicalResult matchAndRewrite(gpu::ShuffleOp op,
				PatternRewriter &rewriter) const override {
				auto loc = op.getLoc();
				auto value = op.getValue();
				auto valueType = value.getType();
				auto valueLoc = value.getLoc();
				auto i32 = rewriter.getI32Type();
				auto i64 = rewriter.getI64Type();

				// If the type of the value is either i32 or f32, the op is already valid.
				if (valueType.getIntOrFloatBitWidth() == 32)
				return failure();

				Value lo, hi;

				// Float types must be converted to i64 to extract the bits.
				if (isa<FloatType>(valueType))
				value = rewriter.create<arith::BitcastOp>(valueLoc, i64, value);

				// Get the low bits by trunc(value).
				lo = rewriter.create<arith::TruncIOp>(valueLoc, i32, value);

				// Get the high bits by trunc(value >> 32).
				auto c32 = rewriter.create<arith::ConstantOp>(
				valueLoc, rewriter.getIntegerAttr(i64, 32));
				hi = rewriter.create<arith::ShRUIOp>(valueLoc, value, c32);
				hi = rewriter.create<arith::TruncIOp>(valueLoc, i32, hi);

				// Shuffle the values.
				ValueRange loRes =
				rewriter
				.create<gpu::ShuffleOp>(op.getLoc(), lo, op.getOffset(),
				op.getWidth(), op.getMode())
				.getResults();
				ValueRange hiRes =
				rewriter
				.create<gpu::ShuffleOp>(op.getLoc(), hi, op.getOffset(),
				op.getWidth(), op.getMode())
				.getResults();

				// Convert lo back to i64.
				lo = rewriter.create<arith::ExtUIOp>(valueLoc, i64, loRes[0]);

				// Convert hi back to i64.
				hi = rewriter.create<arith::ExtUIOp>(valueLoc, i64, hiRes[0]);
				hi = rewriter.create<arith::ShLIOp>(valueLoc, hi, c32);

				// Obtain the shuffled bits hi \| lo.
				value = rewriter.create<arith::OrIOp>(loc, hi, lo);

				// Convert the value back to float.
				if (isa<FloatType>(valueType))
				value = rewriter.create<arith::BitcastOp>(valueLoc, valueType, value);

				// Obtain the shuffle validity by combining both validities.
				auto validity = rewriter.create<arith::AndIOp>(loc, loRes[1], hiRes[1]);
				fmoracAuthorUnsubmitted Done Reply Inline Actions This implementation is based on clang's implementation of the operation, as GPU natively only support 32 bit shuffles. fmorac: This implementation is based on clang's implementation of the operation, as GPU natively only…
				maksleventalUnsubmitted Done Reply Inline Actions can you paste a link here to that impl? just so I can compare side-by-side... makslevental: can you paste a link here to that impl? just so I can compare side-by-side...
				fmoracAuthorUnsubmitted Done Reply Inline Actions In this case clang uses a struct to hide the truncation, shifting and extensions: https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_intrinsics.h#L40-L54 But when optimized it will decay to the above form. To test this: double __attribute__((__used__)) __device__ shfl(double val, int delta, int width) { return __shfl_down_sync(0xFFFFFFFF, val, delta, width); } When compiled with: `clang++ -O3 --offload-device-only shfl.cu -S -emit-llvm -o shfl.ll`, you get: ; Function Attrs: convergent mustprogress nounwind define dso_local noundef double @_Z4shfldii(double noundef %0, i32 noundef %1, i32 noundef %2) #0 { %4 = bitcast double %0 to i64 %5 = trunc i64 %4 to i32 %6 = lshr i64 %4, 32 %7 = trunc i64 %6 to i32 %8 = mul i32 %2, -256 %9 = add i32 %8, 8223 %10 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %5, i32 %1, i32 %9) %11 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %7, i32 %1, i32 %9) %12 = zext i32 %11 to i64 %13 = shl nuw i64 %12, 32 %14 = zext i32 %10 to i64 %15 = or i64 %13, %14 %16 = bitcast i64 %15 to double ret double %16 } The extra %8, %9 in the above code get created because NVVM doesn't use width as the third parameter, but instead a special value: https://docs.nvidia.com/cuda/nvvm-ir-spec/#data-movement But I believe we use width for portability, as AMD only supports width. fmorac: In this case clang uses a struct to hide the truncation, shifting and extensions: https…

				// Replace the op.
				rewriter.replaceOp(op, {value, validity});
				return success();
				}
				};
				} // namespace

				void mlir::populateGpuShufflePatterns(RewritePatternSet &patterns) {
				patterns.add<GpuShuffleRewriter>(patterns.getContext());
				}

mlir/test/Dialect/GPU/invalid.mlir

Show First 20 Lines • Show All 312 Lines • ▼ Show 20 Lines	func.func @shuffle_mismatching_type(%arg0 : f32, %arg1 : i32, %arg2 : i32) {
// expected-error@+1 {{op failed to verify that all of {value, shuffleResult} have same type}}		// expected-error@+1 {{op failed to verify that all of {value, shuffleResult} have same type}}
%shfl, %pred = "gpu.shuffle"(%arg0, %arg1, %arg2) { mode = #gpu<shuffle_mode xor> } : (f32, i32, i32) -> (i32, i1)		%shfl, %pred = "gpu.shuffle"(%arg0, %arg1, %arg2) { mode = #gpu<shuffle_mode xor> } : (f32, i32, i32) -> (i32, i1)
return		return
}		}

// -----		// -----

func.func @shuffle_unsupported_type(%arg0 : index, %arg1 : i32, %arg2 : i32) {		func.func @shuffle_unsupported_type(%arg0 : index, %arg1 : i32, %arg2 : i32) {
// expected-error@+1 {{operand #0 must be i32 or f32}}		// expected-error@+1 {{operand #0 must be i32, i64, f32 or f64}}
%shfl, %pred = gpu.shuffle xor %arg0, %arg1, %arg2 : index		%shfl, %pred = gpu.shuffle xor %arg0, %arg1, %arg2 : index
return		return
}		}

// -----		// -----

module {		module {
gpu.module @gpu_funcs {		gpu.module @gpu_funcs {
▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/shuffle-rewrite.mlir

This file was added.

				// RUN: mlir-opt --test-gpu-rewrite -split-input-file %s \| FileCheck %s

				module {
				// CHECK-LABEL: func.func @shuffleF64
				// CHECK-SAME: (%[[SZ:.]]: index, %[[VALUE:.]]: f64, %[[OFF:.]]: i32, %[[WIDTH:.]]: i32, %[[MEM:.*]]: memref<f64, 1>) {
				func.func @shuffleF64(%sz : index, %value: f64, %offset: i32, %width: i32, %mem: memref<f64, 1>) {
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %sz, %grid_y = %sz, %grid_z = %sz)
				threads(%tx, %ty, %tz) in (%block_x = %sz, %block_y = %sz, %block_z = %sz) {
				// CHECK: %[[INTVAL:.*]] = arith.bitcast %[[VALUE]] : f64 to i64
				// CHECK-NEXT: %[[LO:.*]] = arith.trunci %[[INTVAL]] : i64 to i32
				// CHECK-NEXT: %[[HI64:.]] = arith.shrui %[[INTVAL]], %[[C32:.]] : i64
				// CHECK-NEXT: %[[HI:.*]] = arith.trunci %[[HI64]] : i64 to i32
				// CHECK-NEXT: %[[SH1:.]], %[[V1:.]] = gpu.shuffle xor %[[LO]], %[[OFF]], %[[WIDTH]] : i32
				// CHECK-NEXT: %[[SH2:.]], %[[V2:.]] = gpu.shuffle xor %[[HI]], %[[OFF]], %[[WIDTH]] : i32
				// CHECK-NEXT: %[[LOSH:.*]] = arith.extui %[[SH1]] : i32 to i64
				// CHECK-NEXT: %[[HISHTMP:.*]] = arith.extui %[[SH2]] : i32 to i64
				// CHECK-NEXT: %[[HISH:.*]] = arith.shli %[[HISHTMP]], %[[C32]] : i64
				// CHECK-NEXT: %[[SHFLINT:.*]] = arith.ori %[[HISH]], %[[LOSH]] : i64
				// CHECK-NEXT: = arith.bitcast %[[SHFLINT]] : i64 to f64
				%shfl, %pred = gpu.shuffle xor %value, %offset, %width : f64
				memref.store %shfl, %mem[] : memref<f64, 1>
				gpu.terminator
				fmoracAuthorUnsubmitted Done Reply Inline Actions This `memref.store` is need it otherwise the operation is folded away by `applyPatternsAndFoldGreedily`. fmorac: This `memref.store` is need it otherwise the operation is folded away by…
				fmoracAuthorUnsubmitted Done Reply Inline Actions This test verifies the op for f64. fmorac: This test verifies the op for f64.
				}
				return
				}
				}

				// -----

				module {
				// CHECK-LABEL: func.func @shuffleI64
				// CHECK-SAME: (%[[SZ:.]]: index, %[[VALUE:.]]: i64, %[[OFF:.]]: i32, %[[WIDTH:.]]: i32, %[[MEM:.*]]: memref<i64, 1>) {
				func.func @shuffleI64(%sz : index, %value: i64, %offset: i32, %width: i32, %mem: memref<i64, 1>) {
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %sz, %grid_y = %sz, %grid_z = %sz)
				threads(%tx, %ty, %tz) in (%block_x = %sz, %block_y = %sz, %block_z = %sz) {
				// CHECK: %[[LO:.*]] = arith.trunci %[[VALUE]] : i64 to i32
				// CHECK-NEXT: %[[HI64:.]] = arith.shrui %[[VALUE]], %[[C32:.]] : i64
				// CHECK-NEXT: %[[HI:.*]] = arith.trunci %[[HI64]] : i64 to i32
				// CHECK-NEXT: %[[SH1:.]], %[[V1:.]] = gpu.shuffle xor %[[LO]], %[[OFF]], %[[WIDTH]] : i32
				// CHECK-NEXT: %[[SH2:.]], %[[V2:.]] = gpu.shuffle xor %[[HI]], %[[OFF]], %[[WIDTH]] : i32
				// CHECK-NEXT: %[[LOSH:.*]] = arith.extui %[[SH1]] : i32 to i64
				// CHECK-NEXT: %[[HISHTMP:.*]] = arith.extui %[[SH2]] : i32 to i64
				// CHECK-NEXT: %[[HISH:.*]] = arith.shli %[[HISHTMP]], %[[C32]] : i64
				// CHECK-NEXT: %[[SHFLINT:.*]] = arith.ori %[[HISH]], %[[LOSH]] : i64
				%shfl, %pred = gpu.shuffle xor %value, %offset, %width : i64
				memref.store %shfl, %mem[] : memref<i64, 1>
				gpu.terminator
				}
				return
				}
				}
				fmoracAuthorUnsubmitted Done Reply Inline Actions This test verifies the op for i64. fmorac: This test verifies the op for i64.

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add i64 & f64 support to gpu.shuffle
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 525792

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/shuffle-rewrite.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add i64 & f64 support to gpu.shuffleClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 525792

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/shuffle-rewrite.mlir

[mlir][gpu] Add i64 & f64 support to gpu.shuffle
ClosedPublic