This patch adds support for i64, f64 values in gpu.shuffle.
The reason behind this change is that both CUDA & HIP support this kind of shuffling.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp | ||
---|---|---|
48–87 | This implementation is based on clang's implementation of the operation, as GPU natively only support 32 bit shuffles. | |
mlir/test/Dialect/GPU/shuffle-rewrite.mlir | ||
6–21 | This test verifies the op for f64. | |
21 | This memref.store is need it otherwise the operation is folded away by applyPatternsAndFoldGreedily. | |
30–50 | This test verifies the op for i64. |
mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp | ||
---|---|---|
48–87 | can you paste a link here to that impl? just so I can compare side-by-side... |
mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp | ||
---|---|---|
48–87 | In this case clang uses a struct to hide the truncation, shifting and extensions: double __attribute__((__used__)) __device__ shfl(double val, int delta, int width) { return __shfl_down_sync(0xFFFFFFFF, val, delta, width); } When compiled with: clang++ -O3 --offload-device-only shfl.cu -S -emit-llvm -o shfl.ll, you get: ; Function Attrs: convergent mustprogress nounwind define dso_local noundef double @_Z4shfldii(double noundef %0, i32 noundef %1, i32 noundef %2) #0 { %4 = bitcast double %0 to i64 %5 = trunc i64 %4 to i32 %6 = lshr i64 %4, 32 %7 = trunc i64 %6 to i32 %8 = mul i32 %2, -256 %9 = add i32 %8, 8223 %10 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %5, i32 %1, i32 %9) %11 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %7, i32 %1, i32 %9) %12 = zext i32 %11 to i64 %13 = shl nuw i64 %12, 32 %14 = zext i32 %10 to i64 %15 = or i64 %13, %14 %16 = bitcast i64 %15 to double ret double %16 } The extra %8, %9 in the above code get created because NVVM doesn't use width as the third parameter, but instead a special value: |
Looks good to me - only nit I I would ask is add a comment explaining that the set of shifts and truncates and such can be double checked against clang (just basically one sentence summarizing that can you generate the .ll from the macro that's in clang).
The reason behind this change is that both CUDA & HIP support this kind of shuffling.
This is not supported in ptx spec. What kind of lowering are you expecting for this op? If this is going to be broken down anyway I don't think adding it here makes sense. This is the reason why it wasn't added for non 32bits type.
Could you give more details on how you are planning to lower this to rocdl/nvvm or other dialects?
You're right, they're not natively supported. However I added a pattern (see file ShuffleRewriter.cpp) to transform those instructions into supported instructions, basically it rewrites the op into 2 shuffles of 32 bits, this is also what HIP and CUDA do internally.
Added more context specifying the implementation provided by this patch mirrors that of clang.
This implementation is based on clang's implementation of the operation, as GPU natively only support 32 bit shuffles.