This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add i64 & f64 support to gpu.shuffle
ClosedPublic

Authored by fmorac on Apr 21 2023, 5:14 PM.

Details

Summary

This patch adds support for i64, f64 values in gpu.shuffle.
The reason behind this change is that both CUDA & HIP support this kind of shuffling.

Diff Detail

Event Timeline

fmorac created this revision.Apr 21 2023, 5:14 PM
fmorac updated this revision to Diff 515966.Apr 21 2023, 5:19 PM
fmorac edited projects, added Restricted Project; removed Restricted Project.
This comment was removed by fmorac.
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 5:19 PM
fmorac added inline comments.Apr 22 2023, 2:09 PM
mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp
49–88

This implementation is based on clang's implementation of the operation, as GPU natively only support 32 bit shuffles.

mlir/test/Dialect/GPU/shuffle-rewrite.mlir
7–22

This test verifies the op for f64.

22

This memref.store is need it otherwise the operation is folded away by applyPatternsAndFoldGreedily.

31–51

This test verifies the op for i64.

fmorac published this revision for review.Apr 24 2023, 5:50 AM
fmorac added a comment.May 1 2023, 7:23 AM

ping for review

fmorac updated this revision to Diff 521377.May 11 2023, 11:12 AM

Rebasing to main.

fmorac updated this revision to Diff 521389.May 11 2023, 11:32 AM

Rebasing main

fmorac updated this revision to Diff 521395.May 11 2023, 11:45 AM

Rebasing main.

makslevental added inline comments.May 11 2023, 6:52 PM
mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp
49–88

can you paste a link here to that impl? just so I can compare side-by-side...

fmorac marked an inline comment as done.May 12 2023, 4:14 AM
fmorac added inline comments.
mlir/lib/Dialect/GPU/Transforms/ShuffleRewriter.cpp
49–88

In this case clang uses a struct to hide the truncation, shifting and extensions:
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_intrinsics.h#L40-L54
But when optimized it will decay to the above form. To test this:

double __attribute__((__used__)) __device__ shfl(double val, int delta,
                                                 int width) {
  return __shfl_down_sync(0xFFFFFFFF, val, delta, width);
}

When compiled with: clang++ -O3 --offload-device-only shfl.cu -S -emit-llvm -o shfl.ll, you get:

; Function Attrs: convergent mustprogress nounwind
define dso_local noundef double @_Z4shfldii(double noundef %0, i32 noundef %1, i32 noundef %2) #0 {
  %4 = bitcast double %0 to i64
  %5 = trunc i64 %4 to i32
  %6 = lshr i64 %4, 32
  %7 = trunc i64 %6 to i32
  %8 = mul i32 %2, -256
  %9 = add i32 %8, 8223
  %10 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %5, i32 %1, i32 %9)
  %11 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %7, i32 %1, i32 %9)
  %12 = zext i32 %11 to i64
  %13 = shl nuw i64 %12, 32
  %14 = zext i32 %10 to i64
  %15 = or i64 %13, %14
  %16 = bitcast i64 %15 to double
  ret double %16
}

The extra %8, %9 in the above code get created because NVVM doesn't use width as the third parameter, but instead a special value:
https://docs.nvidia.com/cuda/nvvm-ir-spec/#data-movement
But I believe we use width for portability, as AMD only supports width.

makslevental accepted this revision.May 23 2023, 10:44 AM

Looks good to me - only nit I I would ask is add a comment explaining that the set of shifts and truncates and such can be double checked against clang (just basically one sentence summarizing that can you generate the .ll from the macro that's in clang).

This revision is now accepted and ready to land.May 23 2023, 10:44 AM

The reason behind this change is that both CUDA & HIP support this kind of shuffling.

This is not supported in ptx spec. What kind of lowering are you expecting for this op? If this is going to be broken down anyway I don't think adding it here makes sense. This is the reason why it wasn't added for non 32bits type.

Could you give more details on how you are planning to lower this to rocdl/nvvm or other dialects?

fmorac marked an inline comment as done.May 23 2023, 12:47 PM

Could you give more details on how you are planning to lower this to rocdl/nvvm or other dialects?

You're right, they're not natively supported. However I added a pattern (see file ShuffleRewriter.cpp) to transform those instructions into supported instructions, basically it rewrites the op into 2 shuffles of 32 bits, this is also what HIP and CUDA do internally.

fmorac updated this revision to Diff 525792.May 25 2023, 2:07 PM

Added more context specifying the implementation provided by this patch mirrors that of clang.

This revision was automatically updated to reflect the committed changes.