This commit adds initial support for mapping SCF parallel loops with
reductions onto GPUs using atomics and the GPU allreduce operation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This commit is an initial implementation to adding reduction support for SCF to GPU mapping. It's really a work in progress that I'm hoping to get feedback on, because a couple parts of the implementation feel awkward / hacky. This approach handles simply nested parallel for loops, which have been tiled for blocks and then threads. Thread-level reductions are handled with all-reduce operations (extra work around this due to iteration space guards), and block-level reductions are handled with atomic RMW operations to GPU memory.
As an example, the following affine code (which does a parallel sum and max reduction over an input array):
func.func @reduce2(%input : memref<?xf32>) -> (f32, f32) { %zero = arith.constant 0. : f32 %zero_0 = arith.constant 0 : index %n = memref.dim %input, %zero_0 : memref<?xf32> %reduceval, %maxval = affine.for %i = 0 to %n iter_args(%sum = %zero, %max = %zero) -> (f32, f32) { %0 = affine.load %input[%i] : memref<?xf32> %1 = arith.addf %0, %sum : f32 %2 = arith.maxf %0, %max : f32 affine.yield %1, %2 : f32, f32 } return %reduceval, %maxval : f32, f32 }
is lowered with ./bin/mlir-opt ../testing.mlir -pass-pipeline="builtin.module(func.func(affine-parallelize{parallel-reductions}, lower-affine, canonicalize, scf-parallel-loop-tiling{parallel-loop-tile-sizes=256}, scf-for-loop-canonicalization, gpu-map-parallel-loops))" to
#map = affine_map<(d0, d1, d2) -> (256, d1 - d2)> module { func.func @reduce2(%arg0: memref<?xf32>) -> (f32, f32) { %c256 = arith.constant 256 : index %cst = arith.constant 0xFF800000 : f32 %c1 = arith.constant 1 : index %cst_0 = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %dim = memref.dim %arg0, %c0 : memref<?xf32> %0:2 = scf.parallel (%arg1) = (%c0) to (%dim) step (%c256) init (%cst_0, %cst) -> (f32, f32) { %3 = affine.min #map(%c256, %dim, %arg1) %4:2 = scf.parallel (%arg2) = (%c0) to (%3) step (%c1) init (%cst_0, %cst) -> (f32, f32) { %5 = arith.addi %arg2, %arg1 : index %6 = memref.load %arg0[%5] : memref<?xf32> scf.reduce(%6) : f32 { ^bb0(%arg3: f32, %arg4: f32): %7 = arith.addf %arg3, %arg4 : f32 scf.reduce.return %7 : f32 } scf.reduce(%6) : f32 { ^bb0(%arg3: f32, %arg4: f32): %7 = arith.maxf %arg3, %arg4 : f32 scf.reduce.return %7 : f32 } scf.yield } {mapping = [#gpu.loop_dim_map<processor = thread_x, map = (d0) -> (d0), bound = (d0) -> (d0)>]} scf.reduce(%4#0) : f32 { ^bb0(%arg2: f32, %arg3: f32): %5 = arith.addf %arg2, %arg3 : f32 scf.reduce.return %5 : f32 } scf.reduce(%4#1) : f32 { ^bb0(%arg2: f32, %arg3: f32): %5 = arith.maxf %arg2, %arg3 : f32 scf.reduce.return %5 : f32 } scf.yield } {mapping = [#gpu.loop_dim_map<processor = block_x, map = (d0) -> (d0), bound = (d0) -> (d0)>]} %1 = arith.addf %0#0, %cst_0 : f32 %2 = arith.maxf %0#1, %cst_0 : f32 return %1, %2 : f32, f32 } }
Final application of the conversion pass with this patch yields:
#map = affine_map<(d0)[s0, s1] -> ((d0 - s0) ceildiv s1)> #map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)> #map2 = affine_map<(d0, d1, d2) -> (256, d1 - d2)> module { func.func @reduce2(%arg0: memref<?xf32>) -> (f32, f32) { %c256 = arith.constant 256 : index %cst = arith.constant 0xFF800000 : f32 %c1 = arith.constant 1 : index %cst_0 = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %dim = memref.dim %arg0, %c0 : memref<?xf32> %memref = gpu.alloc () : memref<f32> %alloca = memref.alloca() : memref<f32> memref.store %cst_0, %alloca[] : memref<f32> gpu.memcpy %memref, %alloca : memref<f32>, memref<f32> %memref_1 = gpu.alloc () : memref<f32> %alloca_2 = memref.alloca() : memref<f32> memref.store %cst, %alloca_2[] : memref<f32> gpu.memcpy %memref_1, %alloca_2 : memref<f32>, memref<f32> %c1_3 = arith.constant 1 : index %0 = affine.apply #map(%dim)[%c0, %c256] %c256_4 = arith.constant 256 : index %1 = affine.apply #map(%c256_4)[%c0, %c1] gpu.launch blocks(%arg1, %arg2, %arg3) in (%arg7 = %0, %arg8 = %c1_3, %arg9 = %c1_3) threads(%arg4, %arg5, %arg6) in (%arg10 = %1, %arg11 = %c1_3, %arg12 = %c1_3) { %6 = affine.apply #map1(%arg1)[%c256, %c0] %7 = affine.min #map2(%c256, %dim, %6) %8 = affine.apply #map1(%arg4)[%c1, %c0] %9 = arith.cmpi slt, %8, %7 : index %10:2 = scf.if %9 -> (f32, f32) { %17 = arith.addi %8, %6 : index %18 = memref.load %arg0[%17] : memref<?xf32> scf.yield %18, %18 : f32, f32 } else { scf.yield %cst_0, %cst : f32, f32 } %11 = gpu.all_reduce max %10#1 uniform { } : (f32) -> f32 %12 = gpu.all_reduce add %10#0 uniform { } : (f32) -> f32 %13 = gpu.thread_id x %c0_7 = arith.constant 0 : index %14 = arith.cmpi eq, %13, %c0_7 : index scf.if %14 { %17 = memref.atomic_rmw addf %12, %memref[] : (f32, memref<f32>) -> f32 } %15 = gpu.thread_id x %c0_8 = arith.constant 0 : index %16 = arith.cmpi eq, %15, %c0_8 : index scf.if %16 { %17 = memref.atomic_rmw maxf %11, %memref_1[] : (f32, memref<f32>) -> f32 } gpu.terminator } {SCFToGPU_visited} %alloca_5 = memref.alloca() : memref<f32> gpu.memcpy %alloca_5, %memref : memref<f32>, memref<f32> %2 = memref.load %alloca_5[] : memref<f32> %alloca_6 = memref.alloca() : memref<f32> gpu.memcpy %alloca_6, %memref_1 : memref<f32>, memref<f32> %3 = memref.load %alloca_6[] : memref<f32> %4 = arith.addf %2, %cst_0 : f32 %5 = arith.maxf %3, %cst_0 : f32 return %4, %5 : f32, f32 } }
I would consider having an operation, or maybe attributes to gpu.launch, that represent the cross-block reduction. That operation could be lowered separately and hopefully make the implementation here less complex.
In general, I couldn't always follow where the new IR gets inserted. Specifically, it looks like the if providing the reduction or neutral value is always inserted in the beginning, but it may depend on values computed later within the loop, e.g., when the logic is more complex than loading from a memref indexed by the loop/block index. It feels like it can be inserted right after the scf.reduce so it is dominated by the same set of values as the reduce itself. Similarly, all_reduce should be inserted there, after the if, which would remove the non-guaranteed hack that attempts to move it in the end.
mlir/lib/Conversion/SCFToGPU/SCFToGPU.cpp | ||
---|---|---|
36 | This is not allowed in LLVM https://llvm.org/docs/CodingStandards.html#include-iostream-is-forbidden | |
539 | Why is this limited to threadX? What happens if the reduction loop is mapped to other thread dimensions? | |
546 | Nit: please use /*thisStyleOfCommentForArgumentName=*/1 to avoid magic values. | |
561 | It isn't very fresh in my memory even if I may have reviewed this, but it does feel quite convoluted. However, that's what the existing code is doing so I would just keep doing the same for this patch and submitted a refactoring as a separate patch. | |
624–625 | Same place as matchReduction? | |
682 | Nit: please expand auto unless the type is obvious from context, e.g., the RHS is a cast, or annoying to spell out, e.g. iterators and lambdas. Note that here the type isn't clear from the context and it may be either MLIRContext * or MLIRContext & (clang-tidy will complain about this). https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable | |
699 | Nit: no need to prefix most of LLVM ADTs with llvm:: in MLIR code as they are re-exported into the MLIR namespace. | |
700 | Nit: don't specify the number of stack elements in small containers unless you have a strong reason to peek a specific value. | |
721–753 | Instead of doing all this, can we just allocate memory that's accessible from both host and device? Than we can just store into it on host. Or does that come with a performance penalty? | |
726 | Nit: you should be able to construct an ArrayRef<int64_t> rather than a vector. Here and below. | |
734–735 | I suppose this is related to how the underlying memset function is implemented. | |
747 | I would expect that the op has properly named arguments in ODS so we can see which of the values is the source and which is the destination. | |
862–863 | Nit: you can just do cast<scf::ifOp>(parent) that asserts internally and remove your own assert. | |
944 | RAUW is forbidden in patterns. Use the rewriter API instead. | |
966 | Low-level IR mutation API are forbidden in patterns. Use the rewriter API instead. |
Thanks for the initial pass, let me respond to some high-level things before going and making changes to the code. I'll fix all the nits / code style around the rewriter, thanks for the pointers.
I would consider having an operation, or maybe attributes to gpu.launch, that represent the cross-block reduction. That operation could be lowered separately and hopefully make the implementation here less complex.
I think an operation makes sense, but I don't view myself as knowledgeable enough about this community and consumers of the GPU dialect to go forward with proposing and adding all the things needed for a new operation. I'm not sure that should block the addition of a feature like this.
Specifically, it looks like the if providing the reduction or neutral value is always inserted in the beginning, but it may depend on values computed later within the loop, e.g., when the logic is more complex than loading from a memref indexed by the loop/block index.
AFAICT, the if is inserted as a guard to make sure that only the thread ids inside the thread-mapped loop enter the loop body. It is not dependent on other things that may be inside the loop, because it must surround the entire loop body. On the other hand, the gpu.all_reduce _must_ go outside the if, because every thread must execute the collective block-wide reduction whether or not individual threads in a warp of the block enter the loop body. So that's why I've structured it this way.
Instead of doing all this, can we just allocate memory that's accessible from both host and device? Than we can just store into it on host. Or does that come with a performance penalty?
If I remember correctly, contended atomic RMW's to zero-copy memory perform worse than those to frame-buffer memory. However, I'll run a simple experiment later and see -- it would be much simpler to just reduce to a host buffer.
I would expect that the op has properly named arguments in ODS so we can see which of the values is the source and which is the destination.
I looked deeper into this and the driver function that this ends up calling is able to tell whether a pointer is host or device allocated.
Woops, forgot one high-level comment:
Why is this limited to threadX? What happens if the reduction loop is mapped to other thread dimensions?
I'm not sure what to do if there are multiple nested loops mapped to different thread dimensions. Consider:
%0 = pfor i in ... -- map thread x reduce: %1 = pfor j in ... -- map thread y reduce: ...
If the thread x block uses the value %1 in a way that isn't just collapsing both the x and y thread dimensions to a single scalar, then we need some kind of sub-block reduction primitive that is not currently there. The use case I have in mind is not going to map multiple thread dimensions when doing reductions, so I'm incentivized to support my simple use case.
If I remember correctly, contended atomic RMW's to zero-copy memory perform worse than those to frame-buffer memory. However, I'll run a simple experiment later and see -- it would be much simpler to just reduce to a host buffer.
A quick experiment yielded that for the case of reducing to a single element, there wasn't any noticeable difference in reductions to a buffer in zero copy memory versus a buffer in gpu framebuffer. I'll clean up the implementation here to allocate the reduction buffers in zero-copy memory, which should clean up some of the cruft here. However, the GPUtoLLVM pass doesn't currently support host-shared gpu allocations! I can add support for this for CUDA in this PR or a separate one, depending on what makes sense. I don't have access to AMD machines, so I wouldn't be able to add the corresponding definitions in the AMD gpu runtime.
A quick experiment yielded that for the case of reducing to a single element, there wasn't any noticeable difference in reductions to a buffer in zero copy memory versus a buffer in gpu framebuffer. I'll clean up the implementation here to allocate the reduction buffers in zero-copy memory, which should clean up some of the cruft here. However, the GPUtoLLVM pass doesn't currently support host-shared gpu allocations! I can add support for this for CUDA in this PR or a separate one, depending on what makes sense. I don't have access to AMD machines, so I wouldn't be able to add the corresponding definitions in the AMD gpu runtime.
Please ignore this. I had some if-statements backwards in my lowering to test this and the zero-copy buffer was not actually getting allocated. When zero-copy is used, the reduction is significantly slower (around 2 orders of magnitude). So we need to do it this way, and i'll update this code to thread a gpu async token through the allocations and copies.
This is not allowed in LLVM https://llvm.org/docs/CodingStandards.html#include-iostream-is-forbidden