This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add initial support for `mbarrier`
ClosedPublic

Authored by guraypp on Jun 29 2023, 8:14 AM.

Details

Summary

mbarrier is a barrier created in shared memory that supports different flavors of synchronizing threads other than __syncthreads, for more information see below.
https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier

This work adds initial Ops wrt mbarrier to nvgpu dialect.

First, it introduces to two types:
mbarrier.barrier that is barrier object in shared memory
mbarrier.barrier.token that is token

It introduces following Ops:
mbarrier.create creates mbarrier.barrier
mbarrier.init initializes mbarrier.barrier
mbarrier.arrive performs arrive-on mbarrier.barrier returns mbarrier.barrier.token
mbarrier.arrive.nocomplete performs arrive-on (non-blocking) mbarrier.barrier returns mbarrier.barrier.token
mbarrier.test_wait waits on mbarrier.barrier and mbarrier.barrier.token

Diff Detail

Event Timeline

guraypp created this revision.Jun 29 2023, 8:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 8:14 AM
guraypp requested review of this revision.Jun 29 2023, 8:14 AM
guraypp retitled this revision from [mlir][nvvm] Add initial support for mbarrier to [mlir][nvgpu] Add initial support for `mbarrier`.Jun 29 2023, 8:14 AM
guraypp added inline comments.Jul 4 2023, 2:11 AM
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
717

mbarrier object needs to be created in static shared memory. Here, I created memref::GlobalOp to match it with clang. See an example:
https://godbolt.org/z/T4Gonnoez

clang generates static shared memory value globally like below

@arrive(int*)::bar = internal unnamed_addr addrspace(3) global i64 undef, align 8
guraypp added inline comments.Jul 4 2023, 2:56 AM
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
717

Initially, I attempted to use memref.alloca with address space 3, but it resulted in an incorrect PTX. See an example below:

func.func @do_alloca() {
  %wg = memref.alloca() {alignment = 8} : memref<1xi64, 3>
  %c128 = arith.constant 128 : i32  
  %2 = builtin.unrealized_conversion_cast %wg : memref<1xi64, 3> to !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
  %3 = llvm.extractvalue %2[1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)> 
  nvvm.mbarrier.init.shared %3, %c128 : !llvm.ptr<3>, i32
  func.return
}

I got the following LLVM IR, which looked right to me. The alloca has addrspace(3).

define void @do_alloca() {
  %1 = alloca i64, i64 1, align 8, addrspace(3)
  %2 = insertvalue { ptr addrspace(3), ptr addrspace(3), i64, [1 x i64], [1 x i64] } undef, ptr addrspace(3) %1, 0
  %3 = insertvalue { ptr addrspace(3), ptr addrspace(3), i64, [1 x i64], [1 x i64] } %2, ptr addrspace(3) %1, 1
  %4 = insertvalue { ptr addrspace(3), ptr addrspace(3), i64, [1 x i64], [1 x i64] } %3, i64 0, 2
  %5 = insertvalue { ptr addrspace(3), ptr addrspace(3), i64, [1 x i64], [1 x i64] } %4, i64 1, 3, 0
  %6 = insertvalue { ptr addrspace(3), ptr addrspace(3), i64, [1 x i64], [1 x i64] } %5, i64 1, 4, 0
  %7 = extractvalue { ptr addrspace(3), ptr addrspace(3), i64, [1 x i64], [1 x i64] } %6, 1
  call void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3) %7, i32 128)
  ret void
}

Despite generating LLVM IR with the expected alloca having addrspace(3), the PTX showed that the value was allocated on the stack instead of the shared memory. See the PTX below:

.visible .func do_alloca()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b32       %SP;
        .reg .b32       %SPL;
        .reg .b32       %r<3>;
        mov.u32         %SPL, __local_depot0;
        cvta.local.u32  %SP, %SPL;
        mov.u32         %r1, 128;
        add.u32         %r2, %SP, 0;
        mbarrier.init.shared.b64 [%r2], %r1;
        ret;
                                        // -- End function
}

To address this, I examined how clang implements static shared memory and discovered that it generates a global value with an internal linkage type. Thus, I modified the my IR accordingly. As a result, I achieved the desired PTX output, where my barrier object is generated in the shared memory as .shared .align 8 .b8 __mbarrier[8];.

@__mbarrier = internal addrspace(3) global [1 x i64] undef, align 8
define void @do_global() {
  call void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3) @__mbarrier, i32 128)
  ret void
}

PTX:

.visible .func do_global()              // @do_global
{
        .reg .b32       %r<3>;
        .shared .align 8 .b8 __mbarrier[8];
        mov.u32         %r1, __mbarrier;
        mov.u32         %r2, 128;
        mbarrier.init.shared.b64 [%r1], %r2;
        ret;
                                        // -- End function
}
guraypp added inline comments.Jul 4 2023, 3:14 AM
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
717

Speaking of LLVM's internal linkage type, I could not find a way to set internal linkage type on memref::GlobalOp. I put a PR that attempts to add internal linkage type.
https://reviews.llvm.org/D154074

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
87

doc please: this is the type for a barrier in shared memory that is used to synchronize a variable number of threads. Conceptually it behaves similarly to C++ std::barrier<T>::arrive_and_wait
etc

420

Can we document what "parity" means here ?

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
706

Can we add a detailed enough comment to justify the current choice fo memref::GlobalOp, vs e.g. AllocaOp ?

749

comments here and below for lowering please.

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
561

nit: nl

qcolombet added inline comments.Jul 4 2023, 8:21 AM
mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
411

Can we use init on the same barrier several times?
E.g.,

barrier = mbarrier.create
mbarrier.init barrier
// RegionA: Do some stuff with barrier
mbarrier.init barrier
// RegionB: Do some other stuff with barrier

Ultimately what I am wondering is what kind of "effects" we should set on this operation, and/or how we should model the dependency on barrier. (like init is a store-like operation and other mbarrier operations are load-like operations on the given barrier?)

For instance, in my snippet, we wouldn't want to move some code from RegionB in RegionA.

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
346

Should we return an optional here?

Put differently, what should the user of this API do when they receive {}?

guraypp updated this revision to Diff 539054.Jul 11 2023, 6:17 AM
guraypp marked 3 inline comments as done.

address comments

guraypp edited the summary of this revision. (Show Details)Jul 11 2023, 6:17 AM
nicolasvasilache accepted this revision.Jul 11 2023, 8:32 AM
nicolasvasilache added inline comments.
mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
88

can we drop the first sentence? it seems redundant

This revision is now accepted and ready to land.Jul 11 2023, 8:32 AM
guraypp updated this revision to Diff 539125.Jul 11 2023, 8:34 AM
guraypp edited the summary of this revision. (Show Details)

address the comments

This revision was landed with ongoing or failed builds.Jul 11 2023, 8:35 AM
This revision was automatically updated to reflect the committed changes.