A common practice involves the creation of multiple MBarrier objects for utilization within loops, see an example below. This is particularly valuable in scenarios like software pipelining during matmul code generation, where we need to generate and employ five barriers dynamically within a loop.
This works improves nvgpu.mbarrier.barrier type into the nvgpu.mbarrier.group. All MBarrier-related operations now uses this type. Consequently, these operations are now capable of managing multiple barriers seamlessly.
%barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3> ... scf.for %i = %c0 to %n step %c1 { %mbarId = arith.remui %i, 3 : index %isDone = nvgpu.mbarrier.test.wait %barriers[%mbarId], %token : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 3>, !tokenType }