This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] add simple pipelining for shared memory copies
ClosedPublic

Authored by ftynse on Jul 13 2023, 11:01 AM.

Details

Summary

Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an scf.for loop with either
vector.transfer or nvgpu.device_async_copy operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.

This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.

This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.

Diff Detail

Event Timeline

ftynse created this revision.Jul 13 2023, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2023, 11:01 AM
ftynse requested review of this revision.Jul 13 2023, 11:01 AM
Herald added a project: Restricted Project. · View Herald Transcript

Great, thanks for starting the rationalizing / tech debt removal effort around this important transformation!

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td
34

nit: for a load into shared memory ?

51

Description seems out of date, I see definite failures in the impl.

56

+1

mlir/include/mlir/Dialect/SCF/Transforms/Patterns.h
36

grammo: to a value that captures whether ?

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
98

Even if trivial, I like the bulletpoints in getPipelineStages, can we rewrite as:

Specifically, this collects:
1. all loads from global memory, both sync and async
2. the barriers for async loads
In particular, barriers are omitted if they do not dominate at least one async load for which there is not yet a barrier.

The in particular part seems significant to document and not immediately clear from glancing over code.

117

As always, C++ is cute and terrifying at the same time :)

144

nit: order of

147

Can we merge this condition and the one above and document properly ?

From reading the code I am inferring:

// If not a waitOp or if numGroups is already set, bail.
auto waitOp = dyn_cast<nvgpu::DeviceAsyncWaitOp>(op);
if (!waitOp || waitOp.getNumGroups())
  return;
160

isn't there an auto-tablegen'd method to do that automatically rather than have to manipulate the attrname?

168

add a case that:

- operations in the backward slice of any stage0Ops are all at stage 0
173

Nit: add something along the lines of Hook for the loop pipeliner used by the scheduleFn hook.

Even better put all these under a special namespace nvgpu::pipelining_hooks to really make clear what is what.

174

nit: opsWithPipelineStage would be more self-documenting

191

nit: iterate directly on dependencies rather than on forOp.getBody()->getOperations()

196

seems off, why would you want to replace with a predicated version when no predication is necessary?

198

Same thing re hook here.

216

nit: lines

302

add "try to set the peel_epilogue attribute" ?

mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
9

can we add some comments to the test here to explain why we expect this outcome?
It is really not clear to me why we need to "peel the epilogue" here.

88

Can we spell out the selects operands and how they connect here ?
Would be nice to see the difference between the 2; this is new untested code in MLIR.

91

nit: load

91

can we also check the scf.yield here to "close the loop" ?

110

interestingly, I would have expected something to go wrong here without peeling:
we have both predicated async_copy and loop trip count not divisible by 4

yet this case does not require peeling but the first one does.

A few comments in the test here would help my past, present and future self.

This revision is now accepted and ready to land.Jul 14 2023, 12:58 AM
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
58

Ok now I am lost, I looked deeper into memory spaces and I see:

https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td#L79

says:

def GPU_AddressSpaceGlobal : I32EnumAttrCase<"Global", 1, "global">;
def GPU_AddressSpaceWorkgroup : I32EnumAttrCase<"Workgroup", 2, "workgroup">;
def GPU_AddressSpacePrivate : I32EnumAttrCase<"Private", 3, "private">;
def GPU_AddressSpaceEnum : GPU_I32Enum<
  "AddressSpace", "GPU address space", [
    GPU_AddressSpaceGlobal,
    GPU_AddressSpaceWorkgroup,
    GPU_AddressSpacePrivate
  ]>;

https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td#L60

says:

    /// Defines the MemRef memory space attribute numeric value that indicates
    /// a memref is located in global memory. This should correspond to the
    /// value used in NVVM.
    static constexpr unsigned kGlobaldMemoryAddressSpace = 1;

    /// Defines the MemRef memory space attribute numeric value that indicates
    /// a memref is located in shared memory. This should correspond to the
    /// value used in NVVM.
    static constexpr unsigned kSharedMemoryAddressSpace = 3;

whereas https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td#L40

says:

/// The address space value that represents global memory.
static constexpr unsigned kGlobalMemoryAddressSpace = 1;
/// The address space value that represents shared memory.
static constexpr unsigned kSharedMemoryAddressSpace = 3;
/// The address space value that represents private memory.
static constexpr unsigned kPrivateMemoryAddressSpace = 5;

I don't know if there are others.

ftynse updated this revision to Diff 540962.Jul 17 2023, 5:10 AM
ftynse marked 23 inline comments as done.

Address review.

ftynse added inline comments.Jul 17 2023, 5:12 AM
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
58

ROCDL is irrelevant here. The other two are the platform-agnostic and the NVVM-specific memory spaces. Given that conversion to NVVM happens later than this, I expect the platform-agnostic space to be used in the input.

173

Putting them in a namespace makes them potentially accessible from different translation units, which we'd like to avoid.

191

I think the order in which operations are listed in the opsWIthPipelineStage matters so I'd rather have a small overhead here and be sure the ops are in the same order as originally.

196

Because the pipeliner wants that... I suppose we can change it to require a custom three-state result "replaced with Op, keep the same, couldn't replace". Updated the doc to reflect this.

mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
9

The predication is only implemented for nvpgu.device.async.copy, not for transfer reads/writes.

110

This test does not require peeling because the code being pipelined already handles the "partial" loop iteration correctly from the start. There's no reason why pipelining should break it.

NVGPUToNVVM bazel doesn't look related to this.

This revision was landed with ongoing or failed builds.Jul 17 2023, 7:29 AM
This revision was automatically updated to reflect the committed changes.