This is an archive of the discontinued LLVM Phabricator instance.

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions
ClosedPublic

Authored by steffenlarsen on Jun 24 2021, 4:30 AM.

Details

Summary

Adds NVPTX builtins and intrinsics for the CUDA PTX wmma.load, wmma.store, wmma.mma, and mma instructions added in PTX 6.5 and 7.0.

PTX ISA description of

Overview of wmma.mma and mma matrix shape/type combinations added with specific PTX versions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Co-Authored-by: Stuart Adams <stuart.adams@codeplay.com>

Diff Detail

Event Timeline

steffenlarsen created this revision.Jun 24 2021, 4:30 AM
steffenlarsen requested review of this revision.Jun 24 2021, 4:30 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 24 2021, 4:30 AM
tra accepted this revision.Jun 24 2021, 11:46 AM

Nice. Thank you for adding support for these missing instructions!
LGTM, modulo a few of cosmetic nits.

clang/include/clang/Basic/BuiltinsNVPTX.def
762

Is this a sufficient set of builtins to compile mma.h in CUDA-11.x?

clang/lib/CodeGen/CGBuiltin.cpp
16503–16513

Nit: satf variants are in the minority. We could move them to the end of the variant list and rely on default initialization to 0. E.g something like this:

unsigned getMMAIntrinsic(int Layout, bool Satf) {
    unsigned Index = Layout + 4*Satf;
    if (Index >= Variants.size())
      return 0;
    return Variants[Index];
  }
#define MMA_VARIANTS(geom, type) 
      Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type, \
      Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \
      Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type, \
      Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type 

#define MMA_SATF_VARIANTS(geom, type)
      MMA_VARIANTS(geom, type), \
      Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type##_satfinite, \
      Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \
      Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type##_satfinite, \
      Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type##_satfinite 
...
case NVPTX::BI__hmma_m16n16k16_mma_f16f16:
  return {8, 8, 4, 4, {{ MMA_SATF_VARIANTS(m16n16k16, f16_f16) }}};
...
    case NVPTX::BI__dmma_m8n8k4_mma_f64:
      return {1, 1, 2, 2, {{MMA_VARIANTS(m8n8k4, f64)}}};

Up to you.

clang/test/CodeGen/builtins-nvptx-mma.py
111

typo in the original code: sub-integers or sub-integer types

142–146

It's not obvious why frag d is __mma and not __mma_tf32
Can we use frag.ptx_type to make that decision?

llvm/include/llvm/IR/IntrinsicsNVVM.td
55

Nit: I'd drop some.

219–223

Nit: are identified

221

Nit: types as both A and B are considered.

4474

We're often using an empty string to represent a none. Comparisons with - where we check rnd look like we're doing something special there.
I'd use an empty string for rnd, too.

This revision is now accepted and ready to land.Jun 24 2021, 11:46 AM

Adjusted for comments and fixed formatting issues.

steffenlarsen marked 4 inline comments as done.Jun 25 2021, 7:13 AM

@tra Thank you for the quick response! I agree with your comments and have made changes accordingly.

clang/include/clang/Basic/BuiltinsNVPTX.def
762

mma.h was my frame-of-reference for the builtins and I have CUDA 11.3 (465.19.01) installed, so I would expect it to be.

clang/lib/CodeGen/CGBuiltin.cpp
16503–16513

I agree, I like this better. In case other options will use the satf parameter (e.g. rnd which isn't indicated as being in use from mma.h) this solution is also easier to extend in the future.

clang/test/CodeGen/builtins-nvptx-mma.py
142–146

We absolutely can. I don't know why that wasn't my first solution.

llvm/include/llvm/IR/IntrinsicsNVVM.td
4474

Empty string works for me. I think there are/were some places that used "-" as a default parameter meaning none, but I agree with your assessment.

tra accepted this revision.Jun 25 2021, 10:43 AM

LGTM. Would you like me to land the patch on your behalf?

clang/lib/CodeGen/CGBuiltin.cpp
16489

A comment here describing expected arrangement of the variants here would be helpful.
E.g. ordered by layout-A/layout-B/satf, where 'row' has priority over 'col' for layout. The index of non-satf variants is expected to match the undocumented layout constants used by CUDA's mma.hpp.

It could be cleaner if we could use designated initializer, but we can't use them yet until LLVM switches to c++20.

Added comment about variant ordering.

steffenlarsen marked an inline comment as done.Jun 28 2021, 2:25 AM

LGTM. Would you like me to land the patch on your behalf?

If it isn't to much of a bother. Thank you. 😄

tra added inline comments.Jul 2 2021, 11:15 AM
clang/include/clang/Basic/BuiltinsNVPTX.def
727

Bummer. mma.h in CUDA-11.3 still does not compile for Ampere.

We appear to be missing the new __bmma_m8n8k128_mma_and_popc_b1 builtin for the .and variant of 1-bit mma introduced in PTX 7.1 and not included in this patch.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma

Do you, by any chance, have upcoming patch for PTX7.1, too. :-)

tra added inline comments.Jul 2 2021, 3:25 PM
clang/test/CodeGen/builtins-nvptx-mma.cu
781–786

This looks rather odd. We're calling a tf32 builtin, but expect to see and f32 load intrinsic. Is that expected ?

clang/test/CodeGen/builtins-nvptx-mma.py
74

This does not seem to match the generated builtins-nvptx-mma.cu which does have __mma_tf32_m16n16k8_ld_c
If I regenrate the test I see a somewhat different set of tests, possibly related to the oddity I've pointed in the generated test changes in this patch.

tra added inline comments.Jul 2 2021, 4:21 PM
clang/test/CodeGen/builtins-nvptx-mma.cu
781–786

Never mind. I think I understand what's going on now.
CUDA headers use __mma_tf32 builtins. A and B operate on opaque integer types. C and D operate on floats.
However, on the PTX front we have wmma.load.{a,b}...tf32 but wmma.load.c...f32.

I guess it does make sense to keep LLVM intrinsic names close to the instructions they produce.

Sorry for the late response. Looks like you have handled the issues and more in your patch. Thank you for fixing my blunders. 😄

clang/include/clang/Basic/BuiltinsNVPTX.def
727

Haha, I didn't think of that one. Sadly we don't have any plans to work on extending support for PTX 7.1+ in the next couple of months, but it looks like your new patch (D105384) takes care of it anyway. 😆

clang/test/CodeGen/builtins-nvptx-mma.cu
781–786

Yeah, it was definitely confusing to write. I think the current state is the best solution, as it prioritizes consistency within the sub-projects. Not a big fan of the inconsistency though, but if we want to follow CUDA's example I suppose we're stuck with this.

clang/test/CodeGen/builtins-nvptx-mma.py
74

You are absolutely right. That's a mistake on my part. Looks like you've got it under control in your patch. Thanks!