Adds NVPTX builtins and intrinsics for the CUDA PTX wmma.load, wmma.store, wmma.mma, and mma instructions added in PTX 6.5 and 7.0.
PTX ISA description of
- wmma.load: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-ld
- wmma.store: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-st
- wmma.mma: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
- mma: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
Overview of wmma.mma and mma matrix shape/type combinations added with specific PTX versions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape
Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Co-Authored-by: Stuart Adams <stuart.adams@codeplay.com>
Bummer. mma.h in CUDA-11.3 still does not compile for Ampere.
We appear to be missing the new __bmma_m8n8k128_mma_and_popc_b1 builtin for the .and variant of 1-bit mma introduced in PTX 7.1 and not included in this patch.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
Do you, by any chance, have upcoming patch for PTX7.1, too. :-)