WMMA = "Warp Level Matrix Multiply-Accumulate".
These are the new instructions introduced in PTX6.0 and available
on sm_70 GPUs.
WMMA.MMA wins special medal for having 8 return values and 26 arguments. :-)
Paths
| Differential D38645
[NVPTX] Implemented wmma intrinsics and instructions. ClosedPublic Authored by tra on Oct 6 2017, 1:51 PM.
Details Summary WMMA = "Warp Level Matrix Multiply-Accumulate". WMMA.MMA wins special medal for having 8 return values and 26 arguments. :-)
Diff Detail
Event TimelineHerald added subscribers: hiraditya, sanjoy, jholewinski. · View Herald TranscriptOct 6 2017, 1:51 PM Comment Actions Changed names of MMA intrinsics and instructions to use <typeD>.<typeC> order to match nomenclature used in CUDA headers. Comment Actions Artem, thanks a lot for working on this! I notice that you are taking a different approach to define the llvm wmma intrinsics than what we (NVIDIA) do. http://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-warp-level-matrix Specifically, yours embeds the layout/memory space/ while ours treats them as constant arguments. We did this to reduce the amount of intrinsic functions the optimizer and codegen have to deal with. We have plans for more wmma features in the next few CUDA releases. It would be better to unify the syntax and naming of the wmma intrinsics. It would also make cross-support much easier. Would you be able to revise the patch? Highly appreciated. Thanks. Yuan Comment Actions (Responding to Yuan, which also dumps my in-progress comments on the code; sorry for the noise.)
How does having more intrinsic functions cause a problem for the optimizer / codegen?
Comment Actions We took this approach to reduce the number of intrinsic functions that opt and code-gen has to deal with, for example to have one ld_a_f16 instead of 12. It simplifies our code logic. Take the address space optimization for an example, when we translate a generic load to specific load, we can just change the pointer type. The rests are just copied over. Comment Actions
Yeah, specializing on an inferred address space is an important optimization. We were planning to handle this in tablegen, by pattern-matching wmm.load/store(addrspacecast(ptr, x)). This should let it Just Work, without any need for an analysis beyond what we already have.
Comment Actions
Reducing the number of intrinsics does not change the fact that the root cause of complexity here is the fact that PTX encodes instruction parameters in instruction *names*. Even with reduced number of intrinsics that map to these instructions, someone/somewhere will have to match them to appropriate instruction variant. 1:1 mapping is relatively simple to implement with tablegen and is sufficient for its intended use of generating specific instruction variant. Just in case -- the naming of intrinsics is also different. Intrinsics in the patch are llvm.nvvm.*W*mma, while the intrinsics in NVVM-IR-spec use the llvm.nvvm.*H*mma. For all practical purposes they should not conflict in any way with your downstream implementation.
If NVidia would send a patch with the implementation of NVVM-IR style intrinsics, I would be glad to help reviewing and getting it into LLVM.
I don't think this patch prevents optimizations like these.
Comment Actions
Thinking about this more, most or all of the time we should be able to pattern match on the address space of the pointer operand to the load/store, no messing around following chains of addrspacecast. Right now we'd have to add the intrinsics to rewriteIntrinsicOperands in InferAddressSpaces, but there's a TODO in there to improve that. tra marked 4 inline comments as done. Comment ActionsAdded an explanation for WMMA_VARIANT macro and related enum. Comment Actions Thanks for pointing out the hmma vs wmma difference in the names. That should avoid the naming conflict and is less confusing. Thanks. This revision is now accepted and ready to land.Oct 11 2017, 5:41 PM Closed by commit rL315601: [NVPTX] Implemented wmma intrinsics and instructions. (authored by tra). · Explain WhyOct 12 2017, 11:28 AM This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 118815 llvm/trunk/include/llvm/IR/IntrinsicsNVVM.td
llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
llvm/trunk/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
llvm/trunk/lib/Target/NVPTX/NVPTXISelLowering.cpp
llvm/trunk/lib/Target/NVPTX/NVPTXIntrinsics.td
llvm/trunk/test/CodeGen/NVPTX/wmma.py
|