This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Implemented wmma intrinsics and instructions.
ClosedPublic

Authored by tra on Oct 6 2017, 1:51 PM.

Details

Summary

WMMA = "Warp Level Matrix Multiply-Accumulate".
These are the new instructions introduced in PTX6.0 and available
on sm_70 GPUs.

WMMA.MMA wins special medal for having 8 return values and 26 arguments. :-)

Event Timeline

tra created this revision.Oct 6 2017, 1:51 PM
tra updated this revision to Diff 118298.Oct 9 2017, 5:18 PM

No need for Optional<> in getWmmaLdVariant().

tra updated this revision to Diff 118303.Oct 9 2017, 7:16 PM

Changed names of MMA intrinsics and instructions to use <typeD>.<typeC> order to match nomenclature used in CUDA headers.

Artem, thanks a lot for working on this!

I notice that you are taking a different approach to define the llvm wmma intrinsics than what we (NVIDIA) do.

http://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-warp-level-matrix

Specifically, yours embeds the layout/memory space/ while ours treats them as constant arguments. We did this to reduce the amount of intrinsic functions the optimizer and codegen have to deal with. We have plans for more wmma features in the next few CUDA releases. It would be better to unify the syntax and naming of the wmma intrinsics. It would also make cross-support much easier.

Would you be able to revise the patch? Highly appreciated.

Thanks.

Yuan

jlebar edited edge metadata.Oct 11 2017, 11:35 AM

(Responding to Yuan, which also dumps my in-progress comments on the code; sorry for the noise.)

Specifically, yours embeds the layout/memory space/ while ours treats them as constant arguments. We did this to reduce the amount of intrinsic functions the optimizer and codegen have to deal with.

How does having more intrinsic functions cause a problem for the optimizer / codegen?

llvm/test/CodeGen/NVPTX/wmma.py
16

I guess it's a nit, but perhaps

"{%s}" % ", ".join(...)

? I'd certainly find that more readable.

21

Nit, space around the asterisk, as with other operators.

150

Is there some reason we're using \t rather than indenting with spaces? That might make this easier to read...

154

s/values/params/?

185

Space after ,

188

test_params["values"] = ...?

189

Missing space after ,

tra updated this revision to Diff 118665.Oct 11 2017, 11:59 AM
tra marked 6 inline comments as done.

Addressed Justin's comments.

llvm/test/CodeGen/NVPTX/wmma.py
154

'args' would work even better. Replaced.

188

The line gets too long to split in a readable way.

We took this approach to reduce the number of intrinsic functions that opt and code-gen has to deal with, for example to have one ld_a_f16 instead of 12. It simplifies our code logic. Take the address space optimization for an example, when we translate a generic load to specific load, we can just change the pointer type. The rests are just copied over.

Take the address space optimization for an example, when we translate a generic load to specific load, we can just change the pointer type. The rests are just copied over.

Yeah, specializing on an inferred address space is an important optimization. We were planning to handle this in tablegen, by pattern-matching wmm.load/store(addrspacecast(ptr, x)). This should let it Just Work, without any need for an analysis beyond what we already have.

llvm/include/llvm/IR/IntrinsicsNVVM.td
3882

I know it's tablegen and going to be ugly no matter what we do, but could we indent this as

!if(!eq(Abc#Type,"cf16"),
    [regty, regty, regty, regty],
    [regty, regty, regty, regty, regty, regty, regty, regty]),

? Right now it looks like these arrays are parameters to eq, which is confusing.

3888

Any reason Space must contain a . but Type must not contain one?

3932

Should this be indented 2 fewer spaces?

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
499

This and the #define could use a brief comment, I think.

tra updated this revision to Diff 118685.Oct 11 2017, 1:41 PM

Fixed text alignment in IntrinsicsNVVM.td

What if the addrspacecast() is in a different basic block?

tra added a comment.Oct 11 2017, 1:58 PM

We took this approach to reduce the number of intrinsic functions that opt and code-gen has to deal with, for example to have one ld_a_f16 instead of 12.

Reducing the number of intrinsics does not change the fact that the root cause of complexity here is the fact that PTX encodes instruction parameters in instruction *names*. Even with reduced number of intrinsics that map to these instructions, someone/somewhere will have to match them to appropriate instruction variant. 1:1 mapping is relatively simple to implement with tablegen and is sufficient for its intended use of generating specific instruction variant.
The patch does not block further optimizations. E.g. with a bit of tablegen magic it should be possible to pattern-match ld_a_f16(addrpacecast(SHARED)) and replace it with ld_a_f16_shared.

Just in case -- the naming of intrinsics is also different. Intrinsics in the patch are llvm.nvvm.*W*mma, while the intrinsics in NVVM-IR-spec use the llvm.nvvm.*H*mma. For all practical purposes they should not conflict in any way with your downstream implementation.

It simplifies our code logic.

If NVidia would send a patch with the implementation of NVVM-IR style intrinsics, I would be glad to help reviewing and getting it into LLVM.

Take the address space optimization for an example, when we translate a generic load to specific load, we can just change the pointer type. The rests are just copied over.

I don't think this patch prevents optimizations like these.

llvm/include/llvm/IR/IntrinsicsNVVM.td
3882

Grr. I wish there was clang-format for tablegen. And python. And whatever I write. :-) Fixed.

3888

Space is optional, so for generic Space is "", otherwise it's either ".shared" or ".global".
Non-optional parameters are specified without a dot.

3932

Nope. It's one of the arguments of !listconcat() and is indented as such.

All right, I've finished my first pass. This wasn't as scary as it looked. :)

What if the addrspacecast() is in a different basic block?

Thinking about this more, most or all of the time we should be able to pattern match on the address space of the pointer operand to the load/store, no messing around following chains of addrspacecast.

Right now we'd have to add the intrinsics to rewriteIntrinsicOperands in InferAddressSpaces, but there's a TODO in there to improve that.

tra updated this revision to Diff 118691.Oct 11 2017, 2:21 PM
tra marked 4 inline comments as done.

Added an explanation for WMMA_VARIANT macro and related enum.

Thanks for pointing out the hmma vs wmma difference in the names. That should avoid the naming conflict and is less confusing. Thanks.

jlebar accepted this revision.Oct 11 2017, 5:41 PM
This revision is now accepted and ready to land.Oct 11 2017, 5:41 PM
This revision was automatically updated to reflect the committed changes.