This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] generate correct MMA instruction mnemonics with PTX63+.
ClosedPublic

Authored by tra on Mar 14 2019, 3:17 PM.

Details

Summary

PTX 6.3 requires using ".aligned" in the MMA instruction names.
In order to generate correct name, now we pass current
PTX version to each instruction as an extra constant operand
and InstPrinter adjusts its output accordingly.

Event Timeline

tra created this revision.Mar 14 2019, 3:17 PM
jlebar edited reviewers, added: timshen; removed: jlebar.Mar 14 2019, 4:17 PM
jlebar added a subscriber: jlebar.
timshen added inline comments.Mar 18 2019, 4:50 PM
llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
53

Rebase onto D59389?

llvm/test/CodeGen/NVPTX/wmma.py
244

Who is supposed to run this script? Can we check-in the result of this script and make them part of the regression tests?

Relatedly, for other backends we have a framework for it. See llvm/utils/update_llc_test_checks.py. The generated file looks like llvm/test/CodeGen/PowerPC/atomics-regression.ll.

One of the advantages to check-in the generated file is that, and succeeding behavioral changes are reflected in the patch.

timshen added inline comments.Mar 18 2019, 4:53 PM
llvm/test/CodeGen/NVPTX/wmma.py
244

Who is supposed to run this script?

I guess I can answer this part - lit. Still, it'd be great to check-in the generated .ll files with RUN lines in them.

tra updated this revision to Diff 191213.Mar 18 2019, 5:08 PM
tra marked an inline comment as done.

Rebased on updated D59389

tra marked an inline comment as done.Mar 18 2019, 5:15 PM
tra added inline comments.
llvm/test/CodeGen/NVPTX/wmma.py
244

The script is executed by the lit which then runs llc with the generated output and checks the resulting PTX.

I'm not convinced that committing generated .ll has much value -- it's in the ballpark of a megabyte of uninteresting boilerplate mostly consisting of enumerating 4- and 8-tuples of arguments and results. Upcoming changes will bring more supported types and will multiply the amount of generated ll without making it any more interesting for humans.

e.g just one function out of *a lot*:

declare {<2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>} @llvm.nvvm.wmma.m16n16k16.load.a.row.f16.p3i8(i8 addrspace(3)* %src );

; CHECK-LABEL: .func {{.*}}test_llvm_nvvm_wmma_m16n16k16_load_a_row_f16_p3i8(
define {<2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>} @test_llvm_nvvm_wmma_m16n16k16_load_a_row_f16_p3i8(i8 addrspace(3)* %src ) {
; CHECK: wmma.load.a.sync.aligned.row.m16n16k16.shared.f16
; CHECK: {{{%hh[0-9]+, *%hh[0-9]+, *%hh[0-9]+, *%hh[0-9]+, *%hh[0-9]+, *%hh[0-9]+, *%hh[0-9]+, *%hh[0-9]+}}}
; CHECK: [%rd{{[0-9]+}}]
  %v0 = call {<2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>} @llvm.nvvm.wmma.m16n16k16.load.a.row.f16.p3i8(i8 addrspace(3)* %src );
  ret {<2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>} %v0;
}

It's easy enough to generate or grab the one done by lit -- the name would be right there in the failing command.

timshen accepted this revision.Mar 18 2019, 5:44 PM
timshen added inline comments.
llvm/test/CodeGen/NVPTX/wmma.py
244

I'm not convinced that committing generated .ll has much value -- it's in the ballpark of a megabyte of uninteresting boilerplate mostly consisting of enumerating 4- and 8-tuples of arguments and results. Upcoming changes will bring more supported types and will multiply the amount of generated ll without making it any more interesting for humans.

test/CodeGen/X86 already does this:
...src/llvm-project/llvm % wc -c grep 'autogenerated by utils/update_llc' -r test/CodeGen/X86 -l | tail -1
37287954 total

It has 37MB of autogenerated .ll files, presumably for its massive intrinsics.

e.g just one function out of *a lot*:

Compared to test/CodeGen/X86/avx512vl-vec-masked-cmp.ll it isn't that bad. ;)

However, I realized that wmma.py is somewhat different from utils/update_llc_test_checks.py

What utils/update_llc_test_checks.py does:

  • Run llc on the *arbitrary* input IR and get the asm output.
  • Use regex replacement to turn the asm into CHECK lines. The regexes are different for different targets.
  • Print out the .ll file with those CHECK lines.

What wmma.py does:

  • Enumerate all possible combinations of wmma IR inputs.
  • Generate the CHECK lines directly using the same wmma-specific knowledge that generates the IR.
  • Print out the .ll file with the CHECK lines.

The key difference is that update_llc_test_checks.py won't be wmma-specific.

Another crucial difference is that wmma.py generates very generic check-lines like [%rd{{[0-9]+}}], while update_llc_test_checks.py usually prints out the exact literal it extracts from the asm result, e.g. %rd1.

As a result, wmma.py's output isn't as readable as I thought it would be (less literals), so I'm fine without checking-in the wppa.py-generated files.

However, I encourage that some of the NVPTX contributors (!) add NVPTX support to update_llc_test_checks.py. With that, we could have supported wmma.py almost freely, along with all other kinds of PTX regression tests.

This revision is now accepted and ready to land.Mar 18 2019, 5:44 PM
tra marked an inline comment as done.Mar 19 2019, 11:26 AM
tra added inline comments.
llvm/test/CodeGen/NVPTX/wmma.py
244

IIUIC, update_llc_test_checks.py effectively freezes the output generated by llcn *now* so it can be checked for regressions later.

wmma.py use case is different, at least for me -- I use it as a way to *create* the reference output that llc can't generate yet and then use it to make sure my NVPTX back-end changes do the right thing.
That said, once the back-end functionality is implemented, it becomes just a 'compare to the reference' test and the task of generating CHECK lines can be indeed offloaded to update_llc_test_checks.py.

I'll think of splitting these two use cases. Perhaps I should keep the script to aid with development, but, once it's done, generate reference .ll with implemented intrinsics and let update_llc_test_checks.py generate the checks for generated PTX.

This revision was automatically updated to reflect the committed changes.