This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] Add sm_90a integration test with `wgmma`
Needs ReviewPublic

Authored by guraypp on Aug 17 2023, 12:20 AM.

Details

Summary

This work first introduce sm_90a special integration test.

It also introduce a new test that uses sm_90a instructions such as tma load, and warpgroup matrix multiplication (aka wgmma).

Depends on D157382

Diff Detail

Event Timeline

guraypp created this revision.Aug 17 2023, 12:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2023, 12:20 AM
guraypp requested review of this revision.Aug 17 2023, 12:20 AM
manishucsd added inline comments.Aug 24 2023, 6:40 PM
mlir/test/Integration/GPU/CUDA/sm90a/gemm_m128n128k128_f32f16f16.mlir
185

Can you please explain or better draw out these desc updates?

I see four k-groups and two wgmma.mma_async.m64n128k16

247

Can you please explain how nvvm.wgmma.commit.group.sync.aligned and nvvm.wgmma.wait.group.sync.aligned 1 works here? Use the CTA Shape, wgmma math shape in your explanation as concrete example?

Also, are you able to make a functional test pass with random input distribution for operands?

guraypp added inline comments.Aug 25 2023, 5:16 AM
mlir/test/Integration/GPU/CUDA/sm90a/gemm_m128n128k128_f32f16f16.mlir
185

Right there are 4 x wgmma.m64n128k16. It completest 64x128x64.
Name of test is misleading (gemm_m128n128k128_f32f16f16), it should be gemm_m128n128k64_f32f16f16.

247

Can you please explain how nvvm.wgmma.commit.group.sync.aligned and nvvm.wgmma.wait.group.sync.aligned 1 works here?

Regarding nvvm.wgmma.commit.group.sync.aligned, it commites the 8x wgmma group, and nvvm.wgmma.wait.group.sync.aligned 1 waits for their completion.

Interestingly, when analyzing the PTX from CUTLASS, I see nvvm.wgmma.wait.group.sync.aligned 0. The instruction sequence appears as follows:

wgmma.fence.sync.aligned;
8 x wgmma.mma_async 
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 1
... 
wgmma.wait_group.sync.aligned 0 <--- Do we need this?

Also, are you able to make a functional test pass with random input distribution for operands?

Yes, but I need to more rigid test. Right now, there's a bug in the TMA in upstream. I create the TMA descriptor using cuTensorMapEncodeTiled, I do something wrong. Once I resolve that issue, I plan to create another test that uses random data input data and verifies the output.

guraypp added inline comments.Aug 25 2023, 7:07 AM
mlir/test/Integration/GPU/CUDA/sm90a/gemm_m128n128k128_f32f16f16.mlir
247

Right now, there's a bug in the TMA in upstream.

I correct myself : It isn't a bug. TMA expects 128b alignment for 2D+. To complete the shape, I need to execute 3x`cp.async.bulk.tensor`

  • 1 x [128][64] for matrix-A
  • 2 x [64][64] for matrix-B