This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add NVGPU dialect (architectural specific gpu dialect)
ClosedPublic

Authored by ThomasRaoux on Apr 6 2022, 4:08 PM.

Details

Summary

This introduce a new dialect for vendor specific ptx operations. This
also adds the first operation ldmatrix as an example. More operations
will be added in follow up patches.
This new dialect is meant to be a bridge between GPU and Vector
dialectic and NVVM dialect.

This is based on the RFC proposed here:
https://discourse.llvm.org/t/rfc-add-nv-gpu-dialect-hw-specific-extension-of-gpu-dialect-for-nvidia-gpus/61466

Diff Detail

Event Timeline

ThomasRaoux created this revision.Apr 6 2022, 4:08 PM
ThomasRaoux requested review of this revision.Apr 6 2022, 4:08 PM
ThomasRaoux edited the summary of this revision. (Show Details)Apr 6 2022, 4:08 PM
ThomasRaoux added a reviewer: christopherbate.

Looks good to me. I am happy to stamp, but will weight for folks to weigh in.

bondhugula added a subscriber: bondhugula.

Looks good to me. I am happy to stamp, but will weight for folks to weigh in.

I'd like to see more discussion on this -- posted some questions here: https://discourse.llvm.org/t/rfc-add-nv-gpu-dialect-hw-specific-extension-of-gpu-dialect-for-nvidia-gpus/61466/10?u=bondhugula

Rename dialect to nvgpu

ThomasRaoux retitled this revision from [mlir][nvptx] Add NVPTX dialect (architectural specific gpu dialect) to [mlir][nvgpu] Add NVGPU dialect (architectural specific gpu dialect).Apr 7 2022, 12:18 AM

Looks good to me. I am happy to stamp, but will weight for folks to weigh in.

I'd like to see more discussion on this -- posted some questions here: https://discourse.llvm.org/t/rfc-add-nv-gpu-dialect-hw-specific-extension-of-gpu-dialect-for-nvidia-gpus/61466/10?u=bondhugula

I renamed the dialect as suggested. If there are any fundamental points you think you should discussed please bring it up on discourse or feel free to comment on more details case on the review.

herhut accepted this revision.Apr 13 2022, 8:32 AM

Looks good to me. We really need to figure out a way to group dialects :)

Please also wait for @bondhugula, who had concerns.

This revision is now accepted and ready to land.Apr 13 2022, 8:32 AM

Looks good to me. We really need to figure out a way to group dialects :)

Please also wait for @bondhugula, who had concerns.

Thanks @herhut. @bondhugula, do you still have any concerns?

This revision was landed with ongoing or failed builds.Apr 14 2022, 10:03 AM
This revision was automatically updated to reflect the committed changes.

Looks good to me. We really need to figure out a way to group dialects :)

Please also wait for @bondhugula, who had concerns.

Thanks @herhut. @bondhugula, do you still have any concerns?

This looks fine to me. It was a matter of time before a dialect like this was created. We still have to be cautious about deciding what goes into the GPU dialect vs NVGPU dialect and the lowering paths for the ops that are added here.

We still have to be cautious about deciding what goes into the GPU dialect vs NVGPU dialect and the lowering paths for the ops that are added here.

Yes, I definitely agree.

mlir/include/mlir/Dialect/NVGPU/NVGPU.td
66

Is numTiles the same as the .num attribute in the PTX ISA doc ?

67

The PTX doc specifically mentions 16b elements, did you want to tighten the type here or allow more relaxed semantics with an implicit bitcast and make the verifier only check the final bitlength?

Hmm actually what about the fact that the shape seems to be prescribed to exactly 8x8xf16?
Do you want the op to model exactly that or relax it?

mlir/include/mlir/Dialect/NVGPU/NVGPU.td
66

Yes, but I thought numTiles was more descriptive, see my response to the other question below.

67

I originally authored this code, which was already merged in D123647; here it is code movement.

It is meant to be relaxed vs the stated requirement of 8x8xf16. You can restate the 8x8xf16 tile specification as 8x(4x32b) or 8x16Byte tiles and functionally it will work out. In fact, the NVVM intrinsic ldmatrix in the backend returns i32 values which then need to be bit casted into 2xf16 or 4xi8, etc. We have the tests covering all those cases in the NVVM dialect, but we do need to follow up here with a verifier for this operation.