This is an archive of the discontinued LLVM Phabricator instance.

[mlir][NVGPU] Adding Support for cp_async_zfill via Inline Asm
ClosedPublic

Authored by manishucsd on Aug 19 2022, 5:28 PM.

Details

Summary

cp_async_zfill is currently not present in the nvvm backend, this patch adds cp_async_zfill support by adding inline asm when lowering from nvgpu to nvvm.

Diff Detail

Event Timeline

manishucsd created this revision.Aug 19 2022, 5:28 PM
manishucsd requested review of this revision.Aug 19 2022, 5:28 PM

Minor update

clang format

ThomasRaoux added inline comments.Aug 22 2022, 7:52 AM
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
394

nit: remove empty line

423

I believe this naming is cutlass specific and I don't really see it anywhere in ptx or nvvm spec. I think it would be better to explain the difference between those two different version of cp.async and not name it cp_async_zfill

Great thanks for adding this!

Just a small question: What is the reason for using inline ASM? If you haven't already tried, adding the NVPTX backend support for intrinsics is pretty trivial. It's probably the same amount of code. If you search for past diffs that add NVPTX intrinsics, you can look at the files they modify and follow the same process.

Review comments.

manishucsd marked 2 inline comments as done.Aug 22 2022, 10:47 AM
ThomasRaoux accepted this revision.Aug 22 2022, 10:47 AM

Looks great!

This revision is now accepted and ready to land.Aug 22 2022, 10:47 AM

Great thanks for adding this!

Just a small question: What is the reason for using inline ASM? If you haven't already tried, adding the NVPTX backend support for intrinsics is pretty trivial. It's probably the same amount of code. If you search for past diffs that add NVPTX intrinsics, you can look at the files they modify and follow the same process.

I had suggested it that way to avoid having to add an intrinsic at this point as it would require adding a bunch of new intrinsics (there are 5 for cp.aync that would double it). I wonder if we want to duplicate the intrinsics or let the codegen match cp.async with cp_size == src_size. Anyway I was trying to avoid this discussion and effort at the moment. Maybe I overestimated the effort needed.

In any case I agree that we should add support to it in NVPTX long term.

Great thanks for adding this!

Just a small question: What is the reason for using inline ASM? If you haven't already tried, adding the NVPTX backend support for intrinsics is pretty trivial. It's probably the same amount of code. If you search for past diffs that add NVPTX intrinsics, you can look at the files they modify and follow the same process.

For supporting async copy with zero fills, inline asm seems to be the fastest route for now. Yeah, we would want to eventually support it via NVPTX backend.

Fixes discovered as I try and use this patch in codegen