This is an archive of the discontinued LLVM Phabricator instance.

[mlir][AMDGPU] Add lds_barrier op
ClosedPublic

Authored by krzysz00 on Jul 11 2022, 4:12 PM.

Details

Summary

The lds_barrier op allows workgroups to wait at a barrier for
operations to/from their local data store (LDS) to complete without
incurring the performance penalties of a full memory fence.

Diff Detail

Event Timeline

krzysz00 created this revision.Jul 11 2022, 4:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 4:12 PM
krzysz00 requested review of this revision.Jul 11 2022, 4:12 PM

Is this really necessary if we have https://reviews.llvm.org/D120544. A barrier has no requirement to wait for LDS and VMEM it only does so currently because of a bug. Using inline asm like this seems like it will eventually cause problems although I'm not too familiar with MLIR.

This is a single op that expands to both a waitcnt on LDS and a barrier. I
can go digging tomorrow for what we used to lower this to some sort of
fence and a barrier, but I recall ( @whchung who may have more detail) that
using this bit of inline assembly to work around what I think was the lack
of an LDS-only fence gave a noticable performance increase

This is a single op that expands to both a waitcnt on LDS and a barrier. I
can go digging tomorrow for what we used to lower this to some sort of
fence and a barrier, but I recall ( @whchung who may have more detail) that
using this bit of inline assembly to work around what I think was the lack
of an LDS-only fence gave a noticable performance increase

It's not really the lack of an LDS-only fence that was the problem. You could use the intrinsics for a barrier and a waitcnt for example. The problem with that idea is the backend would always wait for both LDS and VM at barriers when it saw them. So you would need to use inline asm to make the barrier invisible which could cause other problems in the backend with combining waitcnt for example. There is a lot of desire to reland https://reviews.llvm.org/D120544 in the near future. After that, the inline asm will not be needed for gfx90a+ and gfx10+. So then you should probably consider not using the inline asm for those architectures.

Re the above, you mentioned gfx90a+ ... what about gfx908?

And even if there'll be a better way to do it in the future, this type of partier+fence construct, however we implement it, is useful for kernel developers now. We can always swap out the implementation later once the compiler fix lands.

nirvedhmeshram added inline comments.Jul 12 2022, 8:15 AM
mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257

Should the op have side effects?

krzysz00 added inline comments.Jul 12 2022, 11:51 AM
mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257

I don't think so.

nirvedhmeshram added inline comments.Jul 13 2022, 2:22 PM
mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257

So can we change this to /*has_side_effects=*/false then?

@kerbowa I'd like to still land this because the lds_barrier is a useful abstraction over whatever combination of (fence + barrier)/inline assembly/... you end up needing to use to implement it.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257

... right, that field. has_side_effects will get the compiler to not throw this out on account that it doesn't write any registers. So yeah, it some sense, it does have side effects (because it needs to stay put)

I'm not objecting to the change, just pointing out that you may miss out on some optimization since this is lowered to inline ASM, and that you may want to lower it to intrinsics in the future.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257

I think it should definitely be marked as having side effects. It may be scheduled around memops if it isn't. The backend cannot see what instructions are inside inline asm.

nirvedhmeshram accepted this revision.Jul 14 2022, 9:21 AM
This revision is now accepted and ready to land.Jul 14 2022, 9:21 AM
This revision was automatically updated to reflect the committed changes.