This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Optimize waitcnt insertion for flat memory operations
ClosedPublic

Authored by t-tye on Oct 17 2020, 12:25 AM.

Details

Summary

Change waitcnt insertion to check the memory operand tokens to see if
flat memory operations access VMEM in the same way it does to check if
accessing LDS. This avoids adding waitcnt for counters for address
spaces that are not accessed.

In addition, only generate the pessimistic waitcnt 0 if a flat memory
operation appears to access both VMEM and LDS.

This benefits flat memory operations that explicitly specify the
address space as GLOBAL or LOCAL.

Diff Detail

Event Timeline

t-tye created this revision.Oct 17 2020, 12:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 17 2020, 12:25 AM
t-tye requested review of this revision.Oct 17 2020, 12:25 AM
arsenm added inline comments.Oct 19 2020, 8:22 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1248

When is usesVM_CNT false and isFLAT true?

LGTM modulo that inline question.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1248

I'd like to know as well. It may be better to make this an assert(TII->usesVM_CNT(Inst)) instead.

rampitec added inline comments.Oct 19 2020, 11:55 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1202

Could you add assert (TII->isFLAT(MI)) at the top of the function?

1248

Second to that, turn the check into assert. There are no such instructions at least so far.

llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

That one was not supposed to change? The pointer is flat here.

rampitec added inline comments.Oct 19 2020, 1:36 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1207

It should check for local or flat here.

t-tye updated this revision to Diff 299234.Oct 19 2020, 6:35 PM
t-tye marked 4 inline comments as done.

Address review comments.

t-tye marked an inline comment as not done.Oct 19 2020, 6:48 PM
t-tye added inline comments.
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1207

We do not want to test for local and flat as this method is checking to VMEM not LDS. Instead, we want to check for all the address spaces that are legal for a flat operation and involve VMEM. Looking at the enumeration of address spaces there are quite a few that are valid for flat that involve VMEM. Forxample, global, flat, contant, private, the ones involving buffers, etc. Since flat only supports LDS, FLAT, and the address spaces that involve VMEM the clearest test here is to find address spaces that are not LDS. They are the ones that may be VMEM.

I considered asserting if any address space was found that was not legal for a flat operation. For example region (GDS) is not valid. But is there an existing predicate to answer that question?

1248

This was in the original code but I agree it makes little sense. So moved it as an assert into mayAccessVMEMThroughFlat().

llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

Yes. Previously it was "s_waitcnt vmcnt(0) lgkmcnt(0)". Now it is "s_waitcnt vmcnt(0)" as the address space of global16 is 1 which is GLOBAL. Therefore there is no need to wait on LGKM.

t-tye updated this revision to Diff 299238.Oct 19 2020, 6:53 PM

Really upload the review feedback changes.

t-tye updated this revision to Diff 299252.Oct 19 2020, 9:13 PM

Fix clang format warnings.

rampitec requested changes to this revision.Oct 19 2020, 11:33 PM

The patch clearly ignores existence of flat pointers with the test failing.

llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

It is not global, it is flat:

<4 x i32>* %flat16
...
    $vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 $vgpr7_vgpr8, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 16 from %ir.flat16)
This revision now requires changes to proceed.Oct 19 2020, 11:33 PM
rampitec added a comment.EditedOct 20 2020, 12:02 AM

JFYI how much it will help actual programs after it is fixed is unclear. It will likely change a lot of lit tests, but actual effect on real programs would depend on FE and language rules. And inlining of course, as usual.

t-tye added inline comments.Oct 20 2020, 12:56 AM
llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

But isn't this test checking:

$vgpr0 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 4 from %ir.global4)
$vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 $vgpr7_vgpr8, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 16 from %ir.global16)

These are referencing global4 and global16 which are:

i32 addrspace(1)* %global4,
<4 x i32> addrspace(1)* %global16

Which are both marked as the global (1) not flat (0) address space.

Am I missing something?

JFYI how much it will help actual programs after it is fixed is unclear. It will likely change a lot of lit tests, but actual effect on real programs would depend on FE and language rules. And inlining of course, as usual.

It did change 46 lit tests. I agree it is unclear how much it will help. But the GLOBAL and SCRATCH flat operations seem like they may avoid the pessimistic waitcnt 0.

JFYI how much it will help actual programs after it is fixed is unclear. It will likely change a lot of lit tests, but actual effect on real programs would depend on FE and language rules. And inlining of course, as usual.

It did change 46 lit tests. I agree it is unclear how much it will help. But the GLOBAL and SCRATCH flat operations seem like they may avoid the pessimistic waitcnt 0.

Right. Out of these 46 lit tests I was looking for for a very specific one, wanting to ask to write one if it does not exist. This one does exist and it is failing.

llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

No, it is not. Note it first checks label bb.2. And after it:

$vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 $vgpr7_vgpr8, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 16 from %ir.flat16)

It is flat pointer. Not global.

Think about the testcase itself: it is a standalone function (not kernel) taking a generic pointer. You are checking for the question: "is is this DEFINITELY an LDS pointer?" The answer is no, so you say: "this is DEFINITELY NOT an LDS pointer".

67–68

Or VMEM for that matter.

t-tye added a comment.Oct 20 2020, 1:24 AM

JFYI how much it will help actual programs after it is fixed is unclear. It will likely change a lot of lit tests, but actual effect on real programs would depend on FE and language rules. And inlining of course, as usual.

It did change 46 lit tests. I agree it is unclear how much it will help. But the GLOBAL and SCRATCH flat operations seem like they may avoid the pessimistic waitcnt 0.

Right. Out of these 46 lit tests I was looking for a very specific one, wanting to ask to write one if it does not exist. This one does exist and it is failing.

Which test is failing? All the lit tests are passing on my machine. Or are you questioning the way the CHECK tests have been updated? The original test is marking the FLAT pointer as referencing the GLOBAL address space. I assume this is what the frontend did to match the CUDA language semantics that say kernel arguments can only reference global memory. So I believe the generated code is correct unless I am missing something.

JFYI how much it will help actual programs after it is fixed is unclear. It will likely change a lot of lit tests, but actual effect on real programs would depend on FE and language rules. And inlining of course, as usual.

It did change 46 lit tests. I agree it is unclear how much it will help. But the GLOBAL and SCRATCH flat operations seem like they may avoid the pessimistic waitcnt 0.

Right. Out of these 46 lit tests I was looking for a very specific one, wanting to ask to write one if it does not exist. This one does exist and it is failing.

Which test is failing? All the lit tests are passing on my machine. Or are you questioning the way the CHECK tests have been updated? The original test is marking the FLAT pointer as referencing the GLOBAL address space. I assume this is what the frontend did to match the CUDA language semantics that say kernel arguments can only reference global memory. So I believe the generated code is correct unless I am missing something.

waitcnt.mir. It was updated to pass the testing and this update basically flushes the test. It has nothing to do with CUDA, language is irrelevant here. Even more when we speak about functions.

rampitec added inline comments.Oct 20 2020, 10:40 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1207

Hm.. Right, it should return true for generic here. It seems the test (waitcnt.mir) itself does not test what is expected. So we need a test with a real flat pointer, load and full wait.

t-tye updated this revision to Diff 299472.Oct 20 2020, 2:22 PM

Address review comments.

rampitec added inline comments.Oct 20 2020, 2:24 PM
llvm/test/CodeGen/AMDGPU/waitcnt.mir
96

Can you keep just load from flat here? The other load obscures the result.

t-tye updated this revision to Diff 299474.Oct 20 2020, 2:25 PM

Cleanup commit message.

t-tye marked 2 inline comments as done.Oct 20 2020, 2:28 PM
t-tye added inline comments.
llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

I believe the waitcnts are correct, and added the extra test you recommended.

96

Add the extra BB3 you suggested.

The waitcnts being generated seem correct from my inspection.

rampitec added inline comments.Oct 20 2020, 2:31 PM
llvm/test/CodeGen/AMDGPU/waitcnt.mir
96

They seem to be correct, but with two loads per block it is hard to understand which of the loads has actually caused the wait. If you want to keep it this way, add yet another bb.4, but with only a load from flat.

t-tye updated this revision to Diff 299494.Oct 20 2020, 3:34 PM
t-tye marked 2 inline comments as done.

Add test with a single flat load to check that the waitcnt is 0.

rampitec accepted this revision.Oct 20 2020, 3:35 PM

LGTM. Thank you!

This revision is now accepted and ready to land.Oct 20 2020, 3:35 PM
t-tye marked 4 inline comments as done.Oct 20 2020, 3:38 PM
t-tye added inline comments.
llvm/test/CodeGen/AMDGPU/waitcnt.mir
67–68

On checking the test the waitcnts do seem correct because the registers being waited on are produced by loads in earlier basic blocks. Those earlier loads are either global, or they are flat but there is intervening waitcnt that satisfies a vmemcnt(0). Add two additional basic blocks to test this better.

96

Add a bb.4 that has a single load from flat.

This revision was landed with ongoing or failed builds.Oct 20 2020, 3:56 PM
This revision was automatically updated to reflect the committed changes.
t-tye marked an inline comment as done.