This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SIInsertWait: Skip dummy tied source
ClosedPublic

Authored by ruiling on Dec 22 2022, 2:35 AM.

Details

Summary

For D16 memory load instructions, the hardware usually only write to half
of the 32bit register, but we define the destination register using
32bit register for the MachineIR instruction. Without the extra tied
source register, LLVM framework will think previous write to the other
half of the register being dead. This is because by using 32bit register
as the destination register, LLVM will think the instruction will always
overwrite the whole 32bit register. By adding the extra tied source,
LLVM will think we are reading the register, so previous write to the
register will not be dead. This dummy tied source is introducing
unnecessary read-after-write dependency. The change here is to bypass the
tied source that can be skipped, thus avoiding an unnecessary s_waitcnt.

Diff Detail

Event Timeline

ruiling created this revision.Dec 22 2022, 2:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2022, 2:35 AM
ruiling requested review of this revision.Dec 22 2022, 2:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2022, 2:35 AM
foad added inline comments.Dec 22 2022, 3:40 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1181

Should we do the same thing for _d16 DS, FLAT, SCRATCH and GLOBAL instructions?

arsenm added inline comments.Dec 22 2022, 4:08 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1184

Don't see the point of the "Buf" part of the name, it's just D16

ruiling added inline comments.Dec 22 2022, 4:15 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1181

For _d16 DS/FLAT, skipping the tied source does not change the code generation. As we always need a s_waitcnt for two successive _d16 ds/flat load because they may return out of order. So, I would rather not handle them here. I am actually setting the D16Buf bit for global/scratch load in parent change. Should I change D16Buf to something else? I will update the comment.

foad added a comment.Dec 22 2022, 4:22 AM

This needs a test case to show the effect it has on _d16 buffer instructions.

ruiling added inline comments.Dec 22 2022, 4:24 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1184

D16 was renamed into D16Buf in D47434. I think it should be ok to change back? The only one user in cpp source code is: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp#L8048 I don't know what it is doing here.

arsenm added inline comments.Dec 22 2022, 4:55 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1181

Since when can't they return out of order?

Could you demonstrate a change by combining a ds load with a global load? The two halves don't need to access the same address space

ruiling added inline comments.Dec 22 2022, 5:49 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1181

Since when can't they return out of order?

Sorry I don't know what you are asking. Could you help explain?

Could you demonstrate a change by combining a ds load with a global load? The two halves don't need to access the same address space

A ds load followed by a global_load to the same VGPR will always have a WAW dependency. The change here is to remove a false read-after-write dependency.

ruiling updated this revision to Diff 487173.Jan 8 2023, 5:30 AM

Update the change to use a new bit in TSFlags

ruiling retitled this revision from SIInsertWait: Skip tied source of d16 buffer instruction to AMDGPU/SIInsertWait: Skip dummy tied source.Jan 8 2023, 5:35 AM
ruiling edited the summary of this revision. (Show Details)
foad added a comment.Jan 9 2023, 2:19 AM

I like this but it would be nice to have tests showing the effect on DS, FLAT and SCRATCH instructions.

luke added a subscriber: luke.Jan 9 2023, 7:16 AM

I like this but it would be nice to have tests showing the effect on DS, FLAT and SCRATCH instructions.

I have pointed out some tests that will hit the code path added in this patch, but for DS and FLAT instructions, there will be no change in generated assembly as there is still a Write-After-Write dependency there.

llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll
25

This shows the effect on scratch_load_d16.

204

This is the test for ds_read_xxx_d16, but the patch won't change the generated code because there is always a WAW dependency.

364

This will test for the flat_load_xxx_d16, but like ds_load, there is no change in the generated code.

foad accepted this revision.Jan 10 2023, 2:41 AM

Thanks!

This revision is now accepted and ready to land.Jan 10 2023, 2:41 AM
This revision was landed with ongoing or failed builds.Jan 10 2023, 6:00 PM
This revision was automatically updated to reflect the committed changes.