For D16 memory load instructions, the hardware usually only write to half
of the 32bit register, but we define the destination register using
32bit register for the MachineIR instruction. Without the extra tied
source register, LLVM framework will think previous write to the other
half of the register being dead. This is because by using 32bit register
as the destination register, LLVM will think the instruction will always
overwrite the whole 32bit register. By adding the extra tied source,
LLVM will think we are reading the register, so previous write to the
register will not be dead. This dummy tied source is introducing
unnecessary read-after-write dependency. The change here is to bypass the
tied source that can be skipped, thus avoiding an unnecessary s_waitcnt.
Details
- Reviewers
arsenm foad - Commits
- rG9119d9bfcef4: AMDGPU/SIInsertWait: Skip dummy tied source
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1181 | Should we do the same thing for _d16 DS, FLAT, SCRATCH and GLOBAL instructions? |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1181 | For _d16 DS/FLAT, skipping the tied source does not change the code generation. As we always need a s_waitcnt for two successive _d16 ds/flat load because they may return out of order. So, I would rather not handle them here. I am actually setting the D16Buf bit for global/scratch load in parent change. Should I change D16Buf to something else? I will update the comment. |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1184 | D16 was renamed into D16Buf in D47434. I think it should be ok to change back? The only one user in cpp source code is: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp#L8048 I don't know what it is doing here. |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1181 | Since when can't they return out of order? Could you demonstrate a change by combining a ds load with a global load? The two halves don't need to access the same address space |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1181 |
Sorry I don't know what you are asking. Could you help explain?
A ds load followed by a global_load to the same VGPR will always have a WAW dependency. The change here is to remove a false read-after-write dependency. |
I like this but it would be nice to have tests showing the effect on DS, FLAT and SCRATCH instructions.
I have pointed out some tests that will hit the code path added in this patch, but for DS and FLAT instructions, there will be no change in generated assembly as there is still a Write-After-Write dependency there.
llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll | ||
---|---|---|
25 | This shows the effect on scratch_load_d16. | |
204 | This is the test for ds_read_xxx_d16, but the patch won't change the generated code because there is always a WAW dependency. | |
364 | This will test for the flat_load_xxx_d16, but like ds_load, there is no change in the generated code. |
Should we do the same thing for _d16 DS, FLAT, SCRATCH and GLOBAL instructions?