VALU use of an SGPR (pair) as mask followed by SALU write to the
same SGPR can cause incorrect execution of subsequent SALU reads
of the SGPR.
Details
- Reviewers
foad rampitec - Commits
- rGa35013bec68d: [AMDGPU][GFX11] Mitigate VALU mask write hazard
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPU.td | ||
---|---|---|
707 ↗ | (On Diff #461106) | You do not really need a feature bit for this, just a subtarget check. |
llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp | ||
2722 | You seem to start from the instruction in the middle and have no check for the 3rd instruction. This makes the w/a overrestrictive as far as I understand. This maybe fine, but probably deserves a comment. | |
2728 | A mask can only be a 64 bit register in wave64. You could check the size and bail rather then listing 32 bit registers. | |
2735 | Isn't that less error prone to just check for an implicit vcc use? I assume it shall not be true unless it reads a mask. | |
2744 | As far as I understand this can only be a VCC too. | |
2807 | Call runOnInstruction() on this new instruction. |
llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp | ||
---|---|---|
2722 | Yes, this intentionally does not do a forward check to look for hazard expiry. | |
2728 | Mask can only be 64 bit, but SALU write to either 32 bit half is a hazard. | |
2735 | I am not sure I understand? | |
2744 | As above, my understanding is partial overlap is also a hazard. |
llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp | ||
---|---|---|
2722 | Thanks for adding the comment. In fact forward search would not work, we do not have such mechanism. There are cases of 3 instruction hazards where we always start from the last instruction, but that would be an overkill here. | |
2735 | I mean instead of listing instruction opcodes it can use a general check is there is an implicit $vcc operand. But OK, if you think this list is comprehensive. |
llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp | ||
---|---|---|
2735 | The _e32 versions have an implicit $vcc, but the _e64 versions do not. There is still a small problem that the list of instructions here is missing exotic forms like V_CNDMASK_B32_dpp. |
- Add DPP forms and associated tests
I was hanging on to this while I waited for some performance data.
After reviewing this closely, I think I need to switch the workaround mechanism.
Sure, although...
There is still a small problem that the list of instructions here is missing exotic forms like V_CNDMASK_B32_dpp.
... this still bothers me.
The ones I can find at the moment are:
V_ADDC_U32_sdwa
V_CNDMASK_B16_sdwa
V_CNDMASK_B32_sdwa
V_SUBBREV_U32_sdwa
V_SUBB_U32_sdwa
You seem to start from the instruction in the middle and have no check for the 3rd instruction. This makes the w/a overrestrictive as far as I understand. This maybe fine, but probably deserves a comment.