MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.
Details
- Reviewers
arsenm foad - Commits
- rG51e02409f022: [AMDGPU] Produce waitcounts for LDS DMA
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
502–503 | isVALU is redundant with isMUBUF || isFLAT? |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
502–503 | MUBUF and FLAT have VALU = 0 unless it is LDS DMA, which are both memory and VALU instructions. The only of that kind. |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
655 | Swap these checks, Inst.mayStore() && .. |
I also feel like EXTRA_VGPR_LDS was made specifically for this scenario, when we have no VGPR result to track dependency. And it was unused and dead code. Unfortunately we practically do not have tests for this pass and this place is certainly not covered. If I remove isDS() call nothing fails, but probably it should not and it was dead. I cannot prove it though as I have no idea what was originally behind it. But that fits so well, so I guess it was DMA.
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1100 | Was this a pre-existing bug, even if you don't use DMA instructions? E.g. for DS_STORE followed by FLAT_STORE were we missing a wait on expcnt to guarantee WAW order? |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1100 | I thought so too, but it didn't get it in this scenario before and it doesn't get it after because there is no pending VM event. I just believe if anything it shall be vmcnt, not expcnt, although I do not see that documented. For DMA the reason for vmcnt is that we shall wait for vmem to be read, although a wait on lgkmt is not needed (does not seem to be properly documented either). For a pure ds_write it is likely not needed because we don't touch VMEM. |
A side question: is that possible to facilitate MIR syntax to use symbolic syntax for waitcounts instead of a raw number?
Quite literally I am using -start-before instead of -run-pass to see what is really produced while testing.
You would have to add something new to support it. The closest existing thing is probably the comments inlineasm uses for the register class encoded by magic constants
isVALU is redundant with isMUBUF || isFLAT?