- Run the memory legalizer prior to the waitcnt pass; keep the policy that the waitcnt pass does not remove any waitcnts within the incoming IR.
- The waitcnt pass doesn't (yet) track waitcnts that exist prior to the waitcnt pass (it just skips over them); because the waitcnt pass is ignorant of them, it may insert a redundant waitcnt. To avoid this, check the prev instr. If it and the to-be-inserted waitcnt are the same, suppress the insertion. We keep the existing waitcnt under the assumption that whomever, e.g., the memory legalizer, inserted it knows what they were doing.
- Follow-on work: teach the waitcnt pass to record the pre-existing waitcnts for better waitcnt production.
Details
Diff Detail
Event Timeline
A concern is that you do not want to remove an original waitcnt when inserting a new one, as the pass may iterate and subsequently decide not to add a waitcnt there, but will have eliminated a waitcnt needed to implement the memory model. Is that an issue?
lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1567 | Probably we need not to check for identity, but create a single strongest wait combined from two. For example, one waits for vmcnt(2), another for vmcnt(3) - keep vmcnt(2). Generalizing: produce one wait with: s_waitcnt vmcnt(min(vmcnts[])), expcnt(min(expcnts[])), lgkmcnt(min(lgkmcnts[])) |
OK, since we need to ensure that the existing waitcnts are not touched, I can mod the patch so that it doesn't remove the existing waitcnt and suppresses the to-be-inserted-by-the-waitcnt-pass waitcnt.
lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1567 | Agreed in principle, however, I have a sense that the common case is the redundant waitcnt and, moreover, the common case is redundant because of interaction with the memory legalizer. Per Tony, the existing waitcnt instrs should be left alone; it might get messy to attempt to create the strongest waitcnt instr of an existing waitcnt and a waitcnt pass waitcnt. |
lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1567 | We can keep it in mind for the future. I believe a more common case is wait to 0 inserted by the legalizer and wait to a higher count inserted here, so they not necessarily equal. |
Can the pass update its internal state while walking the control flow to factor in the consequences of the original waitcnts? That way a decision as to whether a waitcnt is required will take into account these original waitcnts. This means the benefit is obtained regardless of whether the waitcnts are adjacent or separated (even in different BBs).
It seems that a separate pass could be done after the final waitcnts have been decided to collapse adjacent waitncts into a single one if possible. Or perhaps it would be better to postpone inserting the waitcnts until after the dataflow iteration has found a fixed point, at which time any original/deduced waitcnts can be merger if adjacent.
Yes, all of that can be done; it's been on the TODO list for several months; see TODO at line 1531; my intent for this patch was to grab the low-hanging fruit re: redundant waitcnt instrs.
lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1567 | Agreed. |
- Don't remove existing waitcnt instrs; if a redundant is to be inserted, keep the existing waitcnt and don't insert the duplicate.
- Fix mir test
lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1533 | !TrackedWaitcntSet.count(&Inst) I guess. |
lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1533 | Done; made equivalent changes in several other places. |
!TrackedWaitcntSet.count(&Inst) I guess.