This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Suppress redundant waitcnt instrs
ClosedPublic

Authored by msearles on Feb 2 2018, 10:16 AM.

Details

Summary
  1. Run the memory legalizer prior to the waitcnt pass; keep the policy that the waitcnt pass does not remove any waitcnts within the incoming IR.
  2. The waitcnt pass doesn't (yet) track waitcnts that exist prior to the waitcnt pass (it just skips over them); because the waitcnt pass is ignorant of them, it may insert a redundant waitcnt. To avoid this, check the prev instr. If it and the to-be-inserted waitcnt are the same, suppress the insertion. We keep the existing waitcnt under the assumption that whomever, e.g., the memory legalizer, inserted it knows what they were doing.
  3. Follow-on work: teach the waitcnt pass to record the pre-existing waitcnts for better waitcnt production.

Diff Detail

Repository
rL LLVM

Event Timeline

msearles created this revision.Feb 2 2018, 10:16 AM
msearles added a project: Restricted Project.
msearles added a subscriber: llvm-commits.
t-tye added a comment.Feb 2 2018, 10:27 AM

A concern is that you do not want to remove an original waitcnt when inserting a new one, as the pass may iterate and subsequently decide not to add a waitcnt there, but will have eliminated a waitcnt needed to implement the memory model. Is that an issue?

rampitec added inline comments.Feb 2 2018, 10:45 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗(On Diff #132622)

Probably we need not to check for identity, but create a single strongest wait combined from two.

For example, one waits for vmcnt(2), another for vmcnt(3) - keep vmcnt(2).
One waits for lgkmcnt, another does not - wait for lgkmcnt.

Generalizing: produce one wait with:

s_waitcnt vmcnt(min(vmcnts[])), expcnt(min(expcnts[])), lgkmcnt(min(lgkmcnts[]))

A concern is that you do not want to remove an original waitcnt when inserting a new one, as the pass may iterate and subsequently decide not to add a waitcnt there, but will have eliminated a waitcnt needed to implement the memory model. Is that an issue?

OK, since we need to ensure that the existing waitcnts are not touched, I can mod the patch so that it doesn't remove the existing waitcnt and suppresses the to-be-inserted-by-the-waitcnt-pass waitcnt.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗(On Diff #132622)

Agreed in principle, however, I have a sense that the common case is the redundant waitcnt and, moreover, the common case is redundant because of interaction with the memory legalizer. Per Tony, the existing waitcnt instrs should be left alone; it might get messy to attempt to create the strongest waitcnt instr of an existing waitcnt and a waitcnt pass waitcnt.

rampitec added inline comments.Feb 3 2018, 10:44 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗(On Diff #132622)

We can keep it in mind for the future. I believe a more common case is wait to 0 inserted by the legalizer and wait to a higher count inserted here, so they not necessarily equal.

t-tye added a comment.Feb 3 2018, 10:47 AM

Can the pass update its internal state while walking the control flow to factor in the consequences of the original waitcnts? That way a decision as to whether a waitcnt is required will take into account these original waitcnts. This means the benefit is obtained regardless of whether the waitcnts are adjacent or separated (even in different BBs).

It seems that a separate pass could be done after the final waitcnts have been decided to collapse adjacent waitncts into a single one if possible. Or perhaps it would be better to postpone inserting the waitcnts until after the dataflow iteration has found a fixed point, at which time any original/deduced waitcnts can be merger if adjacent.

Can the pass update its internal state while walking the control flow to factor in the consequences of the original waitcnts? That way a decision as to whether a waitcnt is required will take into account these original waitcnts. This means the benefit is obtained regardless of whether the waitcnts are adjacent or separated (even in different BBs).

It seems that a separate pass could be done after the final waitcnts have been decided to collapse adjacent waitncts into a single one if possible. Or perhaps it would be better to postpone inserting the waitcnts until after the dataflow iteration has found a fixed point, at which time any original/deduced waitcnts can be merger if adjacent.

Yes, all of that can be done; it's been on the TODO list for several months; see TODO at line 1531; my intent for this patch was to grab the low-hanging fruit re: redundant waitcnt instrs.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗(On Diff #132622)

Agreed.

msearles planned changes to this revision.Feb 4 2018, 12:03 PM
msearles updated this revision to Diff 133104.Feb 6 2018, 4:08 PM
msearles edited the summary of this revision. (Show Details)
  • Don't remove existing waitcnt instrs; if a redundant is to be inserted, keep the existing waitcnt and don't insert the duplicate.
  • Fix mir test
rampitec added inline comments.Feb 6 2018, 5:00 PM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1533 ↗(On Diff #133104)

!TrackedWaitcntSet.count(&Inst) I guess.

msearles updated this revision to Diff 133121.Feb 6 2018, 5:56 PM

Adjust per reviewer comment

msearles marked an inline comment as done.Feb 6 2018, 5:56 PM
msearles added inline comments.
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1533 ↗(On Diff #133104)

Done; made equivalent changes in several other places.

rampitec accepted this revision.Feb 6 2018, 5:59 PM

LGTM

This revision is now accepted and ready to land.Feb 6 2018, 5:59 PM
This revision was automatically updated to reflect the committed changes.
msearles marked an inline comment as done.