This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Flush the vmcnt counter in loop preheader when necessary
ClosedPublic

Authored by bsaleil on Dec 14 2021, 12:03 PM.

Details

Summary

waitcnt vmcnt instructions are currently generated in a loop body before using a value loaded
outside of the loop. In some cases, it is better to flush the vmcnt counter in the loop preheader
before entering the loop body. This patch flushes the counter in the two following situations:

  1. (pre-GFX10 only) The loop contains no load, at least one store and uses a vgpr loaded outside of the loop:
v0 = load(...)
loop {
  ...
  use(v0)
  store(...)
}
  1. The loop uses a vgpr loaded outside of the loop and contains at least one load loading a value that is unused in the loop. (On pre-GFX10 targets, the loop also contains no store):
v0 = load(...)
loop {
  ...
  use(v0)
  v1 = load(...)
}

Diff Detail

Event Timeline

bsaleil created this revision.Dec 14 2021, 12:03 PM
bsaleil requested review of this revision.Dec 14 2021, 12:03 PM
foad added a comment.Dec 15 2021, 2:35 AM

In the commit message:

  • "before using a loaded value" -> "before using a value that was loaded outside the loop"?
  • typo "GFX0"
  • "If" should be "if" in middle of sentence
  • maybe briefly mention why this particular problem doesn't happen on GFX10?

Please run "git clang-format @^".

I'm still reading the patch.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
827

Might be cleaner to change this function to take a BasicBlock::iterator instead of a MachineInstr*, so you could naturally pass in MBB->end() without treating it as a special case?

foad added a comment.Dec 15 2021, 4:25 AM

Here's the big question: is it REALLY necessary for shouldFlush to examine every instruction in the loop? (This is quite expensive, and could I think cause quadratic behaviour if you have lots of nested loops.) Or could we make the same decision just by examining the WaitcntBrackets at the end of the preheader compared to the WaitcntBrackets coming from the loop backedge(s)? I think this would be a much cleaner approach if it is possible, even if we have to track a bit more state in WaitcntBrackets to make it work.

Note that the loop over basic blocks in runOnMachineFunction currently looks like:

for each block:
  get saved state for this block
  process block
  merge new state into saved state for each successor

But if necessary we could change it to:

for each block:
  merge saved state from each predecessor
  process block
  save state for this block

I quite agree with Jay's point about looking at waitcnt brackets. In addition, this isn't a vmcnt/vscnt-specific issue. Consider a loop like this:

y = load(...);
x = load(...);
loop {
  // (A)
  use(x);
  ... lots of code ...
  // (B)
  use(y);
  y = load(...)
}

The current algorithm inserts an s_waitcnt vmcnt(0) in (A). It would be better to insert an s_waitcnt vmcnt(0) in the preheader and at (B) instead.

It's not obvious to me what the best approach is here.

A brute force approach is to rotate the overall processing loop in the way Jay suggests. At loop headers, instead of merging all predecessors immediately we separately merge incoming edges into the loop and latch edges, and then process the basic block for both brackets in parallel and compare. Make a decision at the end of the block or when it becomes clear that merging all predecessors with a "flush" in the preheader would force insertion of an overly aggressive s_waitcnt. Except that this approach could be fooled by control flow inside of the loop, so it's certainly not perfect.

An alternative idea is to record, for each GPR, whether it is live and if so, how many instructions (clocks?) until its first use. Then compare that against the scores in the brackets to figure out whether merging the preheader bracket would cause earlier s_waitcnt instructions. This determination could account for control flow internal to the loop, but of course it then becomes yet another fixed-point iteration, so...

foad added a comment.EditedDec 20 2021, 8:41 AM

Can't we always set the state at the start of the loop to the merged state from loop backedges *only*, and then make one final pass over the function to insert extra waits in preheaders to reconcile the state outside the loop with the state that was assumed for the start of the loop? (Sorry I know that is a bit simplistic and hand-wavy.)

That makes sense, thanks for the comments. So maybe I can start by trying to implement the idea of rotating the processing loop and making a decision from the predecessors and see if that solution would be enough to fix the most common cases we observed. Then, if control flow inside loops is really an issue with that solution, I can try the other solution based on GPR liveness, but I also don't like the idea of having another fixed-point loop that could significantly degrade compilation time.

That makes sense, thanks for the comments. So maybe I can start by trying to implement the idea of rotating the processing loop

But in general loops should have been rotated already? We shouldn't have to re-optimize loops at this point

@arsenm the idea is to rotate the processing loop (runOnMachineFunction) of the SIInsertWaitcnt pass, so we can make decisions from predecessors, not the actual machine IR loop containing the waitcnt:

Note that the loop over basic blocks in runOnMachineFunction currently looks like:

for each block:
  get saved state for this block
  process block
  merge new state into saved state for each successor

But if necessary we could change it to:

for each block:
  merge saved state from each predecessor
  process block
  save state for this block
bsaleil updated this revision to Diff 409063.EditedFeb 15 2022, 2:36 PM

Refactoring of the pass. We compute the brackets for both the flushed and non-flushed versions of each outer loop until we finish visiting the loop or until we decide it is not worth to flush in the preheader. Instead of generating the waitcnts, we add them to two separate lists (for the flushed and non-flushed versions). After all the blocks are visited, we generate one of the two lists depending on the decision we made.

For now, the only case in which we may decide to use the flushed version is the original case of this patch, which is when a loop contains no load, and at least one store on GFX9:

v0 = load(...)
loop {
  ...
  use(v0)
  store(...)
}

With the refactoring done, I'm planning to optimize one GFX10 case that we observed in at least one game which is when a loop is only loading values that are not used in the loop:

v0 = load(...)
loop {
  ...
  use(v0)
  v1 = load(...)
}

This will be added in a subsequent patch.

bsaleil edited the summary of this revision. (Show Details)Feb 15 2022, 2:42 PM
bsaleil edited the summary of this revision. (Show Details)
foad added a comment.Feb 17 2022, 6:40 AM

Refactoring of the pass. We compute the brackets for both the flushed and non-flushed versions of each outer loop

I am nervous about the compile time impact of this, and about adding the extra code to implement this. (The pass is pretty complicated already.) Is the refactoring really required for the GFX9 improvement, or only for the planned GFX10 improvement?

I agree that the code is a lot more complicated with this patch than it was with the previous patch. I think this improvement cannot be implemented simply by looking at the WaitcntBrackets without requiring all this refactoring.
So this means we have two choices:

  1. Original patch: Before inserting the waitcnts, visit all the loop instructions a single time (no fixed-point) until we visit all the instructions, or until we find an instruction that invalidates the optimization. Depending on what we found, flush in preheaders or not before generating the waitcnts. This is a lot simpler.
  2. New patch with refactoring: No need to visit the instructions before inserting the waitcnt, but we need to compute two brackets for each block, and keep two waitcnt lists until we decide which one we want to generate. This is a lot more complicated.

The original motivation to work on 2. was concerns about compile-time impact of 1., but because we need to compute two brackets for each block, I actually don't think that 2. is more efficient. Both the GFX9 and GFX10 improvements can be implemented with either 1. or 2. So for the sake of simplicity, I think I should revert back to the original patch.

bsaleil updated this revision to Diff 412764.Mar 3 2022, 10:50 AM
bsaleil retitled this revision from [AMDGPU] Hoist waitcnt out of loops when they unecessarily wait for stores to [AMDGPU] Flush the vmcnt counter in loop preheader when necessary.
bsaleil edited the summary of this revision. (Show Details)

@foad I updated the patch. It is a lot simpler than the previous, and it fixes both the GFX9 and GFX10 cases. But it may still have a significant impact on compile time, I cannot think of another way to do that without visiting all the instructions from the loops :(

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 10:50 AM
foad added inline comments.Mar 4 2022, 8:42 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1667

I think the definition of a preheader is that it has a single successor, which is the header. So you should not have to search through all loops in the function here, just look up the loop for MBB's successor and then walk up its parents.

bsaleil updated this revision to Diff 414459.Mar 10 2022, 12:24 PM

Rebased and addressed @foad comment.

bsaleil marked an inline comment as done.Mar 10 2022, 12:25 PM
piotr added a subscriber: piotr.Mar 22 2022, 12:52 AM

Do you have any data on the compile time impact?

@piotr I ran compile time testing and the patch has no significant impact. Worst case is 1.1% and is the only one above 1%. Average is below 0.1%.

@piotr I ran compile time testing and the patch has no significant impact. Worst case is 1.1% and is the only one above 1%. Average is below 0.1%.

Nice, thanks for testing.

foad added inline comments.Apr 1 2022, 2:37 AM
llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir
5

I think you can remove this whole IR section.

57

It's pretty unusual to use the -LABEL suffix for an actual label. Normally it is only used for the start of each function. Did you have a special reason for doing it this way?

foad added inline comments.Apr 1 2022, 2:58 AM
llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir
74

You could remove "renamable" everywhere, just to reduce clutter.

92

I'd appreciate a very brief comment with each function, saying what it's for and whether the waitcnt should be hoisted or not. E.g. for this one "same as before but bb.0 has no terminator" (it took me a while to spot the difference).

276

I am confused by this test case. Why does GFX9 need a waitcnt inside the loop?

396

Does "_interval" here refer to the RegInterval stuff in SIInsertWaitcnts? Maybe use "_reginterval" to make this clearer.

foad added inline comments.Apr 1 2022, 5:02 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
960–961

Document the new FlushVmCnt argument here.

1653

It is a shame that you have to implement this in two places, for blocks with and without terminators. I'm not sure if there is a better way. Maybe generateWaitcntInstBefore could be changed to take an iterator (which is allowed to be end()) instead of MI, so you would not need the new function generateWaitcntBlockEnd. But that would be quite invasive.

1673

Could maybe simplify this by adding MachineBasicBlock::getUniqueSuccessor, like we have for IR BasicBlock. (I think a preheader has *exactly* one successor, correct?)

1681–1682

I don't quite understand the logic here. Why do you only check whether MBB is the preheader of the outermost loop? Could it not be the preheader of one of the inner loops?

foad added a comment.Apr 1 2022, 5:25 AM

Overall I think this looks really good. None of my inline comments are blockers.

One general observation: the waitcnt that you insert in a preheader is always vmcnt(0), but there are cases where vmcnt(1) or higher could be useful, e.g.:

v0 = vmem load ...
v1 = vmem load ...
loop:
  use of v0
  v2 = vmem load ...
  (no use of v1 or v2 inside loop)

Could you at least add a test case like that, even if you don't want to implement the vmcnt(1) in the preheader yet?

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1692

From the description of cases (1) and (2), neither of them seems to include this case for GFX9:

v0 = vmem load ...
loop:
  use of v0
  vmem store ...
  v1 = vmem load ...
  no use of v1 in loop
1695

Does this case (2) also apply to other kinds of loads, e.g. for SMEM loads would it be better to pull the lgkmcnt(0) out of the loop?

1701

Could just assert that there is a preheader? (But do you need to?)

1710

We don't usually mark loop variables as const - just remove this?

1753

Upper case F in Flush.

bsaleil updated this revision to Diff 432383.May 26 2022, 2:21 PM

Add support for nested loops, and fixes the case where we have no terminator but preexisting s_waitcnt instructions. I still need to address @foad comments.

bsaleil updated this revision to Diff 432985.May 30 2022, 1:52 PM

Addressed review comments and fixed debug build.

bsaleil marked 14 inline comments as done.May 30 2022, 1:56 PM
bsaleil added inline comments.May 30 2022, 2:07 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1653

Yes, unfortunately I also think changing that would be too pervasive. generateWaitcntInstBefore relies a lot on the fact that MI is a valid instruction.

1681–1682

I think it was code from the other implementation. I removed it, now inner loops are also optimized.

1692

I removed the restriction that for 2. on GFX9 we should not have any store. I'm not sure why I added that in the first place. So now this case is supported.

llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir
57

This allows me to match specific s_waitcnt in preheaders or loop bodies, to avoid inadvertently matching the wrong waitcnt. E.g. matching a loop body waitcnt that was actually expected in the preheader.

nhaehnle added inline comments.Jun 1 2022, 6:24 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1653

generateWaitcntInstBefore has two distinct halves: the first half determines the counts to be waited for based on MI, and the second half (I would say starting at the comment // Early-out if no wait is indicated.) is agnostic to *how* the counts were obtained.

It seems to me that it could be fairly natural to split up the function and use the bottom half of it on both paths.

bsaleil added inline comments.Jun 1 2022, 9:37 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1653

I can't find the comment // Early-out if no wait is indicated in that function, but I assume the second part you mention starts from if (OldWaitcntInstr) { ....
To extract that second half into a separate function, I'd need to pass an iterator for the call to applyPreexistingWaitcnt and the two calls to BuildMI. The problem is that I cannot really pass an iterator here because it may be invalidated by all these calls. I guess I could still use a lambda or something to retrieve the iterator each time we need it but I think this may be a bit overkill.

nhaehnle added inline comments.Jun 1 2022, 1:34 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1653

I'm afraid I don't quite follow. MachineInstruction iterators aren't invalidated by BuildMI?

foad added inline comments.Jun 8 2022, 3:05 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
831–832

Not related to your patch, but this could probably be something like:

for (auto &I : make_early_inc_range(make_range(OldWaitcntInstr, It)))
1653

I think it would be OK to leave this cleanup for a follow-up NFC patch.

bsaleil updated this revision to Diff 438835.Jun 21 2022, 2:23 PM

Addressed review comments: Use ranges in for loop and split generateWaitcntInstBefore to reuse its second half for generateWaitcntBlockEnd.
Also changed applyPreexistingWaitcnt to take a MachineBasicBlock::instr_iterator instead of a MachineBasicBlock::iterator to match the caller.

bsaleil marked 4 inline comments as done.Jun 21 2022, 2:24 PM
foad added inline comments.Jun 22 2022, 2:35 AM
llvm/lib/CodeGen/MachineBasicBlock.cpp
919

MachineBasicBlock successors are stored in a vector so you can write this more simply e.g.: return Successors.size() == 1 ? Successors[0] : nullptr;

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
448

Why can't you just pass in an iterator? As Nicolai said, they should not get invalidated.

1255

Maybe use MachineBasicBlock::findDebugLoc or similar?

1563

Don't need these parentheses.

bsaleil added inline comments.Jun 22 2022, 6:42 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
448

Couldn't the iterator be invalidated by the call to applyPreexistingWaitcnt since we eraseFromParent() in it ?

foad added inline comments.Jun 22 2022, 6:50 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
448

If it erases the instruction that It points to then there's a serious problem, and using a callback will not solve it.

If it erases some other instruction then that's OK, the iterator should still be valid.

Instructions are in a doubly linked list and the iterator is more-or-less just a pointer to an instruction.

bsaleil updated this revision to Diff 439023.Jun 22 2022, 7:49 AM

Address review comments

bsaleil marked 5 inline comments as done.Jun 22 2022, 7:49 AM
foad accepted this revision.Jun 22 2022, 8:48 AM

LGTM, thanks!

This revision is now accepted and ready to land.Jun 22 2022, 8:48 AM
This revision was landed with ongoing or failed builds.Jun 23 2022, 7:53 AM
This revision was automatically updated to reflect the committed changes.