This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Make hazard recognizer aware of maximum clause sizes
Needs ReviewPublic

Authored by arsenm on Nov 28 2017, 1:40 PM.

Details

Summary

Supposedly clauses can be formed up until vmcnt/lgkmcnt are
saturated.

The limit for vmcnt on gfx9 is pretty big, so this requires
looking ahead 64 instructions. I don't know how much of an
impact this has on compile time. Reducing it should be
conservatively correct at the cost of extra nops inserted
into unreasonably large clauses.

Another issue is this doesn't account for how some
instructions increase the counter by more than 1,
but this should be conservatively correct.

Diff Detail

Event Timeline

arsenm created this revision.Nov 28 2017, 1:40 PM
This revision is now accepted and ready to land.Nov 28 2017, 3:27 PM
t-tye added inline comments.Nov 28 2017, 8:35 PM
lib/Target/AMDGPU/GCNHazardRecognizer.cpp
366–368 ↗(On Diff #124634)

We must also not put stores in the same clause if they may write the same address. Putting stores in their own clause will trivially satisfy this requirement.

How does this code ensure multiple consecutive stores are not put in the same clause? Seems it will only break the clause if a read is before a store. How is a store before a read broken into a separate clause?

This code seems to be per single basic block? What ensures that a clause cannot straddle across basic blocks (due to fall through)? Since no checking is being done across basic blocks the conservative action is to break clauses that end a basic block.

374 ↗(On Diff #124634)

Would it be better to explicitly break the clause, as a context save restore may bring a wave back in mid clause execution and so the counters may saturate at a different instruction. By breaking the clause explicitly this will not be an issue.

arsenm added inline comments.Nov 30 2017, 4:29 PM
lib/Target/AMDGPU/GCNHazardRecognizer.cpp
366–368 ↗(On Diff #124634)

This was already handled, the clause is broken if it's a store

374 ↗(On Diff #124634)

I don't understand what the issue is here. The counters stop working correctly?

t-tye added inline comments.Nov 30 2017, 7:34 PM
lib/Target/AMDGPU/GCNHazardRecognizer.cpp
374 ↗(On Diff #124634)

When a context save happens the wave could be in the middle of a clause. When the context restore happens it behaves as if the clause started from the point of the context save (since all previous memory operations are waited to complete before the context save, and so the outstanding waitcnt counters start at 0 on the context restore). So the counters may saturate further down the clause than this code predicted.

XNACK replay can cause a similar situation since during replay the clause is being effectively started further in.

So my suggestion is that if the code is only scanning over a certain list of instructions to ensure the clause rules, it should explicitly break the clause at the point it scans up to. Otherwise a context save/restore or XNACK replay could effectively execute a clause that spans over instructions that have not been checked for hazards.

In practice it seems unlikely that clauses will reach the hardware counter limit, so this will likely not result in reduced performance. However, it ensures correctness in the unlikely case that they do.

So the counters are still working correctly, they just may resume from 0 mid clause if the hardware does a context save/restore or XNACK replay which implicitly waits for the outstanding memory operations to complete.

t-tye added inline comments.Nov 30 2017, 7:35 PM
lib/Target/AMDGPU/GCNHazardRecognizer.cpp
366–368 ↗(On Diff #124634)

Does that address all the items in the previous comment?

Also curious which line of code achieves that as it does not appear to be this line of code.

arsenm updated this revision to Diff 255163.Apr 5 2020, 7:35 AM
arsenm added a reviewer: kerbowa.

Rebase

arsenm requested review of this revision.Apr 5 2020, 7:35 AM
kerbowa added inline comments.Apr 11 2020, 11:54 PM
llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
60

gfx10 has 6 bits for LGKMCNT.