This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Construct memory clauses before RA
ClosedPublic

Authored by rampitec on May 29 2018, 6:57 PM.

Details

Summary

Memory clauses are formed into bundles in presence of xnack.
Their source operands are marked as early-clobber.
This allows to allocate distinct source and destination registers within a clause and prevent breaking the clause with s_nop in the hazard recognizer.
Clauses are undone before post-RA scheduler to allow some rescheduling, which will not break the clause since artificial edges are created in the dag to keep memory operations together. Yet this allows a better ILP in some cases.

Diff Detail

Event Timeline

rampitec created this revision.May 29 2018, 6:57 PM
rampitec updated this revision to Diff 149023.May 29 2018, 7:31 PM

Stripped patch off the parent revision D47509 changes.

mgrang added inline comments.May 29 2018, 8:53 PM
lib/Target/AMDGPU/SIFormMemoryClauses.cpp
172
rampitec updated this revision to Diff 149040.May 29 2018, 10:00 PM
rampitec marked an inline comment as done.

Replaced std::sort with llvm::sort.

Needs some tests with frame indexes, and with the d16 lo/hi operations with tied operands

lib/Target/AMDGPU/SIFormMemoryClauses.cpp
33–37

I thought this was a subtarget specific limit.

I apparently never committed D40578 which made the hazard recognizer aware of the subtarget specific sizes

42

Should this be using the BitVector of RegUnits instead?

313

SReg_32_XM0_XEXEC?

364

Why && (and for the rest)?

lib/Target/AMDGPU/SIInstructions.td
548–550

Why do you use this instead of using the generic TargetOpcode::BUNDLE?
I think trying to use a custom bundle instruction is going to cause issues. For example instruction size computation for this won't work properly, and any other code that expects to see BUNDLE

Needs some tests with frame indexes, and with the d16 lo/hi operations with tied operands

There is a test with frame indexes in mir (function name "stack"). Pass skips all instructions with FI operands, so the only test needed is to check there are no clauses formed.

lib/Target/AMDGPU/SIFormMemoryClauses.cpp
33–37

4 bits is a smallest counter for all ASICs (excluding exp, but we are not using exp here). On GFX9 there are two additional bits for vmcnt, but it does not make too much sense to create too big clauses, so to fit into 4 bit seems reasonable at least as a starting point.

Take in mind we cannot be precise here because if there are outstanding operations counter may be higher than zero when we come to a clause. In this case HW will break it anyway when it is saturated. Thus the value is not calculated from available counter bits or hardcoded, but rather a tunable option.

42

RegUnits are for Phys. Anyway, this carries lane bitmask *and* register state, so I need two values.

313

When SIRegisterInfo::getReservedRegs() works it reserves SGPR_32 for amdgpu-num-sgpr attribute. This limit accounts primarily for that attribute use (and debug reserved registers too). I also do not believe I want to use vcc or anything exotic here, I'd rather split the clause if we are running that low of registers.

364

I could use const auto &, but forwarding reference is less verbose (and usually recommended in fact).

lib/Target/AMDGPU/SIInstructions.td
548–550

Primarily to know what to remove post-RA. I do not think it creates any problems though, for instance getFunctionCodeSize() is called when memory clauses are already unbundled.

rampitec updated this revision to Diff 149208.May 30 2018, 2:58 PM

Added tests for d16 and d16_hi loads. Since operands are tied we skip creating a clause for them.

rampitec updated this revision to Diff 149225.May 30 2018, 5:41 PM
rampitec marked an inline comment as done.

Replaced custom instruction with TargetOpcode::BUNDLE.

lib/Target/AMDGPU/SIInstructions.td
548–550

Actually I can do it with BUNDLE too. Resulting bundle returns mayLoad() and can be identified.

vpykhtin added inline comments.May 30 2018, 11:49 PM
lib/Target/AMDGPU/SIFormMemoryClauses.cpp
243

moveMaxPressure actually clears accumulated MaxPressure - is this intended?

rampitec added inline comments.May 30 2018, 11:58 PM
lib/Target/AMDGPU/SIFormMemoryClauses.cpp
243

That is the only interface we made here ;) It does not really matter, as the next step will recreate MaxPressure. Take in mind that pressure cannot go down within the clause, only after. Therefor CurPressure is always a MaxPressure too while inside the clause.

vpykhtin accepted this revision.May 31 2018, 2:04 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/SIFormMemoryClauses.cpp
243

ok, then. I will add something like getMaxPressure next time.

This revision is now accepted and ready to land.May 31 2018, 2:04 AM
This revision was automatically updated to reflect the committed changes.