Memory clauses are formed into bundles in presence of xnack.
Their source operands are marked as early-clobber.
This allows to allocate distinct source and destination registers within a clause and prevent breaking the clause with s_nop in the hazard recognizer.
Clauses are undone before post-RA scheduler to allow some rescheduling, which will not break the clause since artificial edges are created in the dag to keep memory operations together. Yet this allows a better ILP in some cases.
Details
Diff Detail
Event Timeline
lib/Target/AMDGPU/SIFormMemoryClauses.cpp | ||
---|---|---|
172 | Please use llvm::sort instead of std::sort. See https://llvm.org/docs/CodingStandards.html#beware-of-non-deterministic-sorting-order-of-equal-elements. |
Needs some tests with frame indexes, and with the d16 lo/hi operations with tied operands
lib/Target/AMDGPU/SIFormMemoryClauses.cpp | ||
---|---|---|
33–37 | I thought this was a subtarget specific limit. I apparently never committed D40578 which made the hazard recognizer aware of the subtarget specific sizes | |
42 | Should this be using the BitVector of RegUnits instead? | |
313 | SReg_32_XM0_XEXEC? | |
364 | Why && (and for the rest)? | |
lib/Target/AMDGPU/SIInstructions.td | ||
548–550 | Why do you use this instead of using the generic TargetOpcode::BUNDLE? |
There is a test with frame indexes in mir (function name "stack"). Pass skips all instructions with FI operands, so the only test needed is to check there are no clauses formed.
lib/Target/AMDGPU/SIFormMemoryClauses.cpp | ||
---|---|---|
33–37 | 4 bits is a smallest counter for all ASICs (excluding exp, but we are not using exp here). On GFX9 there are two additional bits for vmcnt, but it does not make too much sense to create too big clauses, so to fit into 4 bit seems reasonable at least as a starting point. Take in mind we cannot be precise here because if there are outstanding operations counter may be higher than zero when we come to a clause. In this case HW will break it anyway when it is saturated. Thus the value is not calculated from available counter bits or hardcoded, but rather a tunable option. | |
42 | RegUnits are for Phys. Anyway, this carries lane bitmask *and* register state, so I need two values. | |
313 | When SIRegisterInfo::getReservedRegs() works it reserves SGPR_32 for amdgpu-num-sgpr attribute. This limit accounts primarily for that attribute use (and debug reserved registers too). I also do not believe I want to use vcc or anything exotic here, I'd rather split the clause if we are running that low of registers. | |
364 | I could use const auto &, but forwarding reference is less verbose (and usually recommended in fact). | |
lib/Target/AMDGPU/SIInstructions.td | ||
548–550 | Primarily to know what to remove post-RA. I do not think it creates any problems though, for instance getFunctionCodeSize() is called when memory clauses are already unbundled. |
Added tests for d16 and d16_hi loads. Since operands are tied we skip creating a clause for them.
Replaced custom instruction with TargetOpcode::BUNDLE.
lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
548–550 | Actually I can do it with BUNDLE too. Resulting bundle returns mayLoad() and can be identified. |
lib/Target/AMDGPU/SIFormMemoryClauses.cpp | ||
---|---|---|
243 | moveMaxPressure actually clears accumulated MaxPressure - is this intended? |
lib/Target/AMDGPU/SIFormMemoryClauses.cpp | ||
---|---|---|
243 | That is the only interface we made here ;) It does not really matter, as the next step will recreate MaxPressure. Take in mind that pressure cannot go down within the clause, only after. Therefor CurPressure is always a MaxPressure too while inside the clause. |
lib/Target/AMDGPU/SIFormMemoryClauses.cpp | ||
---|---|---|
243 | ok, then. I will add something like getMaxPressure next time. |
I thought this was a subtarget specific limit.
I apparently never committed D40578 which made the hazard recognizer aware of the subtarget specific sizes