Frame lowering inserts scalar addition to compute the offset to the
stack objects. This instructions inserted in arbitrary place and may clobber
SCC between its definition and S_CSELECT_B32 instruction. This change
workarounds this particular code pattern. It queries the scavenger for SGPR and
if available saves SCC to it and restore its value after frame lowering code
insertion.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
It queries the scavenger for SGPR and
if available saves SCC to it and restore its value after frame lowering code
insertion.
What if there is no free SGPR? I don't think it's acceptable to generate broken code in that case.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2228 | Need braces and indentation for the body of this "if". |
I expected this question and deliberately leave this for discussion. I think that the proper behavior is to report error and bail out.
No alternatives look good to me.
Attempt to use V_ADD_* is unreliable as it assumes to scavenge register again which is also not guaranteed.
Moving the S_ADD_I32 insertion point may be unsolvable due to the data dependence problem
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
843–846 | This isn't harmful, but I also doubt helps you. This is only used by LocalStackSlotAllocation and runs before dead flags should automatically be computed | |
2229 | So this just remains broken if the scavenge failed? | |
2230 | Shouldn't special case s_cselect* | |
2366 | These dead setting changes should be done separately |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2230 | I see it looks like a dirty hack. And it is really. Otherwise, we will have a huge amount of SCC save/restore instructions along the code. Most of them will be unnecessary. In fact, only s_cselect and carry out instructions really use SCC but most of the SALU read it. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2229 | For now, I would opt for reporting fatal here. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2230 | Oops. Sorry I was wrong. |
SCC save/restore for all SCC users. Code which sets dead flag in SCC operands removed as it should be in separate change.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2225–2226 | This still makes me nervous. For graphics we use LLVM as a JIT compiler, and we definitely have cases where SGPRs are spilled (so presumably there are no free SGPRs). What are we supposed to do if this scavenge fails? |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2225–2226 | What we were supposed to do if failed the scavenge at line 2198 and line 2243? In general, no strategy exists that ensures we never run out of registers. We can try to use vector addition with an SGPR and constant but it is possible for a few targets only. Otherwise, we'd need to scavenge VGPR but we already failed to scavenge SGPR which means that we were unable to spill one and probably has no VGPRs available. The trick with using FrameRegister as a temporary room is already used by frame lowering itself, also SCC copy should be alive from the point before the S_ADD insertion and last to the point right after it. Moving the insertion point before the SCC definition is restricted by the result register placement as follows: (1) def R (2) def SCC (3) use R <==== here we're about to insert R = S_ADD_I32 FR, Off (4) use SCC We cannot move the insertion point over another use of R. My point is that:
Please correct me if I am wrong. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2212 | If we've failed to scavenge SGPR for SCC save/restore we never get here. But if we've got here we'd either fail to scavenge SGPR for the frame offset computation. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 | What is this loop for? Isn't the scavenger already supposed to know whether SCC is live? |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2273 | None of this code should depend on wave size. You don't want getBoolRC here - that's for VCC which might be a register pair in wave64. You can always save SCC in a single 32-bit SGPR. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 |
The scavenger's knowledge depends on the accurate dead/kill flags insertion. This loop decreases the number of false-positive "live" SCC and accordingly unnecessary SCC saves/restores. As Matt rightly pointed out, the dead/kill correct placement is out of this change scope. Should I add a corresponding TODO here? |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2273 | Since back SGPR to SCC copy depends on EXEC/EXEC_LO and I am using the same register for saving and restoring I need to use 64bit unless isWave32. I am not sure if I can use S_AND_B32 reg, exec_lo in wave64 mode? |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 | Let me illustrate what I mean. For the code snipped below (1) $sgpr3 = S_ADD_I32 $sgpr33, 32, implicit-def $scc (2) $vgpr51 = V_ADD_U32_e64 killed $sgpr3, killed $sgpr43, 0, implicit $exec (3) $sgpr3 = S_ADD_I32 $sgpr33, 32, implicit-def $scc (4) $vgpr52 = V_ADD_U32_e64 killed $sgpr3, killed $sgpr44, 0, implicit $exec scavenger considers SCC "live" between (1) and (3) because its internal iterator position points to (2). Calling the "advance" method inside the target hook breaks the outer PEI logic. Literally, PEI does not expect the scavenger state to be changed inside the TargetRegisterInfo::LowerFrameIndex method. In this particular case, I can add the dead flag to the particular place: auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpSReg) .addReg(FrameReg) .addImm(Offset); Add->getOperand(3).setIsDead(); Unfortunately, we have plenty of places like that. And for cases that are not as trivial, we are going to have plenty of odd code. | |
2210–2218 |
Good point, thanks. | |
2273 | Given that this this is pure scalar context yes |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 |
No, this does not work really. Given that we have a BB with just one SCC def, following the idea we would insert Some SCC def... $sgpr38 = S_CSELECT_B32 -1, 0, implicit $scc $sgpr8 = S_ADD_I32 $sgpr32, 2960, implicit-def $scc S_CMP_LG_U32 $sgpr38, 0, implicit-def $scc on each frame lowering S_ADD_I32 since each time we have S_CMP_LG_U32 defining SCC and no other defs to the end of the BB. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 |
In fact, the loop aims to restrict a big amount of false positive SCC "live" slots by simulating the register scavenger "forward" job. But if SCC live-in in some BB it is still in scavenger "usedregs". |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 | The loop still sounds like a hack to me that will have O(n^2) cost in the worst case. But I really don't know enough about PEI or the RegisterScavenger to review this properly. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 |
I don't like this approach either and would like to look for a neat solution. This is proposed as a workaround to stop Vulkan RT from failing. Could you also clarify, why you consider it O(n^2)? The worst case is when we have to walk to the end of the BB. This is just on level loop and each instruction is visited once. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 | In a large BB with n instructions that need frame index elimination, you will walk to end of the BB (worst case) for each one of them. So that is O(n^2) overall cost to run this pass. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 | Ok, you meant the whole pass, not the loop itself. I got it now. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2210–2218 | PEI main loop process instructions one by one inserting the frame index lowering code before instruction if it has the frame index operand. renamable $sgpr50 = S_CSELECT_B32 2924, killed renamable $sgpr50, implicit $scc ** <-- RS is here and RS->isRegUsed(AMDGPU::SCC) is true ** renamable $sgpr40 = S_ADD_I32 killed renamable $sgpr40, %stack.0.__llpc_global_proxy_, implicit-def dead $scc We do not need to save SCC here, but we cannot know that SCC will be dead at the next instruction without scanning forward.
Yes, in this case, we scan all the BB in the loop and the whole pass has O(n^2). |
In fact, to avoid this unwanted loop I can use another hack - just forward scavenger one position to look up the SCC liveness at the insertion slot and then immediately backward scavenger to restore its position and avoid breaking the PEI logic.
This looks weird but avoids the loop.
Also, this approach still allows some amount of false positives and, hence, some unnecessary SCC saves/restores. Given that we may already have high SGPR pressure, each false positive spends yet one more SGPR and increases the probability to run out of registers.
Why this approach cannot be improved?
We cannot make a reasonable decision regarding SCC save/restore without scanning down to the nearest use or definition. Register scavenger retrieves liveness information going forward by each instruction at a time. So, at each point, it only knows that the register was defined somewhere before and has not been yet killed by another definition. It may become dead at the next instruction or keep living an arbitrarily long distance.
To make a reasonable decision regarding SCC we need to know if SCC is "live" at the insertion point and if it is used at some point before the next definition. Otherwise, we have to conservatively save/restore SCC if it is "live" at the point where we're about to insert scalar instruction that clobbers it. In all cases where SCC is re-defined before the use, such save/restore is unnecessary but spends SGPR.
To illustrate the above:
def SCC some others instructions use of the FI <-- We are going to insert "S_ADD_I32 FrameReg, Offset" right before it SCC is "live" here ************ here is an arbitrarily long sequence of instructions that neither use nor define SCC ************ Here is an instruction "I" that defines or uses SCC
if "I" defines SCC (kills the previous definition) - no SCC save/restore is necessary at the insertion point.
if "I" uses SCC - it is necessary to save it before S_ADD_I32 insertion and restore it after.
We have to scan down until either "I" is found or the basic block ends.
Am I correct to assume that frame offset is always DWORD aligned? If so we have couple spare bits:
S_ADDC_U32 $reg, offset S_BITCMP1_B32 $reg, 0 S_BITSET0_B32 $reg, 0
For the loop which may potentially consume a lot of time I'd suggest to set a small scan threshold and assume scc is used otherwise. This relies on the possibility of restoring scc without scavenging, like in my previous comment.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2213 | I think you need to skip debug instructions without counting so that debug build does not differ from release. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.h | ||
96 ↗ | (On Diff #469731) | 100 seems to large number to me. Compiling with spills is already sufficiently and painfully slow. I would not go more than 10 here. In a worst case you are now burning just 2 instructions and probably 8 cycles. Besides I do not think it deserves a special function, just a constant threshold right inside the code. Just because it is only used once. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2207 | Save is not a right word anymore. Preserve is. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2376 | This is obviously a mistake. "-Offset" cannot work here. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.h | ||
96 ↗ | (On Diff #469731) |
I would not agree. Really there may be tens of the odd SCC preserving code sequences in 100 lines length BB if it has frequent stack access. So, this is not just 2 instructions but 2*N. I can show you the example of assembly if necessary. |
Temporary workaround to avoid SCC liveness scanning loop and using manually set
"dead" flags to decrease the amount of unnecessary SCC preserving code.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2230 | Maybe it makes sense to add an assert, that the least significant bit of Offset is indeed 0? (here and for s_subb below) |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2253 | The bitcmp1/bitset0 trick only works with addc. It doesn't work with subb, because if scc is set then the result will be 1 *less* than you wanted, so bitset0 will not correct it. |
Since the https://reviews.llvm.org/D137574 has been landed, this review is updated to use backward PEI.
added test for using and restoring frame register for offset calculation.
S_SUBB_U32 changed to S_ADDC_U32 with "-Offset"
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2230 | As far as I know, the flat scratch offset is always dword-aligned. If it is not we've got the UB anyway. So, the assert does not help. | |
2253 | Oops. Good point. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
1633 | Do you still need all these setIsDead changes? In any case can you remove them from *this* patch, so it just contains the essential changes? Maybe put them all in a separate patch, if there is still any need for them. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
1633 | No. I don't need them anymore. Just my inaccuracy. Removed. |
There are still setIsDead changes and whitespace changes. Please try to strip them all out so we get a minimal patch to review.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2230 | Sure, an assert doesn’t help correctness. But if at some point in the future, we encounter a case where the offset is not aligned – maybe due to a bug elsewhere – having an assert makes it easy to find why some memory got corrupted. |
Ok. I thought you wanted me to remove all the setIsDead calls since they are not needed anymore.
removed changes not relevant directly to the current revision
added assert to check if the flat scratch offset is aligned
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2230 | The assert which is currently added only ensures that the offset is even. !(Offset & flat_scratch_min_align) I am not sure if it makes sense but maybe there should be another error message? |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2225–2226 | I think we just need to start reserving an SGPR ahead of time in case we run into these scenarios, and un-reserve it after allocation if unused |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2229 | Needs a comment explaining this trick |
Weird indentation here. Can you run git clang-format on this change?