This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Avoid SCC clobbering before S_CSELECT_B32
ClosedPublic

Authored by alex-t on Oct 18 2022, 6:40 AM.

Details

Summary

Frame lowering inserts scalar addition to compute the offset to the
stack objects. This instructions inserted in arbitrary place and may clobber
SCC between its definition and S_CSELECT_B32 instruction. This change
workarounds this particular code pattern. It queries the scavenger for SGPR and
if available saves SCC to it and restore its value after frame lowering code
insertion.

Diff Detail

Event Timeline

alex-t created this revision.Oct 18 2022, 6:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2022, 6:40 AM
alex-t requested review of this revision.Oct 18 2022, 6:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2022, 6:40 AM
Herald added a subscriber: wdng. · View Herald Transcript
foad added a comment.Oct 18 2022, 7:34 AM

It queries the scavenger for SGPR and
if available saves SCC to it and restore its value after frame lowering code
insertion.

What if there is no free SGPR? I don't think it's acceptable to generate broken code in that case.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2230

Need braces and indentation for the body of this "if".

foad added inline comments.Oct 18 2022, 7:36 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
843

Weird indentation here. Can you run git clang-format on this change?

2231

Don't need the != AMDGPU::NoRegister because Register converts to bool.

Need braces around the multi-line BuildMI call.

It queries the scavenger for SGPR and
if available saves SCC to it and restore its value after frame lowering code
insertion.

What if there is no free SGPR? I don't think it's acceptable to generate broken code in that case.

I expected this question and deliberately leave this for discussion. I think that the proper behavior is to report error and bail out.
No alternatives look good to me.
Attempt to use V_ADD_* is unreliable as it assumes to scavenge register again which is also not guaranteed.
Moving the S_ADD_I32 insertion point may be unsolvable due to the data dependence problem

arsenm added inline comments.Oct 18 2022, 10:11 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
841–847

This isn't harmful, but I also doubt helps you. This is only used by LocalStackSlotAllocation and runs before dead flags should automatically be computed

2231

So this just remains broken if the scavenge failed?

2232

Shouldn't special case s_cselect*

2364

These dead setting changes should be done separately

alex-t updated this revision to Diff 468614.Oct 18 2022, 10:42 AM

Implementing error reporting if no SGPR scavenged. Formatting.

alex-t marked 5 inline comments as done.Oct 18 2022, 10:59 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2232

I see it looks like a dirty hack. And it is really. Otherwise, we will have a huge amount of SCC save/restore instructions along the code. Most of them will be unnecessary. In fact, only s_cselect and carry out instructions really use SCC but most of the SALU read it.
BTW I am curious why did not we hit any errors related to SCC clobbering before s_cselect patch?

alex-t marked an inline comment as done.Oct 18 2022, 11:02 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2231

For now, I would opt for reporting fatal here.
We could try to use v_add for those targets which allow v_add with the SGPR and constant if the frame index user accepts VGPR. This will give a chance in some cases but not in general.

alex-t added inline comments.Oct 18 2022, 11:37 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2232

Oops. Sorry I was wrong.
It is really no need to special case s_cselect here. I checked the TD files and we only have SCC uses where it is really necessary.

alex-t updated this revision to Diff 468734.Oct 18 2022, 4:13 PM

SCC save/restore for all SCC users. Code which sets dead flag in SCC operands removed as it should be in separate change.

alex-t marked 2 inline comments as done.Oct 18 2022, 4:15 PM
foad added inline comments.Oct 19 2022, 2:12 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2227–2228

This still makes me nervous. For graphics we use LLVM as a JIT compiler, and we definitely have cases where SGPRs are spilled (so presumably there are no free SGPRs). What are we supposed to do if this scavenge fails?

alex-t added inline comments.Oct 19 2022, 5:25 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2227–2228

What we were supposed to do if failed the scavenge at line 2198 and line 2243?
In fact, I have just followed the error handling strategy implemented for the same condition in this file earlier. I assumed that if it worked so far it should be okay further.

In general, no strategy exists that ensures we never run out of registers.

We can try to use vector addition with an SGPR and constant but it is possible for a few targets only. Otherwise, we'd need to scavenge VGPR but we already failed to scavenge SGPR which means that we were unable to spill one and probably has no VGPRs available.

The trick with using FrameRegister as a temporary room is already used by frame lowering itself, also SCC copy should be alive from the point before the S_ADD insertion and last to the point right after it.

Moving the insertion point before the SCC definition is restricted by the result register placement as follows:

(1) def R
(2) def SCC
(3) use R
<==== here we're about to insert R = S_ADD_I32 FR, Off
(4) use SCC

We cannot move the insertion point over another use of R.
In the general case, this is unsolvable.

My point is that:

  1. "Failed to scavenge" recovery strategy is a complex problem that should be addressed separately.
  2. There is no general solution as one always can invent the input on which the compiler runs out of registers.

Please correct me if I am wrong.

alex-t marked an inline comment as done.Oct 19 2022, 6:21 AM
alex-t added inline comments.Oct 19 2022, 6:26 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2214

If we've failed to scavenge SGPR for SCC save/restore we never get here. But if we've got here we'd either fail to scavenge SGPR for the frame offset computation.

alex-t updated this revision to Diff 468942.Oct 19 2022, 9:12 AM

A bit more compact code.

foad added inline comments.Oct 19 2022, 9:32 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

What is this loop for? Isn't the scavenger already supposed to know whether SCC is live?

foad added inline comments.Oct 19 2022, 9:38 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2276

None of this code should depend on wave size. You don't want getBoolRC here - that's for VCC which might be a register pair in wave64. You can always save SCC in a single 32-bit SGPR.

alex-t added inline comments.Oct 19 2022, 9:43 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

What is this loop for? Isn't the scavenger already supposed to know whether SCC is live?

The scavenger's knowledge depends on the accurate dead/kill flags insertion. This loop decreases the number of false-positive "live" SCC and accordingly unnecessary SCC saves/restores.

As Matt rightly pointed out, the dead/kill correct placement is out of this change scope.
It also is going to be a complex task.

Should I add a corresponding TODO here?

alex-t marked 2 inline comments as done.Oct 19 2022, 9:52 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2276

Since back SGPR to SCC copy depends on EXEC/EXEC_LO and I am using the same register for saving and restoring I need to use 64bit unless isWave32. I am not sure if I can use S_AND_B32 reg, exec_lo in wave64 mode?

foad added inline comments.Oct 19 2022, 11:43 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

Haven't dead/kill flags been automatically computed before this code runs?

If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?

2276

You can just use "s_cmp_lg_u32 sgpr, 0" to copy back to SCC.

alex-t marked an inline comment as done.Oct 19 2022, 1:10 PM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

Let me illustrate what I mean. For the code snipped below

(1) $sgpr3 = S_ADD_I32 $sgpr33, 32, implicit-def $scc
(2) $vgpr51 = V_ADD_U32_e64 killed $sgpr3, killed $sgpr43, 0, implicit $exec
(3) $sgpr3 = S_ADD_I32 $sgpr33, 32, implicit-def $scc
(4) $vgpr52 = V_ADD_U32_e64 killed $sgpr3, killed $sgpr44, 0, implicit $exec

scavenger considers SCC "live" between (1) and (3) because its internal iterator position points to (2). Calling the "advance" method inside the target hook breaks the outer PEI logic. Literally, PEI does not expect the scavenger state to be changed inside the TargetRegisterInfo::LowerFrameIndex method.

In this particular case, I can add the dead flag to the particular place:

auto Add = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_I32), TmpSReg)
            .addReg(FrameReg)
            .addImm(Offset);
Add->getOperand(3).setIsDead();

Unfortunately, we have plenty of places like that. And for cases that are not as trivial, we are going to have plenty of odd code.

2212–2220

If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?

Good point, thanks.

2276

Given that this this is pure scalar context yes

alex-t added inline comments.Oct 19 2022, 2:13 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?

Good point, thanks.

No, this does not work really. Given that we have a BB with just one SCC def, following the idea we would insert

Some SCC def...
$sgpr38 = S_CSELECT_B32 -1, 0, implicit $scc
$sgpr8 = S_ADD_I32 $sgpr32, 2960, implicit-def $scc
S_CMP_LG_U32 $sgpr38, 0, implicit-def $scc

on each frame lowering S_ADD_I32 since each time we have S_CMP_LG_U32 defining SCC and no other defs to the end of the BB.
So, we will scavenge each time until we are out of registers.

alex-t added inline comments.Oct 19 2022, 2:21 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

If you do need this loop, shouldn't it conservatively assume that SCC is live at the end of the basic block?

In fact, the loop aims to restrict a big amount of false positive SCC "live" slots by simulating the register scavenger "forward" job. But if SCC live-in in some BB it is still in scavenger "usedregs".

alex-t updated this revision to Diff 469045.Oct 19 2022, 2:24 PM

SCC copy restore has been made wave size independent.

alex-t marked an inline comment as done.Oct 19 2022, 2:25 PM
foad added inline comments.Oct 20 2022, 2:35 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

The loop still sounds like a hack to me that will have O(n^2) cost in the worst case. But I really don't know enough about PEI or the RegisterScavenger to review this properly.

alex-t marked an inline comment as done.Oct 20 2022, 3:16 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

The loop still sounds like a hack to me that will have O(n^2) cost in the worst case. But I really don't know enough about PEI or the RegisterScavenger to review this properly.

I don't like this approach either and would like to look for a neat solution. This is proposed as a workaround to stop Vulkan RT from failing.

Could you also clarify, why you consider it O(n^2)? The worst case is when we have to walk to the end of the BB. This is just on level loop and each instruction is visited once.

foad added inline comments.Oct 20 2022, 6:20 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

In a large BB with n instructions that need frame index elimination, you will walk to end of the BB (worst case) for each one of them. So that is O(n^2) overall cost to run this pass.

alex-t marked an inline comment as done.Oct 20 2022, 9:49 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

Ok, you meant the whole pass, not the loop itself. I got it now.

alex-t marked an inline comment as done.Oct 20 2022, 10:17 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2212–2220

PEI main loop process instructions one by one inserting the frame index lowering code before instruction if it has the frame index operand.
Register scavenger steps forward AFTER current instruction has been processed. That means we have the register's liveness information at the point right preceding the instruction being processed. In case the preceding instruction uses SCC, the scavenger considers it "live" even though the current instruction kills it. Without the loop, we'd have a lot of unnecessary SCC save/restore and run out of registers soon.
Consider the following example:

renamable $sgpr50 = S_CSELECT_B32 2924, killed renamable $sgpr50, implicit $scc ** <-- RS is here and RS->isRegUsed(AMDGPU::SCC) is true **
renamable $sgpr40 = S_ADD_I32 killed renamable $sgpr40, %stack.0.__llpc_global_proxy_, implicit-def dead $scc

We do not need to save SCC here, but we cannot know that SCC will be dead at the next instruction without scanning forward.
Typically we break from the loop at the first def or use of SCC.
In the worst case if

  • we have the instruction which uses frame index at the beginning of the large BB
  • and this instruction also uses SCC
  • and no def/use of SCC appears before the end of the BB

Yes, in this case, we scan all the BB in the loop and the whole pass has O(n^2).

alex-t added a comment.EditedOct 20 2022, 11:41 AM

In fact, to avoid this unwanted loop I can use another hack - just forward scavenger one position to look up the SCC liveness at the insertion slot and then immediately backward scavenger to restore its position and avoid breaking the PEI logic.
This looks weird but avoids the loop.

Also, this approach still allows some amount of false positives and, hence, some unnecessary SCC saves/restores. Given that we may already have high SGPR pressure, each false positive spends yet one more SGPR and increases the probability to run out of registers.

Why this approach cannot be improved?
We cannot make a reasonable decision regarding SCC save/restore without scanning down to the nearest use or definition. Register scavenger retrieves liveness information going forward by each instruction at a time. So, at each point, it only knows that the register was defined somewhere before and has not been yet killed by another definition. It may become dead at the next instruction or keep living an arbitrarily long distance.
To make a reasonable decision regarding SCC we need to know if SCC is "live" at the insertion point and if it is used at some point before the next definition. Otherwise, we have to conservatively save/restore SCC if it is "live" at the point where we're about to insert scalar instruction that clobbers it. In all cases where SCC is re-defined before the use, such save/restore is unnecessary but spends SGPR.

To illustrate the above:

def SCC
some others instructions
use of the FI <-- We are going to insert "S_ADD_I32 FrameReg, Offset" right before it
                             SCC is "live" here
        ************
here is an arbitrarily long sequence of instructions
that neither use nor define SCC
        ************
Here is an instruction "I"  that defines or uses SCC

if "I" defines SCC (kills the previous definition) - no SCC save/restore is necessary at the insertion point.
if "I" uses SCC - it is necessary to save it before S_ADD_I32 insertion and restore it after.

We have to scan down until either "I" is found or the basic block ends.

Am I correct to assume that frame offset is always DWORD aligned? If so we have couple spare bits:

S_ADDC_U32 $reg, offset
S_BITCMP1_B32 $reg, 0
S_BITSET0_B32 $reg, 0

For the loop which may potentially consume a lot of time I'd suggest to set a small scan threshold and assume scc is used otherwise. This relies on the possibility of restoring scc without scavenging, like in my previous comment.

alex-t updated this revision to Diff 469731.Oct 21 2022, 12:59 PM

Applied approach suggested by @rampitec

Am I correct to assume that frame offset is always DWORD aligned? If so we have couple spare bits:

S_ADDC_U32 $reg, offset
S_BITCMP1_B32 $reg, 0
S_BITSET0_B32 $reg, 0

That's brilliant! Thanks a lot.

Am I correct to assume that frame offset is always DWORD aligned? If so we have couple spare bits:

S_ADDC_U32 $reg, offset
S_BITCMP1_B32 $reg, 0
S_BITSET0_B32 $reg, 0

That's brilliant! Thanks a lot.

Enjoy :)

rampitec added inline comments.Oct 22 2022, 12:05 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2215

I think you need to skip debug instructions without counting so that debug build does not differ from release.

llvm/lib/Target/AMDGPU/SIRegisterInfo.h
96 ↗(On Diff #469731)

100 seems to large number to me. Compiling with spills is already sufficiently and painfully slow. I would not go more than 10 here. In a worst case you are now burning just 2 instructions and probably 8 cycles.

Besides I do not think it deserves a special function, just a constant threshold right inside the code. Just because it is only used once.

rampitec added inline comments.Oct 22 2022, 12:19 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2209

Save is not a right word anymore. Preserve is.

foad added inline comments.Oct 24 2022, 3:57 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2255

-Offset here surely? Why aren't there any codegen tests for this?

2379–2380

Remove these changes from the patch.

alex-t updated this revision to Diff 470182.Oct 24 2022, 9:16 AM

Corrected wrong Frame Register restore code.
Several minor changes as requested.

alex-t marked 2 inline comments as done.Oct 24 2022, 9:20 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2379–2380

This is obviously a mistake. "-Offset" cannot work here.
Given that we always have Reg and Offset aligned, hence even, we should just use S_SUBB_U32 here.

llvm/lib/Target/AMDGPU/SIRegisterInfo.h
96 ↗(On Diff #469731)

100 seems to large number to me. Compiling with spills is already sufficiently and painfully slow. I would not go more than 10 here. In a worst case you are now burning just 2 instructions and probably 8 cycles.

Besides I do not think it deserves a special function, just a constant threshold right inside the code. Just because it is only used once.

I would not agree. Really there may be tens of the odd SCC preserving code sequences in 100 lines length BB if it has frequent stack access. So, this is not just 2 instructions but 2*N. I can show you the example of assembly if necessary.

alex-t marked an inline comment as done.Oct 24 2022, 12:52 PM
alex-t marked 2 inline comments as done.Oct 24 2022, 2:26 PM
alex-t updated this revision to Diff 473701.Nov 7 2022, 8:53 AM

Temporary workaround to avoid SCC liveness scanning loop and using manually set
"dead" flags to decrease the amount of unnecessary SCC preserving code.

sebastian-ne added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2232

Maybe it makes sense to add an assert, that the least significant bit of Offset is indeed 0? (here and for s_subb below)

foad added inline comments.Nov 17 2022, 6:58 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2255

The bitcmp1/bitset0 trick only works with addc. It doesn't work with subb, because if scc is set then the result will be 1 *less* than you wanted, so bitset0 will not correct it.

alex-t updated this revision to Diff 476953.Nov 21 2022, 11:06 AM

Since the https://reviews.llvm.org/D137574 has been landed, this review is updated to use backward PEI.

alex-t updated this revision to Diff 477007.Nov 21 2022, 2:36 PM
alex-t marked 2 inline comments as done.

added test for using and restoring frame register for offset calculation.
S_SUBB_U32 changed to S_ADDC_U32 with "-Offset"

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2232

As far as I know, the flat scratch offset is always dword-aligned. If it is not we've got the UB anyway. So, the assert does not help.

2255

Oops. Good point.
BTW S_ADDC_U32 with "-Offset" may cause overflow itself. So, we override SCC again.
Although offset has uint64_t, I guess that we always have a flat scratch offset value small enough to never get into this trouble. Just because we subtract the same value as we were adding in the previous step.
And yes, we've got no problem with this so far as we had no tests for that case. I am going to add one.

foad added inline comments.Nov 21 2022, 10:53 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1632

Do you still need all these setIsDead changes? In any case can you remove them from *this* patch, so it just contains the essential changes? Maybe put them all in a separate patch, if there is still any need for them.

alex-t updated this revision to Diff 477139.Nov 22 2022, 4:41 AM

removed unrelated changes

alex-t marked an inline comment as done.Nov 22 2022, 4:42 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1632

No. I don't need them anymore. Just my inaccuracy. Removed.

alex-t marked an inline comment as done.Nov 22 2022, 5:11 AM
foad added a comment.Nov 22 2022, 5:57 AM

There are still setIsDead changes and whitespace changes. Please try to strip them all out so we get a minimal patch to review.

sebastian-ne added inline comments.Nov 22 2022, 6:56 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2232

Sure, an assert doesn’t help correctness. But if at some point in the future, we encounter a case where the offset is not aligned – maybe due to a bug elsewhere – having an assert makes it easy to find why some memory got corrupted.
Without an assert, it will surely take hours to find out what is going on.

There are still setIsDead changes and whitespace changes. Please try to strip them all out so we get a minimal patch to review.

Ok. I thought you wanted me to remove all the setIsDead calls since they are not needed anymore.

alex-t updated this revision to Diff 477236.Nov 22 2022, 9:44 AM

removed changes not relevant directly to the current revision
added assert to check if the flat scratch offset is aligned

foad accepted this revision.Nov 22 2022, 10:00 AM

LGTM, thanks!

There are still setIsDead changes and whitespace changes. Please try to strip them all out so we get a minimal patch to review.

Ok. I thought you wanted me to remove all the setIsDead calls since they are not needed anymore.

That would probably be fine, but it should be a separate patch.

This revision is now accepted and ready to land.Nov 22 2022, 10:00 AM
alex-t marked an inline comment as done.Nov 22 2022, 10:06 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2232

The assert which is currently added only ensures that the offset is even.
This is enough to make sure we have the correct arithmetics here. Although to ensure correct alignment it should be

!(Offset & flat_scratch_min_align)

I am not sure if it makes sense but maybe there should be another error message?

This revision was landed with ongoing or failed builds.Nov 22 2022, 10:08 AM
This revision was automatically updated to reflect the committed changes.
arsenm added inline comments.Nov 22 2022, 10:10 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2227–2228

I think we just need to start reserving an SGPR ahead of time in case we run into these scenarios, and un-reserve it after allocation if unused

arsenm added inline comments.Nov 22 2022, 10:12 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2231

Needs a comment explaining this trick

alex-t marked 3 inline comments as done.Nov 22 2022, 10:47 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2227–2228

I guess it is for a separate change already.

2231

Okay, too late but could I just push yet another commit just adding the comment?

alex-t marked 3 inline comments as done.Nov 22 2022, 12:46 PM