This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Reserve SGPR pair when long branches are present
ClosedPublic

Authored by crobeck on May 3 2023, 11:15 AM.

Details

Reviewers
bcahoon
msearles
vpykhtin
arsenm
b-sumner
Pierre-vh
Group Reviewers
Restricted Project
Summary

After applying recent register allocation patches the compiler generates a register scavenging error. Branch Relaxation runs after RA, but RA doesn't allocate enough registers to plan for the case where an indirect branch ends up being needed during branch relaxation. This causes a “did not find scavenging index” assert to be hit from assignRegToScavengingIndex from within RegScavenger.

In this patch we estimate before RA whether an indirect branch is likely to be needed, and reserve 2 SGPRs if the branch distance is found to be above a threshold. This is difficult as you often don't have an accurate idea of the code size and branch distance before register allocation and when you would need to reserve the registers. We therefore make the distance calculation a reduced complexity approximation and add a tuning factor on the threshold through the -amdgpu-long-branch-factor command line argument.

Diff Detail

Event Timeline

crobeck created this revision.May 3 2023, 11:15 AM
crobeck created this object with visibility "Restricted Project (Project)".
crobeck created this object with edit policy "Restricted Project (Project)".
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 11:15 AM
crobeck requested review of this revision.May 3 2023, 11:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 11:15 AM
crobeck updated this revision to Diff 519182.May 3 2023, 11:28 AM
crobeck updated this revision to Diff 519208.May 3 2023, 12:33 PM

I'm also wondering if reserving s[6:7] could have unintended side effects? Aren't those values used to pass parameters to the kernel?
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-sgpr-register-set-up-order-table

Maybe we need to use different, higher registers or something that depends on the CC/arguments of the function?

llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
3

formatting

11

missing comment (or just remove this)

30

I'm not sure I understand, do you mean that the (estimated) branch offset gets multiplied by this value, and if the result doesn't fit in the branch instruction, we consider it a long branch?

69–70
94–97
171–175

If you're just interested in unconditional branches you can remove an indentation level

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2606–2609

If I understand correctly, LongBranchReservedReg is always reserved so could we just avoid running the scavenger when it's set and just use it directly? Otherwise it's just a "wasted" reserved register

+ nit: no {} for a one-line if body
+ very small nit: you could also just use Register() as the value for LongBranchReservedReg I think, it avoids the need to check against NoRegister. IMO it's more intuitive as you can just do if(Register LongBranchReservedReg = MFI->getLongBranchReservedReg()) to check for its presence

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
384

Add a small doxygen comment on top

893

I'm not a fan of the default argument, I would just let the caller set it.
Generally if I see a parameterless setter such as setLongBranchReservedReg() I would expect that LongBranchReservedReg is a bool that we're setting to true - for other types of value I'd expect to have to pass a parameter

note that it's just my opinion, if a reviewer disagrees or if there is a precedent for this style then it's fine.

crobeck updated this revision to Diff 519465.May 4 2023, 6:22 AM
crobeck added a reviewer: Pierre-vh.

address some review comments

crobeck marked 7 inline comments as done.May 4 2023, 6:33 AM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
30

Right, it's essentially just an empirical tuning parameter on when we reserve the registers since estimating the code/branch size at this point, pre RA, is difficult.

171–175

I believe we just care about unconditional branches. But input on the logic here is of interest as well.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2606–2609

LongBranchReservedReg is only set if the PreRABranchDistance pass sets it because it finds a long branch. Otherwise it is just set to null.

Formatting of LongBranchReservedReg != AMDGPU::NoRegister seemed consistent with the surrounding code but no strong feelings there. Open to changing.

crobeck marked 2 inline comments as done.May 4 2023, 6:40 AM
crobeck marked 2 inline comments as not done.May 4 2023, 7:09 AM
crobeck marked an inline comment as done.May 5 2023, 10:34 AM
crobeck updated this revision to Diff 520388.May 8 2023, 8:25 AM
crobeck added a reviewer: Restricted Project.

address review comment

crobeck marked an inline comment as done.May 8 2023, 8:26 AM

Absolutely do not globally reserve s[6:7] for such things. Low SGPRs are very often given special meaning in function signatures.

Furthermore, is a global reservation even necessary? You could identify candidate long branches and give them a fake clobbered operand, so that the register allocator can just pre-choose a register pair, but not block it everywhere.

crobeck added a comment.EditedMay 9 2023, 5:21 AM

Absolutely do not globally reserve s[6:7] for such things. Low SGPRs are very often given special meaning in function signatures.

Furthermore, is a global reservation even necessary? You could identify candidate long branches and give them a fake clobbered operand, so that the register allocator can just pre-choose a register pair, but not block it everywhere.

Sure. My thinking was to pick as low a SGPR pair as I could that wasn't already explicitly reserved. I thought s[6:7] were not used in the kernel args but I could be wrong. If there is another, higher, pair that makes more sense I'm OK with that.

I'll have to think about the fake clobbered operand idea. Would we be more likely to over allocate SGPRs when might not need them in that case?

vpykhtin added inline comments.May 12 2023, 7:40 AM
llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
174

If you always need only the offset of the last instruction can you calculate it using the offset of the next MBB instead of summing every BB instruction's size?

I'm not sure why but this review doesn't contain context (unchanged lines of code), may be it worth to try arc diff to generate review requests.

What if we always reserve a register for long jumps?

arsenm added inline comments.May 12 2023, 10:12 AM
llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
37

This is all copied from BranchRelaxation?

I think you're trying to be far too precise for tracking this. This is going to be far too fuzzy given you have no idea what spills are going to be inserted. I think expected ~10 bytes per instruction without bothering to compute actual instruction sizes is enough and see if that's in the neighborhood of the maximum branch distance

179

This made a change and needs to return true. You also didn't actually modify the set of reserved registers

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
387

This is missing MIR serialization.

Also don't need the initializer.

897

This shouldn't just hardcode to s[6:7]. Should pick the highest available and we can compact later (although I guess it's a bit awkward to move this one since relaxation runs after PEI where we handle all the other shifted reserved registers).

What if we always reserve a register for long jumps?

That would really be more correct

crobeck updated this revision to Diff 523895.May 19 2023, 12:21 PM
crobeck changed the visibility from "Restricted Project (Project)" to "Public (No Login Required)".

rebase and address some review comments

crobeck marked 3 inline comments as done.May 19 2023, 12:24 PM
crobeck edited the summary of this revision. (Show Details)May 19 2023, 1:10 PM
crobeck updated this revision to Diff 524741.May 23 2023, 8:45 AM
crobeck edited the summary of this revision. (Show Details)

addressing review comments.

crobeck added a comment.EditedMay 23 2023, 8:49 AM

What if we always reserve a register for long jumps?

I think we still want the ability to not reserve them in cases where they might not actually be needed and would waste resources. The amdgpu-long-branch-factor initialized default has been reset higher so that we lean toward always reserving them now - if this value is 0 the long branch registers are never reserved. As this value grows the great chance the branch distance will fall within the threshold and the registers will be marked to be reserved.

crobeck added inline comments.May 23 2023, 9:05 AM
llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
37

The distance calculation is mostly from BR. I'm not sure the actual method of determining what maximum branch distance use as the threshold should matter too much since it's just an empirical tuning factor. We lean toward always reserving the register by setting the default threshold somewhat high and can tune with the CL argument for individual cases where we don't want to reserve them. However you're right this distance calculation is probably overkill and could simpler.

What if we always reserve a register for long jumps?

I think we still want the ability to not reserve them in cases where they might not actually be needed and would waste resources. The amdgpu-long-branch-factor initialized default has been reset higher so that we lean toward always reserving them now - if this value is 0 the long branch registers are never reserved. As this value grows the great chance the branch distance will fall within the threshold and the registers will be marked to be reserved.

Maybe we can reserve a register if the overall number of SGPRs is close to some threshold?

jrbyrnes added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1454

Will this ever be called directly after freezeReservedRegs for VGPRForAGPRCopy (i.e. on gfx908)? If so, in those cases, can we use reserveReg instead of the second freezeReservedRegs?

crobeck updated this revision to Diff 527120.May 31 2023, 10:51 AM

rebase and address review comments

crobeck marked an inline comment as done.May 31 2023, 10:52 AM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1454

That's a good point. Updated.

crobeck marked an inline comment as done.May 31 2023, 11:47 AM
crobeck updated this revision to Diff 527263.May 31 2023, 7:36 PM

formatting

Pierre-vh added inline comments.May 31 2023, 11:53 PM
llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
3

formatting

179–181

All of those are unsigned I think so I would use it here too to be consistent
There is also a use of uint64_t somewhere else

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1453–1455

How does this work? Don't you need to check if UnusedLowSGPR returns something? Will it always return the pair we reserved earlier if nothing else is available?

Would it be worth adding a test with amdgpu-num-sgpr to see what happens when 100% of SGPRs are used?

llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
69

missing newline

crobeck updated this revision to Diff 527552.Jun 1 2023, 12:00 PM

fix some tests

crobeck marked an inline comment as done.Jun 1 2023, 12:00 PM
crobeck updated this revision to Diff 527587.Jun 1 2023, 1:32 PM

update to bypass register scavenger if we've previous reserved ones

crobeck marked an inline comment as done.Jun 1 2023, 1:32 PM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2606–2609

Updated.

crobeck updated this revision to Diff 527592.Jun 1 2023, 1:37 PM

formatting

crobeck marked an inline comment as done.Jun 1 2023, 1:37 PM
crobeck updated this revision to Diff 527595.Jun 1 2023, 1:45 PM

update types in GCNPreRABranchDistance pass

crobeck marked an inline comment as done.Jun 1 2023, 1:45 PM
crobeck updated this revision to Diff 527688.Jun 1 2023, 6:07 PM
This comment was removed by crobeck.
crobeck updated this revision to Diff 527695.EditedJun 1 2023, 6:19 PM

fix a type mismatch

crobeck updated this revision to Diff 527892.Jun 2 2023, 10:07 AM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1453–1455

Yes, you're right. I've made the checks more explicit.

crobeck added inline comments.Jun 2 2023, 10:13 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1453–1455

Would it be worth adding a test with amdgpu-num-sgpr to see what happens when 100% of SGPRs are used?

That's a good idea. At the very least good to understand what happens when were maxed out on SGPRs.

crobeck updated this revision to Diff 527935.Jun 2 2023, 12:33 PM
crobeck updated this revision to Diff 530066.Jun 9 2023, 1:12 PM
crobeck edited the summary of this revision. (Show Details)

add test for when maxed out on SGPRs

crobeck added inline comments.Jun 9 2023, 1:24 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1453–1455

This ended up being kind of a mess trying to do it with the amdgpu-num-sgpr attribute so just used inline asm to use up all the registers

crobeck marked an inline comment as done.Jun 9 2023, 1:26 PM
crobeck added a comment.EditedJun 16 2023, 9:43 AM

Any further comments on this? @arsenm Do you think we need to update the distance calculation or are OK to leave as is?

arsenm added inline comments.Jun 16 2023, 3:55 PM
llvm/lib/Target/AMDGPU/GCNPreRABranchDistance.cpp
37

I think you should go for fuzzier. Drop the block map and all that from BranchRelaxation. Just do something dead simple like number of instructions * 8 * fuzz ~= branch distance

crobeck updated this revision to Diff 532736.Jun 19 2023, 12:23 PM

Reduce complexity of branch distance calculation substantially - we now just approximate branch size by assuming 8 bytes per instruction. Also updated pass name to be more representative of what the pass does.

crobeck updated this revision to Diff 532741.Jun 19 2023, 12:51 PM

clean up unused header files and an unused function

crobeck marked 2 inline comments as done.Jun 19 2023, 1:32 PM
crobeck edited the summary of this revision. (Show Details)Jun 20 2023, 8:27 AM
crobeck edited the summary of this revision. (Show Details)
crobeck edited the summary of this revision. (Show Details)Jun 20 2023, 8:29 AM
arsenm added inline comments.Jun 21 2023, 7:00 AM
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
45 ↗(On Diff #532741)

uint64_ts?

77–78 ↗(On Diff #532741)

Clear at exit instead?

85 ↗(On Diff #532741)

Should skip debug and meta instructions, otherwise this violates the cardinal rule that debug info shouldn't change codegen

120 ↗(On Diff #532741)

const

134–136 ↗(On Diff #532741)

This should be moved into a TRI reserve registers helper like the others.

Also, would be a bit safer to invert this by reserving by default and having this pass *remove* the reserved register

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1429

The reserved registers should already be frozen before PEI. I think the current freezeReservedRegs call here was just never updated to use the new incremental reserveReg

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2555

We should also fix this fixme at some point

llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll
67

Add a test to show no change for debug instructions

arsenm added inline comments.Jun 21 2023, 7:01 AM
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
30 ↗(On Diff #532741)

If you're going to use FP for this, might as well use double

arsenm changed the edit policy from "Restricted Project (Project)" to "All Users".Jun 21 2023, 7:31 AM

Add a test for MIR serialization.

llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
106 ↗(On Diff #532741)

Capitalize.

117 ↗(On Diff #532741)

Capitalize.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1445

Unify the freezeReservedRegs at the end of the function?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2553

const

crobeck updated this revision to Diff 533380.Jun 21 2023, 1:41 PM
crobeck marked 8 inline comments as done.

address review comments, add additional debug instruction test

crobeck marked an inline comment as done.Jun 21 2023, 1:45 PM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
134–136 ↗(On Diff #532741)

We're not actually reserving the register here. We're just setting the flag that we need to reserve the register in TRI. But, I can move this if it makes more sense there.

If we instead invert it we then need to always run through everything and prove there isn't a long branch as opposed to exiting once if we find a long branch.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1429

Just set both calls to reserveReg then?

1445

Pending updating reserveReg call above.

crobeck updated this revision to Diff 533586.Jun 22 2023, 6:53 AM

Update reserveReg calls in SIFrameLowering

crobeck marked 2 inline comments as done.Jun 22 2023, 6:55 AM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1429

I think I've convinced myself these can both just be reserveReg calls.

crobeck updated this revision to Diff 533663.Jun 22 2023, 9:43 AM

Swap out reserve register routine for TRI helper function

crobeck added inline comments.Jun 22 2023, 9:52 AM
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
134–136 ↗(On Diff #532741)

This should be cleaner now with a call into TRI instead of grabbing my own registers.

arsenm added inline comments.Jun 22 2023, 10:13 AM
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
86 ↗(On Diff #533663)

Just isMetaInstruction is fine, it's a superset of isDebugIstr

102 ↗(On Diff #533663)

Don't need a specific instruction offset, just the block distances are good enough

crobeck updated this revision to Diff 533713.Jun 22 2023, 11:43 AM

remove specific instruction offsets from distance calc

crobeck marked 2 inline comments as done.Jun 22 2023, 11:44 AM
crobeck updated this revision to Diff 533731.Jun 22 2023, 12:30 PM

variable, pointer/reference clean up

crobeck updated this revision to Diff 533956.Jun 23 2023, 7:30 AM

invert to reserve register by default and then have pass remove the register if unneeded

crobeck marked an inline comment as done.Jun 23 2023, 7:38 AM
arsenm added inline comments.Jun 23 2023, 3:04 PM
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
116–117 ↗(On Diff #533956)

I don't see how this can fail unless you're using amdgpu-num-sgprs (and this is one of the reasons it should be removed)

119 ↗(On Diff #533956)

I would assume this is already noregister if this failed

I think you misunderstood what I was saying about reserve first and release later. I meant in the compile pipeline, not in the pass itself. You can find the reserved register in finalizeLowering, and then have this pass unset and un-reserve. I don't know if we have an exposed helper to clear reserved registers. You can do that as a follow up,

120 ↗(On Diff #533956)

If you actually did modify the MFI register this should be a return true

143 ↗(On Diff #533956)

This should be return true if you modified MFI

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2624

Should also skip the enterBasicBlockEnd above

crobeck updated this revision to Diff 534108.Jun 23 2023, 3:58 PM

remove unnecessary set functions

crobeck marked 3 inline comments as done.Jun 23 2023, 4:00 PM
crobeck added inline comments.
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
116–117 ↗(On Diff #533956)

It failed in some test cases where all the SGPRs were already used from inline asm

119 ↗(On Diff #533956)

Yes, you're right I can remove that set function.

Got it. OK, that makes sense.

crobeck updated this revision to Diff 534113.EditedJun 23 2023, 4:18 PM
crobeck marked an inline comment as done.

skipping some of register scavenging if we already have a reserved register

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2624

There is some init of MBB in enterBasicBlockEnd that breaks things if I skip it entirely. But I think I can get away will skipping some of it.

arsenm accepted this revision.Jun 27 2023, 3:14 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/GCNPreRALongBranchReg.cpp
116–117 ↗(On Diff #533956)

oh right

This revision is now accepted and ready to land.Jun 27 2023, 3:14 PM