This is an archive of the discontinued LLVM Phabricator instance.

[RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness
ClosedPublic

Authored by foad on May 6 2022, 9:10 AM.

Details

Summary

Add a new TargetRegisterInfo hook to allow targets to tweak the
priority of live ranges, so that AllocationPriority of the register
class will be treated as more important than whether the range is local
to a basic block or global. This is determined per-MachineFunction.

Diff Detail

Event Timeline

foad created this revision.May 6 2022, 9:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 9:10 AM
foad requested review of this revision.May 6 2022, 9:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 9:10 AM
foad added a comment.May 6 2022, 9:15 AM

I'm posting this to get feedback on the general idea of allowing targets to tweak live range priority. I'm not sure if there is a precedent for that.

The background is that the AMDGPU target (especially for graphics workloads) tends to use a lot of wide register classes representing a tuple of 2, 3, 4 or 8 gprs. In some cases we get much better packing of the register allocation (i.e. it uses fewer registers overall) if we allocate the wide tuples first, regardless of how long their live ranges are. And for GPUs using fewer registers overall is important because it allows the hardware to launch more kernels simultaneously.

qcolombet accepted this revision.May 6 2022, 10:26 AM

Hi Jay,

I haven't looked too much at the details, but the general idea sounds fine to me.
Could you introduce a comment line option that would override this setting when set?

Just for testing/debugging purposes.

Cheers,
-Quentin

This revision is now accepted and ready to land.May 6 2022, 10:26 AM

Also could use a test where it makes a difference

bjope added a subscriber: bjope.May 7 2022, 10:47 PM
foad updated this revision to Diff 428083.May 9 2022, 7:44 AM

Add a command line option. Add a test case.

bjope added a comment.May 9 2022, 7:51 AM

Drive-by-nit: Just happened to notice that the commit msg says "RagAlloc...".

lkail added a subscriber: lkail.May 9 2022, 7:52 AM
foad retitled this revision from [RagAllocGreedy] New hook regClassPriorityTrumpsGlobalness to [RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness.May 9 2022, 7:55 AM
bjope added inline comments.May 9 2022, 8:23 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
291

Just for info:

Downstream we are doing

if (<our downstream target>) {
   // Let AllocationPriority affect all ranges.
   const TargetRegisterClass &RC = *MRI->getRegClass(Reg);
   Size = Size * (RC.AllocationPriority + 1);
 }

here (I think Quentin have suggested something like that in the past).

Anyway, I tried replacing that old hack by this patch, but got some mixed results. Same thing if using both patches together.

Last time I tried to do something in this area I had a hard time finding some heuristic that gave generally better result without some occasional larger regression. Kind of annoying, since your patch also seem to indicate that there is a potential gain here also for our target in several benchmarks (but regressions by almost 20% in a couple of our benchmarks can't be ignored completely). That might of course be due to other shortcomings in our backend (such as the AllocationPriority setup etc). I guess I need to investigate that a bit closer before we consider to use this new option for our target.

foad added inline comments.May 9 2022, 8:28 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
291

Thanks for trying it. Does your downstream target also have a concept of occupancy, so that using fewer physical registers means that things run significantly faster on the hardware? I'm aware that the register allocator was developed for CPUs and CPUs generally do not have that property.

bjope added inline comments.May 9 2022, 9:03 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
291

No I don't think we have that (if I understand the question correctly).

But we have a limited set of registers (typically 16-32 regs). And similar to what you mentioned about AMDGPU we do have some wide register classes that consist of tuples and quads. Certain quads might be costly to spill/reload, and we do not want that to happen inside a loop for example. So generally I think we want globalness to trump the allocation prio, but sometimes it is bad to allocate a long range quad early since then we have to spill it (mostly guessing here).

uabelho added a subscriber: uabelho.May 9 2022, 9:57 PM
foad added inline comments.May 12 2022, 8:07 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
291

I have tried your Size = Size * (RC.AllocationPriority + 1); heuristic but it doesn't help in the cases I am interested in, because I really need a wide local range to have higher priority than a narrow global range.

Just an idea: we could split the actual calculation of the Prio metric out into a separate function like this: https://reviews.llvm.org/differential/diff/428948/
... and then make it a TargetRegisterInfo hook so that targets could tweak the priority however they like?

In the meantime I would like to proceed with the current patch.

bjope added inline comments.May 12 2022, 8:20 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57

This comment should clarify that if using GreedyRegClassPriorityTrumpsGlobalness then the range is [0,31].

When looking into how AllocationPriority is used (by our downstream target vs in-tree targets) I noticed that at least PowerPC is using AllocationPriority>32 to set bit 29 in Prio. So they use that as a way to get a higher prio compared to "global and split ranges" based on the AllocationPriority. So, is setting AllocationPriority > 32 a hackier way to trump globalness already without this patch?

foad added inline comments.May 13 2022, 1:56 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57

Yes, setting AllocationPriority > 32 is definitely a hackier way of doing this! I don't like it, because then globalness is ignored even for two live ranges with the same AllocationPriority.

I don't want to document that the range is smaller only if you're using GreedyRegClassPriorityTrumpsGlobalness. Because in both cases, you can use priorities >= 32 if you really want to, and it will clobber some other bit in the Prio value.

Do you know why PowerPC uses priorities >= 32? Was it done deliberately to clobber the global bit?

bjope added inline comments.May 13 2022, 2:17 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57

Right, I also found it a bit ugly that those things overlap.

I don't know much about PowerPC. It looks like it is deliberate as code comments for example say this:

  // Give the VSRp registers a non-zero AllocationPriority. The value is less
  // than 32 as these registers should not always be allocated before global
  // ranges and the value should be less than the AllocationPriority - 32 for
  // the UACC registers. Even global VSRp registers should be allocated after
  // the UACC registers have been chosen.
  let AllocationPriority = 2;

...

  // Give the VSRp registers a non-zero AllocationPriority. The value is less
  // than 32 as these registers should not always be allocated before global
  // ranges and the value should be less than the AllocationPriority - 32 for
  // the UACC registers. Even global VSRp registers should be allocated after
  // the UACC registers have been chosen.
  let AllocationPriority = 2;

And here goes another by-the-way: utils/TableGen/CodeGenRegisters.cpp is verifying that the AllocationPriority values used is in the range [0, 63] so just modifying the comment here would make this comment unsynced with the tablegen implementation.

Maybe one could say something about values above 31 being special since they would overlap with some other Prio-bits (however, which bits that overlap depend on GreedyRegClassPriorityTrumpsGlobalness).

foad added a subscriber: stefanp.May 13 2022, 4:00 AM
foad added inline comments.
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57

Agreed, the PowerPC usage looks deliberate. It comes from D105854. @stefanp do you think PowerPC might be interested in using regClassPriorityTrumpsGlobalness instead of using AllocationPriority values >= 32?

foad updated this revision to Diff 429194.May 13 2022, 4:05 AM

Add a cautionary note about AllocationPriority.

foad updated this revision to Diff 429195.May 13 2022, 4:07 AM

Update the other copy of the same comment.

This revision was landed with ongoing or failed builds.May 17 2022, 4:42 AM
This revision was automatically updated to reflect the committed changes.
arsenm added inline comments.Jun 23 2022, 3:50 PM
llvm/lib/CodeGen/RegAllocGreedy.cpp
311–316

I wonder if instead of adding yet another control if the heuristic here just needs to be redone.

I think there are several issues with this heuristic. First, getNumAllocatableRegs should probably return a count for disjoint registers. This number is way too big with overlapping tuples in the same register class.

Second. the use of the interval size doesn't really work if any pass modified the live intervals. I've struggled to reduce many testcases where the scheduler triggering renumbering of SlotIndexes resulted in different regalloc behavior vs. if the SlotIndexes aren't preserved (i.e. you're just using -run-pass for the one allocator pass).

2710

Why a function level decision, and not a register class?

arsenm added inline comments.Jun 23 2022, 3:52 PM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57

What if we went the opposite direction, and made something less terrible for the priority setting? I think a per-class priority makes more sense than the function option

foad added inline comments.Jun 24 2022, 4:05 AM
llvm/lib/CodeGen/RegAllocGreedy.cpp
311–316

getNumAllocatableRegs - sounds reasonable.

SlotIndexes - I've wondered before about forcibly renumbering them before regalloc runs to avoid this kind of problem.

2710

I'm not sure it makes sense to directly compare two different priorities if they might have been calculated with different settings of RegClassPriorityTrumpsGlobalness, since it is completely changing the calculation.

I was specifically interested in tuples like vreg_64 vs vreg_128, which are different classes but they overlap so you need to be able to compare their priorities.

arsenm added inline comments.Jul 21 2022, 8:22 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085

What's the argument for making this configurable? No lit tests fail if I default this to true?

arsenm added inline comments.Jul 21 2022, 8:34 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085

Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix

foad added inline comments.Jul 22 2022, 3:39 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085

No lit tests fail if I default this to true?

Really? I get:

Failed Tests (43):
  LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/localizer.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/srem.i64.ll
  LLVM :: CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
  LLVM :: CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_buffer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll
  LLVM :: CodeGen/AMDGPU/collapse-endcf.ll
  LLVM :: CodeGen/AMDGPU/ctpop16.ll
  LLVM :: CodeGen/AMDGPU/dag-divergence-atomic.ll
  LLVM :: CodeGen/AMDGPU/divergent-branch-uniform-condition.ll
  LLVM :: CodeGen/AMDGPU/extend-phi-subrange-not-in-parent.mir
  LLVM :: CodeGen/AMDGPU/extract-subvector-16bit.ll
  LLVM :: CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll
  LLVM :: CodeGen/AMDGPU/global-atomics-fp.ll
  LLVM :: CodeGen/AMDGPU/i1-copy-from-loop.ll
  LLVM :: CodeGen/AMDGPU/idiv-licm.ll
  LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll
  LLVM :: CodeGen/AMDGPU/llvm.round.f64.ll
  LLVM :: CodeGen/AMDGPU/load-constant-i16.ll
  LLVM :: CodeGen/AMDGPU/loop_break.ll
  LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll
  LLVM :: CodeGen/AMDGPU/no-dup-inst-prefetch.ll
  LLVM :: CodeGen/AMDGPU/sdiv64.ll
  LLVM :: CodeGen/AMDGPU/sgpr-control-flow.ll
  LLVM :: CodeGen/AMDGPU/si-annotate-cf-kill.ll
  LLVM :: CodeGen/AMDGPU/skip-if-dead.ll
  LLVM :: CodeGen/AMDGPU/spill-vgpr.ll
  LLVM :: CodeGen/AMDGPU/srem64.ll
  LLVM :: CodeGen/AMDGPU/udiv64.ll
  LLVM :: CodeGen/AMDGPU/urem64.ll
  LLVM :: CodeGen/AMDGPU/wqm.ll
  LLVM :: CodeGen/PowerPC/more-dq-form-prepare.ll
  LLVM :: CodeGen/PowerPC/ppc64-acc-regalloc.ll
  LLVM :: CodeGen/PowerPC/subreg-killed.mir

Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix

I don't know why you call it a bug. It was just a different heuristic. I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them.

foad added inline comments.Jul 22 2022, 4:11 AM
llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085

I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them.

798fa7e9d6973c7ecb736eea41755ef86220cda1