Add a new TargetRegisterInfo hook to allow targets to tweak the
priority of live ranges, so that AllocationPriority of the register
class will be treated as more important than whether the range is local
to a basic block or global. This is determined per-MachineFunction.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I'm posting this to get feedback on the general idea of allowing targets to tweak live range priority. I'm not sure if there is a precedent for that.
The background is that the AMDGPU target (especially for graphics workloads) tends to use a lot of wide register classes representing a tuple of 2, 3, 4 or 8 gprs. In some cases we get much better packing of the register allocation (i.e. it uses fewer registers overall) if we allocate the wide tuples first, regardless of how long their live ranges are. And for GPUs using fewer registers overall is important because it allows the hardware to launch more kernels simultaneously.
Hi Jay,
I haven't looked too much at the details, but the general idea sounds fine to me.
Could you introduce a comment line option that would override this setting when set?
Just for testing/debugging purposes.
Cheers,
-Quentin
llvm/lib/CodeGen/RegAllocGreedy.cpp | ||
---|---|---|
291 | Just for info: Downstream we are doing if (<our downstream target>) { // Let AllocationPriority affect all ranges. const TargetRegisterClass &RC = *MRI->getRegClass(Reg); Size = Size * (RC.AllocationPriority + 1); } here (I think Quentin have suggested something like that in the past). Anyway, I tried replacing that old hack by this patch, but got some mixed results. Same thing if using both patches together. Last time I tried to do something in this area I had a hard time finding some heuristic that gave generally better result without some occasional larger regression. Kind of annoying, since your patch also seem to indicate that there is a potential gain here also for our target in several benchmarks (but regressions by almost 20% in a couple of our benchmarks can't be ignored completely). That might of course be due to other shortcomings in our backend (such as the AllocationPriority setup etc). I guess I need to investigate that a bit closer before we consider to use this new option for our target. |
llvm/lib/CodeGen/RegAllocGreedy.cpp | ||
---|---|---|
291 | Thanks for trying it. Does your downstream target also have a concept of occupancy, so that using fewer physical registers means that things run significantly faster on the hardware? I'm aware that the register allocator was developed for CPUs and CPUs generally do not have that property. |
llvm/lib/CodeGen/RegAllocGreedy.cpp | ||
---|---|---|
291 | No I don't think we have that (if I understand the question correctly). But we have a limited set of registers (typically 16-32 regs). And similar to what you mentioned about AMDGPU we do have some wide register classes that consist of tuples and quads. Certain quads might be costly to spill/reload, and we do not want that to happen inside a loop for example. So generally I think we want globalness to trump the allocation prio, but sometimes it is bad to allocate a long range quad early since then we have to spill it (mostly guessing here). |
llvm/lib/CodeGen/RegAllocGreedy.cpp | ||
---|---|---|
291 | I have tried your Size = Size * (RC.AllocationPriority + 1); heuristic but it doesn't help in the cases I am interested in, because I really need a wide local range to have higher priority than a narrow global range. Just an idea: we could split the actual calculation of the Prio metric out into a separate function like this: https://reviews.llvm.org/differential/diff/428948/ In the meantime I would like to proceed with the current patch. |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
57–59 | This comment should clarify that if using GreedyRegClassPriorityTrumpsGlobalness then the range is [0,31]. When looking into how AllocationPriority is used (by our downstream target vs in-tree targets) I noticed that at least PowerPC is using AllocationPriority>32 to set bit 29 in Prio. So they use that as a way to get a higher prio compared to "global and split ranges" based on the AllocationPriority. So, is setting AllocationPriority > 32 a hackier way to trump globalness already without this patch? |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
57–59 | Yes, setting AllocationPriority > 32 is definitely a hackier way of doing this! I don't like it, because then globalness is ignored even for two live ranges with the same AllocationPriority. I don't want to document that the range is smaller only if you're using GreedyRegClassPriorityTrumpsGlobalness. Because in both cases, you can use priorities >= 32 if you really want to, and it will clobber some other bit in the Prio value. Do you know why PowerPC uses priorities >= 32? Was it done deliberately to clobber the global bit? |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
57–59 | Right, I also found it a bit ugly that those things overlap. I don't know much about PowerPC. It looks like it is deliberate as code comments for example say this: // Give the VSRp registers a non-zero AllocationPriority. The value is less // than 32 as these registers should not always be allocated before global // ranges and the value should be less than the AllocationPriority - 32 for // the UACC registers. Even global VSRp registers should be allocated after // the UACC registers have been chosen. let AllocationPriority = 2; ... // Give the VSRp registers a non-zero AllocationPriority. The value is less // than 32 as these registers should not always be allocated before global // ranges and the value should be less than the AllocationPriority - 32 for // the UACC registers. Even global VSRp registers should be allocated after // the UACC registers have been chosen. let AllocationPriority = 2; And here goes another by-the-way: utils/TableGen/CodeGenRegisters.cpp is verifying that the AllocationPriority values used is in the range [0, 63] so just modifying the comment here would make this comment unsynced with the tablegen implementation. Maybe one could say something about values above 31 being special since they would overlap with some other Prio-bits (however, which bits that overlap depend on GreedyRegClassPriorityTrumpsGlobalness). |
llvm/lib/CodeGen/RegAllocGreedy.cpp | ||
---|---|---|
311–316 | I wonder if instead of adding yet another control if the heuristic here just needs to be redone. I think there are several issues with this heuristic. First, getNumAllocatableRegs should probably return a count for disjoint registers. This number is way too big with overlapping tuples in the same register class. Second. the use of the interval size doesn't really work if any pass modified the live intervals. I've struggled to reduce many testcases where the scheduler triggering renumbering of SlotIndexes resulted in different regalloc behavior vs. if the SlotIndexes aren't preserved (i.e. you're just using -run-pass for the one allocator pass). | |
2710 | Why a function level decision, and not a register class? |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
57–59 | What if we went the opposite direction, and made something less terrible for the priority setting? I think a per-class priority makes more sense than the function option |
llvm/lib/CodeGen/RegAllocGreedy.cpp | ||
---|---|---|
311–316 | getNumAllocatableRegs - sounds reasonable. SlotIndexes - I've wondered before about forcibly renumbering them before regalloc runs to avoid this kind of problem. | |
2710 | I'm not sure it makes sense to directly compare two different priorities if they might have been calculated with different settings of RegClassPriorityTrumpsGlobalness, since it is completely changing the calculation. I was specifically interested in tuples like vreg_64 vs vreg_128, which are different classes but they overlap so you need to be able to compare their priorities. |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
1081–1087 | What's the argument for making this configurable? No lit tests fail if I default this to true? |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
1081–1087 | Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
1081–1087 |
Really? I get: Failed Tests (43): LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.ll LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll LLVM :: CodeGen/AMDGPU/GlobalISel/localizer.ll LLVM :: CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll LLVM :: CodeGen/AMDGPU/GlobalISel/srem.i64.ll LLVM :: CodeGen/AMDGPU/agpr-copy-no-free-registers.ll LLVM :: CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_buffer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll LLVM :: CodeGen/AMDGPU/collapse-endcf.ll LLVM :: CodeGen/AMDGPU/ctpop16.ll LLVM :: CodeGen/AMDGPU/dag-divergence-atomic.ll LLVM :: CodeGen/AMDGPU/divergent-branch-uniform-condition.ll LLVM :: CodeGen/AMDGPU/extend-phi-subrange-not-in-parent.mir LLVM :: CodeGen/AMDGPU/extract-subvector-16bit.ll LLVM :: CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll LLVM :: CodeGen/AMDGPU/global-atomics-fp.ll LLVM :: CodeGen/AMDGPU/i1-copy-from-loop.ll LLVM :: CodeGen/AMDGPU/idiv-licm.ll LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll LLVM :: CodeGen/AMDGPU/llvm.round.f64.ll LLVM :: CodeGen/AMDGPU/load-constant-i16.ll LLVM :: CodeGen/AMDGPU/loop_break.ll LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll LLVM :: CodeGen/AMDGPU/no-dup-inst-prefetch.ll LLVM :: CodeGen/AMDGPU/sdiv64.ll LLVM :: CodeGen/AMDGPU/sgpr-control-flow.ll LLVM :: CodeGen/AMDGPU/si-annotate-cf-kill.ll LLVM :: CodeGen/AMDGPU/skip-if-dead.ll LLVM :: CodeGen/AMDGPU/spill-vgpr.ll LLVM :: CodeGen/AMDGPU/srem64.ll LLVM :: CodeGen/AMDGPU/udiv64.ll LLVM :: CodeGen/AMDGPU/urem64.ll LLVM :: CodeGen/AMDGPU/wqm.ll LLVM :: CodeGen/PowerPC/more-dq-form-prepare.ll LLVM :: CodeGen/PowerPC/ppc64-acc-regalloc.ll LLVM :: CodeGen/PowerPC/subreg-killed.mir
I don't know why you call it a bug. It was just a different heuristic. I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them. |
llvm/include/llvm/CodeGen/TargetRegisterInfo.h | ||
---|---|---|
1081–1087 |
|
This comment should clarify that if using GreedyRegClassPriorityTrumpsGlobalness then the range is [0,31].
When looking into how AllocationPriority is used (by our downstream target vs in-tree targets) I noticed that at least PowerPC is using AllocationPriority>32 to set bit 29 in Prio. So they use that as a way to get a higher prio compared to "global and split ranges" based on the AllocationPriority. So, is setting AllocationPriority > 32 a hackier way to trump globalness already without this patch?