This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
4/9
TargetRegisterInfo.h
-
lib/CodeGen/
-
CodeGen/
-
RegAllocGreedy.h
4/8
RegAllocGreedy.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
greedy-liverange-priority.mir

Differential D125102

[RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness
ClosedPublic

Authored by foad on May 6 2022, 9:10 AM.

Download Raw Diff

Details

Reviewers

arsenm
qcolombet
qiucf

Commits

rG77480556c41f: [RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness

Summary

Add a new TargetRegisterInfo hook to allow targets to tweak the
priority of live ranges, so that AllocationPriority of the register
class will be treated as more important than whether the range is local
to a basic block or global. This is determined per-MachineFunction.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,030 ms	x64 debian > ThreadSanitizer-x86_64.ThreadSanitizer-x86_64::restore_stack.cpp
	60,070 ms	x64 debian > libFuzzer.libFuzzer::large.test

Event Timeline

foad created this revision.May 6 2022, 9:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 9:10 AM

Herald added subscribers: hiraditya, MatzeB. · View Herald Transcript

foad requested review of this revision.May 6 2022, 9:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 9:10 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

I'm posting this to get feedback on the general idea of allowing targets to tweak live range priority. I'm not sure if there is a precedent for that.

The background is that the AMDGPU target (especially for graphics workloads) tends to use a lot of wide register classes representing a tuple of 2, 3, 4 or 8 gprs. In some cases we get much better packing of the register allocation (i.e. it uses fewer registers overall) if we allocate the wide tuples first, regardless of how long their live ranges are. And for GPUs using fewer registers overall is important because it allows the hardware to launch more kernels simultaneously.

Harbormaster completed remote builds in B163165: Diff 427659.May 6 2022, 10:12 AM

Hi Jay,

I haven't looked too much at the details, but the general idea sounds fine to me.
Could you introduce a comment line option that would override this setting when set?

Just for testing/debugging purposes.

Cheers,
-Quentin

This revision is now accepted and ready to land.May 6 2022, 10:26 AM

Also could use a test where it makes a difference

bjope added a subscriber: bjope.May 7 2022, 10:47 PM

Add a command line option. Add a test case.

Herald added subscribers: kerbowa, jvesely. · View Herald TranscriptMay 9 2022, 7:44 AM

Drive-by-nit: Just happened to notice that the commit msg says "RagAlloc...".

lkail added a subscriber: lkail.May 9 2022, 7:52 AM

foad retitled this revision from [RagAllocGreedy] New hook regClassPriorityTrumpsGlobalness to [RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness.May 9 2022, 7:55 AM

bjope added inline comments.May 9 2022, 8:23 AM

llvm/lib/CodeGen/RegAllocGreedy.cpp
291	Just for info: Downstream we are doing if (<our downstream target>) { // Let AllocationPriority affect all ranges. const TargetRegisterClass &RC = MRI->getRegClass(Reg); Size = Size (RC.AllocationPriority + 1); } here (I think Quentin have suggested something like that in the past). Anyway, I tried replacing that old hack by this patch, but got some mixed results. Same thing if using both patches together. Last time I tried to do something in this area I had a hard time finding some heuristic that gave generally better result without some occasional larger regression. Kind of annoying, since your patch also seem to indicate that there is a potential gain here also for our target in several benchmarks (but regressions by almost 20% in a couple of our benchmarks can't be ignored completely). That might of course be due to other shortcomings in our backend (such as the AllocationPriority setup etc). I guess I need to investigate that a bit closer before we consider to use this new option for our target.

foad added inline comments.May 9 2022, 8:28 AM

llvm/lib/CodeGen/RegAllocGreedy.cpp
291	Thanks for trying it. Does your downstream target also have a concept of occupancy, so that using fewer physical registers means that things run significantly faster on the hardware? I'm aware that the register allocator was developed for CPUs and CPUs generally do not have that property.

Harbormaster completed remote builds in B163484: Diff 428083.May 9 2022, 8:53 AM

bjope added inline comments.May 9 2022, 9:03 AM

llvm/lib/CodeGen/RegAllocGreedy.cpp
291	No I don't think we have that (if I understand the question correctly). But we have a limited set of registers (typically 16-32 regs). And similar to what you mentioned about AMDGPU we do have some wide register classes that consist of tuples and quads. Certain quads might be costly to spill/reload, and we do not want that to happen inside a loop for example. So generally I think we want globalness to trump the allocation prio, but sometimes it is bad to allocate a long range quad early since then we have to spill it (mostly guessing here).

uabelho added a subscriber: uabelho.May 9 2022, 9:57 PM

foad added inline comments.May 12 2022, 8:07 AM

llvm/lib/CodeGen/RegAllocGreedy.cpp
291	I have tried your `Size = Size * (RC.AllocationPriority + 1);` heuristic but it doesn't help in the cases I am interested in, because I really need a wide local range to have higher priority than a narrow global range. Just an idea: we could split the actual calculation of the Prio metric out into a separate function like this: https://reviews.llvm.org/differential/diff/428948/ ... and then make it a TargetRegisterInfo hook so that targets could tweak the priority however they like? In the meantime I would like to proceed with the current patch.

Herald added a subscriber: kosarev. · View Herald TranscriptMay 12 2022, 8:07 AM

bjope added inline comments.May 12 2022, 8:20 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57	This comment should clarify that if using GreedyRegClassPriorityTrumpsGlobalness then the range is [0,31]. When looking into how AllocationPriority is used (by our downstream target vs in-tree targets) I noticed that at least PowerPC is using AllocationPriority>32 to set bit 29 in Prio. So they use that as a way to get a higher prio compared to "global and split ranges" based on the AllocationPriority. So, is setting AllocationPriority > 32 a hackier way to trump globalness already without this patch?

foad added inline comments.May 13 2022, 1:56 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57	Yes, setting AllocationPriority > 32 is definitely a hackier way of doing this! I don't like it, because then globalness is ignored even for two live ranges with the same AllocationPriority. I don't want to document that the range is smaller only if you're using GreedyRegClassPriorityTrumpsGlobalness. Because in both cases, you can use priorities >= 32 if you really want to, and it will clobber some other bit in the Prio value. Do you know why PowerPC uses priorities >= 32? Was it done deliberately to clobber the global bit?

bjope added inline comments.May 13 2022, 2:17 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h

Right, I also found it a bit ugly that those things overlap.

I don't know much about PowerPC. It looks like it is deliberate as code comments for example say this:

  // Give the VSRp registers a non-zero AllocationPriority. The value is less
  // than 32 as these registers should not always be allocated before global
  // ranges and the value should be less than the AllocationPriority - 32 for
  // the UACC registers. Even global VSRp registers should be allocated after
  // the UACC registers have been chosen.
  let AllocationPriority = 2;

...

  // Give the VSRp registers a non-zero AllocationPriority. The value is less
  // than 32 as these registers should not always be allocated before global
  // ranges and the value should be less than the AllocationPriority - 32 for
  // the UACC registers. Even global VSRp registers should be allocated after
  // the UACC registers have been chosen.
  let AllocationPriority = 2;

And here goes another by-the-way: utils/TableGen/CodeGenRegisters.cpp is verifying that the AllocationPriority values used is in the range [0, 63] so just modifying the comment here would make this comment unsynced with the tablegen implementation.

Maybe one could say something about values above 31 being special since they would overlap with some other Prio-bits (however, which bits that overlap depend on GreedyRegClassPriorityTrumpsGlobalness).

foad added a subscriber: stefanp.May 13 2022, 4:00 AM

foad added inline comments.

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57	Agreed, the PowerPC usage looks deliberate. It comes from D105854. @stefanp do you think PowerPC might be interested in using regClassPriorityTrumpsGlobalness instead of using AllocationPriority values >= 32?

Add a cautionary note about AllocationPriority.

Update the other copy of the same comment.

Harbormaster completed remote builds in B164283: Diff 429195.May 13 2022, 5:06 AM

This revision was landed with ongoing or failed builds.May 17 2022, 4:42 AM

Closed by commit rG77480556c41f: [RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness (authored by foad). · Explain Why

This revision was automatically updated to reflect the committed changes.

foad added a commit: rG77480556c41f: [RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness.

arsenm added inline comments.Jun 23 2022, 3:50 PM

llvm/lib/CodeGen/RegAllocGreedy.cpp
311–316	I wonder if instead of adding yet another control if the heuristic here just needs to be redone. I think there are several issues with this heuristic. First, getNumAllocatableRegs should probably return a count for disjoint registers. This number is way too big with overlapping tuples in the same register class. Second. the use of the interval size doesn't really work if any pass modified the live intervals. I've struggled to reduce many testcases where the scheduler triggering renumbering of SlotIndexes resulted in different regalloc behavior vs. if the SlotIndexes aren't preserved (i.e. you're just using -run-pass for the one allocator pass).
2710	Why a function level decision, and not a register class?

arsenm added inline comments.Jun 23 2022, 3:52 PM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
57	What if we went the opposite direction, and made something less terrible for the priority setting? I think a per-class priority makes more sense than the function option

foad added inline comments.Jun 24 2022, 4:05 AM

llvm/lib/CodeGen/RegAllocGreedy.cpp
311–316	getNumAllocatableRegs - sounds reasonable. SlotIndexes - I've wondered before about forcibly renumbering them before regalloc runs to avoid this kind of problem.
2710	I'm not sure it makes sense to directly compare two different priorities if they might have been calculated with different settings of RegClassPriorityTrumpsGlobalness, since it is completely changing the calculation. I was specifically interested in tuples like vreg_64 vs vreg_128, which are different classes but they overlap so you need to be able to compare their priorities.

arsenm added inline comments.Jul 21 2022, 8:22 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085	What's the argument for making this configurable? No lit tests fail if I default this to true?

arsenm added inline comments.Jul 21 2022, 8:34 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085	Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix

foad added inline comments.Jul 22 2022, 3:39 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h

1079–1085

No lit tests fail if I default this to true?

Really? I get:

Failed Tests (43):
  LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/localizer.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
  LLVM :: CodeGen/AMDGPU/GlobalISel/srem.i64.ll
  LLVM :: CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
  LLVM :: CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_buffer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll
  LLVM :: CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll
  LLVM :: CodeGen/AMDGPU/collapse-endcf.ll
  LLVM :: CodeGen/AMDGPU/ctpop16.ll
  LLVM :: CodeGen/AMDGPU/dag-divergence-atomic.ll
  LLVM :: CodeGen/AMDGPU/divergent-branch-uniform-condition.ll
  LLVM :: CodeGen/AMDGPU/extend-phi-subrange-not-in-parent.mir
  LLVM :: CodeGen/AMDGPU/extract-subvector-16bit.ll
  LLVM :: CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll
  LLVM :: CodeGen/AMDGPU/global-atomics-fp.ll
  LLVM :: CodeGen/AMDGPU/i1-copy-from-loop.ll
  LLVM :: CodeGen/AMDGPU/idiv-licm.ll
  LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll
  LLVM :: CodeGen/AMDGPU/llvm.round.f64.ll
  LLVM :: CodeGen/AMDGPU/load-constant-i16.ll
  LLVM :: CodeGen/AMDGPU/loop_break.ll
  LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll
  LLVM :: CodeGen/AMDGPU/no-dup-inst-prefetch.ll
  LLVM :: CodeGen/AMDGPU/sdiv64.ll
  LLVM :: CodeGen/AMDGPU/sgpr-control-flow.ll
  LLVM :: CodeGen/AMDGPU/si-annotate-cf-kill.ll
  LLVM :: CodeGen/AMDGPU/skip-if-dead.ll
  LLVM :: CodeGen/AMDGPU/spill-vgpr.ll
  LLVM :: CodeGen/AMDGPU/srem64.ll
  LLVM :: CodeGen/AMDGPU/udiv64.ll
  LLVM :: CodeGen/AMDGPU/urem64.ll
  LLVM :: CodeGen/AMDGPU/wqm.ll
  LLVM :: CodeGen/PowerPC/more-dq-form-prepare.ll
  LLVM :: CodeGen/PowerPC/ppc64-acc-regalloc.ll
  LLVM :: CodeGen/PowerPC/subreg-killed.mir

Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix

I don't know why you call it a bug. It was just a different heuristic. I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them.

foad added inline comments.Jul 22 2022, 4:11 AM

llvm/include/llvm/CodeGen/TargetRegisterInfo.h
1079–1085	I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them. 798fa7e9d6973c7ecb736eea41755ef86220cda1

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetRegisterInfo.h

8 lines

lib/

CodeGen/

RegAllocGreedy.h

4 lines

RegAllocGreedy.cpp

20 lines

test/

CodeGen/

AMDGPU/

greedy-liverange-priority.mir

48 lines

Diff 428083

llvm/include/llvm/CodeGen/TargetRegisterInfo.h

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
using sc_iterator = const TargetRegisterClass* const *;		using sc_iterator = const TargetRegisterClass* const *;

// Instance variables filled by tablegen, do not use!		// Instance variables filled by tablegen, do not use!
const MCRegisterClass *MC;		const MCRegisterClass *MC;
const uint32_t *SubClassMask;		const uint32_t *SubClassMask;
const uint16_t *SuperRegIndices;		const uint16_t *SuperRegIndices;
const LaneBitmask LaneMask;		const LaneBitmask LaneMask;
/// Classes with a higher priority value are assigned first by register		/// Classes with a higher priority value are assigned first by register
/// allocators using a greedy heuristic. The value is in the range [0,63].		/// allocators using a greedy heuristic. The value is in the range [0,63].
		bjopeUnsubmitted Not Done Reply Inline Actions This comment should clarify that if using GreedyRegClassPriorityTrumpsGlobalness then the range is [0,31]. When looking into how AllocationPriority is used (by our downstream target vs in-tree targets) I noticed that at least PowerPC is using AllocationPriority>32 to set bit 29 in Prio. So they use that as a way to get a higher prio compared to "global and split ranges" based on the AllocationPriority. So, is setting AllocationPriority > 32 a hackier way to trump globalness already without this patch? bjope: This comment should clarify that if using GreedyRegClassPriorityTrumpsGlobalness then the range…
		foadAuthorUnsubmitted Done Reply Inline Actions Yes, setting AllocationPriority > 32 is definitely a hackier way of doing this! I don't like it, because then globalness is ignored even for two live ranges with the same AllocationPriority. I don't want to document that the range is smaller only if you're using GreedyRegClassPriorityTrumpsGlobalness. Because in both cases, you can use priorities >= 32 if you really want to, and it will clobber some other bit in the Prio value. Do you know why PowerPC uses priorities >= 32? Was it done deliberately to clobber the global bit? foad: Yes, setting AllocationPriority > 32 is definitely a //hackier// way of doing this! I don't…
		bjopeUnsubmitted Not Done Reply Inline Actions Right, I also found it a bit ugly that those things overlap. I don't know much about PowerPC. It looks like it is deliberate as code comments for example say this: // Give the VSRp registers a non-zero AllocationPriority. The value is less // than 32 as these registers should not always be allocated before global // ranges and the value should be less than the AllocationPriority - 32 for // the UACC registers. Even global VSRp registers should be allocated after // the UACC registers have been chosen. let AllocationPriority = 2; ... // Give the VSRp registers a non-zero AllocationPriority. The value is less // than 32 as these registers should not always be allocated before global // ranges and the value should be less than the AllocationPriority - 32 for // the UACC registers. Even global VSRp registers should be allocated after // the UACC registers have been chosen. let AllocationPriority = 2; And here goes another by-the-way: utils/TableGen/CodeGenRegisters.cpp is verifying that the AllocationPriority values used is in the range [0, 63] so just modifying the comment here would make this comment unsynced with the tablegen implementation. Maybe one could say something about values above 31 being special since they would overlap with some other Prio-bits (however, which bits that overlap depend on GreedyRegClassPriorityTrumpsGlobalness). bjope: Right, I also found it a bit ugly that those things overlap. I don't know much about PowerPC.
		foadAuthorUnsubmitted Done Reply Inline Actions Agreed, the PowerPC usage looks deliberate. It comes from D105854. @stefanp do you think PowerPC might be interested in using regClassPriorityTrumpsGlobalness instead of using AllocationPriority values >= 32? foad: Agreed, the PowerPC usage looks deliberate. It comes from D105854. @stefanp do you think…
		arsenmUnsubmitted Not Done Reply Inline Actions What if we went the opposite direction, and made something less terrible for the priority setting? I think a per-class priority makes more sense than the function option arsenm: What if we went the opposite direction, and made something less terrible for the priority…
const uint8_t AllocationPriority;		const uint8_t AllocationPriority;
/// Configurable target specific flags.		/// Configurable target specific flags.
const uint8_t TSFlags;		const uint8_t TSFlags;
/// Whether the class supports two (or more) disjunct subregister indices.		/// Whether the class supports two (or more) disjunct subregister indices.
const bool HasDisjunctSubRegs;		const bool HasDisjunctSubRegs;
/// Whether a combination of subregisters can cover every register in the		/// Whether a combination of subregisters can cover every register in the
/// class. See also the CoveredBySubRegs description in Target.td.		/// class. See also the CoveredBySubRegs description in Target.td.
const bool CoveredBySubRegs;		const bool CoveredBySubRegs;
▲ Show 20 Lines • Show All 1,005 Lines • ▼ Show 20 Lines	public:
/// This method is used to decide whether \p VirtReg should use the deferred		/// This method is used to decide whether \p VirtReg should use the deferred
/// spilling stage instead of being spilled right away.		/// spilling stage instead of being spilled right away.
virtual bool		virtual bool
shouldUseDeferredSpillingForVirtReg(const MachineFunction &MF,		shouldUseDeferredSpillingForVirtReg(const MachineFunction &MF,
const LiveInterval &VirtReg) const {		const LiveInterval &VirtReg) const {
return false;		return false;
}		}

		/// When prioritizing live ranges in register allocation, if this hook returns
		/// true then the AllocationPriority of the register class will be treated as
		/// more important than whether the range is local to a basic block or global.
		virtual bool
		regClassPriorityTrumpsGlobalness(const MachineFunction &MF) const {
		return false;
		}
		arsenmUnsubmitted Not Done Reply Inline Actions What's the argument for making this configurable? No lit tests fail if I default this to true? arsenm: What's the argument for making this configurable? No lit tests fail if I default this to true?
		arsenmUnsubmitted Not Done Reply Inline Actions Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix arsenm: Basically I think the regclass priority was just a broken feature before, and this is a flag to…
		foadAuthorUnsubmitted Done Reply Inline Actions No lit tests fail if I default this to true? Really? I get: Failed Tests (43): LLVM :: CodeGen/AMDGPU/GlobalISel/insertelement.ll LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll LLVM :: CodeGen/AMDGPU/GlobalISel/localizer.ll LLVM :: CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll LLVM :: CodeGen/AMDGPU/GlobalISel/srem.i64.ll LLVM :: CodeGen/AMDGPU/agpr-copy-no-free-registers.ll LLVM :: CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_buffer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll LLVM :: CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll LLVM :: CodeGen/AMDGPU/collapse-endcf.ll LLVM :: CodeGen/AMDGPU/ctpop16.ll LLVM :: CodeGen/AMDGPU/dag-divergence-atomic.ll LLVM :: CodeGen/AMDGPU/divergent-branch-uniform-condition.ll LLVM :: CodeGen/AMDGPU/extend-phi-subrange-not-in-parent.mir LLVM :: CodeGen/AMDGPU/extract-subvector-16bit.ll LLVM :: CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll LLVM :: CodeGen/AMDGPU/global-atomics-fp.ll LLVM :: CodeGen/AMDGPU/i1-copy-from-loop.ll LLVM :: CodeGen/AMDGPU/idiv-licm.ll LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll LLVM :: CodeGen/AMDGPU/llvm.round.f64.ll LLVM :: CodeGen/AMDGPU/load-constant-i16.ll LLVM :: CodeGen/AMDGPU/loop_break.ll LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll LLVM :: CodeGen/AMDGPU/no-dup-inst-prefetch.ll LLVM :: CodeGen/AMDGPU/sdiv64.ll LLVM :: CodeGen/AMDGPU/sgpr-control-flow.ll LLVM :: CodeGen/AMDGPU/si-annotate-cf-kill.ll LLVM :: CodeGen/AMDGPU/skip-if-dead.ll LLVM :: CodeGen/AMDGPU/spill-vgpr.ll LLVM :: CodeGen/AMDGPU/srem64.ll LLVM :: CodeGen/AMDGPU/udiv64.ll LLVM :: CodeGen/AMDGPU/urem64.ll LLVM :: CodeGen/AMDGPU/wqm.ll LLVM :: CodeGen/PowerPC/more-dq-form-prepare.ll LLVM :: CodeGen/PowerPC/ppc64-acc-regalloc.ll LLVM :: CodeGen/PowerPC/subreg-killed.mir Basically I think the regclass priority was just a broken feature before, and this is a flag to enable/disable a bug fix I don't know why you call it a bug. It was just a different heuristic. I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them. foad: > No lit tests fail if I default this to true? Really? I get: ``` Failed Tests (43): LLVM…
		foadAuthorUnsubmitted Done Reply Inline Actions I'm pretty sure I could find cases that get better allocation with either setting of the flag if I went looking for them. 798fa7e9d6973c7ecb736eea41755ef86220cda1 foad: > I'm pretty sure I could find cases that get better allocation with either setting of the flag…

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
/// Debug information queries.		/// Debug information queries.

/// getFrameRegister - This method should return the register used as a base		/// getFrameRegister - This method should return the register used as a base
/// for values allocated in the current stack frame.		/// for values allocated in the current stack frame.
virtual Register getFrameRegister(const MachineFunction &MF) const = 0;		virtual Register getFrameRegister(const MachineFunction &MF) const = 0;

/// Mark a register and all its aliases as reserved in the given set.		/// Mark a register and all its aliases as reserved in the given set.
▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines

llvm/lib/CodeGen/RegAllocGreedy.h

Show First 20 Lines • Show All 316 Lines • ▼ Show 20 Lines	#endif

/// Set of broken hints that may be reconciled later because of eviction.		/// Set of broken hints that may be reconciled later because of eviction.
SmallSetVector<const LiveInterval *, 8> SetOfBrokenHints;		SmallSetVector<const LiveInterval *, 8> SetOfBrokenHints;

/// The register cost values. This list will be recreated for each Machine		/// The register cost values. This list will be recreated for each Machine
/// Function		/// Function
ArrayRef<uint8_t> RegCosts;		ArrayRef<uint8_t> RegCosts;

		/// Flags for the live range priority calculation, determined once per
		/// machine function.
		bool RegClassPriorityTrumpsGlobalness;

public:		public:
RAGreedy(const RegClassFilterFunc F = allocateAllRegClasses);		RAGreedy(const RegClassFilterFunc F = allocateAllRegClasses);

/// Return the pass name.		/// Return the pass name.
StringRef getPassName() const override { return "Greedy Register Allocator"; }		StringRef getPassName() const override { return "Greedy Register Allocator"; }

/// RAGreedy analysis usage.		/// RAGreedy analysis usage.
void getAnalysisUsage(AnalysisUsage &AU) const override;		void getAnalysisUsage(AnalysisUsage &AU) const override;
▲ Show 20 Lines • Show All 164 Lines • Show Last 20 Lines

llvm/lib/CodeGen/RegAllocGreedy.cpp

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	CSRFirstTimeCost("regalloc-csr-first-time-cost",
cl::init(0), cl::Hidden);		cl::init(0), cl::Hidden);

static cl::opt<unsigned long> GrowRegionComplexityBudget(		static cl::opt<unsigned long> GrowRegionComplexityBudget(
"grow-region-complexity-budget",		"grow-region-complexity-budget",
cl::desc("growRegion() does not scale with the number of BB edges, so "		cl::desc("growRegion() does not scale with the number of BB edges, so "
"limit its budget and bail out once we reach the limit."),		"limit its budget and bail out once we reach the limit."),
cl::init(10000), cl::Hidden);		cl::init(10000), cl::Hidden);

		static cl::opt<bool> GreedyRegClassPriorityTrumpsGlobalness(
		"greedy-regclass-priority-trumps-globalness",
		cl::desc("Change the greedy register allocator's live range priority "
		"calculation to make the AllocationPriority of the register class "
		"more important then whether the range is global"),
		cl::Hidden);

static RegisterRegAlloc greedyRegAlloc("greedy", "greedy register allocator",		static RegisterRegAlloc greedyRegAlloc("greedy", "greedy register allocator",
createGreedyRegisterAllocator);		createGreedyRegisterAllocator);

char RAGreedy::ID = 0;		char RAGreedy::ID = 0;
char &llvm::RAGreedyID = RAGreedy::ID;		char &llvm::RAGreedyID = RAGreedy::ID;

INITIALIZE_PASS_BEGIN(RAGreedy, "greedy",		INITIALIZE_PASS_BEGIN(RAGreedy, "greedy",
"Greedy Register Allocator", false, false)		"Greedy Register Allocator", false, false)
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines

void RAGreedy::enqueue(PQueue &CurQueue, const LiveInterval *LI) {		void RAGreedy::enqueue(PQueue &CurQueue, const LiveInterval *LI) {
// Prioritize live ranges by size, assigning larger ranges first.		// Prioritize live ranges by size, assigning larger ranges first.
// The queue holds (size, reg) pairs.		// The queue holds (size, reg) pairs.
const unsigned Size = LI->getSize();		const unsigned Size = LI->getSize();
const Register Reg = LI->reg();		const Register Reg = LI->reg();
assert(Reg.isVirtual() && "Can only enqueue virtual registers");		assert(Reg.isVirtual() && "Can only enqueue virtual registers");
unsigned Prio;		unsigned Prio;

		bjopeUnsubmitted Not Done Reply Inline Actions Just for info: Downstream we are doing if (<our downstream target>) { // Let AllocationPriority affect all ranges. const TargetRegisterClass &RC = MRI->getRegClass(Reg); Size = Size (RC.AllocationPriority + 1); } here (I think Quentin have suggested something like that in the past). Anyway, I tried replacing that old hack by this patch, but got some mixed results. Same thing if using both patches together. Last time I tried to do something in this area I had a hard time finding some heuristic that gave generally better result without some occasional larger regression. Kind of annoying, since your patch also seem to indicate that there is a potential gain here also for our target in several benchmarks (but regressions by almost 20% in a couple of our benchmarks can't be ignored completely). That might of course be due to other shortcomings in our backend (such as the AllocationPriority setup etc). I guess I need to investigate that a bit closer before we consider to use this new option for our target. bjope: Just for info: Downstream we are doing ``` if (<our downstream target>) { // Let…
		foadAuthorUnsubmitted Done Reply Inline Actions Thanks for trying it. Does your downstream target also have a concept of occupancy, so that using fewer physical registers means that things run significantly faster on the hardware? I'm aware that the register allocator was developed for CPUs and CPUs generally do not have that property. foad: Thanks for trying it. Does your downstream target also have a concept of occupancy, so that…
		bjopeUnsubmitted Not Done Reply Inline Actions No I don't think we have that (if I understand the question correctly). But we have a limited set of registers (typically 16-32 regs). And similar to what you mentioned about AMDGPU we do have some wide register classes that consist of tuples and quads. Certain quads might be costly to spill/reload, and we do not want that to happen inside a loop for example. So generally I think we want globalness to trump the allocation prio, but sometimes it is bad to allocate a long range quad early since then we have to spill it (mostly guessing here). bjope: No I don't think we have that (if I understand the question correctly). But we have a limited…
		foadAuthorUnsubmitted Done Reply Inline Actions I have tried your `Size = Size * (RC.AllocationPriority + 1);` heuristic but it doesn't help in the cases I am interested in, because I really need a wide local range to have higher priority than a narrow global range. Just an idea: we could split the actual calculation of the Prio metric out into a separate function like this: https://reviews.llvm.org/differential/diff/428948/ ... and then make it a TargetRegisterInfo hook so that targets could tweak the priority however they like? In the meantime I would like to proceed with the current patch. foad: I have tried your `Size = Size * (RC.AllocationPriority + 1);` heuristic but it doesn't help in…
auto Stage = ExtraInfo->getOrInitStage(Reg);		auto Stage = ExtraInfo->getOrInitStage(Reg);
if (Stage == RS_New) {		if (Stage == RS_New) {
Stage = RS_Assign;		Stage = RS_Assign;
ExtraInfo->setStage(Reg, Stage);		ExtraInfo->setStage(Reg, Stage);
}		}
if (Stage == RS_Split) {		if (Stage == RS_Split) {
// Unsplit ranges that couldn't be allocated immediately are deferred until		// Unsplit ranges that couldn't be allocated immediately are deferred until
// everything else has been allocated.		// everything else has been allocated.
Prio = Size;		Prio = Size;
} else if (Stage == RS_Memory) {		} else if (Stage == RS_Memory) {
// Memory operand should be considered last.		// Memory operand should be considered last.
// Change the priority such that Memory operand are assigned in		// Change the priority such that Memory operand are assigned in
// the reverse order that they came in.		// the reverse order that they came in.
// TODO: Make this a member variable and probably do something about hints.		// TODO: Make this a member variable and probably do something about hints.
static unsigned MemOp = 0;		static unsigned MemOp = 0;
Prio = MemOp++;		Prio = MemOp++;
} else {		} else {
// Giant live ranges fall back to the global assignment heuristic, which		// Giant live ranges fall back to the global assignment heuristic, which
// prevents excessive spilling in pathological cases.		// prevents excessive spilling in pathological cases.
bool ReverseLocal = TRI->reverseLocalAssignment();		bool ReverseLocal = TRI->reverseLocalAssignment();
const TargetRegisterClass &RC = *MRI->getRegClass(Reg);		const TargetRegisterClass &RC = *MRI->getRegClass(Reg);
bool ForceGlobal = !ReverseLocal &&		bool ForceGlobal = !ReverseLocal &&
(Size / SlotIndex::InstrDist) > (2 * RCI.getNumAllocatableRegs(&RC));		(Size / SlotIndex::InstrDist) > (2 * RCI.getNumAllocatableRegs(&RC));
		unsigned GlobalBit = 0;

		arsenmUnsubmitted Not Done Reply Inline Actions I wonder if instead of adding yet another control if the heuristic here just needs to be redone. I think there are several issues with this heuristic. First, getNumAllocatableRegs should probably return a count for disjoint registers. This number is way too big with overlapping tuples in the same register class. Second. the use of the interval size doesn't really work if any pass modified the live intervals. I've struggled to reduce many testcases where the scheduler triggering renumbering of SlotIndexes resulted in different regalloc behavior vs. if the SlotIndexes aren't preserved (i.e. you're just using -run-pass for the one allocator pass). arsenm: I wonder if instead of adding yet another control if the heuristic here just needs to be redone.
		foadAuthorUnsubmitted Done Reply Inline Actions getNumAllocatableRegs - sounds reasonable. SlotIndexes - I've wondered before about forcibly renumbering them before regalloc runs to avoid this kind of problem. foad: getNumAllocatableRegs - sounds reasonable. SlotIndexes - I've wondered before about forcibly…
if (Stage == RS_Assign && !ForceGlobal && !LI->empty() &&		if (Stage == RS_Assign && !ForceGlobal && !LI->empty() &&
LIS->intervalIsInOneMBB(*LI)) {		LIS->intervalIsInOneMBB(*LI)) {
// Allocate original local ranges in linear instruction order. Since they		// Allocate original local ranges in linear instruction order. Since they
// are singly defined, this produces optimal coloring in the absence of		// are singly defined, this produces optimal coloring in the absence of
// global interference and other constraints.		// global interference and other constraints.
if (!ReverseLocal)		if (!ReverseLocal)
Prio = LI->beginIndex().getInstrDistance(Indexes->getLastIndex());		Prio = LI->beginIndex().getInstrDistance(Indexes->getLastIndex());
else {		else {
// Allocating bottom up may allow many short LRGs to be assigned first		// Allocating bottom up may allow many short LRGs to be assigned first
// to one of the cheap registers. This could be much faster for very		// to one of the cheap registers. This could be much faster for very
// large blocks on targets with many physical registers.		// large blocks on targets with many physical registers.
Prio = Indexes->getZeroIndex().getInstrDistance(LI->endIndex());		Prio = Indexes->getZeroIndex().getInstrDistance(LI->endIndex());
}		}
} else {		} else {
// Allocate global and split ranges in long->short order. Long ranges that		// Allocate global and split ranges in long->short order. Long ranges that
// don't fit should be spilled (or split) ASAP so they don't create		// don't fit should be spilled (or split) ASAP so they don't create
// interference. Mark a bit to prioritize global above local ranges.		// interference. Mark a bit to prioritize global above local ranges.
Prio = (1u << 29) + Size;		Prio = Size;
		GlobalBit = 1;
}		}
Prio \|= RC.AllocationPriority << 24;		if (RegClassPriorityTrumpsGlobalness)
		Prio \|= RC.AllocationPriority << 25 \| GlobalBit << 24;
		else
		Prio \|= GlobalBit << 29 \| RC.AllocationPriority << 24;

// Mark a higher bit to prioritize global and local above RS_Split.		// Mark a higher bit to prioritize global and local above RS_Split.
Prio \|= (1u << 31);		Prio \|= (1u << 31);

// Boost ranges that have a physical register hint.		// Boost ranges that have a physical register hint.
if (VRM->hasKnownPreference(Reg))		if (VRM->hasKnownPreference(Reg))
Prio \|= (1u << 30);		Prio \|= (1u << 30);
}		}
▲ Show 20 Lines • Show All 2,350 Lines • ▼ Show 20 Lines	bool RAGreedy::runOnMachineFunction(MachineFunction &mf) {
Bundles = &getAnalysis<EdgeBundles>();		Bundles = &getAnalysis<EdgeBundles>();
SpillPlacer = &getAnalysis<SpillPlacement>();		SpillPlacer = &getAnalysis<SpillPlacement>();
DebugVars = &getAnalysis<LiveDebugVariables>();		DebugVars = &getAnalysis<LiveDebugVariables>();
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();

initializeCSRCost();		initializeCSRCost();

RegCosts = TRI->getRegisterCosts(*MF);		RegCosts = TRI->getRegisterCosts(*MF);
		RegClassPriorityTrumpsGlobalness =
		GreedyRegClassPriorityTrumpsGlobalness.getNumOccurrences()
		? GreedyRegClassPriorityTrumpsGlobalness
		: TRI->regClassPriorityTrumpsGlobalness(*MF);
		arsenmUnsubmitted Not Done Reply Inline Actions Why a function level decision, and not a register class? arsenm: Why a function level decision, and not a register class?
		foadAuthorUnsubmitted Done Reply Inline Actions I'm not sure it makes sense to directly compare two different priorities if they might have been calculated with different settings of RegClassPriorityTrumpsGlobalness, since it is completely changing the calculation. I was specifically interested in tuples like vreg_64 vs vreg_128, which are different classes but they overlap so you need to be able to compare their priorities. foad: I'm not sure it makes sense to directly compare two different priorities if they might have…

ExtraInfo.emplace();		ExtraInfo.emplace();
EvictAdvisor =		EvictAdvisor =
getAnalysis<RegAllocEvictionAdvisorAnalysis>().getAdvisor(MF, this);		getAnalysis<RegAllocEvictionAdvisorAnalysis>().getAdvisor(MF, this);

VRAI = std::make_unique<VirtRegAuxInfo>(MF, LIS, VRM, Loops, *MBFI);		VRAI = std::make_unique<VirtRegAuxInfo>(MF, LIS, VRM, Loops, *MBFI);
SpillerInstance.reset(createInlineSpiller(this, MF, VRM, VRAI));		SpillerInstance.reset(createInlineSpiller(this, MF, VRM, VRAI));

Show All 23 Lines

llvm/test/CodeGen/AMDGPU/greedy-liverange-priority.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=gfx1030 -greedy-regclass-priority-trumps-globalness=0 -start-before greedy -o - %s \| FileCheck %s -check-prefix=OLD
				# RUN: llc -march=amdgcn -mcpu=gfx1030 -greedy-regclass-priority-trumps-globalness=1 -start-before greedy -o - %s \| FileCheck %s -check-prefix=NEW

				# At the time of writing -greedy-regclass-priority-trumps-globalness makes a
				# significant improvement in the total number of vgprs needed to compile this
				# test, from 11 down to 7.

				# OLD: NumVgprs: 11{{$}}
				# NEW: NumVgprs: 7{{$}}

				---
				name: _amdgpu_cs_main
				tracksRegLiveness: true
				body: \|
				bb.0:
				successors: %bb.1, %bb.2
				liveins: $vgpr0, $vgpr6

				%6:vgpr_32 = COPY $vgpr6
				undef %30.sub0:vreg_128 = COPY $vgpr0
				undef %27.sub0:vreg_128 = V_MED3_F32_e64 0, 0, 0, 0, 0, 0, 0, 0, implicit $mode, implicit $exec
				undef %16.sub0:sgpr_256 = S_MOV_B32 0
				undef %26.sub1:vreg_64 = V_LSHRREV_B32_e32 1, %6, implicit $exec
				%27.sub1:vreg_128 = COPY %27.sub0
				%27.sub2:vreg_128 = COPY %27.sub0
				%27.sub3:vreg_128 = COPY %27.sub0
				%26.sub0:vreg_64 = V_MOV_B32_e32 1, implicit $exec
				%16.sub1:sgpr_256 = COPY %16.sub0
				%16.sub2:sgpr_256 = COPY %16.sub0
				%16.sub3:sgpr_256 = COPY %16.sub0
				%16.sub4:sgpr_256 = COPY %16.sub0
				%16.sub5:sgpr_256 = COPY %16.sub0
				%16.sub6:sgpr_256 = COPY %16.sub0
				%16.sub7:sgpr_256 = COPY %16.sub0
				IMAGE_STORE_V4_V2_gfx10 %27, %26, %16, 0, 1, -1, 0, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable store (s32) into custom "ImageResource")
				S_CBRANCH_SCC1 %bb.2, implicit undef $scc
				S_BRANCH %bb.1

				bb.1:
				%30.sub1:vreg_128 = V_MOV_B32_e32 0, implicit $exec
				%30.sub2:vreg_128 = COPY %30.sub1
				%30.sub3:vreg_128 = COPY %30.sub1
				%26.sub1:vreg_64 = COPY %30.sub1
				IMAGE_STORE_V4_V2_gfx10 %30, %26, %16, 0, 1, -1, 0, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable store (s32) into custom "ImageResource")

				bb.2:
				S_ENDPGM 0
				...

This is an archive of the discontinued LLVM Phabricator instance.

[RegAllocGreedy] New hook regClassPriorityTrumpsGlobalnessClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 428083

llvm/include/llvm/CodeGen/TargetRegisterInfo.h

llvm/lib/CodeGen/RegAllocGreedy.h

llvm/lib/CodeGen/RegAllocGreedy.cpp

llvm/test/CodeGen/AMDGPU/greedy-liverange-priority.mir

[RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness
ClosedPublic