This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
3
RegAllocFast.cpp
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
-
cmpxchg-O0.ll

Differential D33814

CodeGen: Fix ARM cmpxchg64 register fragmentation in fast-regalloc
AbandonedPublic

Authored by strager on Jun 1 2017, 5:38 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
ab
MatzeB
qcolombet

Summary

ARM's CMP_SWAP_64 pseudo-instruction (introduced in r266679,
used with -O0) uses three GPRPair-class registers and two
GPR-class registers. The fast register allocator (also used
with -O0) allocates GPR-class registers before allocating
GPRPair-class registers.

GPRPair includes is r0+r1, r2+r3, r4+r5, r6+r7, r8+r9,
r10+r11, and r12+sp. With Clang's -ffixed-r9 option, with
the frame pointer enabled, and with sp reserved, the
register allocator can only use r0+r1, r2+r3, r4+r5, and
r6+r7 from GPRPair.

If the fast register allocator allocates CMP_SWAP_64's GPR
operands first, it may decide to allocate r1 and r3. Later,
when the allocator allocates CMP_SWAP_64's GPRPair oprands,
it realizes only two GPRPair-class registers are available
(r4+r5 and r6+r7), and it can't spill the already-allocated
registers to make room. LLVM then fails with the message:

error: ran out of registers during register allocation

In short, the fast register allocator fragments the
registers and LLVM can't compile 64-bit compare-exchanges.

As a workaround, reduce the risk of fragmentation by
allocating registers for bigger register classes before
smaller ones. For CMP_SWAP_64 on ARM, this means
registers are allocated for GPRPair operands before GPR
operands.

For consistency, change all architectures, not just ARM.

This fixes PR30228.

Test Plan:
See the included test case (test/CodeGen/ARM/cmpxchg-O0.ll
test_cmpxchg_64_register_pressure). It is based on the
following C program compiled with Clang with -O0:

void f(unsigned long long *addr, unsigned long long desired, unsigned long long new) {
  while (!__sync_bool_compare_and_swap(addr, desired, new)) {
  }
}

Diff Detail

Build Status

Buildable 6935
Build 6935: arc lint + arc unit

Event Timeline

strager created this revision.Jun 1 2017, 5:38 PM

Herald added subscribers: kristof.beyls, javed.absar, qcolombet and 2 others. · View Herald TranscriptJun 1 2017, 5:38 PM

Let me know if there's a better solution than sorting by size! (Aside from fixing the bug which necessitates CMP_SWAP_64 in the first place...)

lib/CodeGen/RegAllocFast.cpp
997–1000	(I know I should delete this before merging. Locally I need to work with LLVM 4.0.0.)

smeenai added a subscriber: smeenai.Jun 1 2017, 5:41 PM

Ping! This is ready for design and code review.

+Matthias for register allocation

+ Quentin for register allocation

I'd prefer not to add additional logic into the main path of the fast register allocator. How about just reordering the operands of CMP_SWAP_64 (or for all of the CMP_SWAP operations for consistency) instead?

Hi,

The approach sounds sensible to me however, I'd like we use the AllocationPriority instead of the size of the register class.
Moreover, please refactor the code (I am thinking helper function) so that the definitions are processed in the same way.
Finally, keep the order of the operand as tie breaker.

Cheers,
-Quentin
PS: If AllocationPriority does not fix the problem, then you'll have to fix the description of the related register class, because size alone is too magic.

This revision now requires changes to proceed.Aug 7 2017, 10:36 AM

qcolombet added inline comments.Aug 7 2017, 10:37 AM

lib/CodeGen/RegAllocFast.cpp
984	Reject Reg == 0
991	Why can't you just get the the reg class here?

In D33814#834086, @qcolombet wrote:

Hi,

The approach sounds sensible to me however, I'd like we use the AllocationPriority instead of the size of the register class.
Moreover, please refactor the code (I am thinking helper function) so that the definitions are processed in the same way.
Finally, keep the order of the operand as tie breaker.

Cheers,
-Quentin
PS: If AllocationPriority does not fix the problem, then you'll have to fix the description of the related register class, because size alone is too magic.

The ARM target does not use AllocationPriority yet. Though it would be great to start using it for the GPRPair and tuple classes (in a separate commit though).

I don't plan to work on this anymore.

Revision Contents

Path

Size

lib/

CodeGen/

RegAllocFast.cpp

57 lines

test/

CodeGen/

ARM/

cmpxchg-O0.ll

48 lines

Diff 101146

lib/CodeGen/RegAllocFast.cpp

Show First 20 Lines • Show All 907 Lines • ▼ Show 20 Lines	if (MI->isCopy()) {
CopySrcSub = MI->getOperand(1).getSubReg();		CopySrcSub = MI->getOperand(1).getSubReg();
}		}

// Track registers used by instruction.		// Track registers used by instruction.
UsedInInstr.clear();		UsedInInstr.clear();

// First scan.		// First scan.
// Mark physreg uses and early clobbers as used.		// Mark physreg uses and early clobbers as used.
// Find the end of the virtreg operands
unsigned VirtOpEnd = 0;
bool hasTiedOps = false;		bool hasTiedOps = false;
bool hasEarlyClobbers = false;		bool hasEarlyClobbers = false;
bool hasPartialRedefs = false;		bool hasPartialRedefs = false;
bool hasPhysDefs = false;		bool hasPhysDefs = false;
for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
// Make sure MRI knows about registers clobbered by regmasks.		// Make sure MRI knows about registers clobbered by regmasks.
if (MO.isRegMask()) {		if (MO.isRegMask()) {
MRI->addPhysRegsUsedFromRegMask(MO.getRegMask());		MRI->addPhysRegsUsedFromRegMask(MO.getRegMask());
continue;		continue;
}		}
if (!MO.isReg()) continue;		if (!MO.isReg()) continue;
unsigned Reg = MO.getReg();		unsigned Reg = MO.getReg();
if (!Reg) continue;		if (!Reg) continue;
if (TargetRegisterInfo::isVirtualRegister(Reg)) {		if (TargetRegisterInfo::isVirtualRegister(Reg)) {
VirtOpEnd = i+1;
if (MO.isUse()) {		if (MO.isUse()) {
hasTiedOps = hasTiedOps \|\|		hasTiedOps = hasTiedOps \|\|
MCID.getOperandConstraint(i, MCOI::TIED_TO) != -1;		MCID.getOperandConstraint(i, MCOI::TIED_TO) != -1;
} else {		} else {
if (MO.isEarlyClobber())		if (MO.isEarlyClobber())
hasEarlyClobbers = true;		hasEarlyClobbers = true;
if (MO.getSubReg() && MI->readsVirtualRegister(Reg))		if (MO.getSubReg() && MI->readsVirtualRegister(Reg))
hasPartialRedefs = true;		hasPartialRedefs = true;
Show All 27 Lines	if (MI->isInlineAsm() \|\| hasEarlyClobbers \|\| hasPartialRedefs \|\|
CopyDst = 0;		CopyDst = 0;
// Pretend we have early clobbers so the use operands get marked below.		// Pretend we have early clobbers so the use operands get marked below.
// This is not necessary for the common case of a single tied use.		// This is not necessary for the common case of a single tied use.
hasEarlyClobbers = true;		hasEarlyClobbers = true;
}		}

// Second scan.		// Second scan.
// Allocate virtreg uses.		// Allocate virtreg uses.
for (unsigned i = 0; i != VirtOpEnd; ++i) {		// HACK(strager): Allocate larger registers first.
MachineOperand &MO = MI->getOperand(i);		SmallVector<unsigned, 8> VirtOps;
if (!MO.isReg()) continue;		for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
		const MachineOperand &MO = MI->getOperand(i);
		if (!MO.isReg())
		continue;
		unsigned Reg = MO.getReg();
		if (!TargetRegisterInfo::isVirtualRegister(Reg))
		continue;
		if (!MO.isUse())
		continue;
		qcolombetUnsubmitted Not Done Reply Inline Actions Reject Reg == 0 qcolombet: Reject Reg == 0
		VirtOps.emplace_back(i);
		}
		std::sort(VirtOps.begin(), VirtOps.end(),
		[this, &MI](unsigned i, unsigned j) {
		auto OpSize = [this](const MachineOperand &MO) -> unsigned {
		unsigned Reg = MO.getReg();
		if (auto RC = MRI->getRegClassOrRegBank(Reg)) {
		qcolombetUnsubmitted Not Done Reply Inline Actions Why can't you just get the the reg class here? qcolombet: Why can't you just get the the reg class here?
		if (RC.is<const TargetRegisterClass *>()) {
		#if LLVM_VERSION_MAJOR >= 5
		// LLVM 5.0.0.
		return TRI->getRegSizeInBits(
		RC.get<const TargetRegisterClass >());
		#else
		// LLVM 4.0.0.
		return RC.get<const TargetRegisterClass *>()->getSize();
		#endif
		stragerAuthorUnsubmitted Not Done Reply Inline Actions (I know I should delete this before merging. Locally I need to work with LLVM 4.0.0.) strager: (I know I should delete this before merging. Locally I need to work with LLVM 4.0.0.)
		} else {
		return 0;
		}
		} else {
		return 0;
		}
		};
		return OpSize(MI->getOperand(i)) >= OpSize(MI->getOperand(j));
		});
		for (unsigned i : VirtOps) {
		const MachineOperand &MO = MI->getOperand(i);
unsigned Reg = MO.getReg();		unsigned Reg = MO.getReg();
if (!TargetRegisterInfo::isVirtualRegister(Reg)) continue;
if (MO.isUse()) {
LiveRegMap::iterator LRI = reloadVirtReg(*MI, i, Reg, CopyDst);		LiveRegMap::iterator LRI = reloadVirtReg(*MI, i, Reg, CopyDst);
unsigned PhysReg = LRI->PhysReg;		unsigned PhysReg = LRI->PhysReg;
CopySrc = (CopySrc == Reg \|\| CopySrc == PhysReg) ? PhysReg : 0;		CopySrc = (CopySrc == Reg \|\| CopySrc == PhysReg) ? PhysReg : 0;
if (setPhysReg(MI, i, PhysReg))		if (setPhysReg(MI, i, PhysReg))
killVirtReg(LRI);		killVirtReg(LRI);
}		}
}

// Track registers defined by instruction - early clobbers and tied uses at		// Track registers defined by instruction - early clobbers and tied uses at
// this point.		// this point.
UsedInInstr.clear();		UsedInInstr.clear();
if (hasEarlyClobbers) {		if (hasEarlyClobbers) {
for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
if (!MO.isReg()) continue;		if (!MO.isReg()) continue;
▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

test/CodeGen/ARM/cmpxchg-O0.ll

	Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
	; CHECK: cmp{{(\.w)?}} [[STATUS]], #0			; CHECK: cmp{{(\.w)?}} [[STATUS]], #0
	; CHECK: bne [[RETRY]]			; CHECK: bne [[RETRY]]
	; CHECK: [[DONE]]:			; CHECK: [[DONE]]:
	; CHECK: dmb ish			; CHECK: dmb ish
	%res = cmpxchg i64* %addr, i64 %desired, i64 %new seq_cst monotonic			%res = cmpxchg i64* %addr, i64 %desired, i64 %new seq_cst monotonic
	ret { i64, i1 } %res			ret { i64, i1 } %res
	}			}

				; If r9 and fp are reserved, cmpxchg can only use r0/r1, r2/r3, r4/r5, or r6/r7
				; for the 64-bit inputs to ldrexd and strexd. Ensure fast-regalloc can find
				; enough registers without spilling.
				define i64 @test_cmpxchg_64_register_pressure(i64* %addr, i64 %desired, i64 %new) nounwind "no-frame-pointer-elim"="true" "target-features"="+reserve-r9" {
				; CHECK-LABEL: test_cmpxchg_64_register_pressure:
				%addr.addr = alloca i64*, align 4
				%desired.addr = alloca i64, align 8
				%new.addr = alloca i64, align 8
				store i64* %addr, i64** %addr.addr, align 4
				store i64 %desired, i64* %desired.addr, align 8
				store i64 %new, i64* %new.addr, align 8
				br label %while.cond

				while.cond:
				%addr.tmp = load i64, i64* %addr.addr, align 4
				%desired.tmp = load i64, i64* %desired.addr, align 8
				%new.tmp = load i64, i64* %new.addr, align 8

				; CHECK-DAG: mov [[NEWLO:r[0-9]+]], r3
				; CHECK-NEXT:mov [[NEWHI:r[0-9]+]], {{r[0-9]+}}
				; CHECK-DAG: ldrd [[DESIREDLO:r[0-9]+]], [[DESIREDHI:r[0-9]+]], [sp, #{{[0-9]+}}] @ 8-byte Reload
				; CHECK-DAG: dmb ish
				; CHECK: [[INNERRETRY:.LBB[0-9]+_[0-9]+]]:
				; CHECK-NOT: {{ldr[^e]\|str}}
				; CHECK: ldrexd [[OLDLO:r[0-9]+]], [[OLDHI:r[0-9]+]], {{\[}}[[ADDR:r[0-9]+]]{{\]}}
				; CHECK-NOT: {{ldr\|str}}
				; CHECK: cmp [[OLDLO]], [[DESIREDLO]]
				; CHECK-NOT: {{ldr\|str}}
				; CHECK: cmpeq [[OLDHI]], [[DESIREDHI]]
				; CHECK-NOT: {{ldr\|str}}
				; CHECK: bne [[INNERDONE:.LBB[0-9]+_[0-9]+]]
				; CHECK-NOT: {{ldr\|str[^e]}}
				; CHECK: strexd [[STATUS:r[0-9]+]], [[NEWLO]], [[NEWHI]], {{\[}}[[ADDR]]{{\]}}
				; CHECK-NOT: {{ldr\|str}}
				; CHECK: cmp{{(\.w)?}} [[STATUS]], #0
				; CHECK-NOT: {{ldr\|str}}
				; CHECK: bne [[INNERRETRY]]
				; CHECK: [[INNERDONE]]:
				; CHECK: dmb ish
				%tmp = cmpxchg i64* %addr.tmp, i64 %desired.tmp, i64 %new.tmp seq_cst seq_cst

				%status = extractvalue { i64, i1 } %tmp, 1
				br i1 %status, label %done, label %while.cond

				done:
				ret i64 0
				}

	define { i64, i1 } @test_nontrivial_args(i64* %addr, i64 %desired, i64 %new) {			define { i64, i1 } @test_nontrivial_args(i64* %addr, i64 %desired, i64 %new) {
	; CHECK-LABEL: test_nontrivial_args:			; CHECK-LABEL: test_nontrivial_args:
	; CHECK: dmb ish			; CHECK: dmb ish
	; CHECK-NOT: uxt			; CHECK-NOT: uxt
	; CHECK: [[RETRY:.LBB[0-9]+_[0-9]+]]:			; CHECK: [[RETRY:.LBB[0-9]+_[0-9]+]]:
	; CHECK: ldrexd [[OLDLO:r[0-9]+]], [[OLDHI:r[0-9]+]], [r0]			; CHECK: ldrexd [[OLDLO:r[0-9]+]], [[OLDHI:r[0-9]+]], [r0]
	; CHECK: cmp [[OLDLO]], {{r[0-9]+}}			; CHECK: cmp [[OLDLO]], {{r[0-9]+}}
	; CHECK: cmpeq [[OLDHI]], {{r[0-9]+}}			; CHECK: cmpeq [[OLDHI]], {{r[0-9]+}}
	Show All 23 Lines