ARM's CMP_SWAP_64 pseudo-instruction (introduced in r266679,
used with -O0) uses three GPRPair-class registers and two
GPR-class registers. The fast register allocator (also used
with -O0) allocates GPR-class registers before allocating
GPRPair-class registers.
GPRPair includes is r0+r1, r2+r3, r4+r5, r6+r7, r8+r9,
r10+r11, and r12+sp. With Clang's -ffixed-r9 option, with
the frame pointer enabled, and with sp reserved, the
register allocator can only use r0+r1, r2+r3, r4+r5, and
r6+r7 from GPRPair.
If the fast register allocator allocates CMP_SWAP_64's GPR
operands first, it may decide to allocate r1 and r3. Later,
when the allocator allocates CMP_SWAP_64's GPRPair oprands,
it realizes only two GPRPair-class registers are available
(r4+r5 and r6+r7), and it can't spill the already-allocated
registers to make room. LLVM then fails with the message:
error: ran out of registers during register allocation
In short, the fast register allocator fragments the
registers and LLVM can't compile 64-bit compare-exchanges.
As a workaround, reduce the risk of fragmentation by
allocating registers for bigger register classes before
smaller ones. For CMP_SWAP_64 on ARM, this means
registers are allocated for GPRPair operands before GPR
operands.
For consistency, change all architectures, not just ARM.
This fixes PR30228.
Test Plan:
See the included test case (test/CodeGen/ARM/cmpxchg-O0.ll
test_cmpxchg_64_register_pressure). It is based on the
following C program compiled with Clang with -O0:
void f(unsigned long long *addr, unsigned long long desired, unsigned long long new) { while (!__sync_bool_compare_and_swap(addr, desired, new)) { } }
Reject Reg == 0