This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
-
RegAllocGreedy.cpp
-
test/CodeGen/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
extractelement-stack-lower.ll
-
mul.ll
-
sdiv.i64.ll
-
sdivrem.ll
-
srem.i64.ll
-
udiv.i64.ll
-
urem.i64.ll
-
amdgpu-codegenprepare-idiv.ll
-
copy-illegal-type.ll
-
cvt_f32_ubyte.ll
-
frem.ll
-
greedy-global-heuristic.mir
-
half.ll
-
llvm.round.f64.ll
-
load-constant-i16.ll
-
load-global-i16.ll
-
sdiv64.ll
-
shl.ll
-
soft-clause-exceeds-register-budget.ll
-
splitkit-copy-live-lanes.mir
-
srl.ll
-
ARM/
-
fptosi-sat-scalar.ll
-
srem-seteq-illegal-types.ll
-
umulo-128-legalisation-lowering.ll
-
Hexagon/
-
reg-scavengebug-2.ll
-
Mips/cconv/
-
cconv/
-
vector.ll
-
PowerPC/
-
ppc-fpclass.ll
-
srem-vector-lkk.ll
-
urem-vector-lkk.ll
-
RISCV/
-
rvv/
-
fixed-vectors-bitreverse.ll
-
fixed-vectors-bswap.ll
-
fixed-vectors-cttz.ll
-
stack-store-check.ll
-
Thumb2/
-
mve-simple-arith.ll
-
mve-vld4.ll
-
srem-seteq-illegal-types.ll
-
X86/
-
2007-10-12-SpillerUnfold1.ll
-
2008-04-16-ReMatBug.ll
-
64-bit-shift-by-32-minus-y.ll
-
abs.ll
-
avx512-calling-conv.ll
-
avx512-regcall-NoMask.ll
-
avx512-select.ll
-
avx512bw-intrinsics-upgrade.ll
-
avx512bwvl-intrinsics-upgrade.ll
-
bitreverse.ll
-
bool-vector.ll
-
bswap.ll
-
build-vector-128.ll
-
combine-sbb.ll
-
div-rem-pair-recomposition-signed.ll
-
div-rem-pair-recomposition-unsigned.ll
-
fp128-cast.ll
-
fptosi-sat-scalar.ll
-
funnel-shift-rot.ll
-
funnel-shift.ll
-
gather-addresses.ll
-
hoist-and-by-const-from-lshr-in-eqcmp-zero.ll
-
hoist-and-by-const-from-shl-in-eqcmp-zero.ll
-
horizontal-reduce-smax.ll
-
horizontal-reduce-smin.ll
-
horizontal-reduce-umax.ll
-
horizontal-reduce-umin.ll
-
i128-mul.ll
-
i128-sdiv.ll
-
i256-add.ll
-
i64-to-float.ll
-
illegal-bitfield-loadstore.ll
-
known-signbits-vector.ll
-
legalize-shl-vec.ll
-
load-combine.ll
-
masked_gather_scatter.ll
-
memcmp-more-load-pairs.ll
-
merge-consecutive-stores-nt.ll
-
mmx-arith.ll
-
mul-constant-i64.ll
-
mul-constant-result.ll
-
mul-i1024.ll
-
mul-i256.ll
-
mul-i512.ll
-
mul128.ll
-
neg-abs.ll
-
nontemporal.ll
-
nosse-vector.ll
-
overflow.ll
-
peephole-na-phys-copy-folding.ll
-
popcnt.ll
-
pr31088.ll
-
pr32284.ll
-
pr32329.ll
-
pr32610.ll
-
pr34080-2.ll
-
pr46527.ll
-
sadd_sat.ll
-
sadd_sat_plus.ll
-
scheduler-backtracking.ll
-
sdiv_fix.ll
-
sdiv_fix_sat.ll
-
select.ll
-
setcc-wide-types.ll
-
shrink_vmul.ll
-
smax.ll
-
smin.ll
-
smul_fix.ll
-
smul_fix_sat.ll
-
smulo-128-legalisation-lowering.ll
-
sse-intrinsics-fast-isel.ll
-
sse2-intrinsics-fast-isel.ll
-
sshl_sat.ll
-
sshl_sat_vec.ll
-
ssub_sat.ll
-
ssub_sat_plus.ll
-
stack-align-memcpy.ll
-
statepoint-vreg-unlimited-tied-opnds.ll
-
subvector-broadcast.ll
-
uadd_sat.ll
-
udiv_fix_sat.ll
-
umax.ll
-
umin.ll
-
umul-with-overflow.ll
-
umul_fix.ll
-
umul_fix_sat.ll
-
umulo-64-legalisation-lowering.ll
-
unfold-masked-merge-vector-variablemask.ll
-
ushl_sat.ll
-
ushl_sat_vec.ll
-
usub_sat.ll
-
vec-strict-cmp-128.ll
-
vec-strict-cmp-sub128.ll
-
vec-strict-fptoint-256.ll
-
vec-strict-inttofp-512.ll
-
vec_shift4.ll
-
vec_smulo.ll
-
vec_umulo.ll
-
vector-fshl-128.ll
-
vector-fshl-rot-128.ll
-
vector-fshr-128.ll
-
vector-fshr-rot-128.ll
-
vector-gep.ll
-
vector-idiv-v2i32.ll
-
vector-lzcnt-128.ll
-
vector-rotate-128.ll
-
vector-sext.ll
-
vector-shift-lshr-256.ll
-
vector-shift-shl-256.ll
-
vector-trunc-ssat.ll
-
vector-tzcnt-128.ll
-
vshift-6.ll
-
widen_cast-4.ll
-
x86-fpclass.ll
-
xmulo.ll

Differential D108578

RegAllocGreedy: Account for reserved registers in num regs heuristic
ClosedPublic

Authored by arsenm on Aug 23 2021, 1:10 PM.

Download Raw Diff

Details

Reviewers

qcolombet
kparzysz
foad
rampitec

Summary

This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.

There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.

The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.

Diff Detail

Event Timeline

arsenm created this revision.Aug 23 2021, 1:10 PM

Herald added subscribers: frasercrmck, kerbowa, luismarques and 31 others. · View Herald TranscriptAug 23 2021, 1:10 PM

arsenm requested review of this revision.Aug 23 2021, 1:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2021, 1:10 PM

Herald added subscribers: MaskRay, wdng. · View Herald Transcript

Harbormaster completed remote builds in B120855: Diff 368192.Aug 23 2021, 1:11 PM

The change itself looks reasonable to me.

The worst looking regression is likely test/Thumb2/mve-vld4.ll

We likely get the register pressure very wrong, and it looks like less code in total. I have no objections to this patch from an Arm point of view.

Arm MVE has 32 scalar FP S regs, 4 (aligned) of which make up the 8 vector Q regs (so q0=s0-s1-s2-s3, q1=s4-s5-s6-s7 etc).
4 potentially unaligned Q registers make up a QQQQPR reg (q0-q1-q2-q3, q1-q2-q3-q4, q2-q3-q4-q5, q3-q4-q5-q6 and q4-q5-q6-q7)
So we have "5" QQQQPR registers, only 2 of which are actually allocatable at once.

Additionally, I think it would be more appropriate to use the number of disjointly allocatable registers in the class. For the AMDGPU register tuples, there are a large number of registers in each tuple class, but only a small fraction can actually be allocated at the same time since they all overlap with each other. It seems we do not have a query that corresponds to the number of independently allocatable registers.

That sounds very useful.

ping

lkail added a subscriber: lkail.Sep 13 2021, 7:08 AM

qcolombet accepted this revision.Sep 14 2021, 1:28 PM

This revision is now accepted and ready to land.Sep 14 2021, 1:28 PM

4a36e96c3fc2a9128097bfc4f907ccebc5dc66af

paklui added a subscriber: paklui.Sep 15 2021, 11:56 AM