This is an archive of the discontinued LLVM Phabricator instance.

X86: Stop assigning register costs for longer encodings
ClosedPublic

Authored by MatzeB on Sep 14 2022, 3:32 PM.

Details

Summary

This stops reporting CostPerUse 1 for R8-R15 and XMM8-XMM31. This was
previously done because instruction encoding require a REX prefix when
using them resulting in longer instruction encodings. I found that this
regresses the quality of the register allocation as the costs impose an
ordering on eviction candidates. I also feel that there is a bit of an
impedance mismatch as the actual costs occure when encoding instructions
using those registers, but the order of VReg assignments is not
primarily ordered by number of Defs+Uses.

I did extensive measurements with the llvm-test-suite wiht SPEC2006 +
SPEC2017 included, internal services showed similar patterns. Generally
there are a log of improvements but also a lot of regression. But on
average the allocation quality seems to improve at a small code size
regression.

Results for measuring static and dynamic instruction counts:

-O3 + ThinLTO + Instr-PGO

Dynamic Counts (scaled by execution frequency) / Optimization Remarks:

Spills+FoldedSpills   -5.6%
Reloads+FoldedReloads -4.2%
Copies                -0.1%

Static / LLVM Statistics:

regalloc.NumSpills    mean -1.6%, geomean -2.8%
regalloc.NumReloads   mean -1.7%, geomean -3.1%
size..text            mean +0.4%, geomean +0.4%

-O3

Static / LLVM Statistics:

mean -2.2%, geomean -3.1%) regalloc.NumSpills
mean -2.6%, geomean -3.9%) regalloc.NumReloads
mean +0.6%, geomean +0.6%) size..text

-Os

Static / LLVM Statistics:

regalloc.NumSpills   mean -3.0%
regalloc.NumReloads  mean -3.3%
size..text           mean +0.3%, geomean +0.3%

Detailed numbers in https://reviews.llvm.org/P8290

Diff Detail

Event Timeline

MatzeB created this revision.Sep 14 2022, 3:32 PM
MatzeB requested review of this revision.Sep 14 2022, 3:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2022, 3:32 PM

I guess this somewhat of an RFC, if we would accept better regalloc at slightly bigger programs: I found that dropping CostPerUse produced a small but measurable improvement on X86 regalloc.

I would also love to hear if others can explain why the increased register costs for the larger encodings result in slightly smaller programs, even though I don't us prefering registers with more or less uses when ordering eviction and assignment choices (and as far as the AllocationOrder goes the register with longer encodings are still on the back anyway).

MatzeB edited the summary of this revision. (Show Details)Sep 14 2022, 3:42 PM

Looks like the code change to X86RegisterInfo.td is not in this patch?

asb added a subscriber: asb.Sep 15 2022, 9:47 AM

D86836 introduced support for setting multiple CostPerUse values (which we use on RISC-V to set a non-zero value for some registers only when the 'compressed' extension is enabled). Did you evaluate doing something similar to only set CostPerUse at Os or perhaps Oz?

Looks like the code change to X86RegisterInfo.td is not in this patch?

huh. It's included now.

D86836 introduced support for setting multiple CostPerUse values (which we use on RISC-V to set a non-zero value for some registers only when the 'compressed' extension is enabled). Did you evaluate doing something similar to only set CostPerUse at Os or perhaps Oz?

Yes I saw that, didn't push on this angle (yet) as we should really use llvm::shouldOptimizeForSize then which requires a ProfileSummaryInfo reference that isn't around in that callback. So would need to refactor those interfaces in some way...

MatzeB updated this revision to Diff 460453.Sep 15 2022, 10:46 AM

Early x86-64 CPUs (often with weak frontend/decoders) might benefit from avoiding REX encodings (I know when I was doing exegesis testing for the atom/silvermont sched models avoiding REX registers gave much more consistent backend numbers), although I'm not certain how much work adding CPU Feature specific costs would be (or how much benefit they'd give for most people).

llvm/lib/Target/X86/X86RegisterInfo.td
128

Keep the comments?

221

This comment is still useful

LGTM

I guess this somewhat of an RFC, if we would accept better regalloc at slightly bigger programs: I found that dropping CostPerUse produced a small but measurable improvement on X86 regalloc.

FWIW, tried it on a server app (was curious if the "slightly larger" bit meant more iCache pressure), doesn't seem to be an issue.

I would also love to hear if others can explain why the increased register costs for the larger encodings result in slightly smaller programs, even though I don't us prefering registers with more or less uses when ordering eviction and assignment choices (and as far as the AllocationOrder goes the register with longer encodings are still on the back anyway).

mtrofin accepted this revision.Sep 16 2022, 12:11 PM
This revision is now accepted and ready to land.Sep 16 2022, 12:11 PM

Looks sensible to me.
I would just second @RKSimon's question about how this works with older x86 HW but if x86 folks are fine with the change, that's fine by me.

MatzeB marked 2 inline comments as done.Sep 30 2022, 1:42 PM

Early x86-64 CPUs (often with weak frontend/decoders) might benefit from avoiding REX encodings (I know when I was doing exegesis testing for the atom/silvermont sched models avoiding REX registers gave much more consistent backend numbers), although I'm not certain how much work adding CPU Feature specific costs would be (or how much benefit they'd give for most people).

My gut feeling would be that we are better off without the added complexity, as the size differences are somewhat happening by accident anyway. And in fact looking into my data the size differences seem to mostly stem from small functions and there is not really a trend for bigger functions.

llvm/lib/Target/X86/X86RegisterInfo.td
128

To me it seemed that this comment was meant to justify the CostPerUse = [1] which I am removing, but no problem to keep them anyway.

MatzeB updated this revision to Diff 464388.Sep 30 2022, 1:49 PM
MatzeB marked an inline comment as done.

rebase, update some tests missed the first time.

MatzeB edited the summary of this revision. (Show Details)Sep 30 2022, 1:50 PM
MatzeB edited the summary of this revision. (Show Details)
MatzeB edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.Sep 30 2022, 4:03 PM
This revision was automatically updated to reflect the committed changes.
mtrofin added a comment.EditedSep 30 2022, 5:28 PM

was just sending the fix or the mlgo bit, thanks for looking into it!

I'm not sure I have 100% fixed it, we'll see :)

Also thinking about things right now, I realized that my change probably lowers the amount of times ragreedy asks for eviction decisions. I hope that doesn't reduce the quality of ML regalloc, maybe you can keep on eye on that on your end.

(I'm happy to revert if it decreases quality, though we have to find a better way than somewhat "arbitrary" CostPerUse / high register numbers triggering things long-term...)

I'm not sure I have 100% fixed it, we'll see :)

Also thinking about things right now, I realized that my change probably lowers the amount of times ragreedy asks for eviction decisions. I hope that doesn't reduce the quality of ML regalloc, maybe you can keep on eye on that on your end.

(I'm happy to revert if it decreases quality, though we have to find a better way than somewhat "arbitrary" CostPerUse / high register numbers triggering things long-term...)

I tried it with your change already, didn't see anything for the workload I looked at - anyway, we can always retrain it if needed.

Seems we need your changed to dev-mode-logging.ll after all, will push them now.

skan added a subscriber: skan.Feb 1 2023, 4:03 AM