This is an archive of the discontinued LLVM Phabricator instance.

[Regalloc][WIP] Increase CSR cost in RegAllocGreedy to favour splitting/spill over CSR first use
Needs ReviewPublic

Authored by lkail on Aug 22 2019, 1:13 AM.

Details

Summary

It's trying to solve the issue which was proposed at D32201 and D27366 before.

The register allocator currently favors allocating a CSR over splitting/spilling a region. This in effect results in introducing more opportunity to save/restore CSR registers at prologue/epilogue. It hurts performance especially when there is no need of prologue/epilogue for the hot path. If we can save/restore non-CSR across call, then we don't use CSR during the function and we don't need to save/restore them at prologue/epilogue.

It also influences the shrink-wrapping directly if we must use CSR during the function. For now copies of parameter registers into CSR's in the entry block when the parameter is live across any calls in the function. And of course, this disables shrink-wrapping because the save point then must be the entry block. After the patch, live ranges allocated to CSR registers will be shortened and will only be used in cold places, so that we may get better chances to do shrink-wrapping.

I refine the adjusting CSR cost to enhance the RegAllocGreedy to favour splitting/spill the virtual register which is being allocated instead of allocate CSR directly. If use CSR, we need add load/store pair in prologue/epilogue. So if the cost caused by spillings which is introduced by trying to avoid using CSR is larger than load/store pair in prologue/epilogue, we would choose to use CSR. I make the raw cost to be 1 because the cost is load/store pair in prologue/epilogue if use CSR. so the cost is 1 time spill like the cost calculated in SpillPlacer. Then we need scale the cost relative to entry frequency.

It can directly help to expand the opportunity exposed to shrink-wrapping at later phase because it causes the use of CSR is delayed to the BB with the call instead of entry BB.It's trying to avoid the magic cost number for different target. The CSRCost is a relative times corresponding to the spill operation in prologue/epilogue.

Diff Detail

Event Timeline

wuzish created this revision.Aug 22 2019, 1:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2019, 1:13 AM

This is a WIP draft and I have not fixed all check-all cases now. I want to get some quick feedback whether it's on the right direction and whether its performance is good at all different target. If you have time, could you please review it or give some comments, and help me to verify the performance at other targets?

xbolva00 added a subscriber: xbolva00.EditedAug 22 2019, 1:24 AM

Can you add tests for atleast x86/arm and precommit them + rebase?

dexonsmith resigned from this revision.Aug 25 2019, 8:17 AM
wuzish edited the summary of this revision. (Show Details)
wuzish added a reviewer: xbolva00.

Enable it for AARCH64 and x86, and address testcases.

Could anybody run some bmk to see the performance result of AARCH64 and X86?
I have run on PowerPC, it has good influence in some bmks of spec2017 and no obvious degression.

Can the patch description please be updated to contain a dumbed down explanation of the issue this is solving?

+ @dmgreen for arm benchmarks

wuzish edited the summary of this revision. (Show Details)Sep 2 2019, 1:05 AM
wuzish added a reviewer: dmgreen.
wuzish added a reviewer: Restricted Project.Sep 2 2019, 10:16 PM

gentle pin...

.AMDGPU also override the getCSRFirstUseCost() but your patch didn't catch that. And would you please post some improve number for powerpc of this patch ?

llvm/test/CodeGen/PowerPC/tail-dup-break-cfg.ll
2

I suggest that, you should commit a NFC patch to update the CHCK first.

.AMDGPU also override the getCSRFirstUseCost() but your patch didn't catch that. And would you please post some improve number for powerpc of this patch ?

Not done; would be good to have some perf numbers here, for ppc and x86

.AMDGPU also override the getCSRFirstUseCost() but your patch didn't catch that. And would you please post some improve number for powerpc of this patch ?

Not done; would be good to have some perf numbers here, for ppc and x86

We have test this patch(getCSRFirstUseCost () return 1) on PowerPC.
For spec base, there are 6 cases has improved more that 1%, the largest improvement case is 3.63%, no case degraded more than 1%.
For spec peak, there are 5 cases has improved more that 1%, the largest improvement case is 5.9%, only one case degraded more than 1%(1.76%).
Overall, the base & peak reset has been improved after this patch on PPC.

ZhangKang commandeered this revision.Oct 11 2019, 2:39 AM
ZhangKang edited reviewers, added: wuzish; removed: ZhangKang.

Do you have numbers also for x86?

Do you have numbers also for x86?

No, I don't have. I'm sorry that I have no x86 test machine.

Do you have numbers also for x86?

@xbolva00 , I think this patch should have a good performance on X86, but I don't have a X86 performance machine to confirm it, Could you help me to test the spec performance for this patch on X86 or tell me who I can ask for help to do the test?

I can.. but next week.

Maybe @Carrot @evandro @fhahn could check it too.

lebedev.ri added inline comments.Oct 18 2019, 2:32 AM
llvm/test/CodeGen/PowerPC/tail-dup-layout.ll
116–136 ↗(On Diff #217086)

Please precommit all NFC test regenerations first, and rebase the patch.

I can.. but next week.

Maybe @Carrot @evandro @fhahn could check it too.

@xbolva00 , I am very glad to hear that you can do this test. Thank you very much!!!

ZhangKang marked 2 inline comments as done.Oct 23 2019, 6:23 PM
ZhangKang added inline comments.
llvm/test/CodeGen/PowerPC/tail-dup-break-cfg.ll
2

This is a WIP draft and have not fixed all check-all cases now. I want to get some quick feedback whether it's on the right direction and whether its performance is good at all different target. If this patch is OK for others, I will commit a NFC patch to update the CHECK.

Tested on x86-64, -O3, spec 2006. Observed small improvements here and there.. Overall spec int score +0.2.

bzip possibly regressed a bit.. 1%. But it could be noise..

Without patch
347 346 346

With patch
351 351 352

phlav added a subscriber: phlav.Oct 25 2019, 10:03 AM
ZhangKang marked an inline comment as done.

Update the test cases using the tool update_llc_test_checks.py.

ZhangKang updated this revision to Diff 227018.Oct 29 2019, 7:44 PM

Use the script to update the test cases.

lkail commandeered this revision.Sep 9 2020, 11:32 PM
lkail added a reviewer: ZhangKang.