This is an archive of the discontinued LLVM Phabricator instance.

Allow rematerialization of virtual reg uses
ClosedPublic

Authored by rampitec on Jul 20 2021, 3:57 PM.

Details

Summary

Currently isReallyTriviallyReMaterializableGeneric() implementation
prevents rematerialization on any virtual register use on the grounds
that is not a trivial rematerialization and that we do not want to
extend liveranges.

It appears that LRE logic does not attempt to extend a liverange of
a source register for rematerialization so that is not an issue.
That is checked in the LiveRangeEdit::allUsesAvailableAt().

The only non-trivial aspect of it is accounting for tied-defs which
normally represent a read-modify-write operation and not rematerializable.

The test for a tied-def situation already exists in the
/CodeGen/AMDGPU/remat-vop.mir,
test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve.

The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets
where I more or less understand the asm it seems to reduce spilling
(as expected) or be neutral. However, it needs a review by all targets'
specialists.

Diff Detail

Event Timeline

rampitec created this revision.Jul 20 2021, 3:57 PM
rampitec requested review of this revision.Jul 20 2021, 3:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2021, 3:57 PM
Herald added subscribers: MaskRay, wdng. · View Herald Transcript

Note, it needs D106396 to prevent rematerialization in coalescer if not all registers are available.

asb added a comment.Jul 22 2021, 3:52 AM

From an initial look at the RISC-V test changes, none of them seem concerning to me.

RKSimon added inline comments.Jul 22 2021, 10:19 AM
llvm/test/CodeGen/X86/licm-regpressure.ll
13

Add checks for the lea?

jrtc27 added inline comments.Jul 22 2021, 10:26 AM
llvm/test/CodeGen/X86/licm-regpressure.ll
13

Or fix the test so it still demonstrates the bug, otherwise this file no longer serves much of a purpose...

rampitec added inline comments.Jul 22 2021, 10:36 AM
llvm/test/CodeGen/X86/licm-regpressure.ll
13

The question is: should LICM really check register pressure if we can rely on rematerialization instead? I have started this with a very similar situation in AMDGPU, LICM was hoisting instructions out of the loop and we end up spilling in the loop. A proper rematerialization fixes this without burdening the LICM. I.e. this change shall probably fix the bug itself.

jrtc27 added inline comments.Jul 22 2021, 10:37 AM
llvm/test/CodeGen/X86/licm-regpressure.ll
13

If this fixes the bug completely then rewrite the comments to reflect that and tag the revision as closing the bug. If it doesn't fix the bug in all cases then this test needs updating to be a case that's still broken.

efriedma edited reviewers, added: dmgreen; removed: greened.Jul 22 2021, 11:05 AM
efriedma added inline comments.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
8 ↗(On Diff #360302)

This looks like a regression. Not sure what's happening here; we're saving both r7 and r8, but they aren't used. Maybe something related to the hardware loop instructions?

rampitec updated this revision to Diff 360904.Jul 22 2021, 11:33 AM

Updated test X86/licm-regpressure.ll

rampitec marked 3 inline comments as done.Jul 22 2021, 11:35 AM
rampitec added inline comments.
llvm/test/CodeGen/X86/licm-regpressure.ll
13

Looking closely I think we cannot reproduce the bug with the asm inspection because rematerialization mitigates it. However, the issue reported in the PR23143 still exist. This can be explored with MIR inspection. Updated test accordingly.

rampitec marked an inline comment as done.Jul 22 2021, 12:07 PM
rampitec added inline comments.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
8 ↗(On Diff #360302)

Here is what happens, these two instructions are now isTriviallyReMaterializable():

%16:rgpr = t2ADDri %8:rgpr, 15, 14, $noreg, $noreg
%17:rgpr = t2LSRri %16:rgpr, 4, 14, $noreg, $noreg

MachineLICM hoists these out of the loop because of that. RA uses r8 and r8 ends up in the frame setup:

bb.0.entry:
  successors: %bb.1(0x50000000), %bb.2(0x30000000); %bb.1(62.50%), %bb.2(37.50%)
  liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $lr
  $sp = frame-setup t2STMDB_UPD $sp(tied-def 0), 14, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $r8, killed $lr
  frame-setup CFI_INSTRUCTION def_cfa_offset 24
  frame-setup CFI_INSTRUCTION offset $lr, -4
  frame-setup CFI_INSTRUCTION offset $r8, -8
  frame-setup CFI_INSTRUCTION offset $r7, -12
  frame-setup CFI_INSTRUCTION offset $r6, -16
  frame-setup CFI_INSTRUCTION offset $r5, -20
  frame-setup CFI_INSTRUCTION offset $r4, -24
  t2CMPri renamable $r2, 1, 14, $noreg, implicit-def $cpsr
  t2Bcc %bb.2, 11, killed $cpsr
  t2B %bb.1, 14, $noreg

bb.1.for.body.preheader:
; predecessors: %bb.0
  successors: %bb.3(0x80000000); %bb.3(100.00%)
  liveins: $r0, $r1, $r2, $r3
  renamable $r12 = nsw t2LSLri renamable $r3, 2, 14, $noreg, $noreg
  renamable $r4 = t2MOVi 0, 14, $noreg, $noreg
  renamable $r7 = t2ADDri renamable $r3, 15, 14, $noreg, $noreg
  renamable $r8 = t2LSRri killed renamable $r7, 4, 14, $noreg, $noreg
  t2B %bb.3, 14, $noreg

But it is eliminated by the ARM Low Overhead Loops pass:

# *** IR Dump After ARM Low Overhead Loops pass (arm-low-overhead-loops) ***:

bb.1.for.body.preheader:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)
  liveins: $r0, $r1, $r2, $r3
  renamable $r12 = nsw t2LSLri renamable $r3, 2, 14, $noreg, $noreg
  renamable $r4, dead $cpsr = tMOVi8 0, 14, $noreg
  tB %bb.2, 14, $noreg

Frame setup however already done and not updated.

Maybe pass an extra argument into isTriviallyReMaterializable()? That could return false if any virtual registers are used for the purpose of MachineLICM and CalcSpillWeights and only return true for the regalloc/coalescer itself.

efriedma added inline comments.Jul 22 2021, 12:28 PM
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
8 ↗(On Diff #360302)

That's unfortunate.

In general, the hoisting is probably fine. The problem here is that if the instructions are used as input to the low-overhead loop pseudo-instructions, we don't want to hoist them: they're likely to be eliminated by the LowOverheadLoops pass, so it isn't profitable. (The low-overhead loop instructions get formed very late because they have odd restrictions on what branches are allowed.)

rampitec added inline comments.Jul 22 2021, 1:07 PM
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
8 ↗(On Diff #360302)

I am not really sure it is fine to hoist in this case because a rematerialization of an instruction with vreg uses is not guaranteed. All used registers must be available at the point of rematerialization and if not it simply will not happen. MachineLICM considers rematerialization as granted, but in this case it is rather opportunistic.

I am looking at the change not to hoist such instructions right now.

rampitec updated this revision to Diff 360960.Jul 22 2021, 1:54 PM
rampitec marked 2 inline comments as done.
rampitec edited the summary of this revision. (Show Details)

Added argument to isTriviallyReMaterializable() to prevent hoisting of instructions in an expectation these can be later rematerialized. Only opportunistic rematerialization in RA is allowed.

@arsenm note, we are reporting that instructions with such uses are trivially materializable for a long time in case of VALU moves, and now even with more VALU instructions. We only do not so it for SALU. I think we will need to have the same check for AllowVRegs in our isReallyTriviallyReMaterializable to prevent LICM hoisting it under high pressure.

@arsenm note, we are reporting that instructions with such uses are trivially materializable for a long time in case of VALU moves, and now even with more VALU instructions. We only do not so it for SALU. I think we will need to have the same check for AllowVRegs in our isReallyTriviallyReMaterializable to prevent LICM hoisting it under high pressure.

Although that is practically unexploitable because MachineLICM always thinks there is no high pressure, whatever SALU I feed it from out target, and because MachineLoop::isLoopInvariant() always returns false for any VALU because of the exec use. So it always hoists SALU, no matter what, and never hoists VALU. I.e. on practice this change should not pessimize our codegen, because it will not affect LICM for us.

Tail predication being awkward aside - do you have performance results for this patch? Something to show it is beneficial to do over a set of benchmarks.

Tail predication being awkward aside - do you have performance results for this patch? Something to show it is beneficial to do over a set of benchmarks.

I had earlier benchmarked this patch for an embedded RISC-V system (Ibex simple system). Not a comprehensive set of results at all, but in case it's useful:

  • For CoreMark it had negligible size impact. At Oz and Os it had a small performance benefit (+1.17% and +0.178%, respectively). At O2 and O3 the performance difference was negligible (-0.016% and +0.038%, respectively).
  • For Embench it had no performance impact but improved size a bit, overall. The best case was O2, where the size was overall reduced by 440 bytes, or 0.276%. Most programs of that benchmark suite had minor differences, some positive and some negative, but for picojpeg it shaved up to 384 bytes.

(Raw results here, in a confusing format: https://gist.github.com/luismarques/01948685c6a1d16ab1c1a0229252b0f1. Please ignore the absolute performance values).

Tail predication being awkward aside - do you have performance results for this patch? Something to show it is beneficial to do over a set of benchmarks.

I had earlier benchmarked this patch for an embedded RISC-V system (Ibex simple system). Not a comprehensive set of results at all, but in case it's useful:

  • For CoreMark it had negligible size impact. At Oz and Os it had a small performance benefit (+1.17% and +0.178%, respectively). At O2 and O3 the performance difference was negligible (-0.016% and +0.038%, respectively).
  • For Embench it had no performance impact but improved size a bit, overall. The best case was O2, where the size was overall reduced by 440 bytes, or 0.276%. Most programs of that benchmark suite had minor differences, some positive and some negative, but for picojpeg it shaved up to 384 bytes.

(Raw results here, in a confusing format: https://gist.github.com/luismarques/01948685c6a1d16ab1c1a0229252b0f1. Please ignore the absolute performance values).

Thanks for measuring this!

For AMDGPU target there is no visible change just yet. It will need at least one more patch to make a difference there. The reason is we are doing this for years now because of the custom isReallyTriviallyReMaterializable implementation which skips the check for virtual registers. That needs to be updated in a way similar to this patch so we skip MachineLICM hoisting on the grounds that an instruction is rematerializable without a check that rematerialization will really happen. Then I have no means to measure other targets.

That said AMDGPU is a target where any spilling is extremely expensive. Not just expensive like on CPU targets, but way more expensive. What I see as a result of this patch is a decrease of spilling across the board, which must be a good thing even for a less sensitive target. Code size might increase though, which is generally a case with any rematerialization.

For AMDGPU target there is no visible change just yet. It will need at least one more patch to make a difference there. The reason is we are doing this for years now because of the custom isReallyTriviallyReMaterializable implementation which skips the check for virtual registers. That needs to be updated in a way similar to this patch so we skip MachineLICM hoisting on the grounds that an instruction is rematerializable without a check that rematerialization will really happen. Then I have no means to measure other targets.

That said AMDGPU is a target where any spilling is extremely expensive. Not just expensive like on CPU targets, but way more expensive. What I see as a result of this patch is a decrease of spilling across the board, which must be a good thing even for a less sensitive target. Code size might increase though, which is generally a case with any rematerialization.

OK. I was hoping you would be able to say this is very important for your performance, and have some data to back that up.

The results I have here look less great.. but looking again, in general it looks OK. Codesize was either flat or smaller by a little, which is good. Performance went up and down depending on the test. The results are not great in places, but looking at the some of the decreases it may just be unlucky, spilling in hot loops where it didn't in the past and behaving a bit chaotically from differences in register allocation. There are some improvements too, to make it generally OK overall.

llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll
103–104

Change this to check for a strd? Or just update the check lines.

OK. I was hoping you would be able to say this is very important for your performance, and have some data to back that up.

I'd love to say that but in fact we have a gordian knot of problems and AMDGPU is the least affected BE by this change:

  • We are already doing it for VOP instructions which is the most important kind for AMDGPU, since our implementation of isReallyTriviallyReMaterializable does not check for virtual registers unlike isReallyTriviallyReMaterializableGeneric, so no change here.
  • The other kind of rematerializable instructions are SOP, but most of them either cannot be rematerilized at all because of physreg defs or because of subreg defs, so the impact on SOP will not be huge.
  • There would be a difference if MachineLICM hoisted VOP instructions and then it would be impossible to rematerialize them, but that simply does not happen because all VOPs have implicit $exec use and it avoids hoisting anyway.
  • The latter is very simple to fix with the new isIgnorableUse target callback, but if I do it without this patch it will blindly hoist instructions which will not be remateraialized later.
  • In fact MachineLICM needs to have register pressure impact estimate improved, way improved. So if even after all of that it still may have negative impact on performance if such hoisting is enabled.

All of that seems to need a lot of cleanup.

rampitec updated this revision to Diff 364166.Aug 4 2021, 9:57 AM
rampitec marked an inline comment as done.

Updated thumb2/ldr-str-imm12.ll test checks.

Thanks. This sounds OK for ARM to me.

llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll
100

On second thoughts, can you just run the update_llc_test_checks on this file?

The strd is just two adjacent str glommed together, so this is checking for more stores than it was before. But there appear to be extra stores in the original, including stm. From the comment at the top of the file it is not very clear what this is trying to test now. Just updating the test checks so that we see the whole function sounds like the best idea.

rampitec added inline comments.Aug 5 2021, 2:27 AM
llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll
100

Maybe switch to generated checks, precommit and update the test in the review? That way you will better see what has changed. That is more or less what I've been doing looking at it.

JFYI I am not good in thumb isa, so I am guessing what has changed.

rampitec updated this revision to Diff 364576.Aug 5 2021, 12:21 PM
rampitec marked an inline comment as done.

Rebased on top of D107590 and regenerated ldr-str-imm12.ll checks.

dmgreen added inline comments.Aug 5 2021, 2:56 PM
llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll
100

Yeah, I had taken a look to see how it's changing. It looks fine to me.

Thanks for updating the test checks. It looks like a more useful test now.

arsenm added inline comments.Aug 6 2021, 4:17 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

I don't understand why AllowVRegs needs to be a parameter. Why wouldn't it just always true?

120–125

i.e. this is rematerializable, the pressure question is a different heuristic

rampitec added inline comments.Aug 6 2021, 4:26 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

The problem is licm will hoist instructions if rematerializable even without checking the pressure (not saying its pressure calculation is not great) simply because if we run out of registers it will be easy to rematerialize the instruction. This is not the case with vreg uses, even if possible RA will not do it if it needs extending LR. The rematerialization in RA is opportunistic and not granted. Besides it is not even always possible to rematerialize such instructions even with all checks passed.

arsenm added inline comments.Aug 6 2021, 4:36 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

But why can't licm check if there are vregs on the instruction itself? Why thread this through the legality check?

rampitec added inline comments.Aug 6 2021, 4:53 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

A good idea, thanks! I can do it as a separate patch preceding this one and instead of D107677.

rampitec updated this revision to Diff 365232.Aug 9 2021, 11:26 AM
rampitec edited the summary of this revision. (Show Details)

Avoid predication for MachineLICM. MachineLICM itself is updated instead in D107677. Rebased on top of D107677.

rampitec updated this revision to Diff 366072.Aug 12 2021, 12:24 PM

Updated comments on isTriviallyReMaterializable and isReallyTriviallyReMaterializable. These are not exactly true for a long time already at least for AMDGPU and ARM.

Have you gone through the uses of isTriviallyReMaterializable to check they line up to the new semantics? It looks like it's used in CalcSpillWeight, LiveRangeEdits (you've fixed recently?), MachineLICM (is getting an updated costmodel), Register Coalescer, shouldRegionSplitForVirtReg, WebAssemblyRegStackifty. Are they all expected to keep working with the change here?

I feel like the new behavior isn't really "Trivial" any more, and it may be worth keeping the old method for trivial cases, which is just the base case plus a virtual reg use check. Essentially MachineLICMBase::isTriviallyReMaterializable from D107677. I'm not sure what to call the new method though, and not sure its worth it if all the uses above are OK as-is.

Have you gone through the uses of isTriviallyReMaterializable to check they line up to the new semantics?

I believe I did:

It looks like it's used in CalcSpillWeight,

Checked, spill weight calculation matches what LiveRangeEdit/RA in general do. Prohibiting vreg uses there did a worse job.

LiveRangeEdits (you've fixed recently?)

LRE handles it for a long time already, it has a check in the allUsesAvailableAt(). AMDGPU allows vreg uses, also ARM for VCTP. This is already the case.

MachineLICM (is getting an updated costmodel),

Yes, D107677.

Register Coalescer,

One corner case was fixed in D106396, otherwise shall work.

shouldRegionSplitForVirtReg,

Part or RA logic, works.

WebAssemblyRegStackifty.

Allows rematerialization of CONST_* opcodes only. These have immediate operands and no register uses at all. Should work and does not have any test changes.

Are they all expected to keep working with the change here?

Overall yes. The only per-requestity is MachineLICM change in the D107677 at this point.

I feel like the new behavior isn't really "Trivial" any more, and it may be worth keeping the old method for trivial cases, which is just the base case plus a virtual reg use check. Essentially MachineLICMBase::isTriviallyReMaterializable from D107677. I'm not sure what to call the new method though, and not sure its worth it if all the uses above are OK as-is.

The uses above are OK as far as I can tell. I can see how that is possible to wrap one method into another, but as usual have problem with names. We already have isRematerializable, isTriviallyRematerializable, and isReallyTriviallyRematerializable. Creating something like isReallyReallyTriviallyRematerializable looks like going down a rabbit hole ;) But I am open to opinions.

dmgreen accepted this revision.Aug 16 2021, 11:40 AM

Are they all expected to keep working with the change here?

Overall yes. The only per-requestity is MachineLICM change in the D107677 at this point.

Nice one

I feel like the new behavior isn't really "Trivial" any more, and it may be worth keeping the old method for trivial cases, which is just the base case plus a virtual reg use check. Essentially MachineLICMBase::isTriviallyReMaterializable from D107677. I'm not sure what to call the new method though, and not sure its worth it if all the uses above are OK as-is.

The uses above are OK as far as I can tell. I can see how that is possible to wrap one method into another, but as usual have problem with names. We already have isRematerializable, isTriviallyRematerializable, and isReallyTriviallyRematerializable. Creating something like isReallyReallyTriviallyRematerializable looks like going down a rabbit hole ;) But I am open to opinions.

I was thinking of something like "isSimpleRematerializable" vs "isTriviallyRematerializable" (but I didn't really like "simple"). If we don't have a user of it (other than Machine LICM which has it's own function), I'd say it is fine as-is.

Thanks for checking. Providing there are not other comment, this SGTM.

This revision is now accepted and ready to land.Aug 16 2021, 11:40 AM
rampitec marked 4 inline comments as done.Aug 16 2021, 12:03 PM
This revision was landed with ongoing or failed builds.Aug 16 2021, 12:53 PM
This revision was automatically updated to reflect the committed changes.
phosek added a subscriber: phosek.Aug 17 2021, 7:18 PM

We started seeing Clang failure when building our codebase with ASan and bisecting identified this change as a culprit, see PR51516 for details and reproducer. Would it be possible to revert this change until that issue is resolved?

We started seeing Clang failure when building our codebase with ASan and bisecting identified this change as a culprit, see PR51516 for details and reproducer. Would it be possible to revert this change until that issue is resolved?

Petr, I am AFK for today. Please revert it or I will do it tomorrow morning. Thanks for narrowing down.

We started seeing Clang failure when building our codebase with ASan and bisecting identified this change as a culprit, see PR51516 for details and reproducer. Would it be possible to revert this change until that issue is resolved?

Thanks Petr. Reproduced the error.

rampitec reopened this revision.Aug 20 2021, 10:50 AM

Proposed fix for https://bugs.llvm.org/show_bug.cgi?id=51516: D108475.
This needs that fix before it can be relanded.

This revision is now accepted and ready to land.Aug 20 2021, 10:51 AM
lkail added a subscriber: lkail.Aug 24 2021, 10:22 AM
rampitec updated this revision to Diff 368402.Aug 24 2021, 11:02 AM

Rebased to reland.

This revision was landed with ongoing or failed builds.Aug 24 2021, 11:09 AM
This revision was automatically updated to reflect the committed changes.

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This is an expensive check build that does verify machineinstrs after each pass (I think), so you probably won't see it on the lldb job. I'll try to grab you an ll.

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This is an expensive check build that does verify machineinstrs after each pass (I think), so you probably won't see it on the lldb job. I'll try to grab you an ll.

Thanks! This object seems to build fine to me with expensive checks, so I am not able to reproduce it myself.

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This is an expensive check build that does verify machineinstrs after each pass (I think), so you probably won't see it on the lldb job. I'll try to grab you an ll.

I mean, it seems intermittent. It happened with build 20218, but then 20220 didn't have it, build 20221 was green, and then it has reappeared in the https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-expensive/

wxiao3 added a subscriber: wxiao3.Sep 14 2021, 11:20 PM

I just want to report that I tried this PR on large internal workloads (for X86) and saw some good performance. The generated code cleaned up nicely, I saw 350K new hits almost all are lea instructions. Observed 3% to 5% reduction in executed load/stores. Can the PR get be approved/landed?

I just want to report that I tried this PR on large internal workloads (for X86) and saw some good performance. The generated code cleaned up nicely, I saw 350K new hits almost all are lea instructions. Observed 3% to 5% reduction in executed load/stores. Can the PR get be approved/landed?

Thanks fort confirming, glad to hear this! This was landed about a month ago.

wxiao3 added a comment.EditedSep 22 2021, 12:01 AM

hi,

There is some internal benchmark performance regression after the patch landed.
The story is as follows:
There is a virtual register: %5 which lives across many basic blocks (some are inside loops).
Before this patch is landed, its weight is: 2.299834e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:2.299834e-03

Its definition is as below:

592B      %5:gr64 = LEA64r %523:gr64, 1, %0:gr64_nosp, 0, $noreg

After this patch is landed, its weight is reduced by half to: 1.149917e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:1.149917e-03

Finally, %5 is evicted by other VR, which doesn't happen before this patch land.

For this case, treating %5 as ReMaterializable and reducing its weight is very bad.

Any idea to fix the performance regression?

Wei

hi,

There is some internal benchmark performance regression after the patch landed.
The story is as follows:
There is a virtual register: %5 which lives across many basic blocks (some are inside loops).
Before this patch is landed, its weight is: 2.299834e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:2.299834e-03

Its definition is as below:

592B      %5:gr64 = LEA64r %523:gr64, 1, %0:gr64_nosp, 0, $noreg

After this patch is landed, its weight is reduced by half to: 1.149917e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:1.149917e-03

Finally, %5 is evicted by other VR, which doesn't happen before this patch land.

For this case, treating %5 as ReMaterializable and reducing its weight is very bad.

Any idea to fix the performance regression?

Wei

It is now considered rematerializable which has halven the spill weight because it could be rematerialized instead. I guess at the end it was not rematerialized. I'd start with checking what has prevented it, but likely that is because %523 or %0 are not available at the point of rematerialization.

I was experimenting with avoiding the division by 2 at the end of the VirtRegAuxInfo::weightCalcHelper if there are vreg uses, but actually got worse results. Since this is a heuristic calculating weights it might make sense to give it some little extra weight for every vreg use I guess.

I was experimenting with avoiding the division by 2 at the end of the VirtRegAuxInfo::weightCalcHelper if there are vreg uses, but actually got worse results. Since this is a heuristic calculating weights it might make sense to give it some little extra weight for every vreg use I guess.

Something like this maybe if you have a good test for it: https://reviews.llvm.org/differential/diff/374324/

Obviously 0.01 there is a pure guesstimate.

LuoYuanke added inline comments.Sep 22 2021, 8:25 PM
llvm/lib/CodeGen/TargetInstrInfo.cpp
990

The comments looks reasonable to me. The use register's live range is extended if we rematerialize def register. Removing this code is not friendly to RA. Do you have any data to ensure there is no side effect but benefit by removing this code?

rampitec added inline comments.Sep 23 2021, 12:04 PM
llvm/lib/CodeGen/TargetInstrInfo.cpp
990

The comment is far from reality. RA does not extend use live ranges. Instead it checks that all uses are available at the point of rematerialization and does not rematerialize if not. There is quite a number of cases where we have less spilling now.

Obviously 0.01 there is a pure guesstimate.

To fix one of my local performance regressions, I need to set it to 0.1 at least.
But tuning the value seems to be meaningless. Because:
Without your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is definitely ReMaterializable.
With your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is probably ReMaterializable.
I don't think the change is consistent with this function's orignal design goal.
Moreover, your patch conflicts with our target implementation: X86InstrInfo::isReallyTriviallyReMaterializable (llvm/lib/Target/X86/X86InstrInfo.cpp). E.g., for most LEA instructions, our implementation will return false. While your patch will always return true if LEA is using virtual registers. Your patch makes our target implementation useless anymore.

it seems to reduce spilling (as expected) or be neutral

I don't observe any data to prove it's good for performance in our internal performance track system running on various hardware. Instead, your patch brings -1.5% drop for CPU2017rerf/557.xz on server side and -2% drop for coremark-pro/core on desktop side. In scenario with big register pressure such as loops with a lot of LiveIntervals, optimistically assuming critical VR (LiveInterval) to be ReMaterializable (but it's not in reality) will make RA tend to spill the critical VR, which is definitely bad for performance.

Could you please revert this change or at least make it not impact our X86 CPU BE? I don't observe the conflicted change bring any benefit but It results at least 2 regressions for X86 CPU BE.

Obviously 0.01 there is a pure guesstimate.

To fix one of my local performance regressions, I need to set it to 0.1 at least.
But tuning the value seems to be meaningless. Because:
Without your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is definitely ReMaterializable.
With your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is probably ReMaterializable.
I don't think the change is consistent with this function's orignal design goal.
Moreover, your patch conflicts with our target implementation: X86InstrInfo::isReallyTriviallyReMaterializable (llvm/lib/Target/X86/X86InstrInfo.cpp). E.g., for most LEA instructions, our implementation will return false. While your patch will always return true if LEA is using virtual registers. Your patch makes our target implementation useless anymore.

it seems to reduce spilling (as expected) or be neutral

I don't observe any data to prove it's good for performance in our internal performance track system running on various hardware. Instead, your patch brings -1.5% drop for CPU2017rerf/557.xz on server side and -2% drop for coremark-pro/core on desktop side. In scenario with big register pressure such as loops with a lot of LiveIntervals, optimistically assuming critical VR (LiveInterval) to be ReMaterializable (but it's not in reality) will make RA tend to spill the critical VR, which is definitely bad for performance.

Could you please revert this change or at least make it not impact our X86 CPU BE? I don't observe the conflicted change bring any benefit but It results at least 2 regressions for X86 CPU BE.

Since this is second perf regression report I will revert it.

I still consider this a good experiment as it has uncovered couple RA bugs in the process.

I still consider this a good experiment as it has uncovered couple RA bugs in the process.

Yes, It's a great experiment.

Thanks!
Wei

vporpo added a subscriber: vporpo.Oct 11 2021, 7:53 PM

Just to add that this patch caused a huge compilation time increase in some workloads, with the majority of time spent in eliminateDeadDefs() called by rematerializeAll(). The compilation time regression was gone after this patch was reverted.

fhahn added a subscriber: fhahn.Oct 12 2021, 3:09 AM

Just to add that this patch caused a huge compilation time increase in some workloads, with the majority of time spent in eliminateDeadDefs() called by rematerializeAll(). The compilation time regression was gone after this patch was reverted.

Would it be possible to share a reproducer for the issue? Without a reprocducer, it will be hard to track down the compile-time issue.

foad added a subscriber: foad.Oct 20 2021, 2:23 AM
Carrot added a subscriber: Carrot.May 16 2022, 11:24 AM

@rampitec, any follow up on this patch?

@rampitec, any follow up on this patch?

Not planned. It is reverted and AMDGPU specific patch was landed instead. I do not have a capacity to fix all regressions including performance on all targets.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap.ll