Page MenuHomePhabricator

Allow rematerialization of virtual reg uses
ClosedPublic

Authored by rampitec on Jul 20 2021, 3:57 PM.

Details

Summary

Currently isReallyTriviallyReMaterializableGeneric() implementation
prevents rematerialization on any virtual register use on the grounds
that is not a trivial rematerialization and that we do not want to
extend liveranges.

It appears that LRE logic does not attempt to extend a liverange of
a source register for rematerialization so that is not an issue.
That is checked in the LiveRangeEdit::allUsesAvailableAt().

The only non-trivial aspect of it is accounting for tied-defs which
normally represent a read-modify-write operation and not rematerializable.

The test for a tied-def situation already exists in the
/CodeGen/AMDGPU/remat-vop.mir,
test_no_remat_v_cvt_f32_i32_sdwa_dst_unused_preserve.

The change has affected ARM/Thumb, Mips, RISCV, and x86. For the targets
where I more or less understand the asm it seems to reduce spilling
(as expected) or be neutral. However, it needs a review by all targets'
specialists.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dmgreen added inline comments.Aug 5 2021, 2:56 PM
llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll
100

Yeah, I had taken a look to see how it's changing. It looks fine to me.

Thanks for updating the test checks. It looks like a more useful test now.

arsenm added inline comments.Aug 6 2021, 4:17 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

I don't understand why AllowVRegs needs to be a parameter. Why wouldn't it just always true?

120–125

i.e. this is rematerializable, the pressure question is a different heuristic

rampitec added inline comments.Aug 6 2021, 4:26 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

The problem is licm will hoist instructions if rematerializable even without checking the pressure (not saying its pressure calculation is not great) simply because if we run out of registers it will be easy to rematerialize the instruction. This is not the case with vreg uses, even if possible RA will not do it if it needs extending LR. The rematerialization in RA is opportunistic and not granted. Besides it is not even always possible to rematerialize such instructions even with all checks passed.

arsenm added inline comments.Aug 6 2021, 4:36 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

But why can't licm check if there are vregs on the instruction itself? Why thread this through the legality check?

rampitec added inline comments.Aug 6 2021, 4:53 PM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
120–125

A good idea, thanks! I can do it as a separate patch preceding this one and instead of D107677.

rampitec updated this revision to Diff 365232.Aug 9 2021, 11:26 AM
rampitec edited the summary of this revision. (Show Details)

Avoid predication for MachineLICM. MachineLICM itself is updated instead in D107677. Rebased on top of D107677.

rampitec updated this revision to Diff 366072.Aug 12 2021, 12:24 PM

Updated comments on isTriviallyReMaterializable and isReallyTriviallyReMaterializable. These are not exactly true for a long time already at least for AMDGPU and ARM.

Have you gone through the uses of isTriviallyReMaterializable to check they line up to the new semantics? It looks like it's used in CalcSpillWeight, LiveRangeEdits (you've fixed recently?), MachineLICM (is getting an updated costmodel), Register Coalescer, shouldRegionSplitForVirtReg, WebAssemblyRegStackifty. Are they all expected to keep working with the change here?

I feel like the new behavior isn't really "Trivial" any more, and it may be worth keeping the old method for trivial cases, which is just the base case plus a virtual reg use check. Essentially MachineLICMBase::isTriviallyReMaterializable from D107677. I'm not sure what to call the new method though, and not sure its worth it if all the uses above are OK as-is.

Have you gone through the uses of isTriviallyReMaterializable to check they line up to the new semantics?

I believe I did:

It looks like it's used in CalcSpillWeight,

Checked, spill weight calculation matches what LiveRangeEdit/RA in general do. Prohibiting vreg uses there did a worse job.

LiveRangeEdits (you've fixed recently?)

LRE handles it for a long time already, it has a check in the allUsesAvailableAt(). AMDGPU allows vreg uses, also ARM for VCTP. This is already the case.

MachineLICM (is getting an updated costmodel),

Yes, D107677.

Register Coalescer,

One corner case was fixed in D106396, otherwise shall work.

shouldRegionSplitForVirtReg,

Part or RA logic, works.

WebAssemblyRegStackifty.

Allows rematerialization of CONST_* opcodes only. These have immediate operands and no register uses at all. Should work and does not have any test changes.

Are they all expected to keep working with the change here?

Overall yes. The only per-requestity is MachineLICM change in the D107677 at this point.

I feel like the new behavior isn't really "Trivial" any more, and it may be worth keeping the old method for trivial cases, which is just the base case plus a virtual reg use check. Essentially MachineLICMBase::isTriviallyReMaterializable from D107677. I'm not sure what to call the new method though, and not sure its worth it if all the uses above are OK as-is.

The uses above are OK as far as I can tell. I can see how that is possible to wrap one method into another, but as usual have problem with names. We already have isRematerializable, isTriviallyRematerializable, and isReallyTriviallyRematerializable. Creating something like isReallyReallyTriviallyRematerializable looks like going down a rabbit hole ;) But I am open to opinions.

dmgreen accepted this revision.Aug 16 2021, 11:40 AM

Are they all expected to keep working with the change here?

Overall yes. The only per-requestity is MachineLICM change in the D107677 at this point.

Nice one

I feel like the new behavior isn't really "Trivial" any more, and it may be worth keeping the old method for trivial cases, which is just the base case plus a virtual reg use check. Essentially MachineLICMBase::isTriviallyReMaterializable from D107677. I'm not sure what to call the new method though, and not sure its worth it if all the uses above are OK as-is.

The uses above are OK as far as I can tell. I can see how that is possible to wrap one method into another, but as usual have problem with names. We already have isRematerializable, isTriviallyRematerializable, and isReallyTriviallyRematerializable. Creating something like isReallyReallyTriviallyRematerializable looks like going down a rabbit hole ;) But I am open to opinions.

I was thinking of something like "isSimpleRematerializable" vs "isTriviallyRematerializable" (but I didn't really like "simple"). If we don't have a user of it (other than Machine LICM which has it's own function), I'd say it is fine as-is.

Thanks for checking. Providing there are not other comment, this SGTM.

This revision is now accepted and ready to land.Aug 16 2021, 11:40 AM
rampitec marked 4 inline comments as done.Aug 16 2021, 12:03 PM
This revision was landed with ongoing or failed builds.Aug 16 2021, 12:53 PM
This revision was automatically updated to reflect the committed changes.
phosek added a subscriber: phosek.Aug 17 2021, 7:18 PM

We started seeing Clang failure when building our codebase with ASan and bisecting identified this change as a culprit, see PR51516 for details and reproducer. Would it be possible to revert this change until that issue is resolved?

We started seeing Clang failure when building our codebase with ASan and bisecting identified this change as a culprit, see PR51516 for details and reproducer. Would it be possible to revert this change until that issue is resolved?

Petr, I am AFK for today. Please revert it or I will do it tomorrow morning. Thanks for narrowing down.

We started seeing Clang failure when building our codebase with ASan and bisecting identified this change as a culprit, see PR51516 for details and reproducer. Would it be possible to revert this change until that issue is resolved?

Thanks Petr. Reproduced the error.

rampitec reopened this revision.Aug 20 2021, 10:50 AM

Proposed fix for https://bugs.llvm.org/show_bug.cgi?id=51516: D108475.
This needs that fix before it can be relanded.

This revision is now accepted and ready to land.Aug 20 2021, 10:51 AM
lkail added a subscriber: lkail.Aug 24 2021, 10:22 AM
rampitec updated this revision to Diff 368402.Aug 24 2021, 11:02 AM

Rebased to reland.

This revision was landed with ongoing or failed builds.Aug 24 2021, 11:09 AM
This revision was automatically updated to reflect the committed changes.

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This is an expensive check build that does verify machineinstrs after each pass (I think), so you probably won't see it on the lldb job. I'll try to grab you an ll.

This might be related to https://bugs.llvm.org/show_bug.cgi?id=51655. Do you mind checking it out?

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This is an expensive check build that does verify machineinstrs after each pass (I think), so you probably won't see it on the lldb job. I'll try to grab you an ll.

Thanks! This object seems to build fine to me with expensive checks, so I am not able to reproduce it myself.

What happened to it? Lldb is green now. Also do you mind to attach ll to the bug? For some reason I cannot build compiler-rt with sanitizers now, still trying to sort it out. This is all not needed to check RA though, if there is an ll.

This is an expensive check build that does verify machineinstrs after each pass (I think), so you probably won't see it on the lldb job. I'll try to grab you an ll.

I mean, it seems intermittent. It happened with build 20218, but then 20220 didn't have it, build 20221 was green, and then it has reappeared in the https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-expensive/

wxiao3 added a subscriber: wxiao3.Sep 14 2021, 11:20 PM

I just want to report that I tried this PR on large internal workloads (for X86) and saw some good performance. The generated code cleaned up nicely, I saw 350K new hits almost all are lea instructions. Observed 3% to 5% reduction in executed load/stores. Can the PR get be approved/landed?

I just want to report that I tried this PR on large internal workloads (for X86) and saw some good performance. The generated code cleaned up nicely, I saw 350K new hits almost all are lea instructions. Observed 3% to 5% reduction in executed load/stores. Can the PR get be approved/landed?

Thanks fort confirming, glad to hear this! This was landed about a month ago.

wxiao3 added a comment.EditedSep 22 2021, 12:01 AM

hi,

There is some internal benchmark performance regression after the patch landed.
The story is as follows:
There is a virtual register: %5 which lives across many basic blocks (some are inside loops).
Before this patch is landed, its weight is: 2.299834e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:2.299834e-03

Its definition is as below:

592B      %5:gr64 = LEA64r %523:gr64, 1, %0:gr64_nosp, 0, $noreg

After this patch is landed, its weight is reduced by half to: 1.149917e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:1.149917e-03

Finally, %5 is evicted by other VR, which doesn't happen before this patch land.

For this case, treating %5 as ReMaterializable and reducing its weight is very bad.

Any idea to fix the performance regression?

Wei

hi,

There is some internal benchmark performance regression after the patch landed.
The story is as follows:
There is a virtual register: %5 which lives across many basic blocks (some are inside loops).
Before this patch is landed, its weight is: 2.299834e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:2.299834e-03

Its definition is as below:

592B      %5:gr64 = LEA64r %523:gr64, 1, %0:gr64_nosp, 0, $noreg

After this patch is landed, its weight is reduced by half to: 1.149917e-03 as below:

%5 [592r,3248B:0)[3312B,3392B:0)[3456B,3616B:0)[3712B,4368B:0)[4464B,5008B:0)[5072B,5328B:0)[6208B,6480B:0)[6560B,6640B:0)[6720B,6976B:0)[7104B,8224B:0)[8336B,9104B:0)[9184B,9488B:0)[10512B,13904B:0)[14480B,16320B:0)  0@592r weight:1.149917e-03

Finally, %5 is evicted by other VR, which doesn't happen before this patch land.

For this case, treating %5 as ReMaterializable and reducing its weight is very bad.

Any idea to fix the performance regression?

Wei

It is now considered rematerializable which has halven the spill weight because it could be rematerialized instead. I guess at the end it was not rematerialized. I'd start with checking what has prevented it, but likely that is because %523 or %0 are not available at the point of rematerialization.

I was experimenting with avoiding the division by 2 at the end of the VirtRegAuxInfo::weightCalcHelper if there are vreg uses, but actually got worse results. Since this is a heuristic calculating weights it might make sense to give it some little extra weight for every vreg use I guess.

I was experimenting with avoiding the division by 2 at the end of the VirtRegAuxInfo::weightCalcHelper if there are vreg uses, but actually got worse results. Since this is a heuristic calculating weights it might make sense to give it some little extra weight for every vreg use I guess.

Something like this maybe if you have a good test for it: https://reviews.llvm.org/differential/diff/374324/

Obviously 0.01 there is a pure guesstimate.

LuoYuanke added inline comments.Sep 22 2021, 8:25 PM
llvm/lib/CodeGen/TargetInstrInfo.cpp
990

The comments looks reasonable to me. The use register's live range is extended if we rematerialize def register. Removing this code is not friendly to RA. Do you have any data to ensure there is no side effect but benefit by removing this code?

rampitec added inline comments.Sep 23 2021, 12:04 PM
llvm/lib/CodeGen/TargetInstrInfo.cpp
990

The comment is far from reality. RA does not extend use live ranges. Instead it checks that all uses are available at the point of rematerialization and does not rematerialize if not. There is quite a number of cases where we have less spilling now.

Obviously 0.01 there is a pure guesstimate.

To fix one of my local performance regressions, I need to set it to 0.1 at least.
But tuning the value seems to be meaningless. Because:
Without your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is definitely ReMaterializable.
With your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is probably ReMaterializable.
I don't think the change is consistent with this function's orignal design goal.
Moreover, your patch conflicts with our target implementation: X86InstrInfo::isReallyTriviallyReMaterializable (llvm/lib/Target/X86/X86InstrInfo.cpp). E.g., for most LEA instructions, our implementation will return false. While your patch will always return true if LEA is using virtual registers. Your patch makes our target implementation useless anymore.

it seems to reduce spilling (as expected) or be neutral

I don't observe any data to prove it's good for performance in our internal performance track system running on various hardware. Instead, your patch brings -1.5% drop for CPU2017rerf/557.xz on server side and -2% drop for coremark-pro/core on desktop side. In scenario with big register pressure such as loops with a lot of LiveIntervals, optimistically assuming critical VR (LiveInterval) to be ReMaterializable (but it's not in reality) will make RA tend to spill the critical VR, which is definitely bad for performance.

Could you please revert this change or at least make it not impact our X86 CPU BE? I don't observe the conflicted change bring any benefit but It results at least 2 regressions for X86 CPU BE.

Obviously 0.01 there is a pure guesstimate.

To fix one of my local performance regressions, I need to set it to 0.1 at least.
But tuning the value seems to be meaningless. Because:
Without your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is definitely ReMaterializable.
With your patch, isReallyTriviallyReMaterializableGeneric return true means that the VR is probably ReMaterializable.
I don't think the change is consistent with this function's orignal design goal.
Moreover, your patch conflicts with our target implementation: X86InstrInfo::isReallyTriviallyReMaterializable (llvm/lib/Target/X86/X86InstrInfo.cpp). E.g., for most LEA instructions, our implementation will return false. While your patch will always return true if LEA is using virtual registers. Your patch makes our target implementation useless anymore.

it seems to reduce spilling (as expected) or be neutral

I don't observe any data to prove it's good for performance in our internal performance track system running on various hardware. Instead, your patch brings -1.5% drop for CPU2017rerf/557.xz on server side and -2% drop for coremark-pro/core on desktop side. In scenario with big register pressure such as loops with a lot of LiveIntervals, optimistically assuming critical VR (LiveInterval) to be ReMaterializable (but it's not in reality) will make RA tend to spill the critical VR, which is definitely bad for performance.

Could you please revert this change or at least make it not impact our X86 CPU BE? I don't observe the conflicted change bring any benefit but It results at least 2 regressions for X86 CPU BE.

Since this is second perf regression report I will revert it.

I still consider this a good experiment as it has uncovered couple RA bugs in the process.

I still consider this a good experiment as it has uncovered couple RA bugs in the process.

Yes, It's a great experiment.

Thanks!
Wei

vporpo added a subscriber: vporpo.Oct 11 2021, 7:53 PM

Just to add that this patch caused a huge compilation time increase in some workloads, with the majority of time spent in eliminateDeadDefs() called by rematerializeAll(). The compilation time regression was gone after this patch was reverted.

fhahn added a subscriber: fhahn.Oct 12 2021, 3:09 AM

Just to add that this patch caused a huge compilation time increase in some workloads, with the majority of time spent in eliminateDeadDefs() called by rematerializeAll(). The compilation time regression was gone after this patch was reverted.

Would it be possible to share a reproducer for the issue? Without a reprocducer, it will be hard to track down the compile-time issue.

foad added a subscriber: foad.Oct 20 2021, 2:23 AM
Carrot added a subscriber: Carrot.May 16 2022, 11:24 AM

@rampitec, any follow up on this patch?

@rampitec, any follow up on this patch?

Not planned. It is reverted and AMDGPU specific patch was landed instead. I do not have a capacity to fix all regressions including performance on all targets.