[CodeGen] Add a new pass for PostRA sink
ClosedPublic

Authored by junbuml on Dec 20 2017, 1:57 PM.

Details

Summary

This pass sinks COPY instructions into a successor block, if the COPY is not
used in the current block and the COPY is live-in to a single successor
(i.e., doesn't require the COPY to be duplicated). This avoids executing the
the copy on paths where their results aren't needed. This also exposes
additional opportunites for dead copy elimination and shrink wrapping.

These copies were either not handled by or are inserted after the MachineSink
pass. As an example of the former case, the MachineSink pass cannot sink
COPY instructions with allocatable source registers; for AArch64 these type
of copy instructions are frequently used to move function parameters (PhyReg)
into virtual registers in the entry block..

For the machine IR below, this pass will sink %w19 in the entry into its
successor (%bb.1) because %w19 is only live-in in %bb.1.

%bb.0:
   %wzr = SUBSWri %w1, 1
   %w19 = COPY %w0
   Bcc 11, %bb.2
 %bb.1:
   Live Ins: %w19
   BL @fun
   %w0 = ADDWrr %w0, %w19
   RET %w0
 %bb.2:
   %w0 = COPY %wzr
   RET %w0

As we sink %w19 (CSR in AArch64) into %bb.1, the shrink-wrapping pass will be
able to see %bb.0 as a candidate.

With this change I observed 12% more shrink-wrapping candidate and 13% more dead copies deleted in spec2000/2006/2017 on AArch64.

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
junbuml marked 5 inline comments as done.Jan 3 2018, 9:38 AM
junbuml edited the summary of this revision. (Show Details)

Moved this pass into MachineSink.cpp and named it PostRAMachineSink pass as Chad commented.

junbuml added inline comments.Jan 3 2018, 9:39 AM
lib/CodeGen/MachineCopySink.cpp
130 ↗(On Diff #127782)

I agree that merging these two loops is easier to read, but I intentionally kept these two loops separate because I wanted to take the second loop only when we found a single sinkable successor.

junbuml updated this revision to Diff 128546.Jan 3 2018, 12:09 PM

Fixed failures in :
test/DebugInfo/X86/dbg-value-transfer-order.ll
test/CodeGen/Hexagon/vect/vect-v4i16.ll

junbuml updated this revision to Diff 128624.Jan 4 2018, 9:38 AM

Fixed typo.

I've tested this out on Power with spec2006 and several open-source applications. With spec I saw a pretty similar increase in shrink wrapping opportunities (~11% with -O3 pgo+thinlto, ~7% with just -O3). I've noticed in some instances we do a lot of sinking without enabling new shrink-wrap opportunities though. For example with xalan stats showed we sunk about 4000 copies with this pass, but only enabled 5 new shrink-wrap opportunities and I see a consistent ~2% degradation with ref data. My understanding was enabling more shrink-wrap candidates was the original motivation. Have you considered breaking this up into an analysis that collects what copies are sinkable, and then only sink if doing so is likely to make the block viable for shrink-wrapping?

I've tested this out on Power with spec2006 and several open-source applications. With spec I saw a pretty similar increase in shrink wrapping opportunities (~11% with -O3 pgo+thinlto, ~7% with just -O3). I've noticed in some instances we do a lot of sinking without enabling new shrink-wrap opportunities though. For example with xalan stats showed we sunk about 4000 copies with this pass, but only enabled 5 new shrink-wrap opportunities and I see a consistent ~2% degradation with ref data. My understanding was enabling more shrink-wrap candidates was the original motivation. Have you considered breaking this up into an analysis that collects what copies are sinkable, and then only sink if doing so is likely to make the block viable for shrink-wrapping?

Very initially, this was motivated for more shrink wrapping, but we don't have to limit this just for more shrink wrapping because it can reduce #of dynamic instruction by sinking copies into paths where their results are really needed. This also helps for more dead copy eliminations. Can you please share little bit more detail about Xalan. AFAIK, Xalan score often vary across the top of truck in range of 3~4%.

I've tested this out on Power with spec2006 and several open-source applications. With spec I saw a pretty similar increase in shrink wrapping opportunities (~11% with -O3 pgo+thinlto, ~7% with just -O3). I've noticed in some instances we do a lot of sinking without enabling new shrink-wrap opportunities though. For example with xalan stats showed we sunk about 4000 copies with this pass, but only enabled 5 new shrink-wrap opportunities and I see a consistent ~2% degradation with ref data. My understanding was enabling more shrink-wrap candidates was the original motivation. Have you considered breaking this up into an analysis that collects what copies are sinkable, and then only sink if doing so is likely to make the block viable for shrink-wrapping?

Very initially, this was motivated for more shrink wrapping, but we don't have to limit this just for more shrink wrapping because it can reduce #of dynamic instruction by sinking copies into paths where their results are really needed. This also helps for more dead copy eliminations. Can you please share little bit more detail about Xalan. AFAIK, Xalan score often vary across the top of truck in range of 3~4%.

Sure, I used -O3 -fexperimental-new-pass-manager as the compiler options, the only difference between the compilers was this patch and I ran the binaries 7 times. With each executable run-to-run variation was small.

In my environment, Xalan is sensitive in code alignment. I open see some small change which is not even applied in hot functions could impact on score in range of 3~4%. I'm not sure if this is also your case. Can you please share how this change impact on hot functions (AFAIK, Xalan has two hot functions). It will be great if you can share any suspicious sinking or transformations caused by sinking copies.

In my environment, Xalan is sensitive in code alignment. I open see some small change which is not even applied in hot functions could impact on score in range of 3~4%. I'm not sure if this is also your case. Can you please share how this change impact on hot functions (AFAIK, Xalan has two hot functions). It will be great if you can share any suspicious sinking or transformations caused by sinking copies.

I spent some more time profiling this. I think your right on the code alignment causing the degradation. I've seen this before with Xalan but never with this big of a difference. I can trace the slow down to 2 functions that didn't change with this patch.

Thank you very much for confirming this. Kindly ping.

Please let me know any other comment about this patch.

Thanks for working on this! I've seen a few 1-2% improvements on arm64 with this enabled and I think it solves one of the issue that has been frequently raised with shrink-wrapping.

Few questions / concerns:

  • Is there anything preventing us to sink this even deeper than "one of the successors"? I think we should go further with this instead of special casing this particular issue.
  • If we end up doing that, I think this pass should sink more than just COPYs. Is going further with this and having a generic Post-RA Sink pass what you're planning to do?
  • I wonder if we could improve MachineSink to be scheduled both pre and post RA.
  • If that's not suitable, should we build some kind of infrastructure where we can merge both pre and post RA Sink passes and re-use the algorithms while only changing the constraints?

I'm just throwing ideas out here, since this feels a little bit special cased to the shrink-wrapping issue, while it could (and from what I understand, it already does) catch some even more interesting opportunities.

lib/CodeGen/MachineSink.cpp
1002 ↗(On Diff #128624)

You can use a reference for things that are never null.

970 ↗(On Diff #128532)

You can use an ArrayRef<MachineBasicBlock *> here.

test/DebugInfo/X86/dbg-value-transfer-order.ll
1 ↗(On Diff #128624)

Just curious: what happened here? Is some debug info missing after this pass?

Is there anything preventing us to sink this even deeper than "one of the successors"? I think we should go further with this instead of special casing this particular issue.

As long as there is no the register dependency, we can continue sinking a COPY deeply. I believe sinkcopy5() in post-ra-machine-sink.mir shows this case.

If we end up doing that, I think this pass should sink more than just COPYs. Is going further with this and having a generic Post-RA Sink pass what you're planning to do?
I wonder if we could improve MachineSink to be scheduled both pre and post RA.
If that's not suitable, should we build some kind of infrastructure where we can merge both pre and post RA Sink passes and re-use the algorithms while only changing the constraints?
I'm just throwing ideas out here, since this feels a little bit special cased to the shrink-wrapping issue, while it could (and from what I understand, it already does) catch some even more interesting opportunities.

I believe we can make this as a generic Post-RA Sink pass, but I didn't see any other motivation cases other than sinking COPYs for now. Considering the current scope of this post-ra sink, I thought separating the pre/post-ra make code much simpler. If there are good enough motivation cases which require the post-ra sink pass to do the pretty much the same jobs done in pre-ra, I will be happy to extend it in a way that you mention here.

Thanks Francis for reviewing and testing this. I will fix your other comments soon.

Is there anything preventing us to sink this even deeper than "one of the successors"? I think we should go further with this instead of special casing this particular issue.

As long as there is no the register dependency, we can continue sinking a COPY deeply. I believe sinkcopy5() in post-ra-machine-sink.mir shows this case.

Great! Sorry, I didn't see that all the potential successors were added to the list.

junbuml updated this revision to Diff 131640.Jan 26 2018, 12:35 PM
junbuml marked 2 inline comments as done.

Addressed Francis' comments.

On top of the COPY source forwarding (D41835), this will increase even more shrink-wrapping . With D41835, I observed about 60% more shrink-wrapping in spec2000/2006/2017 benchmarks.

test/DebugInfo/X86/dbg-value-transfer-order.ll
1 ↗(On Diff #128624)

There was minor block structure change by sinking two COPYs into an empty block. Instead of disabling the pass, I changed block names accordingly.

Kindly ping ?

Kindly ping ?

For me this is generally ok, but I would wait for the other reviewers to see if there are any concerns.

LGTM, but formal approval should probably come from someone outside of our group (ping! :).

lib/CodeGen/MachineSink.cpp
914 ↗(On Diff #131640)

the the -> the

junbuml updated this revision to Diff 134631.Feb 16 2018, 8:15 AM

Updated comments and some test cases.

junbuml added a comment.EditedFeb 16 2018, 8:20 AM

Added reviewers (Taewook Oh and John Brawn) due to the changes in :
. test/CodeGen/X86/branchfolding-debugloc.ll
. test/CodeGen/Thumb2/ifcvt-no-branch-predictor.ll

Kindly ping.

The change to ifcvt-no-branch-predictor.ll looks OK to me, though this patch doesn't apply cleanly to latest trunk (and also has test failures, including in ifcvt-no-branch-predictor.ll). It looks like the cause of that is that D41835 has been reverted in r325421.

The change to ifcvt-no-branch-predictor.ll looks OK to me, though this patch doesn't apply cleanly to latest trunk (and also has test failures, including in ifcvt-no-branch-predictor.ll). It looks like the cause of that is that D41835 has been reverted in r325421.

Thanks John for the review. As I expect D41835 is recommitted soon, I will defer rebasing it for now.

junbuml updated this revision to Diff 136152.Feb 27 2018, 2:10 PM
junbuml added a reviewer: RKSimon.

Rebased.
Added Simon Pilgrim as reviewer due to the change in test/CodeGen/X86/i128-mul.ll

Rebased.
Added Simon Pilgrim as reviewer due to the change in test/CodeGen/X86/i128-mul.ll

That change is fine, thanks.

Thanks Simon ! I will be happy to hear any comment, suggestion, or objection.

junbuml retitled this revision from [CodeGen] Add a new pass to sink Copy instructions after RA to [CodeGen] Add a new pass for PostRA sink.Mar 7 2018, 1:52 PM

Can anybody take a look at this? I will be happy to get any feedback.

sebpop added a subscriber: sebpop.Mar 12 2018, 2:40 PM
sebpop added inline comments.
lib/Target/AArch64/AArch64InstrInfo.cpp
4620 ↗(On Diff #136152)

This check for the zero registers seems to be the only difference with the generic TII implementation.
I am thinking that this may be the case for other targets than aarch64.
You could avoid duplicating all the code in this function by checking the result of a function like
TII->canModifyRegister(Reg) or
TII->isReadOnly(Reg)

test/CodeGen/Thumb2/ifcvt-no-branch-predictor.ll
121 ↗(On Diff #136152)

Why do you need to change the test here?

sebpop added a subscriber: evandro.Mar 13 2018, 8:55 AM
sebpop added inline comments.
lib/CodeGen/MachineSink.cpp
995 ↗(On Diff #136152)

So at this point we are at a loop depth 4:

for (BB : MF)
  for (MI: BB)
    for (SI : BB.successors)
      for (LI : SI->liveins)

I think you could do the last two loops above the loop (MI:BB) by asking first whether there is an opportunity to perform the sinking from BB into one of the BB.successors. The liveins are stored as a bitmap, and you could efficiently compute the difference of liveins in the successors. The registers in the diff are candidates for sinking.

Also please post compile time numbers with this change.

junbuml updated this revision to Diff 138223.Mar 13 2018, 10:25 AM

Thanks Sebastian for the review.
I added inlined reply.

lib/CodeGen/MachineSink.cpp
995 ↗(On Diff #136152)

I think you could do the last two loops above the loop (MI:BB) by asking first whether there is an opportunity to perform the sinking from BB into one of the BB.successors.

We perform this loop only when we can find a single successor to which we can sink in the right above loop line 978. Are you asking to move this check done in this loop outside of getSingleLiveInSuccBB() before finding the single sinkable successor ?

The liveins are stored as a bitmap, and you could efficiently compute the difference of liveins in the successors. The registers in the diff are candidates for sinking.

I'm not clear about this. As far as I checked, LiveIns in a MachineBasicBlock is a vector. I use a bitmap, but that is to track def/use of registers.

Also please post compile time numbers with this change.

I will share compile time info soon.

lib/Target/AArch64/AArch64InstrInfo.cpp
4620 ↗(On Diff #136152)

I cannot see neither TII->canModifyRegister(Reg) nor TII->isReadOnly(Reg), but instead I use TRI->isConstantPhysReg(Reg) in the generic TII.

test/CodeGen/Thumb2/ifcvt-no-branch-predictor.ll
122 ↗(On Diff #136152)

I simply wanted to make r0 (%n) used in both successors so that we can keep the MOV in the entry block. I believe this is the easiest/right way to keep the original intention of this test with this new pass.

sebpop added inline comments.Mar 13 2018, 11:37 AM
lib/CodeGen/MachineSink.cpp
995 ↗(On Diff #136152)

My reasoning is that we could compute the same information independently of an MI.
By only looking at a BB and the liveins in its successors, we can compute whether there is an opportunity to sink.

lib/CodeGen/TargetInstrInfo.cpp
901 ↗(On Diff #138223)

Looks good, thanks!

junbuml added inline comments.Mar 13 2018, 4:25 PM
lib/CodeGen/MachineSink.cpp
995 ↗(On Diff #136152)

I think you meant that we first try to find sinkable Regs in advance independently of an MI and iterate MIs only when there is candidates, right?

Assuming that this was what you meant, I doubt if doing so in advance is beneficial for the compile time. To be independent on MI, we need to do this aliased reg check for all live-ins in successors of every BB.

In current implementation, we do this check after finishing register dependency of MI (line 1040) and when we know that the DestRef of copy is potentially sinkable to a single successor (line 978). I think in practically doing this only when we need to do must be cheaper than doing this in advance for all live-ins of successors independently of an MI.

The worst case I can think of is that when a BB contain many redundant sinkable Copies writing to the same DestReg. For that, we might be able to cache the result for already visited Regs for a BB. Then, in worst case, we will perform this loop at most the number of registers of the target for a BB. Please let me know your thought.

sebpop accepted this revision.Mar 13 2018, 4:35 PM

I think the current implementation is good: please commit.
Thanks for the explanations.

This revision is now accepted and ready to land.Mar 13 2018, 4:35 PM

Thanks Sebastian for the review. I will commit it tomorrow.

junbuml updated this revision to Diff 138402.Mar 14 2018, 10:35 AM
junbuml added a reviewer: kparzysz.

Added Krzysztof as reviewer due to the change in hexagon test (test/CodeGen/Hexagon/noreturn-noepilog.ll).

I tried this patch on aarch64 A72 firefly linux on a set of benchmarks.
Overall the performance improved by 11% cumulatively (sum of all speedups and slowdowns.)
7 benchmarks improved by more than 1% and 4 degraded by >1%, one degraded by 4% and another by 3%.
I will investigate the 4% and 3% regressions.

Thanks for testing this. Please let me know any detail if the regression is real. I will also rerun performance tests on my side as well.

There was no obvious speedup and slowdown in my performance test for spec2000/2006/2017 on aarch64 falkor.

Sebastian,
Please let me know if you found something on your regressions.

Krzysztof,
Can you please confirm if the change in test/CodeGen/Hexagon/noreturn-noepilog.ll is okay?

Thanks,
Jun

The Hexagon change is fine, although a bit surprising. The two instructions are the same, but the :raw form is generally emitted for newer CPUs. I'm not sure what changed to get the older form printed instead.

I got more data for the benchmarks that slowed down, and I see that the variation is within the noise level.
Thanks for checking the performance on your side.

The Hexagon change is fine, although a bit surprising. The two instructions are the same, but the :raw form is generally emitted for newer CPUs. I'm not sure what changed to get the older form printed instead.

Thanks for review this. The only change from this patch is that a MOV in the entry is sunk into %b1. Honestly, I don't have much idea about the :raw form and how the MOV impact on allocframe in Hexagon. Do you think this expose some unexpected behavior in Hexagon?

Krzysztof, here is the assembly before this patch:

f1:                                     // @f1
// %bb.0:                               // %b0
        {
                        p0 = cmp.gtu(r0,#3); if (p0.new) jump:nt .LBB0_2
                        r2 = r0
        }
// %bb.1:                               // %b2
        {
                        r0 = +mpyi(r1,#7)
                        r1 = #0
                        jumpr r31
        }
.LBB0_2:                                // %b1
        {
                        call f0
                        r1:0 = combine(r2,##g0)
                        allocframe(r29,#0):raw
        }

after the patch:

f1:                                     // @f1
// %bb.0:                               // %b0
        {
                        p0 = cmp.gtu(r0,#3); if (p0.new) jump:nt .LBB0_2
        }
// %bb.1:                               // %b2
        {
                        r0 = +mpyi(r1,#7)
                        r1 = #0
                        jumpr r31
        }
.LBB0_2:                                // %b1
        {
                        r2 = r0
                        allocframe(#0)
        }
        {
                        call f0
                        r1:0 = combine(r2,##g0)
        }

The result takes one extra packet, which is a perf regression on Hexagon.
I think this is due to the fact that the sink of copies is post-ra, and
there doesn't seem to be a propagation pass to remove the extra transfer r2=r0.

The result takes one extra packet, which is a perf regression on Hexagon.
I think this is due to the fact that the sink of copies is post-ra, and
there doesn't seem to be a propagation pass to remove the extra transfer r2=r0.

Due to my lack of knowledge on Hexagon, I don't quite understand why this is regression. If this is Hexagon specific issue, I can turn it off for Hexagon.

Each "{ instructions }" represents a packet of instructions.
Each packet executes in 1 cycle.
Overall, before the patch there were 3 packets, after the patch 4 packets.
On one of the paths we go from 2 cycles to 3 cycles.

Can this be disabled via TargetPassConfig::disablePass? If so, please XFAIL the Hexagon test with a comment stating that it started failing after post-ra machine sinking,

Can this be disabled via TargetPassConfig::disablePass? If so, please XFAIL the Hexagon test with a comment stating that it started failing after post-ra machine sinking,

Yes, it can be disabled via TargetPassConfig::disablePass. I will go ahead and XFAIL the hexagon test with a comment.

junbuml updated this revision to Diff 138774.Mar 16 2018, 2:44 PM

XFAILed test/CodeGen/Hexagon/noreturn-noepilog.ll

Thanks for this @junbuml! I did some runs on arm64 and I see a regression on 176.gcc, in average of 1.5%.

I'm comparing

-Os -mllvm -disable-postra-machine-sink
vs
-Os

I did 6 runs for both configs and compared all of them:

176.gcc   1.4%
176.gcc   1.3%
176.gcc   1.3%
176.gcc   0.9%
176.gcc   0.5%
176.gcc   0.6%
176.gcc   1.2%
176.gcc   1.1%
176.gcc   1.1%
176.gcc   0.7%
176.gcc   0.3%
176.gcc   0.4%
176.gcc   1.4%
176.gcc   1.2%
176.gcc   1.2%
176.gcc   0.8%
176.gcc   0.4%
176.gcc   0.6%
176.gcc   2.4%
176.gcc   2.3%
176.gcc   2.3%
176.gcc   1.9%
176.gcc   1.5%
176.gcc   1.6%
176.gcc   2.5%
176.gcc   2.4%
176.gcc   2.4%
176.gcc   1.9%
176.gcc   1.6%
176.gcc   1.7%
176.gcc   2.6%
176.gcc   2.5%
176.gcc   2.5%
176.gcc   2.1%
176.gcc   1.7%
176.gcc   1.8%

I will investigate to see if I can find anything if I'm the only one seeing this.

lib/CodeGen/MachineSink.cpp
970 ↗(On Diff #138774)

Should be ArrayRef<MachineBasicBlock *> SinkableBBs

Thanks Francis for your test. In my previous test (O3) on aarch64, I didn't observe any noticeable change in spec 2000/2006/2017 score. I just kicked off perf tests for Os, and I will share the results.

In my test with Os, I can see -0.6% in spec2000/gcc, but I couldn't see any change in the hot function (propagate_block), which takes just 11% in profile though. As far as I check in my environment, 0.6% seems reasonable performance swing. I'm not sure if your regression on 176.gcc could be explained as swing or not. Please let me know you see any negative impact from this.

junbuml updated this revision to Diff 139123.Mar 20 2018, 7:55 AM
junbuml marked an inline comment as done.

minor formatting change.

Francis,
Just curious if your regression on 176.gcc is really caused by this change or it was possible performance swing. Did you by change have any improvement on your test?

In my test with Os, I can see -0.6% in spec2000/gcc, but I couldn't see any change in the hot function (propagate_block), which takes just 11% in profile though. As far as I check in my environment, 0.6% seems reasonable performance swing. I'm not sure if your regression on 176.gcc could be explained as swing or not. Please let me know you see any negative impact from this.

I took a look at the diffs. Other than very few block placement diffs and some loads using different base registers, everything looks great, so I think this should be fine (I hope I'm not missing anything obvious here). Thank you for the work!

Thanks Francis for looking at this.

junbuml updated this revision to Diff 139327.Mar 21 2018, 10:37 AM

Found a new failure in a recently added hexagon test.

Krzysztof, can you please confirm if my change in your test (test/CodeGen/Hexagon/swp-phi-ref.ll) is okay?

junbuml updated this revision to Diff 139443.Mar 22 2018, 7:54 AM

Assuming this pass will be disabled on Hexagon, XFAILed swp-phi-ref.ll just like noreturn-noepilog.ll.

This revision was automatically updated to reflect the committed changes.

This pass destroys DominatorInfo and we have to recompute it right after the pass from scratch. Is it possible to preserve it? Also, have you measured compile time impact of the patch?

Thanks,
Michael

This pass destroys DominatorInfo and we have to recompute it right after the pass from scratch. Is it possible to preserve it? Also, have you measured compile time impact of the patch?

I measured compile time impact of this patch for spec2000/2006/2017. Overall, I wasn't able to see any reproduciable regression; all up and down are in noise range. There is no change in CFG in this pass, preserve DT should be good and I will submit a follow-up patch for it.

I measured compile time impact of this patch for spec2000/2006/2017. Overall, I wasn't able to see any reproduciable regression; all up and down are in noise range. There is no change in CFG in this pass, preserve DT should be good and I will submit a follow-up patch for it.

Sounds good, thank you!

Michael

MatzeB added inline comments.Apr 16 2018, 5:36 PM
llvm/trunk/include/llvm/CodeGen/TargetInstrInfo.h
961

Why did you put this function into TargetInstrInfo? There is nothing target specific about it. In fact I would heavily object if a target would actually override this!