This is an archive of the discontinued LLVM Phabricator instance.

Add LiveRangeShrink pass to shrink live range within BB.
ClosedPublic

Authored by danielcdh on Apr 26 2017, 2:46 PM.

Details

Summary

LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update the patch to have a separate pass to handle live range shrinking within BB.

danielcdh retitled this revision from Improve code placement algorithm in Reassociate pass. to Add LiveRangeShrink pass to shrink live range within BB..Apr 28 2017, 10:17 AM
danielcdh edited the summary of this revision. (Show Details)
davidxl edited edge metadata.Apr 28 2017, 10:23 AM

A couple of comments:

  1. Moving it out of the reassociation pass is probably the right direction to go.
  2. having this pass allows future enhancement
  3. This actually allows you to do more aggressive expression operand re-ordering (to enable more shrinking) -- but to void offsetting reassocation's effect (to enable cse, constant folding, loop invariant code motion etc), the shrink pass should be pushed very late in the pipeline.
danielcdh updated this revision to Diff 97125.Apr 28 2017, 10:51 AM

move the pass till the end of the optimizations.

spatel added a subscriber: spatel.Apr 29 2017, 1:26 PM

d = foo();
c = foo();
t = d + c;
b = foo();
t = t + b;
a = foo();
t = t + a;

The MachineCombiner pass will try to reverse this transform to increase throughput. See comments above TargetInstrInfo::getMachineCombinerPatterns(). If it's not doing it currently, it's probably only because some intervening instructions are causing the pattern to not match.

Why would a transform that is concerned with register pressure and live ranges not be better accomplished at the machine instruction level?

In all the new test cases, the problem is indirectly caused by the presence of function calls in the reassociated expression.
Your patch is essentially trying to solve a register pressure problem at LLVM IR level by pre-scheduling instructions. However, as Sanjay pointed out, this design can be a bit fragile since later passes may undo your canonicalization.

My understanding is that the pre-RA-scheduler pass should be responsible for finding a good schedule that minimizes register pressure in preparation for regalloc.
Do you know why pre-RA-scheduler is unable to find a good schedule for your particular case?

If I pass ‘test2()’ to llc (with flag -pre-RA-sched=list-ilp -mtriple=x86_64-unknown-unknown) then I get the expected schedule of call/add instructions.

test2:
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        callq   foo
        movl    %eax, %r14d
        callq   foo
        movl    %eax, %ebx
        addq    %r14, %rbx
        callq   foo
        movl    %eax, %eax
        addq    %rbx, %rax
        addq    $8, %rsp
        popq    %rbx
        popq    %r14
        retq

Yes, adding -pre-RA-sched=list-ilp can produce the expected result without this patch for my testcase. I'm not familiar with -pre-RA-sched, any quick insight why this option is not on by default?

Thanks,
Dehao

In all the new test cases, the problem is indirectly caused by the presence of function calls in the reassociated expression.
Your patch is essentially trying to solve a register pressure problem at LLVM IR level by pre-scheduling instructions. However, as Sanjay pointed out, this design can be a bit fragile since later passes may undo your canonicalization.

My understanding is that the pre-RA-scheduler pass should be responsible for finding a good schedule that minimizes register pressure in preparation for regalloc.
Do you know why pre-RA-scheduler is unable to find a good schedule for your particular case?

If I pass ‘test2()’ to llc (with flag -pre-RA-sched=list-ilp -mtriple=x86_64-unknown-unknown) then I get the expected schedule of call/add instructions.

test2:
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        callq   foo
        movl    %eax, %r14d
        callq   foo
        movl    %eax, %ebx
        addq    %r14, %rbx
        callq   foo
        movl    %eax, %eax
        addq    %rbx, %rax
        addq    $8, %rsp
        popq    %rbx
        popq    %r14
        retq
wmi edited edge metadata.May 1 2017, 4:43 PM
wmi added a subscriber: atrick.

+ Andy, for the history on pre-RA-sched and misched.

lib/Transforms/Scalar/LiveRangeShrink.cpp
54–56

Maybe use BasicBlock::getFirstInsertionPt instead. Set Insert to be BasicBlock::getFirstInsertionPt if Insert is before BasicBlock::getFirstInsertionPt.

58–60

Can we remove the update loop of InstOrderMap? It leads to O(N^2) complexity.

One way is just to let InstOrderMap[I] = InstOrderMap[Insert] and keep the rest the same.

66

Can we use a function pass? which will be easier for future extension.

MatzeB added a comment.EditedMay 1 2017, 4:51 PM
In D32563#742848, @wmi wrote:

+ Andy, for the history on pre-RA-sched and misched.

Selection-DAG scheduling (-pre-ra-sched=XXX) should only be concerned about bringing the SelectionDAG nodes into a linear order, they are not supposed to perform register pressure minimiziation, latency modeling, etc.
The main scheduling is left to the MachineScheduler nowadays (which runs between copy coalescing and register allocation).

Is the proposed pass necessary because the machine scheduler doesn't schedule instructions accross calls? Then https://reviews.llvm.org/D15667 may be related...

wmi added inline comments.May 1 2017, 4:58 PM
lib/Transforms/Scalar/LiveRangeShrink.cpp
58–60

Please ignore this for now. Sorry. It is an incomplete comment sent out accidently.

atrick added a comment.May 2 2017, 1:46 PM

Why are the adds "sunk down" in the first place? Is this reassociation at work?

Machine global code motion, using MachineTraceMetrics and RegisterPressure would be awesome. Nobody really followed through with implementing it so unfortunately there's no switch to turn on to try it.

wmi added a comment.May 2 2017, 2:31 PM

Why are the adds "sunk down" in the first place? Is this reassociation at work?

Yes, reassociation for an expression tree inserts the intermediate result directly before the final root. That is how the adds sunk down.

Machine global code motion, using MachineTraceMetrics and RegisterPressure would be awesome. Nobody really followed through with implementing it so unfortunately there's no switch to turn on to try it.

The machine global code motion you mention here is misched or a separate pass? We found existing machine scheduling can do some live range shrinking in some cases, but we don't know whether depending on existing register pressure aware scheduing and improving it is a good way to solve the live range shrinking problem generally. Could you shed some light on how existing misched choose between maximizing ILP and minimizing register pressure?

Thanks,
Wei.

atrick added a comment.May 2 2017, 3:50 PM

Ideally there should be a separate pass that runs on the SSA machine code (before register coalescing) to minimize register pressure and hide latency for chains of loads or FP ops. It should work across calls and loop boundaries. You could even do a kind of poor-man's cyclic scheduling this way.

MISched works at the lower level of instructions groups and CPU pipeline hazards. It would be nice if MISched worked at the level of extended basic blocks (it would be easy to implement and has been done out of tree). I don't think it makes as much sense for it to work across call sites though. That is not hard to implement but seems it will generate large DAGS and will be bad compile-time tradeoff.

MISched is not a scheduling algorithm, it's a scheduling framework. The generic scheduler is a pile of heuristics that exercise most of the functionality and seems to be working ok for several popular targets. The strategy that it takes is:

  • Make a single scheduling pass handling all heuristics at once. Don't reorder the instructions at all unless the heuristics identify a register pressure or latency problem.
  • Try to determine, before scheduling a block, whether register pressure or latency is likely to become a problem. This avoids the scheduler backing itself into a corner (we don't want the scheduler to backtrack).

You'll notice that this is very conservative with respect to managing compile time and preserving the decisions made by earlier passes.

You could follow that basic strategy and simply adjust the priority of heuristics for your target. You can add better up-front analysis to detect code patterns for each block before prioritizing heuristics. Or you could implement a completely different strategy. For example, schedule for register pressure first, then reschedule for ILP.

I probably won't be able to help you must more than that. Matthias has taken over maintenance of MISched. I think it would help if you give more background on your situation (sorry if I haven't paid attention or have forgotten). Is this PowerPC? Inorder/out-of-order?

wmi added a comment.May 2 2017, 5:12 PM

Ideally there should be a separate pass that runs on the SSA machine code (before register coalescing) to minimize register pressure and hide latency for chains of loads or FP ops. It should work across calls and loop boundaries. You could even do a kind of poor-man's cyclic scheduling this way.

MISched works at the lower level of instructions groups and CPU pipeline hazards. It would be nice if MISched worked at the level of extended basic blocks (it would be easy to implement and has been done out of tree). I don't think it makes as much sense for it to work across call sites though. That is not hard to implement but seems it will generate large DAGS and will be bad compile-time tradeoff.

MISched is not a scheduling algorithm, it's a scheduling framework. The generic scheduler is a pile of heuristics that exercise most of the functionality and seems to be working ok for several popular targets. The strategy that it takes is:

  • Make a single scheduling pass handling all heuristics at once. Don't reorder the instructions at all unless the heuristics identify a register pressure or latency problem.
  • Try to determine, before scheduling a block, whether register pressure or latency is likely to become a problem. This avoids the scheduler backing itself into a corner (we don't want the scheduler to backtrack).

You'll notice that this is very conservative with respect to managing compile time and preserving the decisions made by earlier passes.

You could follow that basic strategy and simply adjust the priority of heuristics for your target. You can add better up-front analysis to detect code patterns for each block before prioritizing heuristics. Or you could implement a completely different strategy. For example, schedule for register pressure first, then reschedule for ILP.

I probably won't be able to help you must more than that. Matthias has taken over maintenance of MISched. I think it would help if you give more background on your situation (sorry if I haven't paid attention or have forgotten). Is this PowerPC? Inorder/out-of-order?

Thank you so much for introducing the design philosophy of MISched. It really helps.

About the background on our situation, the problem we are trying to solve is that reassociation pass introduces a lot of live range interference unnecessarily because of the way how it inserts the intermediate results. For the testcase exposing the problem, it is quite special in that the instructions we care about are all in the same BB. Most of the defs have only one use and most of the uses have only one def. So Dehao proposed a limited live range shrinking pass in llvm IR phase to solve it. We may extend the live range shrinking in the future because we have seen other cases before that would be benefited from more general live range shrinking.

Sanjay and Andrea made us notice that scheduling may already achieve the same result as the shrinking pass we proposed, at least for the motivational testcase, so we looked into MISched. Dehao has found that even we enabled existing generic scheduling to cross call, the more complex DAG can lead to a better but still unoptimal solution, and the unoptimal issue is not easy to solve without a global picture of the whole basicblock. This makes us incline to implement it in a separate pass.

The problem was found on x86, which is the target we care the most right now, but I think the problem is platform independent.

atrick added a comment.May 2 2017, 5:49 PM

I think the algorithm needs to be explained. It seems reasonable to argue that single def/use live ranges should be kept short. But is there any reason at all to run this before other IR passes? It seems natural to run this after machine LICM and Sinking.

atrick added a comment.May 2 2017, 6:26 PM

The time to shrink live ranges is after all the global machine code
motion is done if the goal is handle as many cases as possible.

Doing code motion in machine code is tricky because you have to worry
about physical registers, flags and such. But those are exactly the
things you need to worry about to avoid generating terrible code. For
example, you don't want to hoist something over a call that defines a
physreg or is used by a physreg copy!

That said, if reassociate is making silly decisions about where to
place operations, I wouldn't see a problem fixing that directly.

The time to shrink live ranges is after all the global machine code
motion is done if the goal is handle as many cases as possible.

Doing code motion in machine code is tricky because you have to worry
about physical registers, flags and such. But those are exactly the
things you need to worry about to avoid generating terrible code. For
example, you don't want to hoist something over a call that defines a
physreg or is used by a physreg copy!

That said, if reassociate is making silly decisions about where to
place operations, I wouldn't see a problem fixing that directly.

We initially tried to fix it in reassociate, but decided to have it fixed separate because we do not want reassociate to become too complicated. And also this should also be able to fix live-range issues introduced by user. I'll try to move this pass to Machine level.

danielcdh updated this revision to Diff 97891.May 4 2017, 4:21 PM

change to do live range shrinking at machine level.

I've updated the patch to implement this in Machine Level. The down side is that it needs to update a lot of unittests especially those autogenerated llc tests.

At this point, the patch is well tested with internal google workload, and seems to be valid. But I may ignore other subtle cases that's not covered by my testing. I'd appreciate if someone can help point out.

Thanks,
Dehao

gberry added a subscriber: gberry.May 5 2017, 8:27 AM

I had a quick look at the code, and I noticed that you never check for the presence of DBG_VALUE instructions in a basic block.
I think that your pass would not work as expected with debug info.

More in general:

  • When you check the number of uses of a register, you should skip debug values.
  • When you move an instruction, you should also move its associated debug values.

A lot of tests have changed as a result of this patch. However, I think there is a value in adding your original test cases involving function calls.
Could you please add a new MIR test with code from (at least one of) your original small test cases?

lib/CodeGen/LiveRangeShrink.cpp
112 ↗(On Diff #97891)

You should probably skip debug values.

115 ↗(On Diff #97891)

s/instrctuion/instruction

126 ↗(On Diff #97891)

s/coalesed/coalesced

142 ↗(On Diff #97891)

We want to skip uses that are debug instructions.

147 ↗(On Diff #97891)

Same. We don't want to count debug uses.

danielcdh updated this revision to Diff 97979.May 5 2017, 10:05 AM
danielcdh marked 5 inline comments as done.

update

I had a quick look at the code, and I noticed that you never check for the presence of DBG_VALUE instructions in a basic block.
I think that your pass would not work as expected with debug info.

More in general:

  • When you check the number of uses of a register, you should skip debug values.
  • When you move an instruction, you should also move its associated debug values.

A lot of tests have changed as a result of this patch. However, I think there is a value in adding your original test cases involving function calls.
Could you please add a new MIR test with code from (at least one of) your original small test cases?

Done

When you move an instruction, you should also move its associated debug values.

See for example how this is done by MachineSink in method MachineSinking::SinkInstruction() (see file MachineSink.cpp).
For every instruction INSN that is a candidate for sinking, MachineSink collects all the DBG_VALUE instructions which are associated to INSN.
When INSN is moved to a different location, all the associated debug values (if any) are moved too.
I think you should do something similar.

About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.

gberry added inline comments.May 5 2017, 11:01 AM
lib/CodeGen/LiveRangeShrink.cpp
112 ↗(On Diff #97979)

Why are you skipping compares?

When you move an instruction, you should also move its associated debug values.

See for example how this is done by MachineSink in method MachineSinking::SinkInstruction() (see file MachineSink.cpp).
For every instruction INSN that is a candidate for sinking, MachineSink collects all the DBG_VALUE instructions which are associated to INSN.
When INSN is moved to a different location, all the associated debug values (if any) are moved too.
I think you should do something similar.

Sorry, did not address this in the last patch. Done now. Note that I do not want to touch unnecessary files in this change. So I simply copied the code from MachineSink.cpp, and will refactor it to a utility function in MachineInstr.h/.cpp in a follow-up patch.

About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.

Done, but I'm not sure if that's the right thing to do. update_llc_test_checks makes the test too fragile: any subtle change of the codegen would break it. And it hides what the author really want to test and follow-up patch writers may simply run the script and regenerate the test (without understanding what it's testing, or they need to spend much time on what the original purpose is). BTW, what's MIR test? could you point me to one example? Thanks.

lib/CodeGen/LiveRangeShrink.cpp
112 ↗(On Diff #97979)

In many architectures, cmp should stay together with jmp instruction for performance purpose. We do not want to break this in this pass.

Sorry, did not address this in the last patch. Done now. Note that I do not want to touch unnecessary files in this change. So I simply copied the code from MachineSink.cpp, and will refactor it to a utility function in MachineInstr.h/.cpp in a follow-up patch.

No problem. Thanks!

About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.

Done, but I'm not sure if that's the right thing to do. update_llc_test_checks makes the test too fragile: any subtle change of the codegen would break it. And it hides what the author really want to test and follow-up patch writers may simply run the script and regenerate the test (without understanding what it's testing, or they need to spend much time on what the original purpose is). BTW, what's MIR test? could you point me to one example? Thanks.

Right, I think it your test is okay.

About MIR tests, you can have a look at this page: http://llvm.org/docs/MIRLangRef.html
In particular, section "MIR Testing Guide" (and the next two sub-sections) are quite useful imho.

For example, you could do something like:

llc my-new-test.ll -stop-after=machine-sink -o test.mir

Then, test.mir with a RUN line which only tests your new pass
For example:

llc -o - %s -mtriple=x86_64-- -run-pass=lrshrink | FileCheck %s

The advantage is that you test your new pass in isolation.
But, as I wrote, I think your test is okay.

gberry added inline comments.May 5 2017, 12:02 PM
lib/CodeGen/LiveRangeShrink.cpp
153 ↗(On Diff #97991)

Typo: 'physicall' -> 'physical'

112 ↗(On Diff #97979)

For targets that don't have allocatable flag regsiters, your later check for physical regs should prevent moving these compares (and any other non-compare flag setting instructions). For targets that keep the results of compares in general purpose registers, this seems like a missed optimization. Also, if it is important that the compare be just before the branch, the scheduler should be taking care of that.

If you don't agree that this check should be removed can you put a comment explaining why this check for compares is being done?

danielcdh marked an inline comment as done.May 5 2017, 4:17 PM

Sorry, did not address this in the last patch. Done now. Note that I do not want to touch unnecessary files in this change. So I simply copied the code from MachineSink.cpp, and will refactor it to a utility function in MachineInstr.h/.cpp in a follow-up patch.

No problem. Thanks!

About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.

Done, but I'm not sure if that's the right thing to do. update_llc_test_checks makes the test too fragile: any subtle change of the codegen would break it. And it hides what the author really want to test and follow-up patch writers may simply run the script and regenerate the test (without understanding what it's testing, or they need to spend much time on what the original purpose is). BTW, what's MIR test? could you point me to one example? Thanks.

Right, I think it your test is okay.

About MIR tests, you can have a look at this page: http://llvm.org/docs/MIRLangRef.html
In particular, section "MIR Testing Guide" (and the next two sub-sections) are quite useful imho.

For example, you could do something like:

llc my-new-test.ll -stop-after=machine-sink -o test.mir

Then, test.mir with a RUN line which only tests your new pass
For example:

llc -o - %s -mtriple=x86_64-- -run-pass=lrshrink | FileCheck %s

The advantage is that you test your new pass in isolation.
But, as I wrote, I think your test is okay.

Thanks for the explanation. I've reverted the test to its original form. And after checking the mir test output, I think it's better to keep it simple in its original phone, and make sure that later scheduler does not undo the change in this patch.

lib/CodeGen/LiveRangeShrink.cpp
112 ↗(On Diff #97979)

That makes sense. I've removed the "compare" check.

atrick added a comment.May 5 2017, 9:24 PM

Can you collect spill code stats and benchmark scores from the test suite?
Can you get someone to run benchmarks on arm64 as well?

lib/CodeGen/LiveRangeShrink.cpp
162 ↗(On Diff #98035)

Why do you care how many instructions are using the register being defined here?

sanjoy added a subscriber: skatkov.May 7 2017, 12:56 PM

The reordering code here is reminding me of the reordering code in ImplicitNullChecks. Is there an opportunity to share code here?

+CC @skatkov

lib/CodeGen/LiveRangeShrink.cpp
55 ↗(On Diff #98035)

s/Othwise/Otherwise/

85 ↗(On Diff #98035)

Range for?

sanjoy added a subscriber: sanjoy.May 7 2017, 1:02 PM
danielcdh marked an inline comment as done.May 8 2017, 6:09 PM

Can you collect spill code stats and benchmark scores from the test suite?
Can you get someone to run benchmarks on arm64 as well?

The patch did not change test-suite benchmark scores.

The patch reduced the total static spill number for the entire test suite from 110283 to 110232

Unfortunately we do not have arm64 machine to run the performance test. But based on its impact on x86, llvm testsuite does not seem to suffer from register pressure, thus it should have minimal perf impact.

lib/CodeGen/LiveRangeShrink.cpp
162 ↗(On Diff #98035)

Right, I should not. Removed the check.

danielcdh updated this revision to Diff 98240.May 8 2017, 6:09 PM
danielcdh marked an inline comment as done.

rebase and update

danielcdh marked 2 inline comments as done.May 8 2017, 6:10 PM
andreadb added inline comments.May 9 2017, 5:55 AM
lib/CodeGen/InlineSpiller.cpp
43–50 ↗(On Diff #98240)

I think that you have uploaded a wrong patch.
This code shouldn't be there :-).

1031–1032 ↗(On Diff #98240)

Same.

danielcdh marked 2 inline comments as done.May 9 2017, 8:15 AM
danielcdh added inline comments.
lib/CodeGen/InlineSpiller.cpp
43–50 ↗(On Diff #98240)

Oops, forgot to revert it... Thanks for catching this :)

MatzeB requested changes to this revision.May 9 2017, 10:58 AM
MatzeB added inline comments.
lib/CodeGen/LiveRangeShrink.cpp
10–12 ↗(On Diff #98292)

Use /// \file, see http://llvm.org/docs/CodingStandards.html#file-headers

It would also be a good idea to add a sentence describing the rules/heuristic that decide what instructions get moved.

34 ↗(On Diff #98292)

This seems like a bold statement for a pass that moves instructions around. I would only expect to see AU.setPreservesCFG() here.

52 ↗(On Diff #98292)

We tend to use doxygen /// to document functions.

57–58 ↗(On Diff #98292)

Use references for things that cannot be nullptr. That also saves you assert(New != nullptr).

79 ↗(On Diff #98292)

You probably want MachineBasicBlock::iterator here and in several other places as you do not plan to break bundles apart.

93–94 ↗(On Diff #98292)

Then why not do so in this patch.

101 ↗(On Diff #98292)

I'm surprised this compiles. In each case you should use MI.getIterator() for clarity. Could also use DI = std::next(...)

102–103 ↗(On Diff #98292)

how about for(MachineInstr &DMI : make_range(DI, DE)) { ... }

111 ↗(On Diff #98292)

// end anonymous namespace

114–115 ↗(On Diff #98292)

This seems unnecessary the for loop below will abort shortly enough if the function is empty. Completely empty functions are also extremely rare so that this shortcut probably takes more time than it saves.

133–134 ↗(On Diff #98292)

Use MachineBasicBlock::iterator

135 ↗(On Diff #98292)

Use a reference for things that cannot be nullptr.

142 ↗(On Diff #98292)

No need for auto here just write MachineOperand

156–157 ↗(On Diff #98292)

You could move this check into the !isSafeToMove() path.

166 ↗(On Diff #98292)

unsigned for counters.

168 ↗(On Diff #98292)

Avoid auto.

173–174 ↗(On Diff #98292)

Maybe assign MO.getReg() to a variable it is used 5 times here.

MatzeB added inline comments.May 9 2017, 10:58 AM
lib/CodeGen/LiveRangeShrink.cpp
50 ↗(On Diff #98292)

You never seem to be using negative numbers, consider using unsigned instead of int.

64 ↗(On Diff #98292)

M[Old]?

65 ↗(On Diff #98292)

M[New] or better avoid the double lookup with M.count(New) above and use an M.find(New) for both.

84 ↗(On Diff #98292)

Why not pass a MachineBasicBlock &MBB as parameter to the function instead of the iterator and doing the Start->getParent()->instr_end() dance? You could have just used for (MachineInstr &I : MBB) {} in that case.

124 ↗(On Diff #98292)

I would have expected this outside the loop so it can be reused (considering that you also went through the trouble of clear()ing it inside BuildInstOrderMap.

125 ↗(On Diff #98292)

Collecting the SawStore value is unnecessary. The only user appears to be MachineInstr::isSafeToMove() below which writes(!) the variable.

150 ↗(On Diff #98292)

Barrier.

211–215 ↗(On Diff #98292)

There's a variant of splice that can move a sequence of instructions which should work nicely here.

It is also not clear to me if moving the debug instructions is legal here. They are not considered when checking how early an instruction can move so you are possibly moving the debug instruction above a point where the value its operand is defined.

This revision now requires changes to proceed.May 9 2017, 10:58 AM
danielcdh marked 21 inline comments as done.May 9 2017, 2:39 PM

Thanks for the review.

lib/CodeGen/LiveRangeShrink.cpp
64 ↗(On Diff #98292)

M is const map, thus cannot use M[Old].

84 ↗(On Diff #98292)

Because Start may not be MBB.begin(), but could also be instruction that has side effect.

The reason I used Start->getParent()->instr_end() instead of passing in MBB and use MBB.instr_end() is because I want to pass one less parameter to make the function clearer.

93–94 ↗(On Diff #98292)

Code updated to inline the logic into caller to make the logic clearer.

125 ↗(On Diff #98292)

SawStore is both input and output of MachineInstr::isSafeToMove().

211–215 ↗(On Diff #98292)

Done

For the Debug Value, it should be legal as the same logic is used by MachineLICM.

danielcdh updated this revision to Diff 98351.May 9 2017, 2:40 PM
danielcdh edited edge metadata.

update

skatkov added inline comments.May 9 2017, 10:27 PM
lib/CodeGen/LiveRangeShrink.cpp
135 ↗(On Diff #98351)

Why don't you refresh SawStore here?

andreadb added inline comments.May 10 2017, 12:21 PM
lib/CodeGen/LiveRangeShrink.cpp
94–95 ↗(On Diff #98351)

You should early exit if this function must be skipped.

if (skipFunction(*MF.getFunction())
  return false;

We want to be able to 'opt-bisect' this pass.
Also, this pass shouldn't be run on optnone functions.

178 ↗(On Diff #98351)

I think that you should also skip debug vaue instructions.

The loop at line 192 works under the assumption that a MI's DBG_VALUE always follows directly after it. This is also suggested by your comment at line 189.

If you don't skip debug values, then you potentially break the above-mentioned assumption.

danielcdh added inline comments.May 10 2017, 1:56 PM
lib/CodeGen/LiveRangeShrink.cpp
94–95 ↗(On Diff #98351)

skipFunction added. Thanks for the comment.

This will not be run in optnone as in:

if (getOptLevel() != CodeGenOpt::None)
  addPass(&LiveRangeShrinkID);
135 ↗(On Diff #98351)

Not sure if I understand the comment. SawStore is updated in MI.isSafeToMove. Do you mean that I can remove the previous "SawStore = BuildInstOrderMap", and set SawStore's initial value to "false"?

178 ↗(On Diff #98351)

Done. Great catch!

skatkov added inline comments.May 10 2017, 8:11 PM
lib/CodeGen/LiveRangeShrink.cpp
135 ↗(On Diff #98351)

My understanding is that if you meet that MI has side effect and it is a barrier you will not move later instruction through this one. SawStore is also a kind of barrier and you cannot move instruction which intersects with the store instruction. After you set the new barrier you can reset SawStore. Say if you saw store before new barrier and there is no stores after the barrier you are free to move instruction which do not intersect with stores. However if SawStore has been set to true once I do not see the place where it can be to reset to false. Do I miss anything?

However the overall logic around SawStore is not clear to me. SawStore is set in BuildInstOrderMap. So it detects where there is a store in the tail. It would prevent us re-ordering instruction (we move instruction up always) even if store is located later. Is it expected?

Currently, we set 'SawStore' if there is at least one 'mayStore' instruction in the scheduling region.

However, we could provide a much more accurate information about where 'mayStore' instructions are in the code.
For example, 'BuildInstOrderMap' could populate a (ordered) map '<order_number, MI*>', where MI is a 'mayStore' instruction, and 'order_map' is the instruction number in the sequence.

Later on, you could query that map to check if it is safe to move a load from loc_B to loc_A. If the map contains a store with an order_number in range [loc_A, loc_B], then you know that it is not safe to move it.

This is just an idea for a future improvement.

lib/CodeGen/LiveRangeShrink.cpp
81–85 ↗(On Diff #98529)

We can early exit from the loop in BuildInstOrderMap as soon as we encounter an instruction with unmodeled side effects.

An instruction with unmodeled side effects is essentially a scheduling barrier.
We don't need to keep numbering instructions after a scheduling barrier because the algorithm would force a recomputation of OrderMap (see line 138).

192–200 ↗(On Diff #98529)

It would be nice to have a STATISTIC that counts the number of instructions reordered by this pass.

135 ↗(On Diff #98351)

You are right. When BuildInstrOrderMap() is called, SawStore should be updated too. Otherwise we risk to prevent the reordering of loads in the new scheduling region.

danielcdh updated this revision to Diff 98635.May 11 2017, 8:39 AM
danielcdh marked an inline comment as done.

update

Thanks for the reviews!

lib/CodeGen/LiveRangeShrink.cpp
81–85 ↗(On Diff #98529)

That would actually make the compile time slower by adding additional check on the frequent path (BuildInstOrderMap), as instructions with unmodeled side effects is extremely rare in real programs.

135 ↗(On Diff #98351)

I've updated the logic to update SawStore on the fly and reset to false when an instruction with unmodeled side effect is met.

MatzeB accepted this revision.May 11 2017, 10:52 AM

Accepting this now as it looks reasonably safe and fast to me (nitpicks below).

It is a conservative/simple heuristic but given the fact that we lack other means of global code motion (in the sense of inter-basic block) in CodeGen today this is fine.

Please wait for some of the other reviewers which showed interest here to accept before committing.

lib/CodeGen/LiveRangeShrink.cpp
112 ↗(On Diff #98635)

Maybe move UseMap outside the loop and clear() it each iteration instead so there is a chance that memory allocations can be reused.

10–12 ↗(On Diff #98292)

Needs three slashes to be recognized as doxygen. (I know this is wrong in a lot of existing code, but we should get it right in new code)

211–215 ↗(On Diff #98292)

I'm not sure why this would work. Maybe you should double check by compiling the llvm-testsuite (or some other set of programs) with a debug compiler and -g...

This revision is now accepted and ready to land.May 11 2017, 10:52 AM
andreadb accepted this revision.May 11 2017, 11:09 AM

Accepting this now as it looks reasonably safe and fast to me (nitpicks below).

It is a conservative/simple heuristic but given the fact that we lack other means of global code motion (in the sense of inter-basic block) in CodeGen today this is fine.

Please wait for some of the other reviewers which showed interest here to accept before committing.

LGTM too.

Thanks!

danielcdh updated this revision to Diff 98677.May 11 2017, 2:25 PM
danielcdh marked 2 inline comments as done.

update

danielcdh added inline comments.May 11 2017, 2:25 PM
lib/CodeGen/LiveRangeShrink.cpp
211–215 ↗(On Diff #98292)

glad that I run this with -g and actually caught a bug: when I move instructions together with debug info, I should increment the Next pointer, otherwise the iterator will iterate to instructions that have already been processed. Updated the implementation and added test coverage.

Thanks all for the reviews. Any other comments/concerns? If not, I'll commit the patch tomorrow.

andreadb accepted this revision.May 12 2017, 3:54 AM

Thanks for fixing the issue with debug values.

The patch still looks good to me.
A couple of comments can be slightly improved (see below).

lib/CodeGen/LiveRangeShrink.cpp
166 ↗(On Diff #98677)

Please remember to add a period at the end of each sentence.

182 ↗(On Diff #98677)

s/shrinked/shrunk
Also, please add a period at the end of this sentence.

185 ↗(On Diff #98677)

You should update this comment since we also skip debug values now.

danielcdh updated this revision to Diff 98774.May 12 2017, 8:35 AM
danielcdh marked 3 inline comments as done.

update

I'll submit the patch now to allow rest of today to catch potential buildbot breakage. Thanks again for all the reviews.

Dehao

danielcdh closed this revision.May 12 2017, 12:42 PM

Sorry about the regression. As Quentin pointed out, moving instructions close to phi should not be a problem as the algorithm checks the moved instruction uses >1 single-use defs. Looking at the example of the regression case, the transformation looks reasonable if we only consider shrinking live-ranges. Could you provide an IR that I can try reproduce the issue and see why register allocation is worse?

During the mean time, I can prepare a patch to move this to x86-only, or revert the patch if necessary.

patch to move the pass to x86 is proposed: https://reviews.llvm.org/D33294

Will have a separate patch to fix the REG_SEQUENCE handling once I get a test that I can include in unittesting. I suppose the fix is: ignore instruction whose def register size is no less than the sum of the use register sizes?

Thanks,
Dehao

patch to move the pass to x86 is proposed: https://reviews.llvm.org/D33294

Will have a separate patch to fix the REG_SEQUENCE handling once I get a test that I can include in unittesting. I suppose the fix is: ignore instruction whose def register size is no less than the sum of the use register sizes?

REG_SEQUENCE is typically used to construct tuple register classes. So you get a set of values in consecutively numbered registers (you may watch my llvm dev meeting talk to see some introduction).

In any case I think the safest solution here would be to NOT trigger on REG_SEQUENCE but on register classes; For the REG_SEQUENCES in question a number of small 1-register values are shrunk while extending a larger multi-register value, so the initial assumption that everything is well when you shrink more ranges than you extend is wrong. So a conservative answer here would be to simply bail out if the register classes of the vregs you are shrinking/extended don't match up.

Hi, escha,

Could you send me an IR file or snippet that could trigger this performance issue? I can include it in the patch as unittest.

Thanks,
Dehao

danielcdh reopened this revision.May 18 2017, 6:26 PM

The patch was reverted because it breaks tests at runtime.

This revision is now accepted and ready to land.May 18 2017, 6:26 PM
danielcdh updated this revision to Diff 99641.May 19 2017, 4:00 PM

Updated the implementation and tests:

  1. Do not hoist instructions if its def MO does not share the same register class with the use MO.
  2. Include both instruction order and MI in the use map. Using instruction order alone is not good enough because when instruction get hoisted, its order will be the same as the next instruction. In this case, we need to ensure the insertion point is after the last barrier, i.e. last instruction MO (implicitly defined by MI) has been used.

PTAL, Thanks!

danielcdh updated this revision to Diff 99642.May 19 2017, 4:01 PM

clang-format...

MatzeB accepted this revision.May 26 2017, 1:39 PM

Register class logic looks good to me with the nitpicks below addressed.

lib/CodeGen/LiveRangeShrink.cpp
164–165 ↗(On Diff #99642)

This should better be !TargetRegisterInfo::isVirtualRegister(Reg) which is not the same as isPhysicalRegister() because a register operand can also be NoReg (=0) which shouldn't stop movement.

177–178 ↗(On Diff #99642)
  • As we got surprised by it in the first try, it would be good to document the necessity for the regClass() check. Something along the lines of:
/// The heuristic does not handle different register classes yet (registers of different sizes, looser/tighter constraints).
  • getRegClass() only works for virtual registers, as far as I understand this needs a check as we could get here with a constant physreg.
danielcdh updated this revision to Diff 100804.May 30 2017, 5:23 PM
danielcdh marked 2 inline comments as done.

update

lib/CodeGen/LiveRangeShrink.cpp
177–178 ↗(On Diff #99642)

Updated the logic to make sure that DefMO and MO will always be virtual register.

MatzeB accepted this revision.May 30 2017, 5:31 PM

LGTM, with nitpicks below.

lib/CodeGen/LiveRangeShrink.cpp
165–166 ↗(On Diff #100804)

You can probably use this so instructions are still moved even if they use %noreg.

if (!Reg || MRI.isConstantPhysReg(Reg))
  continue;
167–168 ↗(On Diff #100804)
  • Remove commented code.
danielcdh updated this revision to Diff 100809.May 30 2017, 5:48 PM
danielcdh marked an inline comment as done.

update

danielcdh marked an inline comment as done.May 30 2017, 5:48 PM
danielcdh updated this revision to Diff 100917.May 31 2017, 1:44 PM

update test

danielcdh closed this revision.May 31 2017, 4:25 PM

For the record: This landed in r304371