LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Details
- Reviewers
davidxl wmi hfinkel MatzeB andreadb - Commits
- rG6b737ddce760: Add LiveRangeShrink pass to shrink live range within BB.
rG65dd23e273c5: Add LiveRangeShrink pass to shrink live range within BB.
rL304371: Add LiveRangeShrink pass to shrink live range within BB.
rL302938: Add LiveRangeShrink pass to shrink live range within BB.
Diff Detail
- Build Status
Buildable 6298 Build 6298: arc lint + arc unit
Event Timeline
A couple of comments:
- Moving it out of the reassociation pass is probably the right direction to go.
- having this pass allows future enhancement
- This actually allows you to do more aggressive expression operand re-ordering (to enable more shrinking) -- but to void offsetting reassocation's effect (to enable cse, constant folding, loop invariant code motion etc), the shrink pass should be pushed very late in the pipeline.
d = foo();
c = foo();
t = d + c;
b = foo();
t = t + b;
a = foo();
t = t + a;
The MachineCombiner pass will try to reverse this transform to increase throughput. See comments above TargetInstrInfo::getMachineCombinerPatterns(). If it's not doing it currently, it's probably only because some intervening instructions are causing the pattern to not match.
Why would a transform that is concerned with register pressure and live ranges not be better accomplished at the machine instruction level?
In all the new test cases, the problem is indirectly caused by the presence of function calls in the reassociated expression.
Your patch is essentially trying to solve a register pressure problem at LLVM IR level by pre-scheduling instructions. However, as Sanjay pointed out, this design can be a bit fragile since later passes may undo your canonicalization.
My understanding is that the pre-RA-scheduler pass should be responsible for finding a good schedule that minimizes register pressure in preparation for regalloc.
Do you know why pre-RA-scheduler is unable to find a good schedule for your particular case?
If I pass ‘test2()’ to llc (with flag -pre-RA-sched=list-ilp -mtriple=x86_64-unknown-unknown) then I get the expected schedule of call/add instructions.
test2: pushq %r14 pushq %rbx pushq %rax callq foo movl %eax, %r14d callq foo movl %eax, %ebx addq %r14, %rbx callq foo movl %eax, %eax addq %rbx, %rax addq $8, %rsp popq %rbx popq %r14 retq
Yes, adding -pre-RA-sched=list-ilp can produce the expected result without this patch for my testcase. I'm not familiar with -pre-RA-sched, any quick insight why this option is not on by default?
Thanks,
Dehao
+ Andy, for the history on pre-RA-sched and misched.
lib/Transforms/Scalar/LiveRangeShrink.cpp | ||
---|---|---|
54–56 ↗ | (On Diff #97125) | Maybe use BasicBlock::getFirstInsertionPt instead. Set Insert to be BasicBlock::getFirstInsertionPt if Insert is before BasicBlock::getFirstInsertionPt. |
58–60 ↗ | (On Diff #97125) | Can we remove the update loop of InstOrderMap? It leads to O(N^2) complexity. One way is just to let InstOrderMap[I] = InstOrderMap[Insert] and keep the rest the same. |
66 ↗ | (On Diff #97125) | Can we use a function pass? which will be easier for future extension. |
Selection-DAG scheduling (-pre-ra-sched=XXX) should only be concerned about bringing the SelectionDAG nodes into a linear order, they are not supposed to perform register pressure minimiziation, latency modeling, etc.
The main scheduling is left to the MachineScheduler nowadays (which runs between copy coalescing and register allocation).
Is the proposed pass necessary because the machine scheduler doesn't schedule instructions accross calls? Then https://reviews.llvm.org/D15667 may be related...
lib/Transforms/Scalar/LiveRangeShrink.cpp | ||
---|---|---|
58–60 ↗ | (On Diff #97125) | Please ignore this for now. Sorry. It is an incomplete comment sent out accidently. |
Why are the adds "sunk down" in the first place? Is this reassociation at work?
Machine global code motion, using MachineTraceMetrics and RegisterPressure would be awesome. Nobody really followed through with implementing it so unfortunately there's no switch to turn on to try it.
Yes, reassociation for an expression tree inserts the intermediate result directly before the final root. That is how the adds sunk down.
Machine global code motion, using MachineTraceMetrics and RegisterPressure would be awesome. Nobody really followed through with implementing it so unfortunately there's no switch to turn on to try it.
The machine global code motion you mention here is misched or a separate pass? We found existing machine scheduling can do some live range shrinking in some cases, but we don't know whether depending on existing register pressure aware scheduing and improving it is a good way to solve the live range shrinking problem generally. Could you shed some light on how existing misched choose between maximizing ILP and minimizing register pressure?
Thanks,
Wei.
Ideally there should be a separate pass that runs on the SSA machine code (before register coalescing) to minimize register pressure and hide latency for chains of loads or FP ops. It should work across calls and loop boundaries. You could even do a kind of poor-man's cyclic scheduling this way.
MISched works at the lower level of instructions groups and CPU pipeline hazards. It would be nice if MISched worked at the level of extended basic blocks (it would be easy to implement and has been done out of tree). I don't think it makes as much sense for it to work across call sites though. That is not hard to implement but seems it will generate large DAGS and will be bad compile-time tradeoff.
MISched is not a scheduling algorithm, it's a scheduling framework. The generic scheduler is a pile of heuristics that exercise most of the functionality and seems to be working ok for several popular targets. The strategy that it takes is:
- Make a single scheduling pass handling all heuristics at once. Don't reorder the instructions at all unless the heuristics identify a register pressure or latency problem.
- Try to determine, before scheduling a block, whether register pressure or latency is likely to become a problem. This avoids the scheduler backing itself into a corner (we don't want the scheduler to backtrack).
You'll notice that this is very conservative with respect to managing compile time and preserving the decisions made by earlier passes.
You could follow that basic strategy and simply adjust the priority of heuristics for your target. You can add better up-front analysis to detect code patterns for each block before prioritizing heuristics. Or you could implement a completely different strategy. For example, schedule for register pressure first, then reschedule for ILP.
I probably won't be able to help you must more than that. Matthias has taken over maintenance of MISched. I think it would help if you give more background on your situation (sorry if I haven't paid attention or have forgotten). Is this PowerPC? Inorder/out-of-order?
Thank you so much for introducing the design philosophy of MISched. It really helps.
About the background on our situation, the problem we are trying to solve is that reassociation pass introduces a lot of live range interference unnecessarily because of the way how it inserts the intermediate results. For the testcase exposing the problem, it is quite special in that the instructions we care about are all in the same BB. Most of the defs have only one use and most of the uses have only one def. So Dehao proposed a limited live range shrinking pass in llvm IR phase to solve it. We may extend the live range shrinking in the future because we have seen other cases before that would be benefited from more general live range shrinking.
Sanjay and Andrea made us notice that scheduling may already achieve the same result as the shrinking pass we proposed, at least for the motivational testcase, so we looked into MISched. Dehao has found that even we enabled existing generic scheduling to cross call, the more complex DAG can lead to a better but still unoptimal solution, and the unoptimal issue is not easy to solve without a global picture of the whole basicblock. This makes us incline to implement it in a separate pass.
The problem was found on x86, which is the target we care the most right now, but I think the problem is platform independent.
I think the algorithm needs to be explained. It seems reasonable to argue that single def/use live ranges should be kept short. But is there any reason at all to run this before other IR passes? It seems natural to run this after machine LICM and Sinking.
The time to shrink live ranges is after all the global machine code
motion is done if the goal is handle as many cases as possible.
Doing code motion in machine code is tricky because you have to worry
about physical registers, flags and such. But those are exactly the
things you need to worry about to avoid generating terrible code. For
example, you don't want to hoist something over a call that defines a
physreg or is used by a physreg copy!
That said, if reassociate is making silly decisions about where to
place operations, I wouldn't see a problem fixing that directly.
We initially tried to fix it in reassociate, but decided to have it fixed separate because we do not want reassociate to become too complicated. And also this should also be able to fix live-range issues introduced by user. I'll try to move this pass to Machine level.
I've updated the patch to implement this in Machine Level. The down side is that it needs to update a lot of unittests especially those autogenerated llc tests.
At this point, the patch is well tested with internal google workload, and seems to be valid. But I may ignore other subtle cases that's not covered by my testing. I'd appreciate if someone can help point out.
Thanks,
Dehao
I had a quick look at the code, and I noticed that you never check for the presence of DBG_VALUE instructions in a basic block.
I think that your pass would not work as expected with debug info.
More in general:
- When you check the number of uses of a register, you should skip debug values.
- When you move an instruction, you should also move its associated debug values.
A lot of tests have changed as a result of this patch. However, I think there is a value in adding your original test cases involving function calls.
Could you please add a new MIR test with code from (at least one of) your original small test cases?
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
113 | You should probably skip debug values. | |
116 | s/instrctuion/instruction | |
127 | s/coalesed/coalesced | |
143 | We want to skip uses that are debug instructions. | |
148 | Same. We don't want to count debug uses. |
When you move an instruction, you should also move its associated debug values.
See for example how this is done by MachineSink in method MachineSinking::SinkInstruction() (see file MachineSink.cpp).
For every instruction INSN that is a candidate for sinking, MachineSink collects all the DBG_VALUE instructions which are associated to INSN.
When INSN is moved to a different location, all the associated debug values (if any) are moved too.
I think you should do something similar.
About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
113 | Why are you skipping compares? |
Sorry, did not address this in the last patch. Done now. Note that I do not want to touch unnecessary files in this change. So I simply copied the code from MachineSink.cpp, and will refactor it to a utility function in MachineInstr.h/.cpp in a follow-up patch.
About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.
Done, but I'm not sure if that's the right thing to do. update_llc_test_checks makes the test too fragile: any subtle change of the codegen would break it. And it hides what the author really want to test and follow-up patch writers may simply run the script and regenerate the test (without understanding what it's testing, or they need to spend much time on what the original purpose is). BTW, what's MIR test? could you point me to one example? Thanks.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
113 | In many architectures, cmp should stay together with jmp instruction for performance purpose. We do not want to break this in this pass. |
No problem. Thanks!
About your new test case. I think a MIR test would be even better. That said, I don't have a strong opinion (I am okay with your new test).
In case, could you please regenerate the CHECK lines using `update_llc_test_checks.py? The test should be small, so I don't expect too many CHECK lines inserted by the script.Done, but I'm not sure if that's the right thing to do. update_llc_test_checks makes the test too fragile: any subtle change of the codegen would break it. And it hides what the author really want to test and follow-up patch writers may simply run the script and regenerate the test (without understanding what it's testing, or they need to spend much time on what the original purpose is). BTW, what's MIR test? could you point me to one example? Thanks.
Right, I think it your test is okay.
About MIR tests, you can have a look at this page: http://llvm.org/docs/MIRLangRef.html
In particular, section "MIR Testing Guide" (and the next two sub-sections) are quite useful imho.
For example, you could do something like:
llc my-new-test.ll -stop-after=machine-sink -o test.mir
Then, test.mir with a RUN line which only tests your new pass
For example:
llc -o - %s -mtriple=x86_64-- -run-pass=lrshrink | FileCheck %s
The advantage is that you test your new pass in isolation.
But, as I wrote, I think your test is okay.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
113 | For targets that don't have allocatable flag regsiters, your later check for physical regs should prevent moving these compares (and any other non-compare flag setting instructions). For targets that keep the results of compares in general purpose registers, this seems like a missed optimization. Also, if it is important that the compare be just before the branch, the scheduler should be taking care of that. If you don't agree that this check should be removed can you put a comment explaining why this check for compares is being done? | |
154 | Typo: 'physicall' -> 'physical' |
Thanks for the explanation. I've reverted the test to its original form. And after checking the mir test output, I think it's better to keep it simple in its original phone, and make sure that later scheduler does not undo the change in this patch.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
113 | That makes sense. I've removed the "compare" check. |
Can you collect spill code stats and benchmark scores from the test suite?
Can you get someone to run benchmarks on arm64 as well?
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
163 | Why do you care how many instructions are using the register being defined here? |
The patch did not change test-suite benchmark scores.
The patch reduced the total static spill number for the entire test suite from 110283 to 110232
Unfortunately we do not have arm64 machine to run the performance test. But based on its impact on x86, llvm testsuite does not seem to suffer from register pressure, thus it should have minimal perf impact.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
163 | Right, I should not. Removed the check. |
lib/CodeGen/InlineSpiller.cpp | ||
---|---|---|
43–50 ↗ | (On Diff #98240) | I think that you have uploaded a wrong patch. |
1031–1032 ↗ | (On Diff #98240) | Same. |
lib/CodeGen/InlineSpiller.cpp | ||
---|---|---|
43–50 ↗ | (On Diff #98240) | Oops, forgot to revert it... Thanks for catching this :) |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
11–13 | Use /// \file, see http://llvm.org/docs/CodingStandards.html#file-headers It would also be a good idea to add a sentence describing the rules/heuristic that decide what instructions get moved. | |
35 | This seems like a bold statement for a pass that moves instructions around. I would only expect to see AU.setPreservesCFG() here. | |
53 | We tend to use doxygen /// to document functions. | |
58–59 | Use references for things that cannot be nullptr. That also saves you assert(New != nullptr). | |
80 | You probably want MachineBasicBlock::iterator here and in several other places as you do not plan to break bundles apart. | |
94–95 | Then why not do so in this patch. | |
102 | I'm surprised this compiles. In each case you should use MI.getIterator() for clarity. Could also use DI = std::next(...) | |
103–104 | how about for(MachineInstr &DMI : make_range(DI, DE)) { ... } | |
112 | // end anonymous namespace | |
115–116 | This seems unnecessary the for loop below will abort shortly enough if the function is empty. Completely empty functions are also extremely rare so that this shortcut probably takes more time than it saves. | |
134–135 | Use MachineBasicBlock::iterator | |
136 | Use a reference for things that cannot be nullptr. | |
143 | No need for auto here just write MachineOperand | |
157–158 | You could move this check into the !isSafeToMove() path. | |
167 | unsigned for counters. | |
169 | Avoid auto. | |
174–175 | Maybe assign MO.getReg() to a variable it is used 5 times here. |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
51 | You never seem to be using negative numbers, consider using unsigned instead of int. | |
65 | M[Old]? | |
66 | M[New] or better avoid the double lookup with M.count(New) above and use an M.find(New) for both. | |
85 | Why not pass a MachineBasicBlock &MBB as parameter to the function instead of the iterator and doing the Start->getParent()->instr_end() dance? You could have just used for (MachineInstr &I : MBB) {} in that case. | |
125 | I would have expected this outside the loop so it can be reused (considering that you also went through the trouble of clear()ing it inside BuildInstOrderMap. | |
126 | Collecting the SawStore value is unnecessary. The only user appears to be MachineInstr::isSafeToMove() below which writes(!) the variable. | |
151 | Barrier. | |
212–216 | There's a variant of splice that can move a sequence of instructions which should work nicely here. It is also not clear to me if moving the debug instructions is legal here. They are not considered when checking how early an instruction can move so you are possibly moving the debug instruction above a point where the value its operand is defined. |
Thanks for the review.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
65 | M is const map, thus cannot use M[Old]. | |
85 | Because Start may not be MBB.begin(), but could also be instruction that has side effect. The reason I used Start->getParent()->instr_end() instead of passing in MBB and use MBB.instr_end() is because I want to pass one less parameter to make the function clearer. | |
94–95 | Code updated to inline the logic into caller to make the logic clearer. | |
126 | SawStore is both input and output of MachineInstr::isSafeToMove(). | |
212–216 | Done For the Debug Value, it should be legal as the same logic is used by MachineLICM. |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
135 | Why don't you refresh SawStore here? |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
94–95 | You should early exit if this function must be skipped. if (skipFunction(*MF.getFunction()) return false; We want to be able to 'opt-bisect' this pass. | |
178 | I think that you should also skip debug vaue instructions. The loop at line 192 works under the assumption that a MI's DBG_VALUE always follows directly after it. This is also suggested by your comment at line 189. If you don't skip debug values, then you potentially break the above-mentioned assumption. |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
94–95 | skipFunction added. Thanks for the comment. This will not be run in optnone as in: if (getOptLevel() != CodeGenOpt::None) addPass(&LiveRangeShrinkID); | |
135 | Not sure if I understand the comment. SawStore is updated in MI.isSafeToMove. Do you mean that I can remove the previous "SawStore = BuildInstOrderMap", and set SawStore's initial value to "false"? | |
178 | Done. Great catch! |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
135 | My understanding is that if you meet that MI has side effect and it is a barrier you will not move later instruction through this one. SawStore is also a kind of barrier and you cannot move instruction which intersects with the store instruction. After you set the new barrier you can reset SawStore. Say if you saw store before new barrier and there is no stores after the barrier you are free to move instruction which do not intersect with stores. However if SawStore has been set to true once I do not see the place where it can be to reset to false. Do I miss anything? However the overall logic around SawStore is not clear to me. SawStore is set in BuildInstOrderMap. So it detects where there is a store in the tail. It would prevent us re-ordering instruction (we move instruction up always) even if store is located later. Is it expected? |
Currently, we set 'SawStore' if there is at least one 'mayStore' instruction in the scheduling region.
However, we could provide a much more accurate information about where 'mayStore' instructions are in the code.
For example, 'BuildInstOrderMap' could populate a (ordered) map '<order_number, MI*>', where MI is a 'mayStore' instruction, and 'order_map' is the instruction number in the sequence.
Later on, you could query that map to check if it is safe to move a load from loc_B to loc_A. If the map contains a store with an order_number in range [loc_A, loc_B], then you know that it is not safe to move it.
This is just an idea for a future improvement.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
82–86 | We can early exit from the loop in BuildInstOrderMap as soon as we encounter an instruction with unmodeled side effects. An instruction with unmodeled side effects is essentially a scheduling barrier. | |
135 | You are right. When BuildInstrOrderMap() is called, SawStore should be updated too. Otherwise we risk to prevent the reordering of loads in the new scheduling region. | |
193–201 | It would be nice to have a STATISTIC that counts the number of instructions reordered by this pass. |
Thanks for the reviews!
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
82–86 | That would actually make the compile time slower by adding additional check on the frequent path (BuildInstOrderMap), as instructions with unmodeled side effects is extremely rare in real programs. | |
135 | I've updated the logic to update SawStore on the fly and reset to false when an instruction with unmodeled side effect is met. |
Accepting this now as it looks reasonably safe and fast to me (nitpicks below).
It is a conservative/simple heuristic but given the fact that we lack other means of global code motion (in the sense of inter-basic block) in CodeGen today this is fine.
Please wait for some of the other reviewers which showed interest here to accept before committing.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
11–13 | Needs three slashes to be recognized as doxygen. (I know this is wrong in a lot of existing code, but we should get it right in new code) | |
113 | Maybe move UseMap outside the loop and clear() it each iteration instead so there is a chance that memory allocations can be reused. | |
212–216 | I'm not sure why this would work. Maybe you should double check by compiling the llvm-testsuite (or some other set of programs) with a debug compiler and -g... |
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
212–216 | glad that I run this with -g and actually caught a bug: when I move instructions together with debug info, I should increment the Next pointer, otherwise the iterator will iterate to instructions that have already been processed. Updated the implementation and added test coverage. |
Thanks all for the reviews. Any other comments/concerns? If not, I'll commit the patch tomorrow.
Thanks for fixing the issue with debug values.
The patch still looks good to me.
A couple of comments can be slightly improved (see below).
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
167 | Please remember to add a period at the end of each sentence. | |
183 | s/shrinked/shrunk | |
186 | You should update this comment since we also skip debug values now. |
I'll submit the patch now to allow rest of today to catch potential buildbot breakage. Thanks again for all the reviews.
Dehao
Sorry about the regression. As Quentin pointed out, moving instructions close to phi should not be a problem as the algorithm checks the moved instruction uses >1 single-use defs. Looking at the example of the regression case, the transformation looks reasonable if we only consider shrinking live-ranges. Could you provide an IR that I can try reproduce the issue and see why register allocation is worse?
During the mean time, I can prepare a patch to move this to x86-only, or revert the patch if necessary.
patch to move the pass to x86 is proposed: https://reviews.llvm.org/D33294
Will have a separate patch to fix the REG_SEQUENCE handling once I get a test that I can include in unittesting. I suppose the fix is: ignore instruction whose def register size is no less than the sum of the use register sizes?
Thanks,
Dehao
REG_SEQUENCE is typically used to construct tuple register classes. So you get a set of values in consecutively numbered registers (you may watch my llvm dev meeting talk to see some introduction).
In any case I think the safest solution here would be to NOT trigger on REG_SEQUENCE but on register classes; For the REG_SEQUENCES in question a number of small 1-register values are shrunk while extending a larger multi-register value, so the initial assumption that everything is well when you shrink more ranges than you extend is wrong. So a conservative answer here would be to simply bail out if the register classes of the vregs you are shrinking/extended don't match up.
Hi, escha,
Could you send me an IR file or snippet that could trigger this performance issue? I can include it in the patch as unittest.
Thanks,
Dehao
Updated the implementation and tests:
- Do not hoist instructions if its def MO does not share the same register class with the use MO.
- Include both instruction order and MI in the use map. Using instruction order alone is not good enough because when instruction get hoisted, its order will be the same as the next instruction. In this case, we need to ensure the insertion point is after the last barrier, i.e. last instruction MO (implicitly defined by MI) has been used.
PTAL, Thanks!
Register class logic looks good to me with the nitpicks below addressed.
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
165–166 | This should better be !TargetRegisterInfo::isVirtualRegister(Reg) which is not the same as isPhysicalRegister() because a register operand can also be NoReg (=0) which shouldn't stop movement. | |
178–179 |
/// The heuristic does not handle different register classes yet (registers of different sizes, looser/tighter constraints).
|
update
lib/CodeGen/LiveRangeShrink.cpp | ||
---|---|---|
178–179 | Updated the logic to make sure that DefMO and MO will always be virtual register. |
Use /// \file, see http://llvm.org/docs/CodingStandards.html#file-headers
It would also be a good idea to add a sentence describing the rules/heuristic that decide what instructions get moved.