I am investigating the effects of enabling bidirectional pre-RA scheduling on SystemZ. My idea is simply that this should generally be better, but at the same time I see that e.g. X86 is seemingly not using this, so I must ask what the general recommendation is..?
Second question is what you think of my experiment as described below. Does this make sense, or is it too marginal of an effect? Could it become a common-code improvement, or SystemZ specific?
I tried to simply enable bidirectional on benchmarks with '-mllvm -misched-topdown=false -mllvm -misched-bottomup=false'. My expectations were somewhat improved general performance with e.g. spilling and resource balancing. However, I noticed that spilling actually increased, so I looked further into this.
I reduced one test case (misched-bidir-spill.ll in patch):
- Single block (no back-edge)
- VR32Bit pressure is high in input order: 33 (> 32)
- A lot of (VR) defs with single use quite close to each other in input
- No spilling with bottom-up sched, but 6 live-ranges spilled with bidirectional (1 spilled without misched).
- Resource balancing and latency heuristics only active for bidirectional. Latency heuristic seems to be the cause for the spilling: too aggressive.
- Reg pressure heuristics are checked before latency, but do not manage to reduce pressure when it has become high in the midst of scheduling. It seems that they miss the big picture and therefore lets through to the latency heuristic. It is not enough to just compare the pressure diffs of two SUs. They might be the same even though they have very different NodeNums.
- Simple idea: disable latency heuristic when reg pressure is high on the specific SUs that affect that pressure set. This assumes that the input order is regpressure friendly, and that we wouldn't want to pull things further together if we know we are very close to spilling.
- Resource balancing heuristic could also have similar bad side-effect given that its position in tryCandidate(), but at least in this case those instructions are using the same busy (vector) FU, so it does not trigger.
- (Margin of 1 against the pressure set limit seems to work best (needed for test case - without it 1 live range spilled)
- Another alternative approach might be to use a fourth reg-pressure heuristic by something like "Prefer Cand if it is involved in a high pressure set and is def + *one-use* connected.". Does that seem better to anyone?
This patch tries the first simple idea of disabling the latency heuristic (if -lesslatency is passed). I did some static benchmarking on SystemZ (z13) and got these numbers:
Impact: master BU master BI BI + patched Number of spilled live ranges 49595 50305 49649 Innermost loops only (**): Spilled ranges - read 8038 8062 8051 Spilled ranges - write 126 129 128 Spilled ranges - read+write 2856 2909 2863 Resource queues (pre-emit) 166076 165720 165758 Heights of DAGs (pre-emit) (*) 3196190 3179874 3175386
Number of spilled live ranges: Go up when switching from BottomUp to BiDirectional, but are mostly remedied with patch.
- Looking at just innermost loops, I checked for intervals that are either just reloaded or spilled, or both. It seems that the innermost loops are affected the same way, but perhaps more marginally.
Resource queues (pre-emit): This is the sum of the ResourcesCosts for the scheduled SUs in pre-emit. Since the pre-RA resource balancing is activated with bidir, these costs are reduced.
Heights of DAGs (pre-emit): To get a feel of the impact of the pre-RA latency heuristic, I summed the pre-emit DAG heights. The more physical register used, the more pre-emit scheduler freedom and lower DAGs. This also improved.
(*) Number of DAGs during pre-emit were nearly identical (varied by 0.0125%).
(**) These numbers were gotten with experimental counters in RAGreedy::selectOrSplitImpl(), just before spilling a register:
bool Reads = false, Writes = false; for (MachineRegisterInfo::reg_instr_iterator RI = MRI->reg_instr_begin(VirtReg.reg), E = MRI->reg_instr_end(); RI != E; ) { MachineInstr &MI = *RI++; if (MachineLoop *L = Loops->getLoopFor(MI.getParent())) { if (L->empty()) { Reads |= MI.readsWritesVirtualRegister(VirtReg.reg).first; Writes |= MI.readsWritesVirtualRegister(VirtReg.reg).second; } } } if (Reads && Writes) NumSpilledRanges_InnerLoop_readswrites++; else if (Reads) NumSpilledRanges_InnerLoop_reads++; else if (Writes) NumSpilledRanges_InnerLoop_writes++; spiller().spill(LRE);
I also found other cases where spilling had increased even with this patch applied. In the few that I checked, several were related to a nondeterministic effect of merely picking from Top. See https://bugs.llvm.org/show_bug.cgi?id=36794.
I also found a case where the loop body got more parallelized, while a reg got spilled *around* the loop, which I thought was fine (That's when I tried the inner-loop statistics above).
There may be more reasons, but I think this patch addresses the main one.
Have not yet checked this further on other targets / tests etc, since I am now merely experimenting and waiting for your suggestions.