This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
-
MachineScheduler.cpp
-
test/CodeGen/SystemZ/
-
CodeGen/
-
SystemZ/
-
misched-bidir-spill.ll

Differential D44635

[MISched] Avoid increased spilling with bidirectional pre-RA scheduling
Needs ReviewPublic

Authored by jonpa on Mar 19 2018, 9:42 AM.

Download Raw Diff

Details

Reviewers

uweigand
atrick
fhahn

Summary

I am investigating the effects of enabling bidirectional pre-RA scheduling on SystemZ. My idea is simply that this should generally be better, but at the same time I see that e.g. X86 is seemingly not using this, so I must ask what the general recommendation is..?

Second question is what you think of my experiment as described below. Does this make sense, or is it too marginal of an effect? Could it become a common-code improvement, or SystemZ specific?

I tried to simply enable bidirectional on benchmarks with '-mllvm -misched-topdown=false -mllvm -misched-bottomup=false'. My expectations were somewhat improved general performance with e.g. spilling and resource balancing. However, I noticed that spilling actually increased, so I looked further into this.

I reduced one test case (misched-bidir-spill.ll in patch):

Single block (no back-edge)
VR32Bit pressure is high in input order: 33 (> 32)
A lot of (VR) defs with single use quite close to each other in input
No spilling with bottom-up sched, but 6 live-ranges spilled with bidirectional (1 spilled without misched).
Resource balancing and latency heuristics only active for bidirectional. Latency heuristic seems to be the cause for the spilling: too aggressive.
Reg pressure heuristics are checked before latency, but do not manage to reduce pressure when it has become high in the midst of scheduling. It seems that they miss the big picture and therefore lets through to the latency heuristic. It is not enough to just compare the pressure diffs of two SUs. They might be the same even though they have very different NodeNums.
Simple idea: disable latency heuristic when reg pressure is high on the specific SUs that affect that pressure set. This assumes that the input order is regpressure friendly, and that we wouldn't want to pull things further together if we know we are very close to spilling.
Resource balancing heuristic could also have similar bad side-effect given that its position in tryCandidate(), but at least in this case those instructions are using the same busy (vector) FU, so it does not trigger.
(Margin of 1 against the pressure set limit seems to work best (needed for test case - without it 1 live range spilled)
Another alternative approach might be to use a fourth reg-pressure heuristic by something like "Prefer Cand if it is involved in a high pressure set and is def + *one-use* connected.". Does that seem better to anyone?

This patch tries the first simple idea of disabling the latency heuristic (if -lesslatency is passed). I did some static benchmarking on SystemZ (z13) and got these numbers:

Impact:                           master BU    master BI   BI + patched
Number of spilled live ranges     49595        50305       49649
Innermost loops only (**):
 Spilled ranges - read            8038         8062        8051
 Spilled ranges - write           126          129         128
 Spilled ranges - read+write      2856         2909        2863
Resource queues (pre-emit)        166076       165720      165758
Heights of DAGs (pre-emit) (*)    3196190      3179874     3175386

Number of spilled live ranges: Go up when switching from BottomUp to BiDirectional, but are mostly remedied with patch.

Looking at just innermost loops, I checked for intervals that are either just reloaded or spilled, or both. It seems that the innermost loops are affected the same way, but perhaps more marginally.

Resource queues (pre-emit): This is the sum of the ResourcesCosts for the scheduled SUs in pre-emit. Since the pre-RA resource balancing is activated with bidir, these costs are reduced.
Heights of DAGs (pre-emit): To get a feel of the impact of the pre-RA latency heuristic, I summed the pre-emit DAG heights. The more physical register used, the more pre-emit scheduler freedom and lower DAGs. This also improved.

(*) Number of DAGs during pre-emit were nearly identical (varied by 0.0125%).
(**) These numbers were gotten with experimental counters in RAGreedy::selectOrSplitImpl(), just before spilling a register:

bool Reads = false, Writes = false;
for (MachineRegisterInfo::reg_instr_iterator
       RI = MRI->reg_instr_begin(VirtReg.reg), E = MRI->reg_instr_end(); RI != E; ) {
  MachineInstr &MI = *RI++;
  if (MachineLoop *L = Loops->getLoopFor(MI.getParent())) {
    if (L->empty()) {
      Reads |= MI.readsWritesVirtualRegister(VirtReg.reg).first;
      Writes |= MI.readsWritesVirtualRegister(VirtReg.reg).second;
    }
  }
}
if (Reads && Writes)
  NumSpilledRanges_InnerLoop_readswrites++;
else if (Reads)
  NumSpilledRanges_InnerLoop_reads++;
else if (Writes)
  NumSpilledRanges_InnerLoop_writes++;

spiller().spill(LRE);

I also found other cases where spilling had increased even with this patch applied. In the few that I checked, several were related to a nondeterministic effect of merely picking from Top. See https://bugs.llvm.org/show_bug.cgi?id=36794.

I also found a case where the loop body got more parallelized, while a reg got spilled *around* the loop, which I thought was fine (That's when I tried the inner-loop statistics above).

There may be more reasons, but I think this patch addresses the main one.

Have not yet checked this further on other targets / tests etc, since I am now merely experimenting and waiting for your suggestions.

Diff Detail

Event Timeline

jonpa created this revision.Mar 19 2018, 9:42 AM

Herald added subscribers: javed.absar, qcolombet, MatzeB. · View Herald TranscriptMar 19 2018, 9:42 AM

Here are some experimental dumps of the pre-RA scheduling results for those who are interested. This is a little dump of the live-ranges for a particular MBB that is my own...

This is how the MBB looks after Bottom-up scheduling:

This is bidirectional, which results in the spills:

This is bidirectional with patch:

To me it is (somewhat ;-) clear that the latency heuristic result is too aggressive without patch... Note that with the patch, there is actually a lot more ILP without any spilling.

jonpa added reviewers: uweigand, atrick, fhahn.Mar 19 2018, 9:52 AM

Hi Jonas!

I think I stumbled upon a similar issue a while ago. When comparing instructions from the bottom and top, and one does increase reg pressure while the other does not, I tweaked the heuristics to favor the instruction not increasing the pressure. The patch is in D38164 but I did not really have time to follow up on it. I just tested it with your example code, and it does not produces spilled live ranges.

In D44635#1041995, @fhahn wrote:

Hi Jonas!

I think I stumbled upon a similar issue a while ago. When comparing instructions from the bottom and top, and one does increase reg pressure while the other does not, I tweaked the heuristics to favor the instruction not increasing the pressure. The patch is in D38164 but I did not really have time to follow up on it. I just tested it with your example code, and it does not produces spilled live ranges.

Hi Florian,

If the regpressure heuristics can be improved so that my patch is not needed, I can see that this would be even better. With that said, I still think that it seems wrong to increase ILP when register pressure is high, which is what my patch prevents...

I regathered the statistics today and included your patch also:

                                            --------- INNER LOOPS ONLY -----------   ------------ pre-emit ------------
BUILD                spilled live ranges    reloads    spills    reload-and-spills   Resource queues        DAG Heights
BOTUP                49615                  8042       126       2858                166226                 3195990
BIDir                50342                  8066       129       2921                165866                 3179598
D44635/BI            49668                  8055       128       2859                165956                 3175226
D38164/BI            49963                  8051       128       2886                165619                 3180176
D44635+D38164/BI     49607                  8037       129       2845                165870                 3183225

D44635 seems more effective with preventing spilled live ranges, although it seems that also using D38164 is a slight improvement.
D38164 seems to lead to better resource balancing, while D44635 is a slight regression. I wonder a bit why this is - the tryLatency() is after the resource check in tryCandidate()... random effect...?
D44635 seems to lead to more ILP (pre-emit DAGs have less height), but they are both still an improvement compared to bottom-up

These differences are quite marginal, but then again, why switch to bidirectional if the numbers are not somewhat improved? Using one of (or even both) of these patches seems to be needed to achieve this given these numbers.

I'd love to hear some opinions and comments on the suitability of bidirectional scheduling on a high OOO target like SystemZ. Am I right to think it *should* be somewhat preferred? What problems have X86 and other targets faced that prevent from using it?

Revision Contents

Path

Size

lib/

CodeGen/

MachineScheduler.cpp

28 lines

test/

CodeGen/

SystemZ/

misched-bidir-spill.ll

233 lines

Diff 138945

lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 2,837 Lines • ▼ Show 20 Lines	static int biasPhysRegCopy(const SUnit *SU, bool isTop) {
// immediately to free the dependent. We can hoist the copy later.		// immediately to free the dependent. We can hoist the copy later.
bool AtBoundary = isTop ? !SU->NumSuccsLeft : !SU->NumPredsLeft;		bool AtBoundary = isTop ? !SU->NumSuccsLeft : !SU->NumPredsLeft;
if (TargetRegisterInfo::isPhysicalRegister(		if (TargetRegisterInfo::isPhysicalRegister(
MI->getOperand(UnscheduledOper).getReg()))		MI->getOperand(UnscheduledOper).getReg()))
return AtBoundary ? -1 : 1;		return AtBoundary ? -1 : 1;
return 0;		return 0;
}		}

		// EXPERIMENTAL
		static cl::opt<bool> LESSLATENCY("lesslatency", cl::init(false));

void GenericScheduler::initCandidate(SchedCandidate &Cand, SUnit *SU,		void GenericScheduler::initCandidate(SchedCandidate &Cand, SUnit *SU,
bool AtTop,		bool AtTop,
const RegPressureTracker &RPTracker,		const RegPressureTracker &RPTracker,
RegPressureTracker &TempTracker) {		RegPressureTracker &TempTracker) {
Cand.SU = SU;		Cand.SU = SU;
Cand.AtTop = AtTop;		Cand.AtTop = AtTop;
if (DAG->isTrackingPressure()) {		if (DAG->isTrackingPressure()) {
if (AtTop) {		if (AtTop) {
Show All 14 Lines	if (AtTop) {
RPTracker.getUpwardPressureDelta(		RPTracker.getUpwardPressureDelta(
Cand.SU->getInstr(),		Cand.SU->getInstr(),
DAG->getPressureDiff(Cand.SU),		DAG->getPressureDiff(Cand.SU),
Cand.RPDelta,		Cand.RPDelta,
DAG->getRegionCriticalPSets(),		DAG->getRegionCriticalPSets(),
DAG->getRegPressure().MaxSetPressure);		DAG->getRegPressure().MaxSetPressure);
}		}
}		}

		if (LESSLATENCY && Cand.Policy.ReduceLatency) {
		// Don't schedule SU for latency if it is related to a high pressure.
		// This assumes input order is register pressure friendly, and that
		// the regpressure heuristics may fail to see the bigger picture.
		const std::vector<unsigned> &CurrPressure = RPTracker.getRegSetPressureAtPos();
		for (unsigned PSet = 0; PSet < CurrPressure.size(); ++PSet) {
		unsigned Limit = Context->RegClassInfo->getRegPressureSetLimit(PSet);
		if (CurrPressure[PSet] + 1 < Limit)
		continue;

		// Look at the cached pressure diff of SU.
		PressureDiff &PDiff = DAG->getPressureDiff(SU);
		for (const PressureChange &PC : PDiff) {
		if (!PC.isValid())
		break;
		if (PC.getPSet() == PSet) {
		Cand.Policy.ReduceLatency = false;
		goto Done;
		}
		}
}		}
		Done:;
		}
		}

DEBUG(if (Cand.RPDelta.Excess.isValid())		DEBUG(if (Cand.RPDelta.Excess.isValid())
dbgs() << " Try SU(" << Cand.SU->NodeNum << ") "		dbgs() << " Try SU(" << Cand.SU->NodeNum << ") "
<< TRI->getRegPressureSetName(Cand.RPDelta.Excess.getPSet())		<< TRI->getRegPressureSetName(Cand.RPDelta.Excess.getPSet())
<< ":" << Cand.RPDelta.Excess.getUnitInc() << "\n");		<< ":" << Cand.RPDelta.Excess.getUnitInc() << "\n");
}		}

/// Apply a set of heursitics to a new candidate. Heuristics are currently		/// Apply a set of heursitics to a new candidate. Heuristics are currently
/// hierarchical. This may be more efficient than a graduated cost model because		/// hierarchical. This may be more efficient than a graduated cost model because
▲ Show 20 Lines • Show All 766 Lines • Show Last 20 Lines

test/CodeGen/SystemZ/misched-bidir-spill.ll

This file was added.


				; NOTE: -lesslatency is EXPERIMENTAL (temporary)
				; RUN: llc -mtriple=s390x-linux-gnu -mcpu=z13 -misched-topdown=false -misched-bottomup=false -stats -lesslatency < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				; CHECK-NOT: spilled live ranges

				%0 = type { %1, %35 }
				%1 = type { %2, %30 }
				%2 = type { %3, %16 }
				%3 = type { %4, %4, %9, %13 }
				%4 = type { %5 }
				%5 = type { %6 }
				%6 = type { %7, %7, i64* }
				%7 = type { %8, [4 x i8] }
				%8 = type <{ i64*, i32 }>
				%9 = type { %10 }
				%10 = type { %11 }
				%11 = type { %12, %12, %12* }
				%12 = type { i32, i32 }
				%13 = type { %14 }
				%14 = type { %15 }
				%15 = type { i32, i32, i32* }
				%16 = type { %17, %21, %4, %4, %24, %27 }
				%17 = type { %18 }
				%18 = type { %19 }
				%19 = type { %20, %20, %20* }
				%20 = type { [2 x i32] }
				%21 = type { %22 }
				%22 = type { %23 }
				%23 = type { i32, i32, i32* }
				%24 = type { %25 }
				%25 = type { %26 }
				%26 = type { i8, i8, i8* }
				%27 = type { %28 }
				%28 = type { %29 }
				%29 = type { i8, i8, i8** }
				%30 = type { %31, %21, %4, %4, %24, %27 }
				%31 = type { %32 }
				%32 = type { %33 }
				%33 = type { %34, %34, %34* }
				%34 = type { [4 x i32] }
				%35 = type { %36, %21, %4, %4, %24, %27, %4 }
				%36 = type { %37 }
				%37 = type { %38 }
				%38 = type { %39, %39, %39* }
				%39 = type { [6 x i32] }
				%40 = type { %41 }
				%41 = type { [3 x double] }

				define void @fun() {
				%1 = load %0, %0* undef, align 8
				%2 = getelementptr inbounds %0, %0* %1, i64 undef
				%3 = load %0, %0* %2, align 8
				%4 = getelementptr inbounds %0, %0* %3, i64 0, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0
				%5 = load %34, %34* %4, align 8
				%6 = getelementptr inbounds %0, %0* %3, i64 0, i32 0, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0
				%7 = load %20, %20* %6, align 8
				%8 = load i32, i32* undef, align 4
				%9 = sext i32 %8 to i64
				%10 = load %40, %40* undef, align 8
				%11 = getelementptr inbounds %20, %20* %7, i64 undef, i32 0, i64 0
				%12 = load i32, i32* %11, align 4
				%13 = sext i32 %12 to i64
				%14 = getelementptr inbounds %40, %40* %10, i64 %13, i32 0, i32 0, i64 0
				%15 = load double, double* %14, align 8
				%16 = getelementptr inbounds %34, %34* %5, i64 undef, i32 0, i64 0
				%17 = load i32, i32* %16, align 4
				%18 = sext i32 %17 to i64
				%19 = getelementptr inbounds %20, %20* %7, i64 %18, i32 0, i64 0
				%20 = load i32, i32* %19, align 4
				%21 = sext i32 %20 to i64
				%22 = load double, double* undef, align 8
				%23 = getelementptr inbounds %20, %20* %7, i64 undef, i32 0, i64 0
				%24 = load double, double* undef, align 8
				%25 = load i32, i32* %23, align 4
				%26 = sext i32 %25 to i64
				%27 = getelementptr inbounds %40, %40* %10, i64 %26, i32 0, i32 0, i64 0
				%28 = load double, double* %27, align 8
				%29 = load double, double* null, align 8
				%30 = getelementptr inbounds %20, %20* %7, i64 undef, i32 0, i64 0
				%31 = getelementptr inbounds %40, %40* %10, i64 undef, i32 0, i32 0, i64 1
				%32 = load double, double* %31, align 8
				%33 = load i32, i32* %30, align 4
				%34 = sext i32 %33 to i64
				%35 = getelementptr inbounds %40, %40* %10, i64 %34, i32 0, i32 0, i64 1
				%36 = load double, double* %35, align 8
				%37 = getelementptr inbounds %40, %40* %10, i64 %21, i32 0, i32 0, i64 1
				%38 = load double, double* %37, align 8
				%39 = load double, double* undef, align 8
				%40 = load double, double* null, align 8
				%41 = getelementptr inbounds %40, %40* %10, i64 undef, i32 0, i32 0, i64 2
				%42 = load double, double* %41, align 8
				%43 = getelementptr inbounds %40, %40* %10, i64 %21, i32 0, i32 0, i64 2
				%44 = load double, double* %43, align 8
				%45 = getelementptr inbounds %40, %40* %10, i64 undef, i32 0, i32 0, i64 2
				%46 = load double, double* %45, align 8
				%47 = fmul double undef, undef
				%48 = fmul double %47, %39
				%49 = fmul double %15, undef
				%50 = fmul double %38, %49
				%51 = fmul double undef, %32
				%52 = fmul double undef, undef
				%53 = fmul double %38, %52
				%54 = fmul double %36, undef
				%55 = fdiv double 1.000000e+00, undef
				%56 = fmul double %29, %39
				%57 = fmul double %22, %56
				%58 = fmul double 0.000000e+00, %40
				%59 = fmul double 0.000000e+00, %39
				%60 = fadd double %59, undef
				%61 = fmul double %55, %60
				%62 = fmul double %61, 0x3FC5555555555555
				%63 = fmul double undef, %44
				%64 = fmul double %22, %63
				%65 = fmul double %24, undef
				%66 = fmul double %38, %65
				%67 = fmul double %40, 0.000000e+00
				%68 = fmul double undef, %42
				%69 = fmul double undef, %68
				%70 = fmul double undef, undef
				%71 = fmul double 0.000000e+00, %67
				%72 = fmul double %40, %42
				%73 = fmul double 0.000000e+00, %42
				%74 = fmul double undef, 0.000000e+00
				%75 = fmul double undef, undef
				%76 = fmul double %29, undef
				%77 = fmul double %39, 0.000000e+00
				%78 = fmul double %57, %46
				%79 = fadd double %78, undef
				%80 = fmul double 0.000000e+00, %44
				%81 = fsub double %79, %80
				%82 = fmul double %54, %44
				%83 = fsub double %81, %82
				%84 = fmul double %22, %77
				%85 = fmul double %29, %84
				%86 = fsub double %83, %85
				%87 = fmul double %38, undef
				%88 = fmul double %87, %40
				%89 = fsub double %86, %88
				%90 = fmul double undef, %72
				%91 = fmul double %29, %90
				%92 = fadd double %91, %89
				%93 = fmul double %36, %73
				%94 = fmul double %93, %44
				%95 = fadd double %94, %92
				%96 = fmul double undef, undef
				%97 = fmul double %38, %96
				%98 = fadd double %97, %95
				%99 = fmul double %22, undef
				%100 = fmul double %99, %46
				%101 = fadd double %100, %98
				%102 = fmul double undef, %42
				%103 = fmul double undef, %102
				%104 = fmul double %103, %44
				%105 = fadd double %104, %101
				%106 = fmul double 0.000000e+00, %44
				%107 = fsub double %105, %106
				%108 = fmul double undef, %46
				%109 = fsub double %107, %108
				%110 = fmul double %22, %42
				%111 = fmul double %36, %110
				%112 = fmul double %111, %46
				%113 = fadd double %112, %109
				%114 = fmul double %40, %70
				%115 = fsub double %113, %114
				%116 = fmul double %32, %69
				%117 = fadd double %116, %115
				%118 = fmul double %36, 0.000000e+00
				%119 = fsub double %117, %118
				%120 = fsub double %119, undef
				%121 = fmul double %38, %74
				%122 = fadd double %121, %120
				%123 = fmul double 0.000000e+00, undef
				%124 = fmul double %38, %123
				%125 = fadd double %124, %122
				%126 = fmul double %29, undef
				%127 = fmul double %22, %126
				%128 = fmul double %127, %46
				%129 = fadd double %128, %125
				%130 = fmul double %39, %66
				%131 = fsub double %129, %130
				%132 = fmul double %40, %64
				%133 = fadd double %132, %131
				%134 = fmul double %28, %44
				%135 = fmul double %38, %134
				%136 = fmul double %42, %135
				%137 = fadd double %136, %133
				%138 = fmul double %22, %44
				%139 = fmul double undef, %138
				%140 = fmul double %42, %139
				%141 = fsub double %137, %140
				%142 = fmul double undef, %44
				%143 = fmul double %38, %142
				%144 = fmul double %143, %46
				%145 = fsub double %141, %144
				%146 = fadd double undef, %145
				%147 = fmul double %29, undef
				%148 = fsub double %146, %147
				%149 = fsub double %148, undef
				%150 = fmul double %15, undef
				%151 = fmul double %38, %150
				%152 = fadd double %151, %149
				%153 = fsub double %152, undef
				%154 = fmul double %39, %75
				%155 = fsub double %153, %154
				%156 = fmul double %38, %71
				%157 = fadd double %156, %155
				%158 = fmul double %22, %67
				%159 = fmul double %29, %158
				%160 = fsub double %157, %159
				%161 = fmul double %15, 0.000000e+00
				%162 = fmul double %161, %44
				%163 = fsub double %160, %162
				%164 = fmul double %32, undef
				%165 = fmul double %164, %44
				%166 = fadd double %165, %163
				%167 = fmul double %15, %67
				%168 = fmul double %38, %167
				%169 = fadd double %168, %166
				%170 = fmul double %32, %158
				%171 = fsub double %169, %170
				%172 = fadd double undef, %171
				%173 = fsub double %172, undef
				%174 = fmul double %15, %58
				%175 = fmul double %174, %42
				%176 = fadd double %175, %173
				%177 = fmul double %55, %176
				%178 = fmul double %177, 0x3FC5555555555555
				store double %62, double* undef, align 8
				store double %178, double* undef, align 8
				ret void
				}