This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
3
MachineScheduler.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
vector-fcopysign.ll

Differential D38164

[MachineScheduler] Favor instructions that do not increase pressure.
Needs RevisionPublic

Authored by fhahn on Sep 22 2017, 12:22 AM.

Download Raw Diff

Details

Reviewers

MatzeB
qcolombet

Summary

I think all other things being equal, it could be beneficial to prioritize
candidates that do not increase register pressure, as this may uncover
new candidates that reduce register pressure.

I have been experimenting with using the MachineScheduler on ARM with
Cortex-A57, and this change helps to reduce spilling in some cases.
For the LLVM test-suite this change helps to reduce spilling in some cases.For
the test-suite and SPEC2k, it improves the geomean of the execution times
compared to the old scheduler from -0.82% (MachinScheduler for ARM on
Cortex-A57) to -1.03% (same as before, but with this patch) and eliminates a
few big outliners. For SPEC2k6, it improves the geomean of execution
times from -0.24% to -0.29%.

For AArch64 I did some benchmark runs comparing master against this
patch.
Cortex-A57:

-0.62% geomean exec time for test-suite and SPEC2k
-0.29% for SPEC2k6
+0.19% SPEC2017 score

Cortex-A72:

-0.29% geomean exec time for test-suite and SPEC2k

The impact of this change on the unit tests is quite small. The
modified test case in this patch uses 1 register less now, with the
order of instructions slightly changed. The earliest fcvt gets still
scheduled after 2 instructions, so latency wise the schedule should not
have regressed.

The following ADMGPU tests also fail:

CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll (with -mtriple=amdgcn--amdhsa -mcpu=fiji -amdgpu-spill-sgpr-to-smem=1) Expected: NumSGPRsForWavesPerEU: 9 Actual: NumSGPRsForWavesPerEU: 14
CodeGen/AMDGPU/schedule-regpressure-limit2.ll (with -march=amdgcn -mcpu=fiji -misched=gcn-max-occupancy-experimental Expected: NumSgprs: {{[1-5][0-9]$}} Actual: NumSGPRsForWavesPerEU: 78

I do not know anything about AMDGPU, so I cannot really say if that is
good or bad?

@jonpa this might be interesting to you too, as I gathered that you are
working on switching SystemZ to the MachineScheduler.

Diff Detail

Event Timeline

fhahn created this revision.Sep 22 2017, 12:22 AM

Herald added subscribers: javed.absar, kristof.beyls, tpr, aemerson. · View Herald TranscriptSep 22 2017, 12:22 AM

cc Matt, maybe you know if the change in the AMDGPU is good or bad?

I vaguely remember trying something like this and having some crypto benchmarks produce bad schedules; I'll see if I can remember/find it
tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.
Do you have a specific example where this helps?

RKSimon added a subscriber: RKSimon.Sep 23 2017, 3:03 AM

In D38164#879078, @MatzeB wrote:

I vaguely remember trying something like this and having some crypto benchmarks produce bad schedules; I'll see if I can remember/find it

Thanks for the pointer, I have a few crypto benchmarks I can run too.

tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.

Thanks, I'll look into that.

Do you have a specific example where this helps?

The runtime of float-mm from the LLVM test-suite increase by around 15% because of additional spilling on ARM with the MachineScheduler enabled (on Cortex-A57). I can also have a look at some improvements on AArch64.

In D38164#879556, @fhahn wrote:

In D38164#879078, @MatzeB wrote:

I vaguely remember trying something like this and having some crypto benchmarks produce bad schedules; I'll see if I can remember/find it

Thanks for the pointer, I have a few crypto benchmarks I can run too.

tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.

Thanks, I'll look into that.

Do you have a specific example where this helps?

The runtime of float-mm from the LLVM test-suite increase by around 15% because of additional spilling on ARM with the MachineScheduler enabled (on Cortex-A57). I can also have a look at some improvements on AArch64.

I cannot find that particular test, in the test-suite what's the exact name/path (or is it an internal one)?

lib/CodeGen/MachineScheduler.cpp
2777–2788	Why isn't this using `tryGreate()` and isn't this the same as just changing `getUnitInc() < 0` to `getUnitInc() <= 0` in the earlier check?

I think the cryptocode I was looking at back then was john the ripper (openbenchmarking.org / pts/john-the-ripper).

Just tried with/without your changes and I see the descrypt part improving and the LM part regressing. So whether this is an improvement or regression may depend on the situation and we should find/look at examples.

I assume Andy designed the current behavior to give more freedom to the latency minimizing code? So an out of order Cortex-A57 may not be the best test case for it; then again if it improves the spilling situation and leads to a win on out of order cores we could switch the behavior based on out-of-order/in-order.

In D38164#879078, @MatzeB wrote:

tryPressure seems like the wrong place to me, as it is used in 3 different contexts: (compared with target limits, compared with increase region limits, and the current max). From your description it sounds like we only want this behavior once.

I tried only favoring instructions that do not increase pressure in each context. When using the current max, the results are very similar to the initial version of the patch and the AMDGPU failures go away. Using it in the other context yields smaller improvements on the set of benchmarks mentioned above.

In D38164#880843, @MatzeB wrote:

I think the cryptocode I was looking at back then was john the ripper (openbenchmarking.org / pts/john-the-ripper).

Just tried with/without your changes and I see the descrypt part improving and the LM part regressing. So whether this is an improvement or regression may depend on the situation and we should find/look at examples.

I'll try to collect a list of examples where it makes things better/worse.

I assume Andy designed the current behavior to give more freedom to the latency minimizing code? So an out of order Cortex-A57 may not be the best test case for it; then again if it improves the spilling situation and leads to a win on out of order cores we could switch the behavior based on out-of-order/in-order.

I'll also run the benchmarks on an in-order core. I was under the impression that the scheduler prioritizes register needs over latency, as the latency of candidates is one of the last things checked in tryCandidate.

lib/CodeGen/MachineScheduler.cpp
2777–2788	Initially I used `tryGreater` above with `getUnitInc() <= 0`. But that version does not prioritize `TryCand` over `Cand` if `TryCand` decreases pressure and `Cand` does not change pressure I think.

@jonpa this might be interesting to you too, as I gathered that you are working on switching SystemZ to the MachineScheduler.

Thanks for putting me on cc! I have a particular test case I am working with now which is kind of related. I tried your patch, but it didn't help in this particular case. See my proposed patch: https://reviews.llvm.org/D38351

MatzeB requested changes to this revision.Oct 11 2017, 4:47 PM

MatzeB added inline comments.

lib/CodeGen/MachineScheduler.cpp
2777–2788	I just looked at this patch again: There actually already is `tryLess(TryP.getUnitInc(), CandP.getUnitInc(), TryCand, Cand, Reason);` below. And the important part here is that it is guarded by `Cand.AtTop != TryCand.AtTop`. Last time I meddled with comparing candidates across the top/bottom boundary I found that we tend to produce terrible schedules for long basic blocks (well Tom complained because of AMDGPU regressions and as it turns out GPUs are more likely to have big scheduling regions). The reason is that it is easy to make decisions on one boundary that run contrary to the decisions being made at the other boundary. With that in mind I'd rather reject this patch or to convince me I would like to some concrete situations or test cases.

This revision now requires changes to proceed.Oct 11 2017, 4:47 PM

fhahn mentioned this in D44635: [MISched] Avoid increased spilling with bidirectional pre-RA scheduling.Mar 19 2018, 10:16 AM

Revision Contents

Path

Size

lib/

CodeGen/

MachineScheduler.cpp

13 lines

test/

CodeGen/

AArch64/

vector-fcopysign.ll

24 lines

Diff 116307

lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 2,768 Lines • ▼ Show 20 Lines	static bool tryPressure(const PressureChange &TryP,
const TargetRegisterInfo *TRI,		const TargetRegisterInfo *TRI,
const MachineFunction &MF) {		const MachineFunction &MF) {
// If one candidate decreases and the other increases, go with it.		// If one candidate decreases and the other increases, go with it.
// Invalid candidates have UnitInc==0.		// Invalid candidates have UnitInc==0.
if (tryGreater(TryP.getUnitInc() < 0, CandP.getUnitInc() < 0, TryCand, Cand,		if (tryGreater(TryP.getUnitInc() < 0, CandP.getUnitInc() < 0, TryCand, Cand,
Reason)) {		Reason)) {
return true;		return true;
}		}

		// If one of the candidates does neither increase or decrease pressure, but
		// the other does increase the pressure, go with it.
		if (TryP.getUnitInc() == 0 && CandP.getUnitInc() > 0) {
		TryCand.Reason = Reason;
		return true;
		}
		if (TryP.getUnitInc() > 0 && CandP.getUnitInc() == 0) {
		if (Cand.Reason > Reason)
		Cand.Reason = Reason;
		return true;
		}
		MatzeBUnsubmitted Not Done Reply Inline Actions Why isn't this using `tryGreate()` and isn't this the same as just changing `getUnitInc() < 0` to `getUnitInc() <= 0` in the earlier check? MatzeB: Why isn't this using `tryGreate()` and isn't this the same as just changing `getUnitInc() < 0`…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions Initially I used `tryGreater` above with `getUnitInc() <= 0`. But that version does not prioritize `TryCand` over `Cand` if `TryCand` decreases pressure and `Cand` does not change pressure I think. fhahn: Initially I used `tryGreater` above with `getUnitInc() <= 0`. But that version does not…
		MatzeBUnsubmitted Not Done Reply Inline Actions I just looked at this patch again: There actually already is `tryLess(TryP.getUnitInc(), CandP.getUnitInc(), TryCand, Cand, Reason);` below. And the important part here is that it is guarded by `Cand.AtTop != TryCand.AtTop`. Last time I meddled with comparing candidates across the top/bottom boundary I found that we tend to produce terrible schedules for long basic blocks (well Tom complained because of AMDGPU regressions and as it turns out GPUs are more likely to have big scheduling regions). The reason is that it is easy to make decisions on one boundary that run contrary to the decisions being made at the other boundary. With that in mind I'd rather reject this patch or to convince me I would like to some concrete situations or test cases. MatzeB: I just looked at this patch again: There actually already is `tryLess(TryP.getUnitInc(), CandP.

// Do not compare the magnitude of pressure changes between top and bottom		// Do not compare the magnitude of pressure changes between top and bottom
// boundary.		// boundary.
if (Cand.AtTop != TryCand.AtTop)		if (Cand.AtTop != TryCand.AtTop)
return false;		return false;

// If both candidates affect the same set in the same boundary, go with the		// If both candidates affect the same set in the same boundary, go with the
// smallest increase.		// smallest increase.
unsigned TryPSet = TryP.getPSetOrMax();		unsigned TryPSet = TryP.getPSetOrMax();
▲ Show 20 Lines • Show All 855 Lines • Show Last 20 Lines

test/CodeGen/AArch64/vector-fcopysign.ll

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b)		%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b)
ret <4 x float> %r		ret <4 x float> %r
}		}

; SplitVecOp #1		; SplitVecOp #1
define <4 x float> @test_copysign_v4f32_v4f64(<4 x float> %a, <4 x double> %b) #0 {		define <4 x float> @test_copysign_v4f32_v4f64(<4 x float> %a, <4 x double> %b) #0 {
; CHECK-LABEL: test_copysign_v4f32_v4f64:		; CHECK-LABEL: test_copysign_v4f32_v4f64:
; CHECK-NEXT: mov s3, v0[1]		; CHECK-NEXT: mov s3, v0[1]
; CHECK-NEXT: movi.4s v4, #128, lsl #24		; CHECK-NEXT: mov d4, v1[1]
; CHECK-NEXT: fcvt s5, d1		; CHECK-NEXT: fcvt s4, d4
; CHECK-NEXT: mov s6, v0[2]		; CHECK-NEXT: movi.4s v5, #128, lsl #24
; CHECK-NEXT: mov s7, v0[3]
; CHECK-NEXT: bit.16b v0, v5, v4
; CHECK-NEXT: fcvt s5, d2
; CHECK-NEXT: bit.16b v6, v5, v4
; CHECK-NEXT: mov d1, v1[1]
; CHECK-NEXT: fcvt s1, d1		; CHECK-NEXT: fcvt s1, d1
; CHECK-NEXT: bit.16b v3, v1, v4		; CHECK-NEXT: bit.16b v3, v4, v5
		; CHECK-NEXT: mov s4, v0[2]
		; CHECK-NEXT: mov s6, v0[3]
		; CHECK-NEXT: bit.16b v0, v1, v5
		; CHECK-NEXT: fcvt s1, d2
		; CHECK-NEXT: bit.16b v4, v1, v5
; CHECK-NEXT: mov d1, v2[1]		; CHECK-NEXT: mov d1, v2[1]
; CHECK-NEXT: fcvt s1, d1		; CHECK-NEXT: fcvt s1, d1
; CHECK-NEXT: ins.s v0[1], v3[0]		; CHECK-NEXT: ins.s v0[1], v3[0]
; CHECK-NEXT: ins.s v0[2], v6[0]		; CHECK-NEXT: ins.s v0[2], v4[0]
; CHECK-NEXT: bit.16b v7, v1, v4		; CHECK-NEXT: bit.16b v6, v1, v5
; CHECK-NEXT: ins.s v0[3], v7[0]		; CHECK-NEXT: ins.s v0[3], v6[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp0 = fptrunc <4 x double> %b to <4 x float>		%tmp0 = fptrunc <4 x double> %b to <4 x float>
%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %tmp0)		%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %tmp0)
ret <4 x float> %r		ret <4 x float> %r
}		}

declare <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b) #0		declare <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b) #0

▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines