This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
2
MachineScheduler.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
machine-combiner.ll
-
misched-sdiv.ll

Differential D38279

[MachineScheduler] Enable latency heuristic based on scheduled lat.
Needs ReviewPublic

Authored by fhahn on Sep 26 2017, 8:14 AM.

Download Raw Diff

Details

Reviewers

MatzeB

Summary

The motivation of this patch is to improve scheduling for the test case
test/CodeGen/AArch64/misched-sdiv.ll with the MachineScheduler. A
similar test is part of test/Codegen/ARM.
I think ideally we would schedule SDIV as early as possibly as any instruction
scheduled before SDIV will increase the critical path. By how much
depends on the number of in-order pipeline stages.

The following happens during the test case. After scheduling the sub instruction,
both sdiv and add are added to the available bottom queue. When picking
the best candidate from the bottom queue, CurrZone.getCurrCycle()
returns 0, which plus the RemLatency is lower than the critical path,
so the latency heuristic is not used. I think
using the current cycle when scheduling top-down makes sense, as it is
that's the point where dispatching it later will impact the computed critical
path length. But when scheduling bottom-up, wouldn't it make sense to
use the latency already scheduled (at least when the candidate is on
the critical path), as this more accurately represents the cost of
scheduling the instruction?

There probably is a better way to handle this and I would appreciate
any input! PostRA scheduling does not take care of that case, as the
registers allocated prevent moving the SDIV instruction up and also is
disbaled on cores like Cortex-A72.

I did some initial benchmark runs on AArch64 with this patch:

AArch64 Cortex-A72 LLVM test-suite & spec2k: -0.22% on execution time
AArch64 Cortex-A57 SPEC2017: +0.74% on score

Diff Detail

Event Timeline

fhahn created this revision.Sep 26 2017, 8:14 AM

Herald added subscribers: javed.absar, kristof.beyls, aemerson. · View Herald TranscriptSep 26 2017, 8:14 AM

This makes sense to me as well, but I wonder why Issued gets a different value depending on top/bot ?

javed.absar added inline comments.Nov 28 2017, 2:22 AM

lib/CodeGen/MachineScheduler.cpp
2445	Bit strange. getScheduledLatency returns max of CurrCycle and ExpectedLatency which I dont quite find set anywhere (other than 0).

In D38279#937309, @jonpa wrote:

This makes sense to me as well, but I wonder why Issued gets a different value depending on top/bot ?

My understanding was current cycle is quite accurate from top-down, but not for bottom up. For example, after scheduling a node with latency 3 from bottom, we would only bump CurrCycle to 1. But if this node is on the critical path, scheduling the predecessor in the next cycle increases the critical path by 2 cycles (if we there are enough remaining instructions to be scheduled), because it was scheduled too early.

When going top-down, only after CurrCycle + RemLatency > CriticalPath scheduling a node later would increase the critical path.

Another way to think about it: When going top down, scheduling a node on the critical path “too early” does not have any impact on that critical path (it might cause another path to become critical though). But scheduling a node on the critical path “too early” when going bottom up is worse, as each instruction scheduled after the node increases the critical path ( assuming we can issue 1 instruction per cycle).

lib/CodeGen/MachineScheduler.cpp
2445	It's updated in `bumpNode`, through the reference created at line 2230 I think.

In D38279#937959, @fhahn wrote:

In D38279#937309, @jonpa wrote:

This makes sense to me as well, but I wonder why Issued gets a different value depending on top/bot ?

Did my response make sense as well? It's probably not extremely well worded :/

In D38279#945029, @fhahn wrote:

In D38279#937959, @fhahn wrote:

In D38279#937309, @jonpa wrote:

This makes sense to me as well, but I wonder why Issued gets a different value depending on top/bot ?

Did my response make sense as well? It's probably not extremely well worded :/

Well.. :-) I must admit I am somewhat confused still... Is this difference the same regardless of in-order / out-of-order and issue width?

I tried this patch on SystemZ, and it seems it may bring just a little improvement if anything (SystemZ is doing bottom-up-only pre-ra).

fhahn mentioned this in D45229: [MI-sched] schedule following instruction latencies.Apr 4 2018, 11:52 AM

evandro added a subscriber: evandro.Apr 4 2018, 2:38 PM

Revision Contents

Path

Size

lib/

CodeGen/

MachineScheduler.cpp

4 lines

test/

CodeGen/

AArch64/

machine-combiner.ll

4 lines

misched-sdiv.ll

28 lines

Diff 116667

lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 2,435 Lines • ▼ Show 20 Lines	void GenericSchedulerBase::setPolicy(CandPolicy &Policy, bool IsPostRA,
if (SchedModel->hasInstrSchedModel()) {		if (SchedModel->hasInstrSchedModel()) {
unsigned LFactor = SchedModel->getLatencyFactor();		unsigned LFactor = SchedModel->getLatencyFactor();
OtherResLimited = (int)(OtherCount - (RemLatency * LFactor)) > (int)LFactor;		OtherResLimited = (int)(OtherCount - (RemLatency * LFactor)) > (int)LFactor;
}		}
// Schedule aggressively for latency in PostRA mode. We don't check for		// Schedule aggressively for latency in PostRA mode. We don't check for
// acyclic latency during PostRA, and highly out-of-order processors will		// acyclic latency during PostRA, and highly out-of-order processors will
// skip PostRA scheduling.		// skip PostRA scheduling.
if (!OtherResLimited) {		if (!OtherResLimited) {
if (IsPostRA \|\| (RemLatency + CurrZone.getCurrCycle() > Rem.CriticalPath)) {		unsigned Issued = CurrZone.isTop() ? CurrZone.getCurrCycle() :
		CurrZone.getScheduledLatency();
		javed.absarUnsubmitted Not Done Reply Inline Actions Bit strange. getScheduledLatency returns max of CurrCycle and ExpectedLatency which I dont quite find set anywhere (other than 0). javed.absar: Bit strange. getScheduledLatency returns max of CurrCycle and ExpectedLatency which I dont…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions It's updated in `bumpNode`, through the reference created at line 2230 I think. fhahn: It's updated in `bumpNode`, through the reference created at line 2230 I think.
		if (IsPostRA \|\| (RemLatency + Issued > Rem.CriticalPath)) {
Policy.ReduceLatency \|= true;		Policy.ReduceLatency \|= true;
DEBUG(dbgs() << " " << CurrZone.Available.getName()		DEBUG(dbgs() << " " << CurrZone.Available.getName()
<< " RemainingLatency " << RemLatency << " + "		<< " RemainingLatency " << RemLatency << " + "
<< CurrZone.getCurrCycle() << "c > CritPath "		<< CurrZone.getCurrCycle() << "c > CritPath "
<< Rem.CriticalPath << "\n");		<< Rem.CriticalPath << "\n");
}		}
}		}
// If the same resource is limiting inside and outside the zone, do nothing.		// If the same resource is limiting inside and outside the zone, do nothing.
▲ Show 20 Lines • Show All 1,187 Lines • Show Last 20 Lines

test/CodeGen/AArch64/machine-combiner.ll

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines

	; Verify that we reassociate some of these ops. The optimal balanced tree of adds is not			; Verify that we reassociate some of these ops. The optimal balanced tree of adds is not
	; produced because that would cost more compile time.			; produced because that would cost more compile time.

	define float @reassociate_adds5(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {			define float @reassociate_adds5(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {
	; CHECK-LABEL: reassociate_adds5:			; CHECK-LABEL: reassociate_adds5:
	; CHECK: fadd s0, s0, s1			; CHECK: fadd s0, s0, s1
	; CHECK-NEXT: fadd s1, s2, s3			; CHECK-NEXT: fadd s1, s2, s3
				; CHECK-NEXT: fadd s2, s4, s5
	; CHECK-NEXT: fadd s0, s0, s1			; CHECK-NEXT: fadd s0, s0, s1
	; CHECK-NEXT: fadd s1, s4, s5			; CHECK-NEXT: fadd s1, s2, s6
	; CHECK-NEXT: fadd s1, s1, s6
	; CHECK-NEXT: fadd s0, s0, s1			; CHECK-NEXT: fadd s0, s0, s1
	; CHECK-NEXT: fadd s0, s0, s7			; CHECK-NEXT: fadd s0, s0, s7
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%t0 = fadd float %x0, %x1			%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2			%t1 = fadd float %t0, %x2
	%t2 = fadd float %t1, %x3			%t2 = fadd float %t1, %x3
	%t3 = fadd float %t2, %x4			%t3 = fadd float %t2, %x4
	%t4 = fadd float %t3, %x5			%t4 = fadd float %t3, %x5
	▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

test/CodeGen/AArch64/misched-sdiv.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64-unknown-linux -mcpu=cortex-a57 -verify-misched -debug-only=machine-scheduler -o - 2>&1 > /dev/null \| FileCheck %s --check-prefix=CHECK --check-prefix=A57_SCHED
				; RUN: llc < %s -mtriple=aarch64-unknown-linux -mcpu=generic -verify-misched -debug-only=machine-scheduler -o - 2>&1 > /dev/null \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC

				; Check the latency for instructions for both generic and cortex-a57.
				; SDIV should be scheduled at the block's begin (20 cyc of independent M unit).
				;
				; CHECK: ******** MI Scheduling ********
				; CHECK: foo:BB#0 entry

				; CHECK: Final schedule for BB#0 *
				; GENERIC: LDRWui
				; GENERIC: SDIV
				; A57_SCHED: SDIV
				; A57_SCHED: LDRWui
				; CHECK: ******** INTERVALS ********


				; Function Attrs: norecurse nounwind readnone
				define hidden i32 @foo(i32 %a, i32 %b, i32 %c, i32* %d) local_unnamed_addr #0 {
				entry:
				%xor = xor i32 %c, %b
				%ld = load i32, i32* %d
				%add = add nsw i32 %xor, %ld
				%div = sdiv i32 %a, %b
				;%div1 = sdiv i32 %add, %b
				%sub = sub i32 %div, %add
				ret i32 %sub
				}