This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Try to issue the load instruction preferentially
AbandonedPublic

Authored by Allen on Jul 16 2022, 1:54 AM.

Download Raw Diff

Details

Reviewers

dmgreen
fhahn
avieira
nhaehnle

Summary

Based on the discussion, it is preferable to hoist the loads as much as possible after register allocation.
https://discourse.llvm.org/t/insn-schedule-is-it-reasonable-to-issue-the-load-instruction-preferentially/63674

Diff Detail

Unit TestsFailed

	Time	Test
	60,120 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
	60,130 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
	1,190 ms	x64 debian > SanitizerCommon-tsan-x86_64-Linux.Linux::decorate_proc_maps.cpp

Event Timeline

Allen created this revision.Jul 16 2022, 1:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2022, 1:54 AM

Herald added subscribers: javed.absar, hiraditya, MatzeB. · View Herald Transcript

Allen requested review of this revision.Jul 16 2022, 1:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2022, 1:54 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B175804: Diff 445208.Jul 16 2022, 2:51 AM

ping ?

Do you have any numbers on the benefit of that? I am not sure if using a target feature here is a good idea. If it is better to schedule aggressively for latency then this should probably be driven by the scheduling model.

hi @fhahn , I'll gain benefits base on the tsv110 target (kunpeng 920), butI'm not sure whether this feature will have a negative performance impact on other architectures, so I add the target feature, does it reasonable ?

Sorry for the delay. My immediate thought was why not just increase the latency on the load. But I guess this is different. Most cpus can only issue a certain number of loads per cycle, so "load;load;load;load;load;add;add;add;add" will be worse than emitting "load;add;load;add;load;add;load;add". It will depend on the latencies of the loads though and any dependencies, most time we encode an optimistic latency into the schedule latencies for loads.

Most cpus will out-of-order execute around that anyway nowadays though, just filling up internal queues. It depends on how much they are fetching per cycle and what is acting as the bottleneck. So I wanted to run some benchmarks to see if this did modify anything, but they didn't show much change.

Umm. Is this for a downstream cpu or is it using tsv110? Does the scheduling model have the correct information for the loads in question? I worry that some of this code we are adding looks like dead code that someone could rightly delete as unused. Should they be added to the tuning features for the tsv110 cpu?

Thank you for your suggestion, it is true that this unconditionally preferentially launch of the load instruction may be too aggressive.
Can you give me some guidance about the the function of "optimistic latency into the schedule latencies for loads code which function is", I can debug to see if a new solution can be used for improvement ?

The scheduling info will come from https://github.com/llvm/llvm-project/blob/439668871ac992159f00309d3bd837db287bdea6/llvm/lib/Target/AArch64/AArch64SchedTSV110.td#L501. It tends to assume a L1 cache latency, as setting it much high tends to push the scheduler into making worse decisions.
It may be better to be a little aggressive after post-ra. I don't know the cpu very well.

nhaehnle resigned from this revision.Jul 20 2022, 6:15 AM

dewen added a subscriber: dewen.Dec 21 2022, 12:03 AM

Matt added a subscriber: Matt.Dec 27 2022, 8:15 AM

I find it is A57UnitL Resource limited in the A57 backend.
So, after I temporarily tried to adjust the A57UnitL number to 2, the load instruction would be issue first.
In fact, the current architecture is not VLIW, and only one instruction can be sent in a cycle, so will the A57UnitL really have a resource bottleneck in the pipeline?

def A57UnitL : ProcResource<2>; // Type L micro-ops

** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 0 1 5 
  TopQ.A RemainingLatency 0 + 0c > CritPath 17
  Cand SU(0) ORDER                              
Pick Top TOP-PATH  
Scheduling SU(0) renamable $q1, renamable $q2 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8)
  Ready @0c
  A57UnitL +2x6u
  *** Critical resource A57UnitL: 2c
  TopQ.A BotLatency SU(0) 17c
  *** Max MOps 3 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
  Retired: 3
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 5 1 7 
  TopQ.A RemainingLatency 0 + 1c > CritPath 17
  TopQ.A ResourceLimited: A57UnitL
  Cand SU(5) ORDER                              
Pick Top RES-REDUCE
Scheduling SU(5) renamable $x2 = SUBSXri renamable $x2, 4, 0, implicit-def $nzcv
  Ready @1c
  A57UnitI +1x3u
TopQ.A @1c
  Retired: 4
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 29 2023, 6:41 PM

Abandon as I think the impovement of target schedule model may be a better way as comment suggested

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64.td

4 lines

AArch64MachineScheduler.cpp

29 lines

test/

CodeGen/

AArch64/

aarch64-sched-load.ll

88 lines

Diff 445208

llvm/lib/Target/AArch64/AArch64.td

	Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines

	def FeatureSlowPaired128 : SubtargetFeature<"slow-paired-128",			def FeatureSlowPaired128 : SubtargetFeature<"slow-paired-128",
	"IsPaired128Slow", "true", "Paired 128 bit loads and stores are slow">;			"IsPaired128Slow", "true", "Paired 128 bit loads and stores are slow">;

	def FeatureAscendStoreAddress : SubtargetFeature<"ascend-store-address",			def FeatureAscendStoreAddress : SubtargetFeature<"ascend-store-address",
	"IsStoreAddressAscend", "false",			"IsStoreAddressAscend", "false",
	"Schedule vector stores by ascending address">;			"Schedule vector stores by ascending address">;

				def FeatureSchedLoadPrefer : SubtargetFeature<"sched-load-prefer",
				"IsSchedLoadPerfer", "true",
				"Schedule load instructions preferentially after register allocation">;

	def FeatureSlowSTRQro : SubtargetFeature<"slow-strqro-store", "IsSTRQroSlow",			def FeatureSlowSTRQro : SubtargetFeature<"slow-strqro-store", "IsSTRQroSlow",
	"true", "STR of Q register with register offset is slow">;			"true", "STR of Q register with register offset is slow">;

	def FeatureAlternateSExtLoadCVTF32Pattern : SubtargetFeature<			def FeatureAlternateSExtLoadCVTF32Pattern : SubtargetFeature<
	"alternate-sextload-cvt-f32-pattern", "UseAlternateSExtLoadCVTF32Pattern",			"alternate-sextload-cvt-f32-pattern", "UseAlternateSExtLoadCVTF32Pattern",
	"true", "Use alternative pattern for sextload convert to f32">;			"true", "Use alternative pattern for sextload convert to f32">;

	def FeatureArithmeticBccFusion : SubtargetFeature<			def FeatureArithmeticBccFusion : SubtargetFeature<
	▲ Show 20 Lines • Show All 1,062 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64MachineScheduler.cpp

	//===- AArch64MachineScheduler.cpp - MI Scheduler for AArch64 -------------===//			//===- AArch64MachineScheduler.cpp - MI Scheduler for AArch64 -------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AArch64MachineScheduler.h"			#include "AArch64MachineScheduler.h"
	#include "AArch64InstrInfo.h"			#include "AArch64InstrInfo.h"
	#include "AArch64Subtarget.h"			#include "AArch64Subtarget.h"
	#include "MCTargetDesc/AArch64MCTargetDesc.h"			#include "MCTargetDesc/AArch64MCTargetDesc.h"

	using namespace llvm;			using namespace llvm;

				#define DEBUG_TYPE "aarch64-scheduler"

	static bool needReorderStoreMI(const MachineInstr *MI) {			static bool needReorderStoreMI(const MachineInstr *MI) {
	if (!MI)			if (!MI)
	return false;			return false;

	switch (MI->getOpcode()) {			switch (MI->getOpcode()) {
	default:			default:
	return false;			return false;
	case AArch64::STURQi:			case AArch64::STURQi:
	Show All 37 Lines
	bool AArch64PostRASchedStrategy::tryCandidate(SchedCandidate &Cand,			bool AArch64PostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
	SchedCandidate &TryCand) {			SchedCandidate &TryCand) {
	bool OriginalResult = PostGenericScheduler::tryCandidate(Cand, TryCand);			bool OriginalResult = PostGenericScheduler::tryCandidate(Cand, TryCand);

	if (Cand.isValid()) {			if (Cand.isValid()) {
	MachineInstr *Instr0 = TryCand.SU->getInstr();			MachineInstr *Instr0 = TryCand.SU->getInstr();
	MachineInstr *Instr1 = Cand.SU->getInstr();			MachineInstr *Instr1 = Cand.SU->getInstr();

	if (!needReorderStoreMI(Instr0) \|\| !needReorderStoreMI(Instr1))			LLVM_DEBUG(dbgs() << " Cand: " << Instr1 << " TryCand: " << Instr0);
	return OriginalResult;

				if (needReorderStoreMI(Instr0) && needReorderStoreMI(Instr1)) {
	int64_t Off0, Off1;			int64_t Off0, Off1;
	// With the same base address and non-overlapping writes.			// With the same base address and non-overlapping writes.
	if (!mayOverlapWrite(Instr0, Instr1, Off0, Off1)) {			if (!mayOverlapWrite(Instr0, Instr1, Off0, Off1)) {
	TryCand.Reason = NodeOrder;			TryCand.Reason = NodeOrder;
	// Order them by ascending offsets.			// Order them by ascending offsets.
	return Off0 < Off1;			return Off0 < Off1;
	}			}
	}			}

				// Try to issue the load instruction preferentially.
				if (Instr0->getMF()->getSubtarget<AArch64Subtarget>().isSchedLoadPerfer()) {
				if (Instr0->mayLoad() && !Instr1->mayLoad()) {
				TryCand.Reason = NodeOrder;
				return true;
				} else if (!Instr0->mayLoad() && Instr1->mayLoad()) {
				return false;
				}
				}
				}

	return OriginalResult;			return OriginalResult;
	}			}

llvm/test/CodeGen/AArch64/aarch64-sched-load.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple aarch64 -mcpu=cortex-a57 < %s \| FileCheck %s --check-prefixes=DEFAULT
				; RUN: llc -mtriple aarch64 -mcpu=cortex-a57 -mattr=+sched-load-prefer < %s \| FileCheck %s --check-prefixes=LOAD

				target triple = "aarch64-unknown-linux-gnu"

				define i32 @hypre_SeqVectorAxpy(double %alpha, double* nocapture readonly %x, double* nocapture %y, i64 %count) {
				; DEFAULT-LABEL: hypre_SeqVectorAxpy:
				; DEFAULT: // %bb.0: // %entry
				; DEFAULT-NEXT: // kill: def $d0 killed $d0 def $q0
				; DEFAULT-NEXT: dup v0.2d, v0.d[0]
				; DEFAULT-NEXT: add x8, x1, #16
				; DEFAULT-NEXT: add x9, x0, #16
				; DEFAULT-NEXT: .p2align 4, 0x0, 8
				; DEFAULT-NEXT: .LBB0_1: // %vector.body
				; DEFAULT-NEXT: // =>This Inner Loop Header: Depth=1
				; DEFAULT-NEXT: ldp q1, q2, [x9, #-16]
				; DEFAULT-NEXT: subs x2, x2, #4
				; DEFAULT-NEXT: add x9, x9, #32
				; DEFAULT-NEXT: ldp q3, q4, [x8, #-16]
				; DEFAULT-NEXT: fmla v4.2d, v0.2d, v2.2d
				; DEFAULT-NEXT: fmla v3.2d, v0.2d, v1.2d
				; DEFAULT-NEXT: stp q3, q4, [x8, #-16]
				; DEFAULT-NEXT: add x8, x8, #32
				; DEFAULT-NEXT: b.ne .LBB0_1
				; DEFAULT-NEXT: // %bb.2: // %cleanup
				; DEFAULT-NEXT: mov w0, wzr
				; DEFAULT-NEXT: ret
				;
				; LOAD-LABEL: hypre_SeqVectorAxpy:
				; LOAD: // %bb.0: // %entry
				; LOAD-NEXT: // kill: def $d0 killed $d0 def $q0
				; LOAD-NEXT: dup v0.2d, v0.d[0]
				; LOAD-NEXT: add x8, x1, #16
				; LOAD-NEXT: add x9, x0, #16
				; LOAD-NEXT: .p2align 4, 0x0, 8
				; LOAD-NEXT: .LBB0_1: // %vector.body
				; LOAD-NEXT: // =>This Inner Loop Header: Depth=1
				; LOAD-NEXT: ldp q1, q2, [x9, #-16]
				; LOAD-NEXT: ldp q3, q4, [x8, #-16]
				; LOAD-NEXT: subs x2, x2, #4
				; LOAD-NEXT: add x9, x9, #32
				; LOAD-NEXT: fmla v4.2d, v0.2d, v2.2d
				; LOAD-NEXT: fmla v3.2d, v0.2d, v1.2d
				; LOAD-NEXT: stp q3, q4, [x8, #-16]
				; LOAD-NEXT: add x8, x8, #32
				; LOAD-NEXT: b.ne .LBB0_1
				; LOAD-NEXT: // %bb.2: // %cleanup
				; LOAD-NEXT: mov w0, wzr
				; LOAD-NEXT: ret
				entry:
				%broadcast.splatinsert = insertelement <2 x double> poison, double %alpha, i32 0
				%broadcast.splat = shufflevector <2 x double> %broadcast.splatinsert, <2 x double> poison, <2 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%add0 = add i64 %index, 0
				%ptr0 = getelementptr inbounds double, double* %x, i64 %add0
				%value0 = getelementptr inbounds double, double* %ptr0, i32 0
				%vec0 = bitcast double* %value0 to <2 x double>*
				%wide.load0 = load <2 x double>, <2 x double>* %vec0, align 8
				%value2 = getelementptr inbounds double, double* %ptr0, i32 2
				%vec2 = bitcast double* %value2 to <2 x double>*
				%wide.load2 = load <2 x double>, <2 x double>* %vec2, align 8
				%mul0 = fmul fast <2 x double> %wide.load0, %broadcast.splat
				%mul2 = fmul fast <2 x double> %wide.load2, %broadcast.splat
				%ptry0 = getelementptr inbounds double, double* %y, i64 %add0
				%valuey0 = getelementptr inbounds double, double* %ptry0, i32 0
				%vecy0 = bitcast double* %valuey0 to <2 x double>*
				%wide.loady0 = load <2 x double>, <2 x double>* %vecy0, align 8
				%valuey2 = getelementptr inbounds double, double* %ptry0, i32 2
				%vecy2 = bitcast double* %valuey2 to <2 x double>*
				%wide.loady2 = load <2 x double>, <2 x double>* %vecy2, align 8
				%fadd0 = fadd fast <2 x double> %wide.loady0, %mul0
				%fadd2 = fadd fast <2 x double> %wide.loady2, %mul2
				%vecy0_new = bitcast double* %valuey0 to <2 x double>*
				store <2 x double> %fadd0, <2 x double>* %vecy0_new, align 8
				%vecy2_new = bitcast double* %valuey2 to <2 x double>*
				store <2 x double> %fadd2, <2 x double>* %vecy2_new, align 8
				%index.next = add nuw i64 %index, 4
				%cmp = icmp eq i64 %index.next, %count
				br i1 %cmp, label %cleanup, label %vector.body

				cleanup: ; preds = %vector.body
				ret i32 0
				}