This is an archive of the discontinued LLVM Phabricator instance.

Differential D19138

[X86] Enable the post-RA-scheduler for 32-bit cpus
ClosedPublic

Authored by mbodart on Apr 14 2016, 4:11 PM.

Download Raw Diff

Details

Reviewers

spatel
nadav
aaboud

Commits

rGe60465ddf7a5: [X86] Enable the post-RA-scheduler for clang's default 32-bit cpu.
rL267809: [X86] Enable the post-RA-scheduler for clang's default 32-bit cpu.

Summary

This change set enables the post RA scheduler for pentium4 and SSE3 class cpus.

The intent is that a vanilla "clang -m32 -O2" compilation, with no specific
-march setting, will get post scheduled. This has demonstrated significant
performance improvements when the code is run on Silvermont, with
essentially neutral performance on avx2 systems.

Some silvermont highlights include over 20% improvements
in 456.hmmer and several EEMBC benchmarks, and 4 to 15%
improvements in over a dozen other industry benchmarks.
There were only a few drops, in the 2 to 4% range.

On an AVX2 system, performance was generally flat, with
a balance of smallish gains and losses (in the 2 to 4% range).

The key scheduling improvement is that loads get separated
from their users.

As coded, the change should not affect "clang -m64" compilations.

With -m32, clang currently defaults to -mcpu=pentium4.
So I changed the scheduling model for pentium4 from the
GenericModel, to a new GenericPostRAModel (same properties,
but also enables the PostRAScheduler).

As clang's default CPU could theoretically change, I also
made the same change for "nearby" cpus pentium-m, pentium4m,
prescott and nocona. Arguably yonah should be in this set
as well, for consistency. But I'd like to get feedback on
whether this is an OK approach overall, as changing yonah will
affect many lit tests. I suspect most of these older cpus
are no longer used in practice, but don't really know.

Diff Detail

Repository: rL LLVM

Event Timeline

mbodart updated this revision to Diff 53798.Apr 14 2016, 4:11 PM

mbodart retitled this revision from to [X86] Enable the post-RA-scheduler for 32-bit cpus.

mbodart updated this object.

mbodart added reviewers: nadav, aaboud.

mbodart added a subscriber: llvm-commits.

mbodart updated this object.Apr 20 2016, 8:38 AM

mbodart added a reviewer: spatel.

aaboud added inline comments.Apr 21 2016, 9:06 AM

lib/Target/X86/X86.td
290 ↗	(On Diff #53798)	Can you create a class for GenereicPostRAModel, similar to class Proc, something like this: class ProcPostRA<string Name, list<SubtargetFeature> Features> : ProcessorModel<Name, GenereicPostRAModel, Features>;
test/CodeGen/X86/pr16360.ll
2 ↗	(On Diff #53798)	Why did not you simply add the "-post-RA-scheduler=false" flag like in other tests?

mbodart added inline comments.Apr 21 2016, 12:12 PM

lib/Target/X86/X86.td
290 ↗	(On Diff #53798)	I think that might add more confusion than clarity. There are currently two base classes used to define the cpus, Proc and ProcessorModel. ProcessorModel is used in cases where we want to override the scheduling model. Post-RA scheduling is just one property of the scheduling model. In the cases where we do create a new class derived from ProcessorModel, it is done to encapsulate the processor "features" in one place. So my preference is to keep the current approach, though it's a minor enough detail that I can change it if others feel strongly.
test/CodeGen/X86/pr16360.ll
2 ↗	(On Diff #53798)	As a general rule I would think we want to avoid adding special options to tests, as that can sometimes mask issues. But there are some exceptions. I disabled the post-RA-scheduler in misched-ilp.ll because that test is looking for a specific behavior from an earlier scheduling phase. As for machine-cp.ll, I could go either way (updating the test or disabling the post scheduler). The test is built with the x86_64 triple and nocona cpu. If built by clang with an x86_64 triple, the post scheduler would not be enabled as the cpu would be set to "x86_64", not nocona. So I arbitrarily chose to match this behavior. But I could simply update the test as well. Is there any precedence or BKM here?

spatel added inline comments.Apr 25 2016, 1:14 PM

test/CodeGen/X86/pr16360.ll
2 ↗	(On Diff #53798)	I don't think this is actually written anywhere as an official guideline, but yes, I think we should prefer to only use special options when they are supposed to affect the outcome of the test. That gives us better test coverage for the common case. There's a huge pile of x86 tests that excessively specify a CPU model when they really should specify an attribute (for example as seen in this patch, -mcpu=pentium4 rather than -mattr=sse2). We should try to fix those as we encounter them. So my vote would be to update all of the affected test files with the flags that they are really testing (these would be test-file-only commits ahead of this one). It will make this patch smaller and the intended effect of this patch will be made clear rather than confused by a bunch of unintentional changes because test files were poorly specified.

mbodart mentioned this in D19568: [X86] Replace -mcpu with -mattr in several tests.Apr 26 2016, 4:28 PM

mbodart added inline comments.Apr 26 2016, 4:36 PM

test/CodeGen/X86/pr16360.ll
2 ↗	(On Diff #53798)	Thanks for the suggestions Sanjay. I created D19568 for the test changes, and will remove them from this review. But note that the new test still needs use of -mcpu. The scheduler behavior is tied to the cpu, not -mattr, AFAICT. If you know of a better mechanism for verifying the post-scheduling behavior for a vanilla "clang -m32 -O2" compilation, I'd be happy to use it.

Rebased, and removed several lit tests from this change set as I updated them separately.

LGTM.

A couple of options to make sure the behavior is as expected:

To confirm the codegen difference with -post-RA-scheduler, you could add a RUN (and corresponding CHECKs) that disables -post-RA-scheduler for one of the tested CPUs. Or add a run line for a newer CPU that does not have -post-RA-scheduler turned on by default.

To confirm that the default i386 CPU model has -post-RA-scheduler enabled, you could add a RUN line that doesn't explicitly specify the CPU. Then if someone comes along and changes the default i386 target to something not in the current list, it should cause this test to fail.

This revision is now accepted and ready to land.Apr 27 2016, 3:27 PM

Thanks for the testing suggestions Sanjay, but I think I'll stick with the current test for now.

Regarding 1), that would only test that some other scheduling paradigm did not separate the loads, which may or may not be a desired behavior, and may change from time to time. So I prefer to limit the test to getting the expected behavior in the known settings where it is desired.

As for 2), the default cpu varies by tool. With clang -m32 it is pentium4 and with clang -m64 it is x86_64,
but with llc it is generic. So the test would indeed fail if I added an llc RUN case with no explicit cpu.

Closed by commit rL267809: [X86] Enable the post-RA-scheduler for clang's default 32-bit cpu. (authored by mbodart). · Explain WhyApr 27 2016, 3:58 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86.td

36 lines

X86Schedule.td

12 lines

test/

CodeGen/

X86/

post-ra-sched.ll

40 lines

Diff 55346

llvm/trunk/lib/Target/X86/X86.td

	Show First 20 Lines • Show All 270 Lines • ▼ Show 20 Lines
	def : Proc<"i686", [FeatureX87, FeatureSlowUAMem16]>;			def : Proc<"i686", [FeatureX87, FeatureSlowUAMem16]>;
	def : Proc<"pentiumpro", [FeatureX87, FeatureSlowUAMem16, FeatureCMOV]>;			def : Proc<"pentiumpro", [FeatureX87, FeatureSlowUAMem16, FeatureCMOV]>;
	def : Proc<"pentium2", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,			def : Proc<"pentium2", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,
	FeatureCMOV, FeatureFXSR]>;			FeatureCMOV, FeatureFXSR]>;
	def : Proc<"pentium3", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,			def : Proc<"pentium3", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,
	FeatureSSE1, FeatureFXSR]>;			FeatureSSE1, FeatureFXSR]>;
	def : Proc<"pentium3m", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,			def : Proc<"pentium3m", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,
	FeatureSSE1, FeatureFXSR, FeatureSlowBTMem]>;			FeatureSSE1, FeatureFXSR, FeatureSlowBTMem]>;
	def : Proc<"pentium-m", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,
				// Enable the PostRAScheduler for SSE2 and SSE3 class cpus.
				// The intent is to enable it for pentium4 which is the current default
				// processor in a vanilla 32-bit clang compilation when no specific
				// architecture is specified. This generally gives a nice performance
				// increase on silvermont, with largely neutral behavior on other
				// contemporary large core processors.
				// pentium-m, pentium4m, prescott and nocona are included as a preventative
				// measure to avoid performance surprises, in case clang's default cpu
				// changes slightly.

				def : ProcessorModel<"pentium-m", GenericPostRAModel,
				[FeatureX87, FeatureSlowUAMem16, FeatureMMX,
	FeatureSSE2, FeatureFXSR, FeatureSlowBTMem]>;			FeatureSSE2, FeatureFXSR, FeatureSlowBTMem]>;
	def : Proc<"pentium4", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,
				def : ProcessorModel<"pentium4", GenericPostRAModel,
				[FeatureX87, FeatureSlowUAMem16, FeatureMMX,
	FeatureSSE2, FeatureFXSR]>;			FeatureSSE2, FeatureFXSR]>;
	def : Proc<"pentium4m", [FeatureX87, FeatureSlowUAMem16, FeatureMMX,
				def : ProcessorModel<"pentium4m", GenericPostRAModel,
				[FeatureX87, FeatureSlowUAMem16, FeatureMMX,
	FeatureSSE2, FeatureFXSR, FeatureSlowBTMem]>;			FeatureSSE2, FeatureFXSR, FeatureSlowBTMem]>;

	// Intel Quark.			// Intel Quark.
	def : Proc<"lakemont", []>;			def : Proc<"lakemont", []>;

	// Intel Core Duo.			// Intel Core Duo.
	def : ProcessorModel<"yonah", SandyBridgeModel,			def : ProcessorModel<"yonah", SandyBridgeModel,
	[FeatureX87, FeatureSlowUAMem16, FeatureMMX, FeatureSSE3,			[FeatureX87, FeatureSlowUAMem16, FeatureMMX, FeatureSSE3,
	FeatureFXSR, FeatureSlowBTMem]>;			FeatureFXSR, FeatureSlowBTMem]>;

	// NetBurst.			// NetBurst.
	def : Proc<"prescott",			def : ProcessorModel<"prescott", GenericPostRAModel,
	[FeatureX87, FeatureSlowUAMem16, FeatureMMX, FeatureSSE3,			[FeatureX87, FeatureSlowUAMem16, FeatureMMX, FeatureSSE3,
	FeatureFXSR, FeatureSlowBTMem]>;			FeatureFXSR, FeatureSlowBTMem]>;
	def : Proc<"nocona", [			def : ProcessorModel<"nocona", GenericPostRAModel, [
	FeatureX87,			FeatureX87,
	FeatureSlowUAMem16,			FeatureSlowUAMem16,
	FeatureMMX,			FeatureMMX,
	FeatureSSE3,			FeatureSSE3,
	FeatureFXSR,			FeatureFXSR,
	FeatureCMPXCHG16B,			FeatureCMPXCHG16B,
	FeatureSlowBTMem			FeatureSlowBTMem
	]>;			]>;
	▲ Show 20 Lines • Show All 505 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Schedule.td

	Show First 20 Lines • Show All 627 Lines • ▼ Show 20 Lines
	// number of in-flight instructions.			// number of in-flight instructions.
	//			//
	// HighLatency=10 is optimistic. X86InstrInfo::isHighLatencyDef			// HighLatency=10 is optimistic. X86InstrInfo::isHighLatencyDef
	// indicates high latency opcodes. Alternatively, InstrItinData			// indicates high latency opcodes. Alternatively, InstrItinData
	// entries may be included here to define specific operand			// entries may be included here to define specific operand
	// latencies. Since these latencies are not used for pipeline hazards,			// latencies. Since these latencies are not used for pipeline hazards,
	// they do not need to be exact.			// they do not need to be exact.
	//			//
	// The GenericModel contains no instruction itineraries.			// The GenericX86Model contains no instruction itineraries
	def GenericModel : SchedMachineModel {			// and disables PostRAScheduler.
				class GenericX86Model : SchedMachineModel {
	let IssueWidth = 4;			let IssueWidth = 4;
	let MicroOpBufferSize = 32;			let MicroOpBufferSize = 32;
	let LoadLatency = 4;			let LoadLatency = 4;
	let HighLatency = 10;			let HighLatency = 10;
	let PostRAScheduler = 0;			let PostRAScheduler = 0;
	let CompleteModel = 0;			let CompleteModel = 0;
	}			}

				def GenericModel : GenericX86Model;

				// Define a model with the PostRAScheduler enabled.
				def GenericPostRAModel : GenericX86Model {
				let PostRAScheduler = 1;
				}

	include "X86ScheduleAtom.td"			include "X86ScheduleAtom.td"
	include "X86SchedSandyBridge.td"			include "X86SchedSandyBridge.td"
	include "X86SchedHaswell.td"			include "X86SchedHaswell.td"
	include "X86ScheduleSLM.td"			include "X86ScheduleSLM.td"
	include "X86ScheduleBtVer2.td"			include "X86ScheduleBtVer2.td"

llvm/trunk/test/CodeGen/X86/post-ra-sched.ll

				; RUN: llc < %s -mtriple=i386 -mcpu=pentium4 \| FileCheck %s
				; RUN: llc < %s -mtriple=i386 -mcpu=pentium4m \| FileCheck %s
				; RUN: llc < %s -mtriple=i386 -mcpu=pentium-m \| FileCheck %s
				; RUN: llc < %s -mtriple=i386 -mcpu=prescott \| FileCheck %s
				; RUN: llc < %s -mtriple=i386 -mcpu=nocona \| FileCheck %s
				;
				; Verify that scheduling puts some distance between a load feeding into
				; the address of another load, and that second load. This currently
				; happens during the post-RA-scheduler, which should be enabled by
				; default with the above specified cpus.

				@ptrs = external global [0 x i32*], align 4
				@idxa = common global i32 0, align 4
				@idxb = common global i32 0, align 4
				@res = common global i32 0, align 4

				define void @addindirect() {
				; CHECK-LABEL: addindirect:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movl idxb, %ecx
				; CHECK-NEXT: movl idxa, %eax
				; CHECK-NEXT: movl ptrs(,%ecx,4), %ecx
				; CHECK-NEXT: movl ptrs(,%eax,4), %eax
				; CHECK-NEXT: movl (%ecx), %ecx
				; CHECK-NEXT: addl (%eax), %ecx
				; CHECK-NEXT: movl %ecx, res
				; CHECK-NEXT: retl
				entry:
				%0 = load i32, i32* @idxa, align 4
				%arrayidx = getelementptr inbounds [0 x i32], [0 x i32]* @ptrs, i32 0, i32 %0
				%1 = load i32, i32* %arrayidx, align 4
				%2 = load i32, i32* %1, align 4
				%3 = load i32, i32* @idxb, align 4
				%arrayidx1 = getelementptr inbounds [0 x i32], [0 x i32]* @ptrs, i32 0, i32 %3
				%4 = load i32, i32* %arrayidx1, align 4
				%5 = load i32, i32* %4, align 4
				%add = add i32 %5, %2
				store i32 %add, i32* @res, align 4
				ret void
				}