This is an archive of the discontinued LLVM Phabricator instance.

Also, I'm slightly worried about this being dropped in just before the 14.x branch - would it be better to ensure we miss 14.x so this has time to cook in trunk for a while?

Harbormaster completed remote builds in B146444: Diff 404251.Jan 29 2022, 3:49 AM

In D118534#3281634, @RKSimon wrote:

Also, I'm slightly worried about this being dropped in just before the 14.x branch - would it be better to ensure we miss 14.x so this has time to cook in trunk for a while?

Sure, I can wait for 14.x freeze. Thanks!

llvm/lib/Target/X86/X86.td
1508	The GCC doc says `generic` will be updated with compiler iteration. I think it's reasonable since the common processors will be upgraded with time.

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

In D118534#3281771, @spatel wrote:

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

Please note that there is a *very* important difference between -march= and -mtune=:

-march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.
while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.

So this change really shouldn't affect available instruction sets.

In D118534#3281811, @lebedev.ri wrote:

In D118534#3281771, @spatel wrote:

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

Please note that there is a *very* important difference between -march= and -mtune=:

-march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.

while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.

So this change really shouldn't affect available instruction sets.

Ah, thanks for explaining. I confused the settings. So it's still a small bit of progress, just not as much as I thought.

craig.topper added inline comments.Jan 29 2022, 9:15 AM

llvm/lib/Target/X86/X86.td
1508	This turns in some fast shuffle flags that aren’t used by any AMD CPU except znver3. Do those flags make sense for generic?

MaskRay added a subscriber: MaskRay.Jan 29 2022, 10:45 AM

In D118534#3281820, @spatel wrote:

In D118534#3281811, @lebedev.ri wrote:

In D118534#3281771, @spatel wrote:

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

I must have missed the comments there, but seems we made the consensus independently.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

Please note that there is a *very* important difference between -march= and -mtune=:

-march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.

while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.

So this change really shouldn't affect available instruction sets.

Yes. Thanks @lebedev.ri

Ah, thanks for explanation. I confused the settings. So it's still a small bit of progress, just not as much as I thought.

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

pengfei added inline comments.Jan 29 2022, 5:29 PM

llvm/lib/Target/X86/X86.td
1508	I saw someone checked on znver1 and gave a conclusion the haswell tuning works reasonably well for both cores and zens. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616 I have no idea how to check it with LLVM, so I'll remove the shuffle flags for conservative.

Remove shuffle flags.

pengfei retitled this revision from [X86] Update generic process model to x86-64-v3 to match with GCC to [X86] Introduce more common morden turnings into `generic`.Jan 29 2022, 5:54 PM

Harbormaster completed remote builds in B146492: Diff 404317.Jan 29 2022, 7:22 PM

lebedev.ri retitled this revision from [X86] Introduce more common morden turnings into `generic` to [X86] Introduce more common modern turnings into `generic`.Jan 29 2022, 11:18 PM

In D118534#3282197, @pengfei wrote:

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.

In D118534#3282807, @spatel wrote:

In D118534#3282197, @pengfei wrote:

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.

That would require bumping what ISA's are implied by generic, yes.

RKSimon retitled this revision from [X86] Introduce more common modern turnings into `generic` to [X86] Introduce more common modern tunings into `generic`.Jan 30 2022, 7:20 AM

In D118534#3282808, @lebedev.ri wrote:

In D118534#3282807, @spatel wrote:

In D118534#3282197, @pengfei wrote:

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.

That would require bumping what ISA's are implied by generic, yes.

I see. Yes, we can do nothing with it currently.

RKSimon added inline comments.Jan 31 2022, 2:16 AM

llvm/lib/Target/X86/X86.td
1518	We now diverge from the x86-64 tuning flags - maybe make it clear in the comments for the generic model?
llvm/test/CodeGen/X86/rdtsc-upgrade.ll
3	do we even need to specificy -mcpu here?
llvm/test/CodeGen/X86/rdtsc.ll
3	do we really need -mcpu?
llvm/test/CodeGen/X86/twoaddr-lea.ll
2	not sure about -mcpu here - we do have some variation in use of lea depending on tuning - maybe we need more -mcpu coverage?
llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s
2	@andreadb what test coverage do we need here?

andreadb added inline comments.Jan 31 2022, 3:53 AM

llvm/lib/Target/X86/X86.td
1220	In my experience, SHLD is rarely fast on AMD processors.
1222–1223	These two tuning flags are very Intel specific. I am not convinced that these should be added for "generic".
llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s
2	I believe the test was just checking that llvm-mca didn't crash when parsing unknown asm directives. So the output here is not really important. It is fine to change the mcpu to something else other than generic.

Remove more Intel specific tunings.

pengfei marked 2 inline comments as done.Jan 31 2022, 5:49 AM

pengfei added inline comments.Jan 31 2022, 6:01 AM

llvm/test/CodeGen/X86/twoaddr-lea.ll
2	Here the generated code is affected by tuning `TuningSlowIncDec`. I think we should prefer to using `-mcpu=x86-64` rather than `generic` in lit tests since we might update the tunings of `generic` in future. We have more than 100 lines of RUNs with it, we may need a seperate patch to replace them. But the diff here is OK to reflect the change on tunings.

Harbormaster completed remote builds in B146611: Diff 404495.Jan 31 2022, 7:29 AM

Ping

spatel added inline comments.Feb 3 2022, 11:32 AM

llvm/lib/Target/X86/X86.td
1219	Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a reciprocal estimate sequence of about 9 instructions (when allowed with fast-math). Based on Agner's timing docs and the flag description, this should be set for Zen1 (sqrtss has latency 9-10), but it's not as obviously good for Zen2/3 because those have sqrtss latency of 14. The flag is set for Intel CPUs since SandyBridge, so that's sqrtss latency between 10-14. I think this is ok to set, but if the assumption is that we're tuning for any mainstream CPU of the last N years, then shouldn't we add this flag to the later AMD models too for less surprising output? There's a possible side benefit that we will produce more accurate results too.

Add TuningFastScalarFSQRT to Zen1/2/3.

pengfei added inline comments.Feb 3 2022, 7:47 PM

llvm/lib/Target/X86/X86.td
1219	uops shows all Zen1/2/3 have the same max latency 14.

Harbormaster completed remote builds in B147553: Diff 405859.Feb 3 2022, 8:32 PM

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

Revert the change of TuningFastScalarFSQRT for Zen.

In D118534#3295994, @lebedev.ri wrote:

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

Although Agner's table says it's 8~21 and 22 for znver1 and znver2 respectively, the mca shows they are worse than znver3. Is it a bug in schedule model? I'd like to leave Znver tuning as is given I'm not familiar with them.

In D118534#3296058, @pengfei wrote:

In D118534#3295994, @lebedev.ri wrote:

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

Although Agner's table says it's 8~21 and 22 for znver1 and znver2 respectively, the mca shows they are worse than znver3. Is it a bug in schedule model? I'd like to leave Znver tuning as is given I'm not familiar with them.

znver1/znver2 schedule models are, well, leave a lot to be desired.

In D118534#3295994, @lebedev.ri wrote:

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

'fsqrt' is the x87 instruction, I think the tuning flag (despite its name which is IR based not x87 based) is concerned with the SSE instruction (v)sqrtss - https://godbolt.org/z/qTzesKWvj

But lets make any znver changes independent of this patch

I'm happy for TuningFastScalarFSQRT to be enabled by default - more recent CPUs are benefiting less and less from NR approximations (NR is usually still worth it for float4/float8 though).

Harbormaster completed remote builds in B147573: Diff 405884.Feb 4 2022, 2:34 AM

In D118534#3296065, @RKSimon wrote:

In D118534#3295994, @lebedev.ri wrote:

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

'fsqrt' is the x87 instruction, I think the tuning flag (despite its name which is IR based not x87 based) is concerned with the SSE instruction (v)sqrtss - https://godbolt.org/z/qTzesKWvj

Correct. Also AFAIK, that tuning flag only comes into play when expanding a plain sqrt(X) operation, not a 1/sqrt(X) operation. But I agree that we can make that change independently for the zen models.
So this patch LGTM.

This revision is now accepted and ready to land.Feb 4 2022, 6:21 AM

spatel mentioned this in D119001: [x86] enable fast sqrtss tuning for AMD Zen cores.Feb 4 2022, 7:52 AM

In D118534#3296584, @spatel wrote:

Correct. Also AFAIK, that tuning flag only comes into play when expanding a plain sqrt(X) operation, not a 1/sqrt(X) operation. But I agree that we can make that change independently for the zen models.

Proposed change for Zen here: D119001

spatel mentioned this in rGfff3e1dbaa9e: [x86] enable fast sqrtss/sqrtps tuning for AMD Zen cores.Feb 4 2022, 10:59 AM

This revision was landed with ongoing or failed builds.Feb 4 2022, 6:32 PM

Closed by commit rG0b7669f33331: [X86] Introduce more common modern tunings into `generic` (authored by pengfei). · Explain Why

This revision was automatically updated to reflect the committed changes.

pengfei added a commit: rG0b7669f33331: [X86] Introduce more common modern tunings into `generic`.

Thank you all for the help!

I noticed that my builds still prefer "add" over "inc" despite this change. Turns out those builds pass -march=x86-64, which apparently affects the tuning which was a surprise to me. I filed https://github.com/llvm/llvm-project/issues/54472 Does that seem right to the experts on this change?

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 4:06 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86.td

6 lines

test/

CodeGen/

X86/

rdtsc-upgrade.ll

4 lines

rdtsc.ll

4 lines

segmented-stacks-dynamic.ll

12 lines

twoaddr-lea.ll

2 lines

MC/

X86/

x86-directive-nops-errors.s

2 lines

tools/

llvm-mca/

X86/

cv_fpo_directive_no_segfault.s

2 lines

Diff 404317

llvm/lib/Target/X86/X86.td

Show First 20 Lines • Show All 1,209 Lines • ▼ Show 20 Lines
// if i386/i486 is specifically requested.		// if i386/i486 is specifically requested.
// NOTE: 64Bit is here as "generic" is the default llc CPU. The X86Subtarget		// NOTE: 64Bit is here as "generic" is the default llc CPU. The X86Subtarget
// constructor checks that any CPU used in 64-bit mode has Feature64Bit enabled.		// constructor checks that any CPU used in 64-bit mode has Feature64Bit enabled.
// It has no effect on code generation.		// It has no effect on code generation.
def : ProcModel<"generic", SandyBridgeModel,		def : ProcModel<"generic", SandyBridgeModel,
[FeatureX87, FeatureCMPXCHG8B, Feature64Bit],		[FeatureX87, FeatureCMPXCHG8B, Feature64Bit],
[TuningSlow3OpsLEA,		[TuningSlow3OpsLEA,
TuningSlowDivide64,		TuningSlowDivide64,
TuningSlowIncDec,
TuningMacroFusion,		TuningMacroFusion,
		TuningFastScalarFSQRT,
		spatelUnsubmitted Not Done Reply Inline Actions Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a reciprocal estimate sequence of about 9 instructions (when allowed with fast-math). Based on Agner's timing docs and the flag description, this should be set for Zen1 (sqrtss has latency 9-10), but it's not as obviously good for Zen2/3 because those have sqrtss latency of 14. The flag is set for Intel CPUs since SandyBridge, so that's sqrtss latency between 10-14. I think this is ok to set, but if the assumption is that we're tuning for any mainstream CPU of the last N years, then shouldn't we add this flag to the later AMD models too for less surprising output? There's a possible side benefit that we will produce more accurate results too. spatel: Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a…
		pengfeiAuthorUnsubmitted Done Reply Inline Actions uops shows all Zen1/2/3 have the same max latency 14. pengfei: [[ https://uops.info/table.html?search=sqrtss%20&cb_lat=on&cb_tp=on&cb_NHM=on&cb_SNB=on&cb_BNL=…
		TuningFastSHLDRotate,
		andreadbUnsubmitted Done Reply Inline Actions In my experience, SHLD is rarely fast on AMD processors. andreadb: In my experience, SHLD is rarely fast on AMD processors.
		TuningFast15ByteNOP,
		TuningPOPCNTFalseDeps,
		TuningLZCNTFalseDeps,
		andreadbUnsubmitted Done Reply Inline Actions These two tuning flags are very Intel specific. I am not convinced that these should be added for "generic". andreadb: These two tuning flags are very Intel specific. I am not convinced that these should be added…
TuningInsertVZEROUPPER]>;		TuningInsertVZEROUPPER]>;

def : Proc<"i386", [FeatureX87],		def : Proc<"i386", [FeatureX87],
[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;		[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;
def : Proc<"i486", [FeatureX87],		def : Proc<"i486", [FeatureX87],
[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;		[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;
def : Proc<"i586", [FeatureX87, FeatureCMPXCHG8B],		def : Proc<"i586", [FeatureX87, FeatureCMPXCHG8B],
[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;		[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;
▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	def : Proc<"c3-2", [FeatureX87, FeatureCMPXCHG8B, FeatureMMX,
[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;		[TuningSlowUAMem16, TuningInsertVZEROUPPER]>;

// We also provide a generic 64-bit specific x86 processor model which tries to		// We also provide a generic 64-bit specific x86 processor model which tries to
// be good for modern chips without enabling instruction set encodings past the		// be good for modern chips without enabling instruction set encodings past the
// basic SSE2 and 64-bit ones. It disables slow things from any mainstream and		// basic SSE2 and 64-bit ones. It disables slow things from any mainstream and
// modern 64-bit x86 chip, and enables features that are generally beneficial.		// modern 64-bit x86 chip, and enables features that are generally beneficial.
//		//
// We currently use the Sandy Bridge model as the default scheduling model as		// We currently use the Sandy Bridge model as the default scheduling model as
// we use it across Nehalem, Westmere, Sandy Bridge, and Ivy Bridge which		// we use it across Nehalem, Westmere, Sandy Bridge, and Ivy Bridge which
		RKSimonUnsubmitted Not Done Reply Inline Actions The old tuning flags didn't match all of SandyBridge's - do we really want to match all of Haswell's? RKSimon: The old tuning flags didn't match all of SandyBridge's - do we really want to match all of…
		pengfeiAuthorUnsubmitted Done Reply Inline Actions The GCC doc says `generic` will be updated with compiler iteration. I think it's reasonable since the common processors will be upgraded with time. pengfei: The GCC [[ https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html \| doc ]] says `generic` will be…
		craig.topperUnsubmitted Not Done Reply Inline Actions This turns in some fast shuffle flags that aren’t used by any AMD CPU except znver3. Do those flags make sense for generic? craig.topper: This turns in some fast shuffle flags that aren’t used by any AMD CPU except znver3. Do those…
		pengfeiAuthorUnsubmitted Done Reply Inline Actions I saw someone checked on znver1 and gave a conclusion the haswell tuning works reasonably well for both cores and zens. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616 I have no idea how to check it with LLVM, so I'll remove the shuffle flags for conservative. pengfei: I saw someone checked on znver1 and gave a conclusion the haswell tuning works reasonably well…
// covers a huge swath of x86 processors. If there are specific scheduling		// covers a huge swath of x86 processors. If there are specific scheduling
// knobs which need to be tuned differently for AMD chips, we might consider		// knobs which need to be tuned differently for AMD chips, we might consider
// forming a common base for them.		// forming a common base for them.
def : ProcModel<"x86-64", SandyBridgeModel, ProcessorFeatures.X86_64V1Features,		def : ProcModel<"x86-64", SandyBridgeModel, ProcessorFeatures.X86_64V1Features,
[		[
TuningSlow3OpsLEA,		TuningSlow3OpsLEA,
TuningSlowDivide64,		TuningSlowDivide64,
TuningSlowIncDec,		TuningSlowIncDec,
TuningMacroFusion,		TuningMacroFusion,
TuningInsertVZEROUPPER		TuningInsertVZEROUPPER
		RKSimonUnsubmitted Done Reply Inline Actions We now diverge from the x86-64 tuning flags - maybe make it clear in the comments for the generic model? RKSimon: We now diverge from the x86-64 tuning flags - maybe make it clear in the comments for the…
]>;		]>;

// x86-64 micro-architecture levels.		// x86-64 micro-architecture levels.
def : ProcModel<"x86-64-v2", SandyBridgeModel, ProcessorFeatures.X86_64V2Features,		def : ProcModel<"x86-64-v2", SandyBridgeModel, ProcessorFeatures.X86_64V2Features,
ProcessorFeatures.SNBTuning>;		ProcessorFeatures.SNBTuning>;
// Close to Haswell.		// Close to Haswell.
def : ProcModel<"x86-64-v3", HaswellModel, ProcessorFeatures.X86_64V3Features,		def : ProcModel<"x86-64-v3", HaswellModel, ProcessorFeatures.X86_64V3Features,
ProcessorFeatures.HSWTuning>;		ProcessorFeatures.HSWTuning>;
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/rdtsc-upgrade.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i686-unknown-unknown -mcpu=generic \| FileCheck %s --check-prefix=X86			; RUN: llc < %s -mtriple=i686-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=X86
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=generic \| FileCheck %s --check-prefix=X64			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=X64
				RKSimonUnsubmitted Not Done Reply Inline Actions do we even need to specificy -mcpu here? RKSimon: do we even need to specificy -mcpu here?

	; Verify upgrading of the old form of the rdtscp intrinsic.			; Verify upgrading of the old form of the rdtscp intrinsic.

	define i64 @test_builtin_rdtscp(i8* %A) {			define i64 @test_builtin_rdtscp(i8* %A) {
	; X86-LABEL: test_builtin_rdtscp:			; X86-LABEL: test_builtin_rdtscp:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: pushl %esi			; X86-NEXT: pushl %esi
	; X86-NEXT: .cfi_def_cfa_offset 8			; X86-NEXT: .cfi_def_cfa_offset 8
	Show All 20 Lines

llvm/test/CodeGen/X86/rdtsc.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i686-unknown-unknown -mcpu=generic \| FileCheck %s --check-prefix=X86			; RUN: llc < %s -mtriple=i686-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=X86
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=generic \| FileCheck %s --check-prefix=X64			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=X64
				RKSimonUnsubmitted Not Done Reply Inline Actions do we really need -mcpu? RKSimon: do we really need -mcpu?

	; Verify that we correctly lower ISD::READCYCLECOUNTER.			; Verify that we correctly lower ISD::READCYCLECOUNTER.


	define i64 @test_builtin_readcyclecounter() {			define i64 @test_builtin_readcyclecounter() {
	; X86-LABEL: test_builtin_readcyclecounter:			; X86-LABEL: test_builtin_readcyclecounter:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: rdtsc			; X86-NEXT: rdtsc
	▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/segmented-stacks-dynamic.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mcpu=generic -mtriple=i686-linux -verify-machineinstrs \| FileCheck %s -check-prefix=X86			; RUN: llc < %s -mcpu=x86-64 -mtriple=i686-linux -verify-machineinstrs \| FileCheck %s -check-prefix=X86
	; RUN: llc < %s -mcpu=generic -mtriple=x86_64-linux -verify-machineinstrs \| FileCheck %s -check-prefix=X64			; RUN: llc < %s -mcpu=x86-64 -mtriple=x86_64-linux -verify-machineinstrs \| FileCheck %s -check-prefix=X64
	; RUN: llc < %s -mcpu=generic -mtriple=x86_64-linux-gnux32 -verify-machineinstrs \| FileCheck %s -check-prefix=X32ABI			; RUN: llc < %s -mcpu=x86-64 -mtriple=x86_64-linux-gnux32 -verify-machineinstrs \| FileCheck %s -check-prefix=X32ABI
	; RUN: llc < %s -mcpu=generic -mtriple=i686-linux -filetype=obj			; RUN: llc < %s -mcpu=x86-64 -mtriple=i686-linux -filetype=obj
	; RUN: llc < %s -mcpu=generic -mtriple=x86_64-linux -filetype=obj			; RUN: llc < %s -mcpu=x86-64 -mtriple=x86_64-linux -filetype=obj
	; RUN: llc < %s -mcpu=generic -mtriple=x86_64-linux-gnux32 -filetype=obj			; RUN: llc < %s -mcpu=x86-64 -mtriple=x86_64-linux-gnux32 -filetype=obj

	; Just to prevent the alloca from being optimized away			; Just to prevent the alloca from being optimized away
	declare void @dummy_use(i32*, i32)			declare void @dummy_use(i32*, i32)

	define i32 @test_basic(i32 %l) #0 {			define i32 @test_basic(i32 %l) #0 {
	; X86-LABEL: test_basic:			; X86-LABEL: test_basic:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: cmpl %gs:48, %esp			; X86-NEXT: cmpl %gs:48, %esp
	▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/twoaddr-lea.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mcpu=generic -mtriple=x86_64-apple-darwin \| FileCheck %s			; RUN: llc < %s -mcpu=x86-64 -mtriple=x86_64-apple-darwin \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions not sure about -mcpu here - we do have some variation in use of lea depending on tuning - maybe we need more -mcpu coverage? RKSimon: not sure about -mcpu here - we do have some variation in use of lea depending on tuning - maybe…
				pengfeiAuthorUnsubmitted Done Reply Inline Actions Here the generated code is affected by tuning `TuningSlowIncDec`. I think we should prefer to using `-mcpu=x86-64` rather than `generic` in lit tests since we might update the tunings of `generic` in future. We have more than 100 lines of RUNs with it, we may need a seperate patch to replace them. But the diff here is OK to reflect the change on tunings. pengfei: Here the generated code is affected by tuning `TuningSlowIncDec`. I think we should prefer to…

	;; X's live range extends beyond the shift, so the register allocator			;; X's live range extends beyond the shift, so the register allocator
	;; cannot coalesce it with Y. Because of this, a copy needs to be			;; cannot coalesce it with Y. Because of this, a copy needs to be
	;; emitted before the shift to save the register value before it is			;; emitted before the shift to save the register value before it is
	;; clobbered. However, this copy is not needed if the register			;; clobbered. However, this copy is not needed if the register
	;; allocator turns the shift into an LEA. This also occurs for ADD.			;; allocator turns the shift into an LEA. This also occurs for ADD.

	; Check that the shift gets turned into an LEA.			; Check that the shift gets turned into an LEA.
	▲ Show 20 Lines • Show All 153 Lines • Show Last 20 Lines

llvm/test/MC/X86/x86-directive-nops-errors.s

	# RUN: not llvm-mc -triple i386 %s -filetype=obj -o /dev/null 2>&1 \| FileCheck --check-prefix=X86 %s			# RUN: not llvm-mc -triple i386 %s -filetype=obj -o /dev/null 2>&1 \| FileCheck --check-prefix=X86 %s
	# RUN: not llvm-mc -triple=x86_64 %s -filetype=obj -o /dev/null 2>&1 \| FileCheck --check-prefix=X64 %s			# RUN: not llvm-mc -triple=x86_64 -mcpu=x86-64 %s -filetype=obj -o /dev/null 2>&1 \| FileCheck --check-prefix=X64 %s

	.nops 4, 3			.nops 4, 3
	# X86: :[[@LINE-1]]:1: error: illegal NOP size 3.			# X86: :[[@LINE-1]]:1: error: illegal NOP size 3.
	.nops 4, 4			.nops 4, 4
	# X86: :[[@LINE-1]]:1: error: illegal NOP size 4.			# X86: :[[@LINE-1]]:1: error: illegal NOP size 4.
	.nops 4, 5			.nops 4, 5
	# X86: :[[@LINE-1]]:1: error: illegal NOP size 5.			# X86: :[[@LINE-1]]:1: error: illegal NOP size 5.
	.nops 16, 15			.nops 16, 15
	# X86: :[[@LINE-1]]:1: error: illegal NOP size 15.			# X86: :[[@LINE-1]]:1: error: illegal NOP size 15.
	# X64: :[[@LINE-2]]:1: error: illegal NOP size 15.			# X64: :[[@LINE-2]]:1: error: illegal NOP size 15.

llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s

	# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
	# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=generic -resource-pressure=false -instruction-info=false < %s \| FileCheck %s			# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -resource-pressure=false -instruction-info=false < %s \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions @andreadb what test coverage do we need here? RKSimon: @andreadb what test coverage do we need here?
				andreadbUnsubmitted Not Done Reply Inline Actions I believe the test was just checking that llvm-mca didn't crash when parsing unknown asm directives. So the output here is not really important. It is fine to change the mcpu to something else other than generic. andreadb: I believe the test was just checking that llvm-mca didn't crash when parsing unknown asm…

	.cv_fpo_pushreg ebx			.cv_fpo_pushreg ebx
	add %eax, %eax			add %eax, %eax
	add %ebx, %ebx			add %ebx, %ebx
	add %ecx, %ecx			add %ecx, %ecx
	add %edx, %edx			add %edx, %edx

	# CHECK: Iterations: 100			# CHECK: Iterations: 100
	# CHECK-NEXT: Instructions: 400			# CHECK-NEXT: Instructions: 400
	# CHECK-NEXT: Total Cycles: 137			# CHECK-NEXT: Total Cycles: 137
	# CHECK-NEXT: Total uOps: 400			# CHECK-NEXT: Total uOps: 400

	# CHECK: Dispatch Width: 4			# CHECK: Dispatch Width: 4
	# CHECK-NEXT: uOps Per Cycle: 2.92			# CHECK-NEXT: uOps Per Cycle: 2.92
	# CHECK-NEXT: IPC: 2.92			# CHECK-NEXT: IPC: 2.92
	# CHECK-NEXT: Block RThroughput: 1.3			# CHECK-NEXT: Block RThroughput: 1.3

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Introduce more common modern tunings into `generic`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 404317

llvm/lib/Target/X86/X86.td

llvm/test/CodeGen/X86/rdtsc-upgrade.ll

llvm/test/CodeGen/X86/rdtsc.ll

llvm/test/CodeGen/X86/segmented-stacks-dynamic.ll

llvm/test/CodeGen/X86/twoaddr-lea.ll

llvm/test/MC/X86/x86-directive-nops-errors.s

llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s

[X86] Introduce more common modern tunings into `generic`
ClosedPublic