This is an archive of the discontinued LLVM Phabricator instance.

[X86] Introduce more common modern tunings into `generic`
ClosedPublic

Authored by pengfei on Jan 29 2022, 3:13 AM.

Diff Detail

Event Timeline

pengfei created this revision.Jan 29 2022, 3:13 AM
pengfei requested review of this revision.Jan 29 2022, 3:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 29 2022, 3:13 AM
RKSimon added inline comments.Jan 29 2022, 3:29 AM
llvm/lib/Target/X86/X86.td
1508

The old tuning flags didn't match all of SandyBridge's - do we really want to match all of Haswell's?

Also, I'm slightly worried about this being dropped in just before the 14.x branch - would it be better to ensure we miss 14.x so this has time to cook in trunk for a while?

Also, I'm slightly worried about this being dropped in just before the 14.x branch - would it be better to ensure we miss 14.x so this has time to cook in trunk for a while?

Sure, I can wait for 14.x freeze. Thanks!

llvm/lib/Target/X86/X86.td
1508

The GCC doc says generic will be updated with compiler iteration. I think it's reasonable since the common processors will be upgraded with time.

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

Please note that there is a *very* important difference between -march= and -mtune=:

  • -march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.
  • while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.

So this change really shouldn't affect available instruction sets.

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

Please note that there is a *very* important difference between -march= and -mtune=:

  • -march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.
  • while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.

So this change really shouldn't affect available instruction sets.

Ah, thanks for explaining. I confused the settings. So it's still a small bit of progress, just not as much as I thought.

craig.topper added inline comments.Jan 29 2022, 9:15 AM
llvm/lib/Target/X86/X86.td
1508

This turns in some fast shuffle flags that aren’t used by any AMD CPU except znver3. Do those flags make sense for generic?

The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.

I must have missed the comments there, but seems we made the consensus independently.

But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.

Should we have a test file that intentionally shows codegen changes as this setting gets updated?

Please note that there is a *very* important difference between -march= and -mtune=:

  • -march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.
  • while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.

So this change really shouldn't affect available instruction sets.

Yes. Thanks @lebedev.ri

Ah, thanks for explanation. I confused the settings. So it's still a small bit of progress, just not as much as I thought.

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

pengfei added inline comments.Jan 29 2022, 5:29 PM
llvm/lib/Target/X86/X86.td
1508

I saw someone checked on znver1 and gave a conclusion the haswell tuning works reasonably well for both cores and zens. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616
I have no idea how to check it with LLVM, so I'll remove the shuffle flags for conservative.

pengfei updated this revision to Diff 404317.Jan 29 2022, 5:52 PM

Remove shuffle flags.

pengfei retitled this revision from [X86] Update generic process model to x86-64-v3 to match with GCC to [X86] Introduce more common morden turnings into `generic`.Jan 29 2022, 5:54 PM
lebedev.ri retitled this revision from [X86] Introduce more common morden turnings into `generic` to [X86] Introduce more common modern turnings into `generic`.Jan 29 2022, 11:18 PM

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.

That would require bumping what ISA's are implied by generic, yes.

RKSimon retitled this revision from [X86] Introduce more common modern turnings into `generic` to [X86] Introduce more common modern tunings into `generic`.Jan 30 2022, 7:20 AM

As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.

I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.

That would require bumping what ISA's are implied by generic, yes.

I see. Yes, we can do nothing with it currently.

RKSimon added inline comments.Jan 31 2022, 2:16 AM
llvm/lib/Target/X86/X86.td
1518

We now diverge from the x86-64 tuning flags - maybe make it clear in the comments for the generic model?

llvm/test/CodeGen/X86/rdtsc-upgrade.ll
3

do we even need to specificy -mcpu here?

llvm/test/CodeGen/X86/rdtsc.ll
3

do we really need -mcpu?

llvm/test/CodeGen/X86/twoaddr-lea.ll
2

not sure about -mcpu here - we do have some variation in use of lea depending on tuning - maybe we need more -mcpu coverage?

llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s
2

@andreadb what test coverage do we need here?

andreadb added inline comments.Jan 31 2022, 3:53 AM
llvm/lib/Target/X86/X86.td
1220

In my experience, SHLD is rarely fast on AMD processors.

1222–1223

These two tuning flags are very Intel specific.
I am not convinced that these should be added for "generic".

llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s
2

I believe the test was just checking that llvm-mca didn't crash when parsing unknown asm directives.

So the output here is not really important. It is fine to change the mcpu to something else other than generic.

pengfei updated this revision to Diff 404495.Jan 31 2022, 5:49 AM
pengfei marked an inline comment as done.

Remove more Intel specific tunings.

pengfei marked 2 inline comments as done.Jan 31 2022, 5:49 AM
pengfei added inline comments.Jan 31 2022, 6:01 AM
llvm/test/CodeGen/X86/twoaddr-lea.ll
2

Here the generated code is affected by tuning TuningSlowIncDec.
I think we should prefer to using -mcpu=x86-64 rather than generic in lit tests since we might update the tunings of generic in future.
We have more than 100 lines of RUNs with it, we may need a seperate patch to replace them. But the diff here is OK to reflect the change on tunings.

spatel added inline comments.Feb 3 2022, 11:32 AM
llvm/lib/Target/X86/X86.td
1219

Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a reciprocal estimate sequence of about 9 instructions (when allowed with fast-math).

Based on Agner's timing docs and the flag description, this should be set for Zen1 (sqrtss has latency 9-10), but it's not as obviously good for Zen2/3 because those have sqrtss latency of 14. The flag is set for Intel CPUs since SandyBridge, so that's sqrtss latency between 10-14.

I think this is ok to set, but if the assumption is that we're tuning for any mainstream CPU of the last N years, then shouldn't we add this flag to the later AMD models too for less surprising output? There's a possible side benefit that we will produce more accurate results too.

pengfei updated this revision to Diff 405859.Feb 3 2022, 7:43 PM

Add TuningFastScalarFSQRT to Zen1/2/3.

pengfei added inline comments.Feb 3 2022, 7:47 PM
llvm/lib/Target/X86/X86.td
1219

uops shows all Zen1/2/3 have the same max latency 14.

lebedev.ri added a comment.EditedFeb 4 2022, 12:46 AM

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

pengfei updated this revision to Diff 405884.Feb 4 2022, 12:54 AM

Revert the change of TuningFastScalarFSQRT for Zen.

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

Although Agner's table says it's 8~21 and 22 for znver1 and znver2 respectively, the mca shows they are worse than znver3. Is it a bug in schedule model? I'd like to leave Znver tuning as is given I'm not familiar with them.

lebedev.ri added a comment.EditedFeb 4 2022, 1:24 AM

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

Although Agner's table says it's 8~21 and 22 for znver1 and znver2 respectively, the mca shows they are worse than znver3. Is it a bug in schedule model? I'd like to leave Znver tuning as is given I'm not familiar with them.

znver1/znver2 schedule models are, well, leave a lot to be desired.

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

'fsqrt' is the x87 instruction, I think the tuning flag (despite its name which is IR based not x87 based) is concerned with the SSE instruction (v)sqrtss - https://godbolt.org/z/qTzesKWvj

But lets make any znver changes independent of this patch

I'm happy for TuningFastScalarFSQRT to be enabled by default - more recent CPUs are benefiting less and less from NR approximations (NR is usually still worth it for float4/float8 though).

spatel accepted this revision.Feb 4 2022, 6:21 AM

Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse

'fsqrt' is the x87 instruction, I think the tuning flag (despite its name which is IR based not x87 based) is concerned with the SSE instruction (v)sqrtss - https://godbolt.org/z/qTzesKWvj

Correct. Also AFAIK, that tuning flag only comes into play when expanding a plain sqrt(X) operation, not a 1/sqrt(X) operation. But I agree that we can make that change independently for the zen models.
So this patch LGTM.

This revision is now accepted and ready to land.Feb 4 2022, 6:21 AM
spatel added a comment.Feb 4 2022, 7:55 AM

Correct. Also AFAIK, that tuning flag only comes into play when expanding a plain sqrt(X) operation, not a 1/sqrt(X) operation. But I agree that we can make that change independently for the zen models.

Proposed change for Zen here: D119001

This revision was landed with ongoing or failed builds.Feb 4 2022, 6:32 PM
This revision was automatically updated to reflect the committed changes.

Thank you all for the help!

hans added a subscriber: hans.Mar 21 2022, 4:06 AM

I noticed that my builds still prefer "add" over "inc" despite this change. Turns out those builds pass -march=x86-64, which apparently affects the tuning which was a surprise to me. I filed https://github.com/llvm/llvm-project/issues/54472 Does that seem right to the experts on this change?

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 4:06 AM