GCC has updated its generic -mtune to haswell. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616
Update it to match with GCC.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/X86/X86.td | ||
---|---|---|
1491 | The old tuning flags didn't match all of SandyBridge's - do we really want to match all of Haswell's? |
Also, I'm slightly worried about this being dropped in just before the 14.x branch - would it be better to ensure we miss 14.x so this has time to cook in trunk for a while?
The default tuning came up in the context of bsf vs. tzcnt here:
D117912
So I wasn't expecting a bump in default tuning yet, but I think it's great.
But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.
Should we have a test file that intentionally shows codegen changes as this setting gets updated?
Please note that there is a *very* important difference between -march= and -mtune=:
- -march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.
- while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.
So this change really shouldn't affect available instruction sets.
Ah, thanks for explaining. I confused the settings. So it's still a small bit of progress, just not as much as I thought.
llvm/lib/Target/X86/X86.td | ||
---|---|---|
1491 | This turns in some fast shuffle flags that aren’t used by any AMD CPU except znver3. Do those flags make sense for generic? |
I must have missed the comments there, but seems we made the consensus independently.
But I agree that this change may cause a lot of surprises, so users need to be given notice (add a line to the clang release notes?) and wait for the 14.0 branch to be created.
Should we have a test file that intentionally shows codegen changes as this setting gets updated?
Please note that there is a *very* important difference between -march= and -mtune=:
- -march= means: no, please, i really only intend to run this code on this specific CPU, so feel free to use all of the ISA available, and schedule for this cCPU too.
- while -mtune= means: i want to be able to run the code on earlier processors, but do schedule for this CPU though.
So this change really shouldn't affect available instruction sets.
Yes. Thanks @lebedev.ri
Ah, thanks for explanation. I confused the settings. So it's still a small bit of progress, just not as much as I thought.
As I read the comments in D117912, the changes in default tuning should solve the problem. Do I miss something else?
I saw some attempts on using x86-64-v2, e.g., RHEL 9. But I think it's still aggressive to use it as default target in compiler, not to mention x86-64-v3.
llvm/lib/Target/X86/X86.td | ||
---|---|---|
1491 | I saw someone checked on znver1 and gave a conclusion the haswell tuning works reasonably well for both cores and zens. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616 |
I might have misunderstood the requirements, but I thought we won't get the ideal performance on the benchmark unless we can generate tzcnt without explicitly specifying the default CPU arch.
llvm/lib/Target/X86/X86.td | ||
---|---|---|
1509 | We now diverge from the x86-64 tuning flags - maybe make it clear in the comments for the generic model? | |
llvm/test/CodeGen/X86/rdtsc-upgrade.ll | ||
3 | do we even need to specificy -mcpu here? | |
llvm/test/CodeGen/X86/rdtsc.ll | ||
3 | do we really need -mcpu? | |
llvm/test/CodeGen/X86/twoaddr-lea.ll | ||
2 | not sure about -mcpu here - we do have some variation in use of lea depending on tuning - maybe we need more -mcpu coverage? | |
llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s | ||
2 | @andreadb what test coverage do we need here? |
llvm/lib/Target/X86/X86.td | ||
---|---|---|
1210 | In my experience, SHLD is rarely fast on AMD processors. | |
1212–1213 | These two tuning flags are very Intel specific. | |
llvm/test/tools/llvm-mca/X86/cv_fpo_directive_no_segfault.s | ||
2 | I believe the test was just checking that llvm-mca didn't crash when parsing unknown asm directives. So the output here is not really important. It is fine to change the mcpu to something else other than generic. |
llvm/test/CodeGen/X86/twoaddr-lea.ll | ||
---|---|---|
2 | Here the generated code is affected by tuning TuningSlowIncDec. |
llvm/lib/Target/X86/X86.td | ||
---|---|---|
1209 | Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a reciprocal estimate sequence of about 9 instructions (when allowed with fast-math). Based on Agner's timing docs and the flag description, this should be set for Zen1 (sqrtss has latency 9-10), but it's not as obviously good for Zen2/3 because those have sqrtss latency of 14. The flag is set for Intel CPUs since SandyBridge, so that's sqrtss latency between 10-14. I think this is ok to set, but if the assumption is that we're tuning for any mainstream CPU of the last N years, then shouldn't we add this flag to the later AMD models too for less surprising output? There's a possible side benefit that we will produce more accurate results too. |
Please split Znver changes into a separate review.
At least for znver3, i'm not really confident that fsqrt is fast,
https://www.agner.org/optimize/instruction_tables.pdf says ~25cy,
while NR takes ~19cy: https://godbolt.org/z/rK9ra4hse
Although Agner's table says it's 8~21 and 22 for znver1 and znver2 respectively, the mca shows they are worse than znver3. Is it a bug in schedule model? I'd like to leave Znver tuning as is given I'm not familiar with them.
'fsqrt' is the x87 instruction, I think the tuning flag (despite its name which is IR based not x87 based) is concerned with the SSE instruction (v)sqrtss - https://godbolt.org/z/qTzesKWvj
But lets make any znver changes independent of this patch
I'm happy for TuningFastScalarFSQRT to be enabled by default - more recent CPUs are benefiting less and less from NR approximations (NR is usually still worth it for float4/float8 though).
Correct. Also AFAIK, that tuning flag only comes into play when expanding a plain sqrt(X) operation, not a 1/sqrt(X) operation. But I agree that we can make that change independently for the zen models.
So this patch LGTM.
I noticed that my builds still prefer "add" over "inc" despite this change. Turns out those builds pass -march=x86-64, which apparently affects the tuning which was a surprise to me. I filed https://github.com/llvm/llvm-project/issues/54472 Does that seem right to the experts on this change?
Fast scalar SQRT controls whether we produce a single-precision sqrtss instruction or a reciprocal estimate sequence of about 9 instructions (when allowed with fast-math).
Based on Agner's timing docs and the flag description, this should be set for Zen1 (sqrtss has latency 9-10), but it's not as obviously good for Zen2/3 because those have sqrtss latency of 14. The flag is set for Intel CPUs since SandyBridge, so that's sqrtss latency between 10-14.
I think this is ok to set, but if the assumption is that we're tuning for any mainstream CPU of the last N years, then shouldn't we add this flag to the later AMD models too for less surprising output? There's a possible side benefit that we will produce more accurate results too.