This is an archive of the discontinued LLVM Phabricator instance.

Have cpu-specific variants set 'tune-cpu' as an optimization hint
ClosedPublic

Authored by erichkeane on Mar 10 2022, 1:49 PM.

Details

Summary

Due to various implementation constraints, despite the programmer
choosing a 'processor' cpu_dispatch/cpu_specific needs to use the
'feature' list of a processor to identify it. This results in the
identified processor in source-code not being propogated to the
optimizer, and thus, not able to be tuned for.

This patch changes to use the actual cpu as written for tune-cpu so that
opt can make decisions based on the cpu-as-spelled, which should better
match the behavior expected by the programmer.

Note that the 'valid' list of processors for x86 is in
llvm/include/llvm/Support/X86TargetParser.def. At the moment, this list
contains only Intel processors, but other vendors may wish to add their
own entries as 'alias'es (or with different feature lists!).

If this is not done, there is two potential performance issues with the
patch, but I believe them to be worth it in light of the improvements to
behavior and performance.

1- In the event that the user spelled "ProcessorB", but we only have the
features available to test for "ProcessorA" (where A is B minus features),
AND there is an optimization opportunity for "B" that negatively affects
"A", the optimizer will likely choose to do so.

2- In the event that the user spelled VendorI's processor, and the feature
list allows it to run on VendorA's processor of similar features, AND there
is an optimization opportunity for VendorIs that negatively affects "A"s,
the optimizer will likely choose to do so. This can be fixed by adding an
alias to X86TargetParser.def.

Diff Detail

Event Timeline

erichkeane created this revision.Mar 10 2022, 1:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 1:49 PM
Herald added a subscriber: pengfei. · View Herald Transcript
erichkeane requested review of this revision.Mar 10 2022, 1:49 PM

@aaron.ballman : if you can add other reviewers or subscribers (particularly those from "VendorA") it would be greatly appreciated!

This example illustrates the problem this patch intends to fix: https://godbolt.org/z/j445sxPMc

For Intel microarchitectures before Skylake, the LLVM cost model says that vector fsqrt is slow, so if fast-math is enabled, we'll use an approximation rather than the vsqrtps instruction when vectorizing a call to sqrtf(). If the code is compiled with -march=skylake or -mtune=skylake, we'll choose the vsqrtps instruction, but with any earlier base target, we'll choose the approximation even if there is a cpu_specific(skylake) implementation in the source code.

For example

__attribute__((cpu_specific(skylake))) void foo(void) {
  for (int i = 0; i < 8; ++i)
    x[i] = sqrtf(y[i]);
}

compiles to

foo.b:
        vmovaps ymm0, ymmword ptr [rip + y]
        vrsqrtps        ymm1, ymm0
        vmulps  ymm2, ymm0, ymm1
        vbroadcastss    ymm3, dword ptr [rip + .LCPI2_0] # ymm3 = [-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0]
        vfmadd231ps     ymm3, ymm2, ymm1        # ymm3 = (ymm2 * ymm1) + ymm3
        vbroadcastss    ymm1, dword ptr [rip + .LCPI2_1] # ymm1 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
        vmulps  ymm1, ymm2, ymm1
        vmulps  ymm1, ymm1, ymm3
        vbroadcastss    ymm2, dword ptr [rip + .LCPI2_2] # ymm2 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
        vandps  ymm0, ymm0, ymm2
        vbroadcastss    ymm2, dword ptr [rip + .LCPI2_3] # ymm2 = [1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38]
        vcmpleps        ymm0, ymm2, ymm0
        vandps  ymm0, ymm0, ymm1
        vmovaps ymmword ptr [rip + x], ymm0
        vzeroupper
        ret

but it should compile to

foo.b:
        vsqrtps ymm0, ymmword ptr [rip + y]
        vmovaps ymmword ptr [rip + x], ymm0
        vzeroupper
        ret
clang/lib/CodeGen/CodeGenModule.cpp
2067

Unfortunately, I don't think it's this easy. The list of names used for cpu_specific doesn't come from the same place as the list of names used by "tune-cpu". For one thing, the cpu_specific names can't contain the '-' character, so we have names like "skylake_avx512" in cpu_specific that would need to be translated to "skylake-avx512" for "tune-cpu". I believe the list of valid names for "tune-cpu" comes from here: https://github.com/llvm/llvm-project/blob/26cd258420c774254cc48330b1f4d23d353baf05/llvm/lib/Support/X86TargetParser.cpp#L294

Also, some of the aliases supported by cpu_specific don't have any corresponding "tune-cpu" name. You happen to have picked one of these for the test. I believe "core_4th_gen_avx" should map to "haswell".

clang/test/CodeGen/attr-cpuspecific-avx-abi.c
28

As noted above, this isn't a valid setting for "tune-cpu". I think it would just be ignored.

erichkeane added inline comments.Mar 10 2022, 5:25 PM
clang/lib/CodeGen/CodeGenModule.cpp
2067

Hmm... this is unfortunate. I wonder if we add some 'translation' type field to the X86TargetParser.def entries? Any idea who the right one to populate said list would be?

Typos in wiht different feature lists and In the even that.

clang/lib/CodeGen/CodeGenModule.cpp
2067

I believe the list of valid names for "tune-cpu" comes from ...

I think it's here https://github.com/llvm/llvm-project/blob/26cd258420c774254cc48330b1f4d23d353baf05/llvm/lib/Target/X86/X86.td#L1408

So back to Andy's problems, where we consume the cpu_specific names in compiler previously, e.g., mapping to different targets? Or it is done by external libraries like compiler-rt?

I think I have the same requirments that mapping - and _ for "tune-cpu" in https://github.com/llvm/llvm-project/issues/50125 where the preprocessor defines use _ as well.

aaron.ballman added a subscriber: arsenm.

Adding @arsenm because of this bit:

Note that the 'valid' list of processors for x86 is in llvm/include/llvm/Support/X86TargetParser.def. At the moment, this list contains only Intel processors, but other vendors may wish to add their own entries as 'alias'es (or wiht different feature lists!).

Thanks all! I'll do some work on populating a list of 'converted names', but I'll definitely need @pengfei and @andrew.w.kaylor help checking the list/filling in what I miss.

erichkeane edited the summary of this revision. (Show Details)Mar 11 2022, 6:02 AM

add a 'translation' feature to the x86 target so that we can get the 'tune cpu' name from the list. Note that there are 9 with blanks that I was unable to figure out the corresponding name (I have an email out to @andrew.w.kaylor and @pengfei to tell me what it should be). In the meantime, these will result in NO tune-cpu.

Also note that I intentionally added this conversion from the 'alias' as well. This gives us the power to use an alias to change the 'tune' if we care to. Typically I'd consider this unimportant, but it means that previously mentioned VendorA (@arsenm) could simply add their processors as aliases and get the tune feature more easily.

Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2022, 6:57 AM
erichkeane added inline comments.Mar 11 2022, 6:58 AM
llvm/include/llvm/Support/X86TargetParser.def
230

Note the blanks on 230-232, 234-237, 245, and 248. Otherwise, a double-check would be really appreciated from everyone familiar with the x86 naming.

Corrected the last few processor names thanks to @andrew.w.kaylor and @pengfei

andrew.w.kaylor accepted this revision.Mar 11 2022, 11:02 AM

This looks good to me. Thanks for the patch!

This revision is now accepted and ready to land.Mar 11 2022, 11:02 AM
craig.topper added inline comments.
llvm/include/llvm/Support/X86TargetParser.def
236

core_aes_pclmulqdq is westmere

Update the core_aes_pclmulqdq to be westmere

erichkeane marked 4 inline comments as done.Mar 11 2022, 11:17 AM
erichkeane added inline comments.
llvm/include/llvm/Support/X86TargetParser.def
236

Thanks!

aaron.ballman accepted this revision.Mar 11 2022, 11:21 AM

LGTM, though I'm not qualified to review the CPU specific bits in the .def file.

pengfei accepted this revision.Mar 11 2022, 9:22 PM

LGTM.

clang/lib/Basic/Targets/X86.cpp
1133

clang-format.

llvm/include/llvm/Support/X86TargetParser.def
236

Missed the left "?

This revision was landed with ongoing or failed builds.Mar 14 2022, 6:14 AM
This revision was automatically updated to reflect the committed changes.
erichkeane marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 6:14 AM
FreddyYe added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
2067

Unfortunately, I don't think it's this easy. The list of names used for cpu_specific doesn't come from the same place as the list of names used by "tune-cpu". For one thing, the cpu_specific names can't contain the '-' character, so we have names like "skylake_avx512" in cpu_specific that would need to be translated to "skylake-avx512" for "tune-cpu". I believe the list of valid names for "tune-cpu" comes from here: https://github.com/llvm/llvm-project/blob/26cd258420c774254cc48330b1f4d23d353baf05/llvm/lib/Support/X86TargetParser.cpp#L294

Also, some of the aliases supported by cpu_specific don't have any corresponding "tune-cpu" name. You happen to have picked one of these for the test. I believe "core_4th_gen_avx" should map to "haswell".

Happens to find this patch. I recently also change here back to the initial version of this patch at https://reviews.llvm.org/D151696. To resolve the problem @andrew.w.kaylor mentioned here, I added these "unsupported" names in X86.td like Phoebe mentioned below. If you are interested, feel free to comment there.