Page MenuHomePhabricator

[X86] Enable -mprefer-vector-width=256 by default for Skylake-avx512 and later Intel CPUs.
ClosedPublic

Authored by craig.topper on Thu, Sep 5, 11:57 PM.

Details

Summary

AVX512 instructions can cause a frequency drop on these CPUs. This
can negate the performance gains from using wider vectors. Enabling
prefer-vector-width=256 will prevent generation of zmm registers
unless explicit 512 bit operations are used in the original source
code.

I believe gcc and icc both do something similar to this by default.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Thu, Sep 5, 11:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptThu, Sep 5, 11:57 PM
spatel added inline comments.Fri, Sep 6, 6:03 AM
llvm/test/CodeGen/X86/min-legal-vector-width.ll
4 ↗(On Diff #219036)

I assume the negative attributes are here to mask non-512 codegen diffs and keep this file focused on the vector width issue, but would it be valuable to actually check that those diffs are as intended (ie, add different prefixes if it's not too distracting)?

RKSimon added inline comments.Fri, Sep 6, 6:17 AM
llvm/test/Transforms/LoopVectorize/X86/pr42674.ll
2 ↗(On Diff #219036)

Would we be better off using -mattr=avx512f,avx512dq,avx512bw instead -mcpu=skylake-avx512?

llvm/test/Transforms/SLPVectorizer/X86/sqrt.ll
6 ↗(On Diff #219036)

Would we be better off changing all these from -mcpu to -mattr instead?

Maybe you should mention this also in Release news, that the default setting was changed and also inform users how to keep old behaviour in case new setting would cause regressions for them.

Add release notes. Pre-commit some of the test changes.

I've modified the min-legal-vector-width to enable fast-variable-shuffle on the original lines and now test with and without avx512vbmi. The avx512vnni change wasn't very interesting since its just an isel pattern peephole and we do test that peephole elsewhere.

Adding Ori.

atdt added a comment.Fri, Sep 6, 4:40 PM

Thank you; this is good to see.

I believe gcc and icc both do something similar to this by default.

That's right.
GCC: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf#page=647 (section 17-78)
ICC has -qopt-zmm-usage=low|high, which defaults to 'low' for Skylake targets (section 17-73)

Let's update older targets, too. The argument for preferring 128-bit on Haswell and Broadwell is even stronger: on these microarchs, cores executing 256-bit AVX can lower the frequency of other cores on the system.

Thank you; this is good to see.

I believe gcc and icc both do something similar to this by default.

That's right.
GCC: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf#page=647 (section 17-78)
ICC has -qopt-zmm-usage=low|high, which defaults to 'low' for Skylake targets (section 17-73)

Let's update older targets, too. The argument for preferring 128-bit on Haswell and Broadwell is even stronger: on these microarchs, cores executing 256-bit AVX can lower the frequency of other cores on the system.

Haswell/broadwell will require a separate patch. We don't have a feature flag that we can use and we probably need to scrub more test command lines.

Do we need to include some benchmark numbers?

I am wondering if this can solve bad perf with avx512 reported here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=6

Do we need to include some benchmark numbers?

What tests do you think we should run?

atdt added a comment.Wed, Sep 11, 8:38 AM

Do we need to include some benchmark numbers?

What tests do you think we should run?

I don't think it's practical to benchmark this. The effects are well-documented and well-understood. Additionally, the incidence and severity of high-power AVX frequency reductions are a function of system states and workload profile, which makes them difficult to reproduce in small, self-contained examples.

It's sufficient justification for this change to note that it brings Clang into line with other compilers. See here:

A significant problem is compiler inserted AVX-512 instructions. Even if you are not using any explicit AVX-512 instructions or intrinsics, compilers may decide to use them as a result of loop vectorization, within library functions and other optimization. Even something as simple as copying a structure may cause AVX-512 instructions to appear in your program. Current compiler behavior here varies greatly, and we can expect it to change in the future. In fact, it has already changed: Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option. Based on some not very comprehensive tests of LLVM’s clang (the default compiler on macOS), GNU gcc, Intel’s compiler (icc) and MSVC (part of Microsoft Visual Studio), only clang makes aggressive use of 512-bit instructions for simple constructs today: it used such instructions while copying structures, inlining memcpy, and vectorizing loops.

Do we need to include some benchmark numbers?

What tests do you think we should run?

I don't think it's practical to benchmark this. The effects are well-documented and well-understood. Additionally, the incidence and severity of high-power AVX frequency reductions are a function of system states and workload profile, which makes them difficult to reproduce in small, self-contained examples.

It's sufficient justification for this change to note that it brings Clang into line with other compilers. See here:

A significant problem is compiler inserted AVX-512 instructions. Even if you are not using any explicit AVX-512 instructions or intrinsics, compilers may decide to use them as a result of loop vectorization, within library functions and other optimization. Even something as simple as copying a structure may cause AVX-512 instructions to appear in your program. Current compiler behavior here varies greatly, and we can expect it to change in the future. In fact, it has already changed: Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option. Based on some not very comprehensive tests of LLVM’s clang (the default compiler on macOS), GNU gcc, Intel’s compiler (icc) and MSVC (part of Microsoft Visual Studio), only clang makes aggressive use of 512-bit instructions for simple constructs today: it used such instructions while copying structures, inlining memcpy, and vectorizing loops.

Sadly, this is a well-known problem at this point. I think that we should move forward with this change, which brings us back in line with other compilers in similar configurations. We should definitely make it clean in the release notes, and in the Clang release notes, how to restore the more-aggressive AVX-512 use.

Update clang release notes as well.

In which case I'm happy to go with this.

So who wants to click Accept?

spatel accepted this revision.Wed, Sep 11, 12:57 PM

LGTM

llvm/docs/ReleaseNotes.rst
100 ↗(On Diff #219735)

register -> registers

This revision is now accepted and ready to land.Wed, Sep 11, 12:57 PM
xbolva00 added inline comments.Wed, Sep 11, 12:59 PM
clang/docs/ReleaseNotes.rst
64 ↗(On Diff #219735)

-mprefer-vector-width=512 ?

This revision was automatically updated to reflect the committed changes.
dcaballe added inline comments.
llvm/docs/ReleaseNotes.rst
102 ↗(On Diff #219735)

Typo? "passing -mattr=-prefer-256-bit to llc" -> "passing -mattr=-prefer-512-bit to llc"?

dcaballe added inline comments.Sat, Sep 14, 3:09 PM
llvm/docs/ReleaseNotes.rst
102 ↗(On Diff #219735)

Sorry, disregard my previous comment. I thought this was a dash, not a minus. If we do -mattr=-prefer-256-bit, is prefer-512-bit automatically set or is it not necessary?