This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
cfe/trunk/docs/
-
trunk/
-
docs/
-
ReleaseNotes.rst
-
llvm/trunk/
-
trunk/
-
docs/
-
ReleaseNotes.rst
-
lib/Target/X86/
-
Target/
-
X86/
-
X86.td
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
min-legal-vector-width.ll

Differential D67259

[X86] Enable -mprefer-vector-width=256 by default for Skylake-avx512 and later Intel CPUs.
ClosedPublic

Authored by craig.topper on Sep 5 2019, 11:57 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
chandlerc
echristo
atdt

Commits

rG635d383fad2b: [X86] Enable -mprefer-vector-width=256 by default for Skylake-avx512 and later…
rC371694: [X86] Enable -mprefer-vector-width=256 by default for Skylake-avx512 and later…
rL371694: [X86] Enable -mprefer-vector-width=256 by default for Skylake-avx512 and later…

Summary

AVX512 instructions can cause a frequency drop on these CPUs. This
can negate the performance gains from using wider vectors. Enabling
prefer-vector-width=256 will prevent generation of zmm registers
unless explicit 512 bit operations are used in the original source
code.

I believe gcc and icc both do something similar to this by default.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper created this revision.Sep 5 2019, 11:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2019, 11:57 PM

Herald added subscribers: ychen, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B37829: Diff 219036.Sep 6 2019, 12:00 AM

spatel added inline comments.Sep 6 2019, 6:03 AM

llvm/test/CodeGen/X86/min-legal-vector-width.ll
4 ↗	(On Diff #219036)	I assume the negative attributes are here to mask non-512 codegen diffs and keep this file focused on the vector width issue, but would it be valuable to actually check that those diffs are as intended (ie, add different prefixes if it's not too distracting)?

RKSimon added inline comments.Sep 6 2019, 6:17 AM

llvm/test/Transforms/LoopVectorize/X86/pr42674.ll
2 ↗	(On Diff #219036)	Would we be better off using -mattr=avx512f,avx512dq,avx512bw instead -mcpu=skylake-avx512?
llvm/test/Transforms/SLPVectorizer/X86/sqrt.ll
6 ↗	(On Diff #219036)	Would we be better off changing all these from -mcpu to -mattr instead?

Maybe you should mention this also in Release news, that the default setting was changed and also inform users how to keep old behaviour in case new setting would cause regressions for them.

rscottmanley added a subscriber: rscottmanley.Sep 6 2019, 6:39 AM

Diffusion mentioned this in rL371261: [X86] Add a AVX512VBMI command line to min-legal-vector-width.ll. Always enable….Sep 6 2019, 2:50 PM

craig.topper mentioned this in rG03936cb0f942: [X86] Add a AVX512VBMI command line to min-legal-vector-width.ll. Always enable….Sep 6 2019, 2:51 PM

Add release notes. Pre-commit some of the test changes.

I've modified the min-legal-vector-width to enable fast-variable-shuffle on the original lines and now test with and without avx512vbmi. The avx512vnni change wasn't very interesting since its just an isel pattern peephole and we do test that peephole elsewhere.

Adding Ori.

Thank you; this is good to see.

I believe gcc and icc both do something similar to this by default.

That's right.
GCC: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf#page=647 (section 17-78)
ICC has -qopt-zmm-usage=low|high, which defaults to 'low' for Skylake targets (section 17-73)

Let's update older targets, too. The argument for preferring 128-bit on Haswell and Broadwell is even stronger: on these microarchs, cores executing 256-bit AVX can lower the frequency of other cores on the system.

In D67259#1661797, @atdt wrote:

Thank you; this is good to see.

I believe gcc and icc both do something similar to this by default.

That's right.
GCC: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf#page=647 (section 17-78)
ICC has -qopt-zmm-usage=low|high, which defaults to 'low' for Skylake targets (section 17-73)

Let's update older targets, too. The argument for preferring 128-bit on Haswell and Broadwell is even stronger: on these microarchs, cores executing 256-bit AVX can lower the frequency of other cores on the system.

Haswell/broadwell will require a separate patch. We don't have a feature flag that we can use and we probably need to scrub more test command lines.

Do we need to include some benchmark numbers?

I am wondering if this can solve bad perf with avx512 reported here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=6

In D67259#1662106, @RKSimon wrote:

Do we need to include some benchmark numbers?

What tests do you think we should run?

In D67259#1665675, @craig.topper wrote:

In D67259#1662106, @RKSimon wrote:

Do we need to include some benchmark numbers?

What tests do you think we should run?

I don't think it's practical to benchmark this. The effects are well-documented and well-understood. Additionally, the incidence and severity of high-power AVX frequency reductions are a function of system states and workload profile, which makes them difficult to reproduce in small, self-contained examples.

It's sufficient justification for this change to note that it brings Clang into line with other compilers. See here:

A significant problem is compiler inserted AVX-512 instructions. Even if you are not using any explicit AVX-512 instructions or intrinsics, compilers may decide to use them as a result of loop vectorization, within library functions and other optimization. Even something as simple as copying a structure may cause AVX-512 instructions to appear in your program. Current compiler behavior here varies greatly, and we can expect it to change in the future. In fact, it has already changed: Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option. Based on some not very comprehensive tests of LLVM’s clang (the default compiler on macOS), GNU gcc, Intel’s compiler (icc) and MSVC (part of Microsoft Visual Studio), only clang makes aggressive use of 512-bit instructions for simple constructs today: it used such instructions while copying structures, inlining memcpy, and vectorizing loops.

In D67259#1666273, @atdt wrote:

In D67259#1665675, @craig.topper wrote:

In D67259#1662106, @RKSimon wrote:

Do we need to include some benchmark numbers?

What tests do you think we should run?

I don't think it's practical to benchmark this. The effects are well-documented and well-understood. Additionally, the incidence and severity of high-power AVX frequency reductions are a function of system states and workload profile, which makes them difficult to reproduce in small, self-contained examples.

It's sufficient justification for this change to note that it brings Clang into line with other compilers. See here:

A significant problem is compiler inserted AVX-512 instructions. Even if you are not using any explicit AVX-512 instructions or intrinsics, compilers may decide to use them as a result of loop vectorization, within library functions and other optimization. Even something as simple as copying a structure may cause AVX-512 instructions to appear in your program. Current compiler behavior here varies greatly, and we can expect it to change in the future. In fact, it has already changed: Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option. Based on some not very comprehensive tests of LLVM’s clang (the default compiler on macOS), GNU gcc, Intel’s compiler (icc) and MSVC (part of Microsoft Visual Studio), only clang makes aggressive use of 512-bit instructions for simple constructs today: it used such instructions while copying structures, inlining memcpy, and vectorizing loops.

Sadly, this is a well-known problem at this point. I think that we should move forward with this change, which brings us back in line with other compilers in similar configurations. We should definitely make it clean in the release notes, and in the Clang release notes, how to restore the more-aggressive AVX-512 use.

Update clang release notes as well.

In which case I'm happy to go with this.

So who wants to click Accept?

LGTM

llvm/docs/ReleaseNotes.rst
100 ↗	(On Diff #219735)	register -> registers

This revision is now accepted and ready to land.Sep 11 2019, 12:57 PM

xbolva00 added inline comments.Sep 11 2019, 12:59 PM

clang/docs/ReleaseNotes.rst
64 ↗	(On Diff #219735)	-mprefer-vector-width=512 ?

Closed by commit rL371694: [X86] Enable -mprefer-vector-width=256 by default for Skylake-avx512 and later… (authored by ctopper). · Explain WhySep 11 2019, 4:52 PM

This revision was automatically updated to reflect the committed changes.

dcaballe added a subscriber: dcaballe.Sep 14 2019, 2:54 PM

dcaballe added inline comments.

llvm/docs/ReleaseNotes.rst
102 ↗	(On Diff #219735)	Typo? "passing -mattr=-prefer-256-bit to llc" -> "passing -mattr=-prefer-512-bit to llc"?

dcaballe added inline comments.Sep 14 2019, 3:09 PM

llvm/docs/ReleaseNotes.rst
102 ↗	(On Diff #219735)	Sorry, disregard my previous comment. I thought this was a dash, not a minus. If we do `-mattr=-prefer-256-bit`, is `prefer-512-bit` automatically set or is it not necessary?

Revision Contents

Path

Size

cfe/

trunk/

docs/

ReleaseNotes.rst

8 lines

llvm/

trunk/

docs/

ReleaseNotes.rst

4 lines

lib/

Target/

X86/

X86.td

2 lines

test/

CodeGen/

X86/

min-legal-vector-width.ll

8 lines

Diff 219832

cfe/trunk/docs/ReleaseNotes.rst

	Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	Improvements to Clang's diagnostics			Improvements to Clang's diagnostics
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	- ...			- ...

	Non-comprehensive list of changes in this release			Non-comprehensive list of changes in this release
	-------------------------------------------------			-------------------------------------------------

	- ...			- For X86 target, -march=skylake-avx512, -march=icelake-client,
				-march=icelake-server, -march=cascadelake, -march=cooperlake will default to
				not using 512-bit zmm registers in vectorized code unless 512-bit intrinsics
				are used in the source code. 512-bit operations are known to cause the CPUs
				to run at a lower frequency which can impact performance. This behavior can be
				changed by passing -mprefer-vector-width=512 on the command line.

	New Compiler Flags			New Compiler Flags
	------------------			------------------

	- ...			- ...

	Deprecated Compiler Flags			Deprecated Compiler Flags
	-------------------------			-------------------------
	▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

llvm/trunk/docs/ReleaseNotes.rst

	Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	* Less than 128 bit vector types, v2i32, v4i16, v2i16, v8i8, v4i8, and v2i8, are			* Less than 128 bit vector types, v2i32, v4i16, v2i16, v8i8, v4i8, and v2i8, are
	now stored in the lower bits of an xmm register and the upper bits are			now stored in the lower bits of an xmm register and the upper bits are
	undefined. Previously the elements were spread apart with undefined bits in			undefined. Previously the elements were spread apart with undefined bits in
	between them.			between them.
	* v32i8 and v64i8 vectors with AVX512F enabled, but AVX512BW disabled will now			* v32i8 and v64i8 vectors with AVX512F enabled, but AVX512BW disabled will now
	be passed in ZMM registers for calls and returns. Previously they were passed			be passed in ZMM registers for calls and returns. Previously they were passed
	in two YMM registers. Old behavior can be enabled by passing			in two YMM registers. Old behavior can be enabled by passing
	-x86-enable-old-knl-abi			-x86-enable-old-knl-abi
				* -mprefer-vector-width=256 is now the default behavior skylake-avx512 and later
				Intel CPUs. This tries to limit the use of 512-bit registers which can cause a
				decrease in CPU frequency on these CPUs. This can be re-enabled by passing
				-mprefer-vector-width=512 to clang or passing -mattr=-prefer-256-bit to llc.

	Changes to the AMDGPU Target			Changes to the AMDGPU Target
	-----------------------------			-----------------------------

	Changes to the AVR Target			Changes to the AVR Target
	-----------------------------			-----------------------------

	During this release ...			During this release ...
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 595 Lines • ▼ Show 20 Lines	list<SubtargetFeature> SKLSpecificFeatures = [FeatureHasFastGather,
FeatureSGX];		FeatureSGX];
list<SubtargetFeature> SKLInheritableFeatures =		list<SubtargetFeature> SKLInheritableFeatures =
!listconcat(BDWInheritableFeatures, SKLAdditionalFeatures);		!listconcat(BDWInheritableFeatures, SKLAdditionalFeatures);
list<SubtargetFeature> SKLFeatures =		list<SubtargetFeature> SKLFeatures =
!listconcat(SKLInheritableFeatures, SKLSpecificFeatures);		!listconcat(SKLInheritableFeatures, SKLSpecificFeatures);

// Skylake-AVX512		// Skylake-AVX512
list<SubtargetFeature> SKXAdditionalFeatures = [FeatureAVX512,		list<SubtargetFeature> SKXAdditionalFeatures = [FeatureAVX512,
		FeaturePrefer256Bit,
FeatureCDI,		FeatureCDI,
FeatureDQI,		FeatureDQI,
FeatureBWI,		FeatureBWI,
FeatureVLX,		FeatureVLX,
FeaturePKU,		FeaturePKU,
FeatureCLWB];		FeatureCLWB];
list<SubtargetFeature> SKXSpecificFeatures = [FeatureHasFastGather,		list<SubtargetFeature> SKXSpecificFeatures = [FeatureHasFastGather,
FeaturePOPCNTFalseDeps];		FeaturePOPCNTFalseDeps];
Show All 17 Lines	list<SubtargetFeature> CPXSpecificFeatures = [FeatureHasFastGather,
FeaturePOPCNTFalseDeps];		FeaturePOPCNTFalseDeps];
list<SubtargetFeature> CPXInheritableFeatures =		list<SubtargetFeature> CPXInheritableFeatures =
!listconcat(CLXInheritableFeatures, CPXAdditionalFeatures);		!listconcat(CLXInheritableFeatures, CPXAdditionalFeatures);
list<SubtargetFeature> CPXFeatures =		list<SubtargetFeature> CPXFeatures =
!listconcat(CPXInheritableFeatures, CPXSpecificFeatures);		!listconcat(CPXInheritableFeatures, CPXSpecificFeatures);

// Cannonlake		// Cannonlake
list<SubtargetFeature> CNLAdditionalFeatures = [FeatureAVX512,		list<SubtargetFeature> CNLAdditionalFeatures = [FeatureAVX512,
		FeaturePrefer256Bit,
FeatureCDI,		FeatureCDI,
FeatureDQI,		FeatureDQI,
FeatureBWI,		FeatureBWI,
FeatureVLX,		FeatureVLX,
FeaturePKU,		FeaturePKU,
FeatureVBMI,		FeatureVBMI,
FeatureIFMA,		FeatureIFMA,
FeatureSHA,		FeatureSHA,
▲ Show 20 Lines • Show All 641 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/min-legal-vector-width.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=fast-variable-shuffle,avx512vl,avx512bw,avx512dq,prefer-256-bit \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX512			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=fast-variable-shuffle,avx512vl,avx512bw,avx512dq,prefer-256-bit \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX512
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=fast-variable-shuffle,avx512vl,avx512bw,avx512dq,prefer-256-bit,avx512vbmi \| FileCheck %s --check-prefixes=CHECK,CHECK-VBMI			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=fast-variable-shuffle,avx512vl,avx512bw,avx512dq,prefer-256-bit,avx512vbmi \| FileCheck %s --check-prefixes=CHECK,CHECK-VBMI
				; Make sure CPUs default to prefer-256-bit. avx512vnni isn't interesting as it just adds an isel peephole for vpmaddwd+vpaddd
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skylake-avx512 \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX512
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-avx512vnni -mcpu=cascadelake \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX512
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-avx512vnni -mcpu=cooperlake \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX512
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=cannonlake \| FileCheck %s --check-prefixes=CHECK,CHECK-VBMI
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-avx512vnni -mcpu=icelake-client \| FileCheck %s --check-prefixes=CHECK,CHECK-VBMI
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-avx512vnni -mcpu=icelake-server \| FileCheck %s --check-prefixes=CHECK,CHECK-VBMI
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-avx512vnni -mcpu=tigerlake \| FileCheck %s --check-prefixes=CHECK,CHECK-VBMI

	; This file primarily contains tests for specific places in X86ISelLowering.cpp that needed be made aware of the legalizer not allowing 512-bit vectors due to prefer-256-bit even though AVX512 is enabled.			; This file primarily contains tests for specific places in X86ISelLowering.cpp that needed be made aware of the legalizer not allowing 512-bit vectors due to prefer-256-bit even though AVX512 is enabled.

	define void @add256(<16 x i32>* %a, <16 x i32>* %b, <16 x i32>* %c) "min-legal-vector-width"="256" {			define void @add256(<16 x i32>* %a, <16 x i32>* %b, <16 x i32>* %c) "min-legal-vector-width"="256" {
	; CHECK-LABEL: add256:			; CHECK-LABEL: add256:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: vmovdqa (%rdi), %ymm0			; CHECK-NEXT: vmovdqa (%rdi), %ymm0
	; CHECK-NEXT: vmovdqa 32(%rdi), %ymm1			; CHECK-NEXT: vmovdqa 32(%rdi), %ymm1
	▲ Show 20 Lines • Show All 965 Lines • Show Last 20 Lines