Download Raw Diff

Details

Reviewers

craig.topper
pengfei
RKSimon
LuoYuanke

Commits

rGd7043e8c41bb: [X86] Add support for "light" AVX

Summary

AVX/AVX512 instructions may cause frequency drop on e.g. Skylake.
The magnitude of frequency/performance drop depends on instruction
(multiplication vs load/store) and vector width. Currently users,
that want to avoid this drop can specify -mprefer-vector-width=128.
However this also prevents generations of 256-bit wide instructions,
that have no associated frequency drop (mainly load/stores).

Add a tuning flag that allows generations of 256-bit AVX load/stores,
even when -mprefer-vector-width=128 is set, to speed-up memcpy&co.
Verified that running memcpy loop on all cores has no frequency impact and
zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.

Makes coping memory faster:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9

Diff Detail

Event Timeline

TokarIP created this revision.Sep 30 2022, 12:00 PM

Herald added subscribers: StephenFan, pengfei, hiraditya. · View Herald TranscriptSep 30 2022, 12:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 12:00 PM

TokarIP requested review of this revision.Sep 30 2022, 12:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 12:00 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B189745: Diff 464345.Sep 30 2022, 1:11 PM

pengfei added a subscriber: echristo.Sep 30 2022, 8:46 PM

pengfei added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
2665	Here the check for 256 was intended from rG47272217 authored by @echristo. It looks to me it is the only difference between `prefer-128-bit` and `prefer-256-bit`. So I don't understand why using `-mattr=prefer-128-bit -x86-light-avx=true` rather than `prefer-256-bit`.

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

craig.topper added reviewers: pengfei, RKSimon, LuoYuanke.Sep 30 2022, 11:26 PM

Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

If we think we need this it would definitely be better as a Tuning feature bit (prefer-256-bit-memcpy)?

In D134982#3828850, @craig.topper wrote:

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

In D134982#3829073, @RKSimon wrote:

Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

Tested (128 + this vs plain 128) on AMD rome:
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 19.2GB/s ± 3% 21.6GB/s ± 8% +12.44% (p=0.000 n=19+20)
BM_Memcpy/1/0 [llvm_libc::memcpy,memcpy Google B ] 9.48GB/s ±11% 9.70GB/s ±10% ~ (p=0.228 n=18+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 33.0GB/s ± 2% 45.3GB/s ± 3% +37.08% (p=0.000 n=20+20)
BM_Memcpy/3/0 [llvm_libc::memcpy,memcpy Google L ] 5.90GB/s ±17% 5.96GB/s ±19% ~ (p=0.835 n=19+20)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 6.55GB/s ±14% 6.87GB/s ±11% ~ (p=0.056 n=20+20)
BM_Memcpy/5/0 [llvm_libc::memcpy,memcpy Google Q ] 3.74GB/s ±18% 3.55GB/s ±17% ~ (p=0.081 n=20+20)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 8.74GB/s ± 8% 9.16GB/s ± 7% +4.70% (p=0.002 n=18+20)
BM_Memcpy/7/0 [llvm_libc::memcpy,memcpy Google U ] 9.79GB/s ±12% 10.38GB/s ±14% +6.01% (p=0.010 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 6.91GB/s ± 9% 7.24GB/s ± 8% +4.75% (p=0.001 n=19+20)
BM_Memcpy/9/0 [llvm_libc::memcpy,uniform 384 to 4096 ] 43.2GB/s ± 1% 65.2GB/s ± 1% +50.69% (p=0.000 n=20+19)

Intel Skylake (server)
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 18.1GB/s ± 9% 20.9GB/s ± 8% +15.58% (p=0.000 n=18+19)
BM_Memcpy/1/0 [llvm_libc::memcpy,memcpy Google B ] 8.43GB/s ±14% 8.74GB/s ±18% ~ (p=0.175 n=19+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 34.5GB/s ± 3% 49.2GB/s ± 5% +42.88% (p=0.000 n=17+18)
BM_Memcpy/3/0 [llvm_libc::memcpy,memcpy Google L ] 5.51GB/s ±29% 5.72GB/s ±19% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 5.57GB/s ±18% 5.72GB/s ±20% ~ (p=0.529 n=20+20)
BM_Memcpy/5/0 [llvm_libc::memcpy,memcpy Google Q ] 2.97GB/s ±12% 3.15GB/s ±11% +6.08% (p=0.007 n=20+19)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 7.88GB/s ±15% 8.41GB/s ± 6% +6.68% (p=0.000 n=18+17)
BM_Memcpy/7/0 [llvm_libc::memcpy,memcpy Google U ] 8.65GB/s ±19% 9.65GB/s ±17% +11.62% (p=0.001 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 6.17GB/s ±15% 6.41GB/s ±10% +3.75% (p=0.038 n=17+18)
BM_Memcpy/9/0 [llvm_libc::memcpy,uniform 384 to 4096 ] 44.5GB/s ± 2% 70.0GB/s ± 9% +57.38% (p=0.000 n=16+17)

And Intel Haswell
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 19.6GB/s ± 7% 22.5GB/s ± 8% +15.08% (p=0.000 n=20+20)
BM_Memcpy/1/0 [llvm_libc::memcpy,memcpy Google B ] 9.15GB/s ± 5% 9.16GB/s ±13% ~ (p=0.798 n=17+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 37.4GB/s ± 6% 53.5GB/s ± 6% +42.95% (p=0.000 n=20+20)
BM_Memcpy/3/0 [llvm_libc::memcpy,memcpy Google L ] 6.74GB/s ±17% 6.88GB/s ±17% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 6.56GB/s ± 5% 6.85GB/s ±16% ~ (p=0.105 n=18+20)
BM_Memcpy/5/0 [llvm_libc::memcpy,memcpy Google Q ] 3.82GB/s ±18% 3.68GB/s ±24% ~ (p=0.253 n=20+20)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 8.75GB/s ± 9% 9.00GB/s ±14% ~ (p=0.211 n=20+20)
BM_Memcpy/7/0 [llvm_libc::memcpy,memcpy Google U ] 10.2GB/s ±16% 10.6GB/s ±16% ~ (p=0.157 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 7.30GB/s ± 8% 7.42GB/s ±11% ~ (p=0.301 n=20+20)
BM_Memcpy/9/0 [llvm_libc::memcpy,uniform 384 to 4096 ] 47.9GB/s ± 3% 77.3GB/s ± 6% +61.61% (p=0.000 n=19+20)

Internal loadtests shows 0.1-0.2% win vs -mprefer-vector-width=128. -mprefer-vector-width=256 causes several % performance regressions vs both this and plain 128.

llvm/lib/Target/X86/X86ISelLowering.cpp
2665	I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.g 256-bit FMA and corresponding frequency penalty. I want to get 256-bit loads/stores because they are free performance win, but not "heavy" instructions.

In D134982#3848311, @TokarIP wrote:

In D134982#3828850, @craig.topper wrote:

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

Thansk. I had forgotten it was used in X86TTIImpl::getRegisterBitWidth().

pengfei added inline comments.Oct 10 2022, 11:48 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
2665	Oh, I thought it is the only place we compared PreferVectorWidth with 256. (There're two other places in X86TargetTransformInfo.cpp) I mentioned @echristo's patch, because I guess make load/store the same size as PreferVectorWidth is beneficial in the most cases. And do we need to consider the intensity of "heavy" instructions? E.g., if we load 256-bit vector, shuffle to 2 128-bit to do a single FMA, and then shuffle back to 256-bit to do the store. Is it really better than 2 128-bit load/store that might be folded into the FMA instruction?

TokarIP added inline comments.Nov 23 2022, 2:35 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
2665	AFAIK this won't happen. We either vectorize everything with 256-bit, so no FMA, just data movement, or with 128-bit and have no 256-bit at all. Reading the code and playing with examples didn't produce any mixed cases.

LGTM.

This revision is now accepted and ready to land.Nov 24 2022, 4:05 AM

Still not a tuning flag

This revision now requires changes to proceed.Nov 24 2022, 4:12 AM

In D134982#3948950, @RKSimon wrote:

Still not a tuning flag

Just to make sure I understand you correctly: you want tuning flag (set for skylake/zen/haswell) and no command line flags?
I'm worried that this will cause some confusion, e.g .user requests -mprefer-vector-width=128, but sees ymm (256-bit) generated and files a bug, etc.
It is also easier to evaluate changes that are enabled by individual flags.

Matt added a subscriber: Matt.Dec 7 2022, 6:27 PM

In D134982#3948950, @RKSimon wrote:

Still not a tuning flag

Added version with a tuning flag

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Please don't consider this a blocking feedback, but that sounds
a lot more problematic than just dealing with loads/stores.
But i'll defer to @craig.topper / @RKSimon.

Harbormaster completed remote builds in B203485: Diff 483356.Dec 15 2022, 7:39 PM

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

llvm/test/CodeGen/X86/memcpy-light-avx.ll
2	Test with generic settings as well (e.g. -mattr=avx2,+prefer-128-bit,+allow-light-avx)

Sorry I missed your previous message - yes this is the kind of tuning attribute I had in mind instead of a generic flag, although I think it needs to be tightened to be more specific (256-bit ops).

llvm/lib/Target/X86/X86.td
619 ↗	(On Diff #483356)	Maybe rename this "allow-light-256-bit"?
llvm/lib/Target/X86/X86Subtarget.h
259 ↗	(On Diff #483356)	Depending on how prevalent this will be used, we might adjust it so people don't need to remember to check Subtarget.getPreferVectorWidth() >= 256 as well: bool useLightAVX256Instructions() const { return Subtarget.getPreferVectorWidth() >= 256 \|\| AllowLightAVX; }

In D134982#4000705, @RKSimon wrote:

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

I had read it from here https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Light instructions include integer operations other than multiplication, logical operations, data shuffling (such as vpermw and vpermd) and so forth. Heavy instructions are common in deep learning, numerical analysis, high performance computing, and some cryptography (i.e., multiplication-based hashing). Light instructions tend to dominate in text processing, fast compression routines, vectorized implementations of library routines such as memcpy in C or System.arrayCopy in Java, and so forth.

In D134982#4000705, @RKSimon wrote:

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

Official? - don't think so. I remember seeing some unofficial lists, but can't find them right now. Considering new cpus released since 2016, retesting would make sense anyway.

llvm/lib/Target/X86/X86.td
619 ↗	(On Diff #483356)	Done for the feature name, should I also rename other mentions of light avx?

Harbormaster completed remote builds in B204429: Diff 484641.Dec 21 2022, 12:29 PM

RKSimon added inline comments.Dec 22 2022, 2:54 AM

llvm/lib/Target/X86/X86.td
619 ↗	(On Diff #483356)	Yes please maintain a consistent naming if you can - AVX is redundant imo and will confuse things if we ever try to do this for 512-bit vectors as well.

TokarIP updated this revision to Diff 484962.Dec 22 2022, 2:19 PM

TokarIP edited the summary of this revision. (Show Details)

TokarIP added inline comments.

llvm/lib/Target/X86/X86.td
619 ↗	(On Diff #483356)	Done.

Harbormaster completed remote builds in B204674: Diff 484962.Dec 22 2022, 3:56 PM

@pengfei Please can you confirm that the Intel models are suitable for the TuningAllowLight256Bit flag?

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	I'm not certain Ryzen needs this - even on znver1 with double pumping of 256-bit ops.

In D134982#4015958, @RKSimon wrote:

@pengfei Please can you confirm that the Intel models are suitable for the TuningAllowLight256Bit flag?

I don't have such targets at hand. I think it should be good in theory, so we can land it first.

llvm/test/CodeGen/X86/vector-width-store-merge.ll
70 ↗	(On Diff #484962)	This patch changes the behavior the test expected, though it should no correctness issue for 256-bits. We should update the test to show it rather than hide it. Note, it will have correctness issue or build error if force to generate 512-bits instructions.

This revision now requires changes to proceed.Dec 24 2022, 4:42 AM

TokarIP added inline comments.Dec 27 2022, 1:18 PM

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	I'm not sure I understand this comment. You mean since Ryzen doesn't have any frequency problems, so we don't care about prefer-vector-width=128 behavior? This is mostly here for a) completeness (since 256-ops don't seem to hurt on ryzen we do prefer 256 bit loads/stores) and b) for cases where users want znver tuning but still prefer good performance on intel sop they pass prefer-vector-width=128
llvm/test/CodeGen/X86/vector-width-store-merge.ll
70 ↗	(On Diff #484962)	We want to test 2 behaviors: 1)prefer-vector-width=128 and no TuningAllowLight256Bit should generate 128-bit load/store - this test 2)prefer-vector-width=128 and TuningAllowLight256Bit should generate 256-bit - memcpy-light-avx.ll Updating this test to check 256 case, means that we still need extra test for behavior #1, I'd rather keep the number of tests smaller with the same coverage.

pengfei added inline comments.Dec 27 2022, 5:19 PM

llvm/test/CodeGen/X86/memcpy-light-avx.ll
2	Why don't use `update_llc_test_checks.py` to generate the test?
llvm/test/CodeGen/X86/vector-width-store-merge.ll
70 ↗	(On Diff #484962)	You can add another RUN to test the behaviors in `llvm/test/CodeGen/X86/memcpy-light-avx.ll` if you like, e.g., `; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell \| FileCheck %s --check-prefix=NO-256` We don't have a method to disable it on new targets, so no 2 behaviors here.

lebedev.ri added inline comments.Dec 27 2022, 5:27 PM

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	I agree with @RKSimon here. I'm not really sure why anyone would want to use non-full vector width on Ryzens, so i don't think we support it there.

TokarIP updated this revision to Diff 485550.Dec 28 2022, 3:35 PM

TokarIP marked an inline comment as done.

TokarIP edited the summary of this revision. (Show Details)

TokarIP added inline comments.

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	FWIW mtune=znver3 + mprefer-vector-width=128 often gives best results for a mixed (skylake+rome) server fleet.
llvm/test/CodeGen/X86/vector-width-store-merge.ll
70 ↗	(On Diff #484962)	Thanks for the suggestion! Now we have 2 runs one cpus without this tuning and one with.

Harbormaster completed remote builds in B205108: Diff 485550.Dec 28 2022, 4:15 PM

RKSimon added inline comments.Dec 30 2022, 3:36 AM

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	Would -mtune=x86-64-v3 not be better for those cases?

tschuett added a subscriber: tschuett.Dec 30 2022, 4:43 AM

TokarIP added inline comments.Jan 5 2023, 3:49 PM

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	Not really, x86-64-v3 is basically haswell, and it seems that ryzen benefits more from ryzen tuning than skylake from haswell tuning.

RKSimon added inline comments.Jan 6 2023, 3:38 AM

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	OK - if you want to include this then please can you ensure you add znver test coverage below
llvm/lib/Target/X86/X86Subtarget.h
259 ↗	(On Diff #483356)	;;
llvm/test/CodeGen/X86/vector-width-store-merge.ll
3 ↗	(On Diff #485550)	Clean up the prefixes - LIGHT256 might be a better prefix to use here? ; RUN: llc < %s -mtriple=x86_64-- -mcpu=skylake\| FileCheck %s --check-prefixes=CHECK,PREFER256 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=sandybridge\| FileCheck %s --check-prefixes=CHECK,LIGHT256

TokarIP updated this revision to Diff 486976.Jan 6 2023, 1:17 PM

TokarIP marked 2 inline comments as done.

TokarIP added inline comments.

llvm/lib/Target/X86/X86.td
1290 ↗	(On Diff #484962)	Added znver case to memcpy-light-avx.ll

Harbormaster completed remote builds in B206190: Diff 486976.Jan 6 2023, 2:04 PM

we still need to improve the test coverage a little I think

llvm/test/CodeGen/X86/memcpy-light-avx.ll
5	; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit \| FileCheck %s
llvm/test/CodeGen/X86/vector-width-store-merge.ll
2 ↗	(On Diff #486976)	Please can you add ryzen coverage here as well: RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1\| FileCheck %s --check-prefixes=CHECK,PREFER256

TokarIP updated this revision to Diff 487910.Jan 10 2023, 11:36 AM

TokarIP marked an inline comment as done.

TokarIP added inline comments.

llvm/test/CodeGen/X86/memcpy-light-avx.ll
5	Since this forces 128-bit, added with NO256 prefix: ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit \| FileCheck --check-prefixes=NO256

Harbormaster completed remote builds in B206871: Diff 487910.Jan 10 2023, 2:42 PM

LGTM - cheers

@pengfei Are you OK to accept this?

LGTM, thanks!

This revision is now accepted and ready to land.Jan 15 2023, 3:36 AM

This will help with memset as well. Thx @TokarIP!

This revision was landed with ongoing or failed builds.Jan 24 2023, 2:03 PM

Closed by commit rGd7043e8c41bb: [X86] Add support for "light" AVX (authored by TokarIP). · Explain Why

This revision was automatically updated to reflect the committed changes.

TokarIP added a commit: rGd7043e8c41bb: [X86] Add support for "light" AVX.

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add support for "light" AVX
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 464345

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/memcpy-light-avx.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add support for "light" AVXClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 464345

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/memcpy-light-avx.ll

[X86] Add support for "light" AVX
ClosedPublic