This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add support for "light" AVX
ClosedPublic

Authored by TokarIP on Sep 30 2022, 12:00 PM.

Details

Summary

AVX/AVX512 instructions may cause frequency drop on e.g. Skylake.
The magnitude of frequency/performance drop depends on instruction
(multiplication vs load/store) and vector width. Currently users,
that want to avoid this drop can specify -mprefer-vector-width=128.
However this also prevents generations of 256-bit wide instructions,
that have no associated frequency drop (mainly load/stores).

Add a tuning flag that allows generations of 256-bit AVX load/stores,
even when -mprefer-vector-width=128 is set, to speed-up memcpy&co.
Verified that running memcpy loop on all cores has no frequency impact and
zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.

Makes coping memory faster:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9

Diff Detail

Event Timeline

TokarIP created this revision.Sep 30 2022, 12:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 12:00 PM
TokarIP requested review of this revision.Sep 30 2022, 12:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 12:00 PM
pengfei added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
2691

Here the check for 256 was intended from rG47272217 authored by @echristo.
It looks to me it is the only difference between prefer-128-bit and prefer-256-bit. So I don't understand why using -mattr=prefer-128-bit -x86-light-avx=true rather than prefer-256-bit.

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

If we think we need this it would definitely be better as a Tuning feature bit (prefer-256-bit-memcpy)?

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

Tested (128 + this vs plain 128) on AMD rome:
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 19.2GB/s ± 3% 21.6GB/s ± 8% +12.44% (p=0.000 n=19+20)
BM_Memcpy/1/0 [
llvm_libc::memcpy,memcpy Google B ] 9.48GB/s ±11% 9.70GB/s ±10% ~ (p=0.228 n=18+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 33.0GB/s ± 2% 45.3GB/s ± 3% +37.08% (p=0.000 n=20+20)
BM_Memcpy/3/0 [
llvm_libc::memcpy,memcpy Google L ] 5.90GB/s ±17% 5.96GB/s ±19% ~ (p=0.835 n=19+20)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 6.55GB/s ±14% 6.87GB/s ±11% ~ (p=0.056 n=20+20)
BM_Memcpy/5/0 [
llvm_libc::memcpy,memcpy Google Q ] 3.74GB/s ±18% 3.55GB/s ±17% ~ (p=0.081 n=20+20)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 8.74GB/s ± 8% 9.16GB/s ± 7% +4.70% (p=0.002 n=18+20)
BM_Memcpy/7/0 [
llvm_libc::memcpy,memcpy Google U ] 9.79GB/s ±12% 10.38GB/s ±14% +6.01% (p=0.010 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 6.91GB/s ± 9% 7.24GB/s ± 8% +4.75% (p=0.001 n=19+20)
BM_Memcpy/9/0 [
llvm_libc::memcpy,uniform 384 to 4096 ] 43.2GB/s ± 1% 65.2GB/s ± 1% +50.69% (p=0.000 n=20+19)

Intel Skylake (server)
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 18.1GB/s ± 9% 20.9GB/s ± 8% +15.58% (p=0.000 n=18+19)
BM_Memcpy/1/0 [
llvm_libc::memcpy,memcpy Google B ] 8.43GB/s ±14% 8.74GB/s ±18% ~ (p=0.175 n=19+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 34.5GB/s ± 3% 49.2GB/s ± 5% +42.88% (p=0.000 n=17+18)
BM_Memcpy/3/0 [
llvm_libc::memcpy,memcpy Google L ] 5.51GB/s ±29% 5.72GB/s ±19% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 5.57GB/s ±18% 5.72GB/s ±20% ~ (p=0.529 n=20+20)
BM_Memcpy/5/0 [
llvm_libc::memcpy,memcpy Google Q ] 2.97GB/s ±12% 3.15GB/s ±11% +6.08% (p=0.007 n=20+19)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 7.88GB/s ±15% 8.41GB/s ± 6% +6.68% (p=0.000 n=18+17)
BM_Memcpy/7/0 [
llvm_libc::memcpy,memcpy Google U ] 8.65GB/s ±19% 9.65GB/s ±17% +11.62% (p=0.001 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 6.17GB/s ±15% 6.41GB/s ±10% +3.75% (p=0.038 n=17+18)
BM_Memcpy/9/0 [
llvm_libc::memcpy,uniform 384 to 4096 ] 44.5GB/s ± 2% 70.0GB/s ± 9% +57.38% (p=0.000 n=16+17)

And Intel Haswell
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 19.6GB/s ± 7% 22.5GB/s ± 8% +15.08% (p=0.000 n=20+20)
BM_Memcpy/1/0 [
llvm_libc::memcpy,memcpy Google B ] 9.15GB/s ± 5% 9.16GB/s ±13% ~ (p=0.798 n=17+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 37.4GB/s ± 6% 53.5GB/s ± 6% +42.95% (p=0.000 n=20+20)
BM_Memcpy/3/0 [
llvm_libc::memcpy,memcpy Google L ] 6.74GB/s ±17% 6.88GB/s ±17% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 6.56GB/s ± 5% 6.85GB/s ±16% ~ (p=0.105 n=18+20)
BM_Memcpy/5/0 [
llvm_libc::memcpy,memcpy Google Q ] 3.82GB/s ±18% 3.68GB/s ±24% ~ (p=0.253 n=20+20)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 8.75GB/s ± 9% 9.00GB/s ±14% ~ (p=0.211 n=20+20)
BM_Memcpy/7/0 [
llvm_libc::memcpy,memcpy Google U ] 10.2GB/s ±16% 10.6GB/s ±16% ~ (p=0.157 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 7.30GB/s ± 8% 7.42GB/s ±11% ~ (p=0.301 n=20+20)
BM_Memcpy/9/0 [
llvm_libc::memcpy,uniform 384 to 4096 ] 47.9GB/s ± 3% 77.3GB/s ± 6% +61.61% (p=0.000 n=19+20)

Internal loadtests shows 0.1-0.2% win vs -mprefer-vector-width=128. -mprefer-vector-width=256 causes several % performance regressions vs both this and plain 128.

llvm/lib/Target/X86/X86ISelLowering.cpp
2691

I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.g 256-bit FMA and corresponding frequency penalty. I want to get 256-bit loads/stores because they are free performance win, but not "heavy" instructions.

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

Thansk. I had forgotten it was used in X86TTIImpl::getRegisterBitWidth().

pengfei added inline comments.Oct 10 2022, 11:48 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
2691

Oh, I thought it is the only place we compared PreferVectorWidth with 256. (There're two other places in X86TargetTransformInfo.cpp)
I mentioned @echristo's patch, because I guess make load/store the same size as PreferVectorWidth is beneficial in the most cases. And do we need to consider the intensity of "heavy" instructions? E.g., if we load 256-bit vector, shuffle to 2 128-bit to do a single FMA, and then shuffle back to 256-bit to do the store. Is it really better than 2 128-bit load/store that might be folded into the FMA instruction?

TokarIP added inline comments.Nov 23 2022, 2:35 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
2691

AFAIK this won't happen. We either vectorize everything with 256-bit, so no FMA, just data movement, or with 128-bit and have no 256-bit at all. Reading the code and playing with examples didn't produce any mixed cases.

pengfei accepted this revision.Nov 24 2022, 4:05 AM

LGTM.

This revision is now accepted and ready to land.Nov 24 2022, 4:05 AM
RKSimon requested changes to this revision.Nov 24 2022, 4:12 AM

Still not a tuning flag

This revision now requires changes to proceed.Nov 24 2022, 4:12 AM

Still not a tuning flag

Just to make sure I understand you correctly: you want tuning flag (set for skylake/zen/haswell) and no command line flags?
I'm worried that this will cause some confusion, e.g .user requests -mprefer-vector-width=128, but sees ymm (256-bit) generated and files a bug, etc.
It is also easier to evaluate changes that are enabled by individual flags.

Matt added a subscriber: Matt.Dec 7 2022, 6:27 PM
TokarIP updated this revision to Diff 483356.Dec 15 2022, 2:53 PM

Still not a tuning flag

Added version with a tuning flag

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Please don't consider this a blocking feedback, but that sounds
a lot more problematic than just dealing with loads/stores.
But i'll defer to @craig.topper / @RKSimon.

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

llvm/test/CodeGen/X86/memcpy-light-avx.ll
2

Test with generic settings as well (e.g. -mattr=avx2,+prefer-128-bit,+allow-light-avx)

Sorry I missed your previous message - yes this is the kind of tuning attribute I had in mind instead of a generic flag, although I think it needs to be tightened to be more specific (256-bit ops).

llvm/lib/Target/X86/X86.td
619

Maybe rename this "allow-light-256-bit"?

llvm/lib/Target/X86/X86Subtarget.h
259

Depending on how prevalent this will be used, we might adjust it so people don't need to remember to check Subtarget.getPreferVectorWidth() >= 256 as well:

bool useLightAVX256Instructions() const {
  return Subtarget.getPreferVectorWidth() >= 256 || AllowLightAVX;
}

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

I had read it from here https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Light instructions include integer operations other than multiplication, logical operations, data shuffling (such as vpermw and vpermd) and so forth. Heavy instructions are common in deep learning, numerical analysis, high performance computing, and some cryptography (i.e., multiplication-based hashing). Light instructions tend to dominate in text processing, fast compression routines, vectorized implementations of library routines such as memcpy in C or System.arrayCopy in Java, and so forth.
TokarIP updated this revision to Diff 484641.Dec 21 2022, 11:48 AM
TokarIP marked 3 inline comments as done.

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

Official? - don't think so. I remember seeing some unofficial lists, but can't find them right now. Considering new cpus released since 2016, retesting would make sense anyway.

llvm/lib/Target/X86/X86.td
619

Done for the feature name, should I also rename other mentions of light avx?

RKSimon added inline comments.Dec 22 2022, 2:54 AM
llvm/lib/Target/X86/X86.td
619

Yes please maintain a consistent naming if you can - AVX is redundant imo and will confuse things if we ever try to do this for 512-bit vectors as well.

TokarIP updated this revision to Diff 484962.Dec 22 2022, 2:19 PM
TokarIP edited the summary of this revision. (Show Details)
TokarIP added inline comments.
llvm/lib/Target/X86/X86.td
619

Done.

@pengfei Please can you confirm that the Intel models are suitable for the TuningAllowLight256Bit flag?

llvm/lib/Target/X86/X86.td
1290

I'm not certain Ryzen needs this - even on znver1 with double pumping of 256-bit ops.

pengfei requested changes to this revision.Dec 24 2022, 4:42 AM

@pengfei Please can you confirm that the Intel models are suitable for the TuningAllowLight256Bit flag?

I don't have such targets at hand. I think it should be good in theory, so we can land it first.

llvm/test/CodeGen/X86/vector-width-store-merge.ll
70

This patch changes the behavior the test expected, though it should no correctness issue for 256-bits.
We should update the test to show it rather than hide it.
Note, it will have correctness issue or build error if force to generate 512-bits instructions.

This revision now requires changes to proceed.Dec 24 2022, 4:42 AM
TokarIP added inline comments.Dec 27 2022, 1:18 PM
llvm/lib/Target/X86/X86.td
1290

I'm not sure I understand this comment. You mean since Ryzen doesn't have any frequency problems, so we don't care about prefer-vector-width=128 behavior? This is mostly here for a) completeness (since 256-ops don't seem to hurt on ryzen we do prefer 256 bit loads/stores) and b) for cases where users want znver tuning but still prefer good performance on intel sop they pass prefer-vector-width=128

llvm/test/CodeGen/X86/vector-width-store-merge.ll
70

We want to test 2 behaviors:
1)prefer-vector-width=128 and no TuningAllowLight256Bit should generate 128-bit load/store - this test
2)prefer-vector-width=128 and TuningAllowLight256Bit should generate 256-bit - memcpy-light-avx.ll

Updating this test to check 256 case, means that we still need extra test for behavior #1, I'd rather keep the number of tests smaller with the same coverage.

pengfei added inline comments.Dec 27 2022, 5:19 PM
llvm/test/CodeGen/X86/memcpy-light-avx.ll
1

Why don't use update_llc_test_checks.py to generate the test?

llvm/test/CodeGen/X86/vector-width-store-merge.ll
70

You can add another RUN to test the behaviors in llvm/test/CodeGen/X86/memcpy-light-avx.ll if you like, e.g.,
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell | FileCheck %s --check-prefix=NO-256

We don't have a method to disable it on new targets, so no 2 behaviors here.

lebedev.ri added inline comments.Dec 27 2022, 5:27 PM
llvm/lib/Target/X86/X86.td
1290

I agree with @RKSimon here. I'm not really sure why anyone would want to
use non-full vector width on Ryzens, so i don't think we support it there.

TokarIP updated this revision to Diff 485550.Dec 28 2022, 3:35 PM
TokarIP marked an inline comment as done.
TokarIP edited the summary of this revision. (Show Details)
TokarIP added inline comments.
llvm/lib/Target/X86/X86.td
1290

FWIW mtune=znver3 + mprefer-vector-width=128 often gives best results for a mixed (skylake+rome) server fleet.

llvm/test/CodeGen/X86/vector-width-store-merge.ll
70

Thanks for the suggestion! Now we have 2 runs one cpus without this tuning and one with.

RKSimon added inline comments.Dec 30 2022, 3:36 AM
llvm/lib/Target/X86/X86.td
1290

Would -mtune=x86-64-v3 not be better for those cases?

TokarIP added inline comments.Jan 5 2023, 3:49 PM
llvm/lib/Target/X86/X86.td
1290

Not really, x86-64-v3 is basically haswell, and it seems that ryzen benefits more from ryzen tuning than skylake from haswell tuning.

RKSimon added inline comments.Jan 6 2023, 3:38 AM
llvm/lib/Target/X86/X86.td
1290

OK - if you want to include this then please can you ensure you add znver test coverage below

llvm/lib/Target/X86/X86Subtarget.h
259

;;

llvm/test/CodeGen/X86/vector-width-store-merge.ll
3

Clean up the prefixes - LIGHT256 might be a better prefix to use here?

; RUN: llc < %s -mtriple=x86_64-- -mcpu=skylake| FileCheck %s --check-prefixes=CHECK,PREFER256
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=sandybridge| FileCheck %s --check-prefixes=CHECK,LIGHT256
TokarIP updated this revision to Diff 486976.Jan 6 2023, 1:17 PM
TokarIP marked 2 inline comments as done.
TokarIP added inline comments.
llvm/lib/Target/X86/X86.td
1290

Added znver case to memcpy-light-avx.ll

we still need to improve the test coverage a little I think

llvm/test/CodeGen/X86/memcpy-light-avx.ll
5

; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit | FileCheck %s

llvm/test/CodeGen/X86/vector-width-store-merge.ll
2

Please can you add ryzen coverage here as well:

RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1| FileCheck %s --check-prefixes=CHECK,PREFER256

TokarIP updated this revision to Diff 487910.Jan 10 2023, 11:36 AM
TokarIP marked an inline comment as done.
TokarIP added inline comments.
llvm/test/CodeGen/X86/memcpy-light-avx.ll
5

Since this forces 128-bit, added with NO256 prefix:
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit | FileCheck --check-prefixes=NO256

RKSimon accepted this revision.Jan 14 2023, 1:49 PM

LGTM - cheers

@pengfei Are you OK to accept this?

pengfei accepted this revision.Jan 15 2023, 3:36 AM

LGTM, thanks!

This revision is now accepted and ready to land.Jan 15 2023, 3:36 AM

This will help with memset as well. Thx @TokarIP!

This revision was landed with ongoing or failed builds.Jan 24 2023, 2:03 PM
This revision was automatically updated to reflect the committed changes.