This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
7/12
X86.td
2/4
X86ISelLowering.cpp
2/2
X86Subtarget.h
-
X86TargetTransformInfo.h
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
3/4
memcpy-light-avx.ll
4/6
vector-width-store-merge.ll

Differential D134982

[X86] Add support for "light" AVX
ClosedPublic

Authored by TokarIP on Sep 30 2022, 12:00 PM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
RKSimon
LuoYuanke

Commits

rGd7043e8c41bb: [X86] Add support for "light" AVX

Summary

AVX/AVX512 instructions may cause frequency drop on e.g. Skylake.
The magnitude of frequency/performance drop depends on instruction
(multiplication vs load/store) and vector width. Currently users,
that want to avoid this drop can specify -mprefer-vector-width=128.
However this also prevents generations of 256-bit wide instructions,
that have no associated frequency drop (mainly load/stores).

Add a tuning flag that allows generations of 256-bit AVX load/stores,
even when -mprefer-vector-width=128 is set, to speed-up memcpy&co.
Verified that running memcpy loop on all cores has no frequency impact and
zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.

Makes coping memory faster:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9

Diff Detail

Event Timeline

TokarIP created this revision.Sep 30 2022, 12:00 PM

Herald added subscribers: StephenFan, pengfei, hiraditya. · View Herald TranscriptSep 30 2022, 12:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 12:00 PM

TokarIP requested review of this revision.Sep 30 2022, 12:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2022, 12:00 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B189745: Diff 464345.Sep 30 2022, 1:11 PM

pengfei added a subscriber: echristo.Sep 30 2022, 8:46 PM

pengfei added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
2691	Here the check for 256 was intended from rG47272217 authored by @echristo. It looks to me it is the only difference between `prefer-128-bit` and `prefer-256-bit`. So I don't understand why using `-mattr=prefer-128-bit -x86-light-avx=true` rather than `prefer-256-bit`.

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

craig.topper added reviewers: pengfei, RKSimon, LuoYuanke.Sep 30 2022, 11:26 PM

Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

If we think we need this it would definitely be better as a Tuning feature bit (prefer-256-bit-memcpy)?

In D134982#3828850, @craig.topper wrote:

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

In D134982#3829073, @RKSimon wrote:

Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

Tested (128 + this vs plain 128) on AMD rome:
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 19.2GB/s ± 3% 21.6GB/s ± 8% +12.44% (p=0.000 n=19+20)
BM_Memcpy/1/0 [llvm_libc::memcpy,memcpy Google B ] 9.48GB/s ±11% 9.70GB/s ±10% ~ (p=0.228 n=18+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 33.0GB/s ± 2% 45.3GB/s ± 3% +37.08% (p=0.000 n=20+20)
BM_Memcpy/3/0 [llvm_libc::memcpy,memcpy Google L ] 5.90GB/s ±17% 5.96GB/s ±19% ~ (p=0.835 n=19+20)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 6.55GB/s ±14% 6.87GB/s ±11% ~ (p=0.056 n=20+20)
BM_Memcpy/5/0 [llvm_libc::memcpy,memcpy Google Q ] 3.74GB/s ±18% 3.55GB/s ±17% ~ (p=0.081 n=20+20)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 8.74GB/s ± 8% 9.16GB/s ± 7% +4.70% (p=0.002 n=18+20)
BM_Memcpy/7/0 [llvm_libc::memcpy,memcpy Google U ] 9.79GB/s ±12% 10.38GB/s ±14% +6.01% (p=0.010 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 6.91GB/s ± 9% 7.24GB/s ± 8% +4.75% (p=0.001 n=19+20)
BM_Memcpy/9/0 [llvm_libc::memcpy,uniform 384 to 4096 ] 43.2GB/s ± 1% 65.2GB/s ± 1% +50.69% (p=0.000 n=20+19)

Intel Skylake (server)
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 18.1GB/s ± 9% 20.9GB/s ± 8% +15.58% (p=0.000 n=18+19)
BM_Memcpy/1/0 [llvm_libc::memcpy,memcpy Google B ] 8.43GB/s ±14% 8.74GB/s ±18% ~ (p=0.175 n=19+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 34.5GB/s ± 3% 49.2GB/s ± 5% +42.88% (p=0.000 n=17+18)
BM_Memcpy/3/0 [llvm_libc::memcpy,memcpy Google L ] 5.51GB/s ±29% 5.72GB/s ±19% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 5.57GB/s ±18% 5.72GB/s ±20% ~ (p=0.529 n=20+20)
BM_Memcpy/5/0 [llvm_libc::memcpy,memcpy Google Q ] 2.97GB/s ±12% 3.15GB/s ±11% +6.08% (p=0.007 n=20+19)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 7.88GB/s ±15% 8.41GB/s ± 6% +6.68% (p=0.000 n=18+17)
BM_Memcpy/7/0 [llvm_libc::memcpy,memcpy Google U ] 8.65GB/s ±19% 9.65GB/s ±17% +11.62% (p=0.001 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 6.17GB/s ±15% 6.41GB/s ±10% +3.75% (p=0.038 n=17+18)
BM_Memcpy/9/0 [llvm_libc::memcpy,uniform 384 to 4096 ] 44.5GB/s ± 2% 70.0GB/s ± 9% +57.38% (p=0.000 n=16+17)

And Intel Haswell
BM_Memcpy/0/0 [llvm_libc::memcpy,memcpy Google A ] 19.6GB/s ± 7% 22.5GB/s ± 8% +15.08% (p=0.000 n=20+20)
BM_Memcpy/1/0 [llvm_libc::memcpy,memcpy Google B ] 9.15GB/s ± 5% 9.16GB/s ±13% ~ (p=0.798 n=17+20)
BM_Memcpy/2/0 [llvm_libc::memcpy,memcpy Google D ] 37.4GB/s ± 6% 53.5GB/s ± 6% +42.95% (p=0.000 n=20+20)
BM_Memcpy/3/0 [llvm_libc::memcpy,memcpy Google L ] 6.74GB/s ±17% 6.88GB/s ±17% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [llvm_libc::memcpy,memcpy Google M ] 6.56GB/s ± 5% 6.85GB/s ±16% ~ (p=0.105 n=18+20)
BM_Memcpy/5/0 [llvm_libc::memcpy,memcpy Google Q ] 3.82GB/s ±18% 3.68GB/s ±24% ~ (p=0.253 n=20+20)
BM_Memcpy/6/0 [llvm_libc::memcpy,memcpy Google S ] 8.75GB/s ± 9% 9.00GB/s ±14% ~ (p=0.211 n=20+20)
BM_Memcpy/7/0 [llvm_libc::memcpy,memcpy Google U ] 10.2GB/s ±16% 10.6GB/s ±16% ~ (p=0.157 n=20+20)
BM_Memcpy/8/0 [llvm_libc::memcpy,memcpy Google W ] 7.30GB/s ± 8% 7.42GB/s ±11% ~ (p=0.301 n=20+20)
BM_Memcpy/9/0 [llvm_libc::memcpy,uniform 384 to 4096 ] 47.9GB/s ± 3% 77.3GB/s ± 6% +61.61% (p=0.000 n=19+20)

Internal loadtests shows 0.1-0.2% win vs -mprefer-vector-width=128. -mprefer-vector-width=256 causes several % performance regressions vs both this and plain 128.

llvm/lib/Target/X86/X86ISelLowering.cpp
2691	I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.g 256-bit FMA and corresponding frequency penalty. I want to get 256-bit loads/stores because they are free performance win, but not "heavy" instructions.

In D134982#3848311, @TokarIP wrote:

In D134982#3828850, @craig.topper wrote:

I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

Thansk. I had forgotten it was used in X86TTIImpl::getRegisterBitWidth().

pengfei added inline comments.Oct 10 2022, 11:48 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
2691	Oh, I thought it is the only place we compared PreferVectorWidth with 256. (There're two other places in X86TargetTransformInfo.cpp) I mentioned @echristo's patch, because I guess make load/store the same size as PreferVectorWidth is beneficial in the most cases. And do we need to consider the intensity of "heavy" instructions? E.g., if we load 256-bit vector, shuffle to 2 128-bit to do a single FMA, and then shuffle back to 256-bit to do the store. Is it really better than 2 128-bit load/store that might be folded into the FMA instruction?

TokarIP added inline comments.Nov 23 2022, 2:35 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
2691	AFAIK this won't happen. We either vectorize everything with 256-bit, so no FMA, just data movement, or with 128-bit and have no 256-bit at all. Reading the code and playing with examples didn't produce any mixed cases.

LGTM.

This revision is now accepted and ready to land.Nov 24 2022, 4:05 AM

Still not a tuning flag

This revision now requires changes to proceed.Nov 24 2022, 4:12 AM

In D134982#3948950, @RKSimon wrote:

Still not a tuning flag

Just to make sure I understand you correctly: you want tuning flag (set for skylake/zen/haswell) and no command line flags?
I'm worried that this will cause some confusion, e.g .user requests -mprefer-vector-width=128, but sees ymm (256-bit) generated and files a bug, etc.
It is also easier to evaluate changes that are enabled by individual flags.

Matt added a subscriber: Matt.Dec 7 2022, 6:27 PM

In D134982#3948950, @RKSimon wrote:

Still not a tuning flag

Added version with a tuning flag

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Please don't consider this a blocking feedback, but that sounds
a lot more problematic than just dealing with loads/stores.
But i'll defer to @craig.topper / @RKSimon.

Harbormaster completed remote builds in B203485: Diff 483356.Dec 15 2022, 7:39 PM

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

llvm/test/CodeGen/X86/memcpy-light-avx.ll
2	Test with generic settings as well (e.g. -mattr=avx2,+prefer-128-bit,+allow-light-avx)

Sorry I missed your previous message - yes this is the kind of tuning attribute I had in mind instead of a generic flag, although I think it needs to be tightened to be more specific (256-bit ops).

llvm/lib/Target/X86/X86.td
619	Maybe rename this "allow-light-256-bit"?
llvm/lib/Target/X86/X86Subtarget.h
259	Depending on how prevalent this will be used, we might adjust it so people don't need to remember to check Subtarget.getPreferVectorWidth() >= 256 as well: bool useLightAVX256Instructions() const { return Subtarget.getPreferVectorWidth() >= 256 \|\| AllowLightAVX; }

In D134982#4000705, @RKSimon wrote:

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

I had read it from here https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Light instructions include integer operations other than multiplication, logical operations, data shuffling (such as vpermw and vpermd) and so forth. Heavy instructions are common in deep learning, numerical analysis, high performance computing, and some cryptography (i.e., multiplication-based hashing). Light instructions tend to dominate in text processing, fast compression routines, vectorized implementations of library routines such as memcpy in C or System.arrayCopy in Java, and so forth.

In D134982#4000705, @RKSimon wrote:

In D134982#3999687, @TokarIP wrote:

In D134982#3999680, @lebedev.ri wrote:

I think LightAVX is a misnomer. If we want to
always utilize full potential of vector load-store unit,
then the Tuning should say as much.

I'll probably expand this to other "light" AVX instructions (like vpcmpeq for memcmp intrinsic) in the future.
Also we don't want the full width, 512-bit load/stores still cause some frequency drop on skylake.

Do we have a definitive list of what intel considers "light" 256-bit instructions?

Official? - don't think so. I remember seeing some unofficial lists, but can't find them right now. Considering new cpus released since 2016, retesting would make sense anyway.

llvm/lib/Target/X86/X86.td
619	Done for the feature name, should I also rename other mentions of light avx?

Harbormaster completed remote builds in B204429: Diff 484641.Dec 21 2022, 12:29 PM

RKSimon added inline comments.Dec 22 2022, 2:54 AM

llvm/lib/Target/X86/X86.td
619	Yes please maintain a consistent naming if you can - AVX is redundant imo and will confuse things if we ever try to do this for 512-bit vectors as well.

TokarIP updated this revision to Diff 484962.Dec 22 2022, 2:19 PM

TokarIP edited the summary of this revision. (Show Details)

TokarIP added inline comments.

llvm/lib/Target/X86/X86.td
619	Done.

Harbormaster completed remote builds in B204674: Diff 484962.Dec 22 2022, 3:56 PM

@pengfei Please can you confirm that the Intel models are suitable for the TuningAllowLight256Bit flag?

llvm/lib/Target/X86/X86.td
1290	I'm not certain Ryzen needs this - even on znver1 with double pumping of 256-bit ops.

In D134982#4015958, @RKSimon wrote:

@pengfei Please can you confirm that the Intel models are suitable for the TuningAllowLight256Bit flag?

I don't have such targets at hand. I think it should be good in theory, so we can land it first.

llvm/test/CodeGen/X86/vector-width-store-merge.ll
70	This patch changes the behavior the test expected, though it should no correctness issue for 256-bits. We should update the test to show it rather than hide it. Note, it will have correctness issue or build error if force to generate 512-bits instructions.

This revision now requires changes to proceed.Dec 24 2022, 4:42 AM

TokarIP added inline comments.Dec 27 2022, 1:18 PM

llvm/lib/Target/X86/X86.td
1290	I'm not sure I understand this comment. You mean since Ryzen doesn't have any frequency problems, so we don't care about prefer-vector-width=128 behavior? This is mostly here for a) completeness (since 256-ops don't seem to hurt on ryzen we do prefer 256 bit loads/stores) and b) for cases where users want znver tuning but still prefer good performance on intel sop they pass prefer-vector-width=128
llvm/test/CodeGen/X86/vector-width-store-merge.ll
70	We want to test 2 behaviors: 1)prefer-vector-width=128 and no TuningAllowLight256Bit should generate 128-bit load/store - this test 2)prefer-vector-width=128 and TuningAllowLight256Bit should generate 256-bit - memcpy-light-avx.ll Updating this test to check 256 case, means that we still need extra test for behavior #1, I'd rather keep the number of tests smaller with the same coverage.

pengfei added inline comments.Dec 27 2022, 5:19 PM

llvm/test/CodeGen/X86/memcpy-light-avx.ll
2	Why don't use `update_llc_test_checks.py` to generate the test?
llvm/test/CodeGen/X86/vector-width-store-merge.ll
70	You can add another RUN to test the behaviors in `llvm/test/CodeGen/X86/memcpy-light-avx.ll` if you like, e.g., `; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell \| FileCheck %s --check-prefix=NO-256` We don't have a method to disable it on new targets, so no 2 behaviors here.

lebedev.ri added inline comments.Dec 27 2022, 5:27 PM

llvm/lib/Target/X86/X86.td
1290	I agree with @RKSimon here. I'm not really sure why anyone would want to use non-full vector width on Ryzens, so i don't think we support it there.

TokarIP updated this revision to Diff 485550.Dec 28 2022, 3:35 PM

TokarIP marked an inline comment as done.

TokarIP edited the summary of this revision. (Show Details)

TokarIP added inline comments.

llvm/lib/Target/X86/X86.td
1290	FWIW mtune=znver3 + mprefer-vector-width=128 often gives best results for a mixed (skylake+rome) server fleet.
llvm/test/CodeGen/X86/vector-width-store-merge.ll
70	Thanks for the suggestion! Now we have 2 runs one cpus without this tuning and one with.

Harbormaster completed remote builds in B205108: Diff 485550.Dec 28 2022, 4:15 PM

RKSimon added inline comments.Dec 30 2022, 3:36 AM

llvm/lib/Target/X86/X86.td
1290	Would -mtune=x86-64-v3 not be better for those cases?

tschuett added a subscriber: tschuett.Dec 30 2022, 4:43 AM

TokarIP added inline comments.Jan 5 2023, 3:49 PM

llvm/lib/Target/X86/X86.td
1290	Not really, x86-64-v3 is basically haswell, and it seems that ryzen benefits more from ryzen tuning than skylake from haswell tuning.

RKSimon added inline comments.Jan 6 2023, 3:38 AM

llvm/lib/Target/X86/X86.td
1290	OK - if you want to include this then please can you ensure you add znver test coverage below
llvm/lib/Target/X86/X86Subtarget.h
259	;;
llvm/test/CodeGen/X86/vector-width-store-merge.ll
3	Clean up the prefixes - LIGHT256 might be a better prefix to use here? ; RUN: llc < %s -mtriple=x86_64-- -mcpu=skylake\| FileCheck %s --check-prefixes=CHECK,PREFER256 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=sandybridge\| FileCheck %s --check-prefixes=CHECK,LIGHT256

TokarIP updated this revision to Diff 486976.Jan 6 2023, 1:17 PM

TokarIP marked 2 inline comments as done.

TokarIP added inline comments.

llvm/lib/Target/X86/X86.td
1290	Added znver case to memcpy-light-avx.ll

Harbormaster completed remote builds in B206190: Diff 486976.Jan 6 2023, 2:04 PM

we still need to improve the test coverage a little I think

llvm/test/CodeGen/X86/memcpy-light-avx.ll
5	; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit \| FileCheck %s
llvm/test/CodeGen/X86/vector-width-store-merge.ll
2	Please can you add ryzen coverage here as well: RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1\| FileCheck %s --check-prefixes=CHECK,PREFER256

TokarIP updated this revision to Diff 487910.Jan 10 2023, 11:36 AM

TokarIP marked an inline comment as done.

TokarIP added inline comments.

llvm/test/CodeGen/X86/memcpy-light-avx.ll
5	Since this forces 128-bit, added with NO256 prefix: ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit \| FileCheck --check-prefixes=NO256

Harbormaster completed remote builds in B206871: Diff 487910.Jan 10 2023, 2:42 PM

LGTM - cheers

@pengfei Are you OK to accept this?

LGTM, thanks!

This revision is now accepted and ready to land.Jan 15 2023, 3:36 AM

This will help with memset as well. Thx @TokarIP!

This revision was landed with ongoing or failed builds.Jan 24 2023, 2:03 PM

Closed by commit rGd7043e8c41bb: [X86] Add support for "light" AVX (authored by TokarIP). · Explain Why

This revision was automatically updated to reflect the committed changes.

TokarIP added a commit: rGd7043e8c41bb: [X86] Add support for "light" AVX.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86.td

22 lines

X86ISelLowering.cpp

2 lines

X86Subtarget.h

4 lines

X86TargetTransformInfo.h

1 line

test/

CodeGen/

X86/

memcpy-light-avx.ll

15 lines

vector-width-store-merge.ll

2 lines

Diff 484641

llvm/lib/Target/X86/X86.td

Show First 20 Lines • Show All 609 Lines • ▼ Show 20 Lines
def TuningPrefer128Bit		def TuningPrefer128Bit
: SubtargetFeature<"prefer-128-bit", "Prefer128Bit", "true",		: SubtargetFeature<"prefer-128-bit", "Prefer128Bit", "true",
"Prefer 128-bit AVX instructions">;		"Prefer 128-bit AVX instructions">;

def TuningPrefer256Bit		def TuningPrefer256Bit
: SubtargetFeature<"prefer-256-bit", "Prefer256Bit", "true",		: SubtargetFeature<"prefer-256-bit", "Prefer256Bit", "true",
"Prefer 256-bit AVX instructions">;		"Prefer 256-bit AVX instructions">;

		def TuningAllowLightAVX
		: SubtargetFeature<"allow-light-256-bit", "AllowLightAVX", "true",
		RKSimonUnsubmitted Done Reply Inline Actions Maybe rename this "allow-light-256-bit"? RKSimon: Maybe rename this "allow-light-256-bit"?
		TokarIPAuthorUnsubmitted Done Reply Inline Actions Done for the feature name, should I also rename other mentions of light avx? TokarIP: Done for the feature name, should I also rename other mentions of light avx?
		RKSimonUnsubmitted Not Done Reply Inline Actions Yes please maintain a consistent naming if you can - AVX is redundant imo and will confuse things if we ever try to do this for 512-bit vectors as well. RKSimon: Yes please maintain a consistent naming if you can - AVX is redundant imo and will confuse…
		TokarIPAuthorUnsubmitted Done Reply Inline Actions Done. TokarIP: Done.
		"Enable generation of 256 AVX load/stores even if we prefer 128-bit">;

def TuningPreferMaskRegisters		def TuningPreferMaskRegisters
: SubtargetFeature<"prefer-mask-registers", "PreferMaskRegisters", "true",		: SubtargetFeature<"prefer-mask-registers", "PreferMaskRegisters", "true",
"Prefer AVX512 mask registers over PTEST/MOVMSK">;		"Prefer AVX512 mask registers over PTEST/MOVMSK">;

def TuningFastBEXTR : SubtargetFeature<"fast-bextr", "HasFastBEXTR", "true",		def TuningFastBEXTR : SubtargetFeature<"fast-bextr", "HasFastBEXTR", "true",
"Indicates that the BEXTR instruction is implemented as a single uop "		"Indicates that the BEXTR instruction is implemented as a single uop "
"with good throughput">;		"with good throughput">;

▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	list<SubtargetFeature> HSWTuning = [TuningMacroFusion,
TuningSlowDivide64,		TuningSlowDivide64,
TuningFastScalarFSQRT,		TuningFastScalarFSQRT,
TuningFastSHLDRotate,		TuningFastSHLDRotate,
TuningFast15ByteNOP,		TuningFast15ByteNOP,
TuningFastVariableCrossLaneShuffle,		TuningFastVariableCrossLaneShuffle,
TuningFastVariablePerLaneShuffle,		TuningFastVariablePerLaneShuffle,
TuningPOPCNTFalseDeps,		TuningPOPCNTFalseDeps,
TuningLZCNTFalseDeps,		TuningLZCNTFalseDeps,
TuningInsertVZEROUPPER];		TuningInsertVZEROUPPER,
		TuningAllowLightAVX];
list<SubtargetFeature> HSWFeatures =		list<SubtargetFeature> HSWFeatures =
!listconcat(IVBFeatures, HSWAdditionalFeatures);		!listconcat(IVBFeatures, HSWAdditionalFeatures);

// Broadwell		// Broadwell
list<SubtargetFeature> BDWAdditionalFeatures = [FeatureADX,		list<SubtargetFeature> BDWAdditionalFeatures = [FeatureADX,
FeatureRDSEED,		FeatureRDSEED,
FeaturePRFCHW];		FeaturePRFCHW];
list<SubtargetFeature> BDWTuning = HSWTuning;		list<SubtargetFeature> BDWTuning = HSWTuning;
Show All 11 Lines	list<SubtargetFeature> SKLTuning = [TuningFastGather,
TuningSlowDivide64,		TuningSlowDivide64,
TuningFastScalarFSQRT,		TuningFastScalarFSQRT,
TuningFastVectorFSQRT,		TuningFastVectorFSQRT,
TuningFastSHLDRotate,		TuningFastSHLDRotate,
TuningFast15ByteNOP,		TuningFast15ByteNOP,
TuningFastVariableCrossLaneShuffle,		TuningFastVariableCrossLaneShuffle,
TuningFastVariablePerLaneShuffle,		TuningFastVariablePerLaneShuffle,
TuningPOPCNTFalseDeps,		TuningPOPCNTFalseDeps,
TuningInsertVZEROUPPER];		TuningInsertVZEROUPPER,
		TuningAllowLightAVX];
list<SubtargetFeature> SKLFeatures =		list<SubtargetFeature> SKLFeatures =
!listconcat(BDWFeatures, SKLAdditionalFeatures);		!listconcat(BDWFeatures, SKLAdditionalFeatures);

// Skylake-AVX512		// Skylake-AVX512
list<SubtargetFeature> SKXAdditionalFeatures = [FeatureAES,		list<SubtargetFeature> SKXAdditionalFeatures = [FeatureAES,
FeatureXSAVEC,		FeatureXSAVEC,
FeatureXSAVES,		FeatureXSAVES,
FeatureCLFLUSHOPT,		FeatureCLFLUSHOPT,
Show All 11 Lines	list<SubtargetFeature> SKXTuning = [TuningFastGather,
TuningFastScalarFSQRT,		TuningFastScalarFSQRT,
TuningFastVectorFSQRT,		TuningFastVectorFSQRT,
TuningFastSHLDRotate,		TuningFastSHLDRotate,
TuningFast15ByteNOP,		TuningFast15ByteNOP,
TuningFastVariableCrossLaneShuffle,		TuningFastVariableCrossLaneShuffle,
TuningFastVariablePerLaneShuffle,		TuningFastVariablePerLaneShuffle,
TuningPrefer256Bit,		TuningPrefer256Bit,
TuningPOPCNTFalseDeps,		TuningPOPCNTFalseDeps,
TuningInsertVZEROUPPER];		TuningInsertVZEROUPPER,
		TuningAllowLightAVX];
list<SubtargetFeature> SKXFeatures =		list<SubtargetFeature> SKXFeatures =
!listconcat(BDWFeatures, SKXAdditionalFeatures);		!listconcat(BDWFeatures, SKXAdditionalFeatures);

// Cascadelake		// Cascadelake
list<SubtargetFeature> CLXAdditionalFeatures = [FeatureVNNI];		list<SubtargetFeature> CLXAdditionalFeatures = [FeatureVNNI];
list<SubtargetFeature> CLXTuning = SKXTuning;		list<SubtargetFeature> CLXTuning = SKXTuning;
list<SubtargetFeature> CLXFeatures =		list<SubtargetFeature> CLXFeatures =
!listconcat(SKXFeatures, CLXAdditionalFeatures);		!listconcat(SKXFeatures, CLXAdditionalFeatures);
Show All 20 Lines	list<SubtargetFeature> CNLTuning = [TuningFastGather,
TuningSlowDivide64,		TuningSlowDivide64,
TuningFastScalarFSQRT,		TuningFastScalarFSQRT,
TuningFastVectorFSQRT,		TuningFastVectorFSQRT,
TuningFastSHLDRotate,		TuningFastSHLDRotate,
TuningFast15ByteNOP,		TuningFast15ByteNOP,
TuningFastVariableCrossLaneShuffle,		TuningFastVariableCrossLaneShuffle,
TuningFastVariablePerLaneShuffle,		TuningFastVariablePerLaneShuffle,
TuningPrefer256Bit,		TuningPrefer256Bit,
TuningInsertVZEROUPPER];		TuningInsertVZEROUPPER,
		TuningAllowLightAVX];
list<SubtargetFeature> CNLFeatures =		list<SubtargetFeature> CNLFeatures =
!listconcat(SKLFeatures, CNLAdditionalFeatures);		!listconcat(SKLFeatures, CNLAdditionalFeatures);

// Icelake		// Icelake
list<SubtargetFeature> ICLAdditionalFeatures = [FeatureBITALG,		list<SubtargetFeature> ICLAdditionalFeatures = [FeatureBITALG,
FeatureVAES,		FeatureVAES,
FeatureVBMI2,		FeatureVBMI2,
FeatureVNNI,		FeatureVNNI,
FeatureVPCLMULQDQ,		FeatureVPCLMULQDQ,
FeatureVPOPCNTDQ,		FeatureVPOPCNTDQ,
FeatureGFNI,		FeatureGFNI,
FeatureRDPID,		FeatureRDPID,
FeatureFSRM];		FeatureFSRM];
list<SubtargetFeature> ICLTuning = [TuningFastGather,		list<SubtargetFeature> ICLTuning = [TuningFastGather,
TuningMacroFusion,		TuningMacroFusion,
TuningSlow3OpsLEA,		TuningSlow3OpsLEA,
TuningSlowDivide64,		TuningSlowDivide64,
TuningFastScalarFSQRT,		TuningFastScalarFSQRT,
TuningFastVectorFSQRT,		TuningFastVectorFSQRT,
TuningFastSHLDRotate,		TuningFastSHLDRotate,
TuningFast15ByteNOP,		TuningFast15ByteNOP,
TuningFastVariableCrossLaneShuffle,		TuningFastVariableCrossLaneShuffle,
TuningFastVariablePerLaneShuffle,		TuningFastVariablePerLaneShuffle,
TuningPrefer256Bit,		TuningPrefer256Bit,
TuningInsertVZEROUPPER];		TuningInsertVZEROUPPER,
		TuningAllowLightAVX];
list<SubtargetFeature> ICLFeatures =		list<SubtargetFeature> ICLFeatures =
!listconcat(CNLFeatures, ICLAdditionalFeatures);		!listconcat(CNLFeatures, ICLAdditionalFeatures);

// Icelake Server		// Icelake Server
list<SubtargetFeature> ICXAdditionalFeatures = [FeaturePCONFIG,		list<SubtargetFeature> ICXAdditionalFeatures = [FeaturePCONFIG,
FeatureCLWB,		FeatureCLWB,
FeatureWBNOINVD];		FeatureWBNOINVD];
list<SubtargetFeature> ICXTuning = ICLTuning;		list<SubtargetFeature> ICXTuning = ICLTuning;
▲ Show 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	list<SubtargetFeature> ZNTuning = [TuningFastLZCNT,
TuningBranchFusion,		TuningBranchFusion,
TuningFastScalarFSQRT,		TuningFastScalarFSQRT,
TuningFastVectorFSQRT,		TuningFastVectorFSQRT,
TuningFastScalarShiftMasks,		TuningFastScalarShiftMasks,
TuningFastVariablePerLaneShuffle,		TuningFastVariablePerLaneShuffle,
TuningFastMOVBE,		TuningFastMOVBE,
TuningSlowSHLD,		TuningSlowSHLD,
TuningSBBDepBreaking,		TuningSBBDepBreaking,
TuningInsertVZEROUPPER];		TuningInsertVZEROUPPER,
		TuningAllowLightAVX];
		RKSimonUnsubmitted Not Done Reply Inline Actions I'm not certain Ryzen needs this - even on znver1 with double pumping of 256-bit ops. RKSimon: I'm not certain Ryzen needs this - even on znver1 with double pumping of 256-bit ops.
		TokarIPAuthorUnsubmitted Done Reply Inline Actions I'm not sure I understand this comment. You mean since Ryzen doesn't have any frequency problems, so we don't care about prefer-vector-width=128 behavior? This is mostly here for a) completeness (since 256-ops don't seem to hurt on ryzen we do prefer 256 bit loads/stores) and b) for cases where users want znver tuning but still prefer good performance on intel sop they pass prefer-vector-width=128 TokarIP: I'm not sure I understand this comment. You mean since Ryzen doesn't have any frequency…
		lebedev.riUnsubmitted Not Done Reply Inline Actions I agree with @RKSimon here. I'm not really sure why anyone would want to use non-full vector width on Ryzens, so i don't think we support it there. lebedev.ri: I agree with @RKSimon here. I'm not really sure why anyone would want to use non-full vector…
		TokarIPAuthorUnsubmitted Done Reply Inline Actions FWIW mtune=znver3 + mprefer-vector-width=128 often gives best results for a mixed (skylake+rome) server fleet. TokarIP: FWIW mtune=znver3 + mprefer-vector-width=128 often gives best results for a mixed…
		RKSimonUnsubmitted Not Done Reply Inline Actions Would -mtune=x86-64-v3 not be better for those cases? RKSimon: Would -mtune=x86-64-v3 not be better for those cases?
		TokarIPAuthorUnsubmitted Done Reply Inline Actions Not really, x86-64-v3 is basically haswell, and it seems that ryzen benefits more from ryzen tuning than skylake from haswell tuning. TokarIP: Not really, x86-64-v3 is basically haswell, and it seems that ryzen benefits more from ryzen…
		RKSimonUnsubmitted Not Done Reply Inline Actions OK - if you want to include this then please can you ensure you add znver test coverage below RKSimon: OK - if you want to include this then please can you ensure you add znver test coverage below
		TokarIPAuthorUnsubmitted Done Reply Inline Actions Added znver case to memcpy-light-avx.ll TokarIP: Added znver case to memcpy-light-avx.ll
list<SubtargetFeature> ZN2AdditionalFeatures = [FeatureCLWB,		list<SubtargetFeature> ZN2AdditionalFeatures = [FeatureCLWB,
FeatureRDPID,		FeatureRDPID,
FeatureRDPRU,		FeatureRDPRU,
FeatureWBNOINVD];		FeatureWBNOINVD];
list<SubtargetFeature> ZN2Tuning = ZNTuning;		list<SubtargetFeature> ZN2Tuning = ZNTuning;
list<SubtargetFeature> ZN2Features =		list<SubtargetFeature> ZN2Features =
!listconcat(ZNFeatures, ZN2AdditionalFeatures);		!listconcat(ZNFeatures, ZN2AdditionalFeatures);
list<SubtargetFeature> ZN3AdditionalFeatures = [FeatureFSRM,		list<SubtargetFeature> ZN3AdditionalFeatures = [FeatureFSRM,
▲ Show 20 Lines • Show All 421 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,682 Lines • ▼ Show 20 Lines	if (Op.size() >= 16 &&
(!Subtarget.isUnalignedMem16Slow() \|\| Op.isAligned(Align(16)))) {		(!Subtarget.isUnalignedMem16Slow() \|\| Op.isAligned(Align(16)))) {
// FIXME: Check if unaligned 64-byte accesses are slow.		// FIXME: Check if unaligned 64-byte accesses are slow.
if (Op.size() >= 64 && Subtarget.hasAVX512() &&		if (Op.size() >= 64 && Subtarget.hasAVX512() &&
(Subtarget.getPreferVectorWidth() >= 512)) {		(Subtarget.getPreferVectorWidth() >= 512)) {
return Subtarget.hasBWI() ? MVT::v64i8 : MVT::v16i32;		return Subtarget.hasBWI() ? MVT::v64i8 : MVT::v16i32;
}		}
// FIXME: Check if unaligned 32-byte accesses are slow.		// FIXME: Check if unaligned 32-byte accesses are slow.
if (Op.size() >= 32 && Subtarget.hasAVX() &&		if (Op.size() >= 32 && Subtarget.hasAVX() &&
(Subtarget.getPreferVectorWidth() >= 256)) {		Subtarget.useLightAVX256Instructions()) {
		pengfeiUnsubmitted Not Done Reply Inline Actions Here the check for 256 was intended from rG47272217 authored by @echristo. It looks to me it is the only difference between `prefer-128-bit` and `prefer-256-bit`. So I don't understand why using `-mattr=prefer-128-bit -x86-light-avx=true` rather than `prefer-256-bit`. pengfei: Here the check for 256 was intended from rG47272217 authored by @echristo. It looks to me it is…
		TokarIPAuthorUnsubmitted Done Reply Inline Actions I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.g 256-bit FMA and corresponding frequency penalty. I want to get 256-bit loads/stores because they are free performance win, but not "heavy" instructions. TokarIP: I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.
		pengfeiUnsubmitted Not Done Reply Inline Actions Oh, I thought it is the only place we compared PreferVectorWidth with 256. (There're two other places in X86TargetTransformInfo.cpp) I mentioned @echristo's patch, because I guess make load/store the same size as PreferVectorWidth is beneficial in the most cases. And do we need to consider the intensity of "heavy" instructions? E.g., if we load 256-bit vector, shuffle to 2 128-bit to do a single FMA, and then shuffle back to 256-bit to do the store. Is it really better than 2 128-bit load/store that might be folded into the FMA instruction? pengfei: Oh, I thought it is the only place we compared PreferVectorWidth with 256. (There're two other…
		TokarIPAuthorUnsubmitted Done Reply Inline Actions AFAIK this won't happen. We either vectorize everything with 256-bit, so no FMA, just data movement, or with 128-bit and have no 256-bit at all. Reading the code and playing with examples didn't produce any mixed cases. TokarIP: AFAIK this won't happen. We either vectorize everything with 256-bit, so no FMA, just data…
// Although this isn't a well-supported type for AVX1, we'll let		// Although this isn't a well-supported type for AVX1, we'll let
// legalization and shuffle lowering produce the optimal codegen. If we		// legalization and shuffle lowering produce the optimal codegen. If we
// choose an optimal type with a vector element larger than a byte,		// choose an optimal type with a vector element larger than a byte,
// getMemsetStores() may create an intermediate splat (using an integer		// getMemsetStores() may create an intermediate splat (using an integer
// multiply) before we splat as a vector.		// multiply) before we splat as a vector.
return MVT::v32i8;		return MVT::v32i8;
}		}
if (Subtarget.hasSSE2() && (Subtarget.getPreferVectorWidth() >= 128))		if (Subtarget.hasSSE2() && (Subtarget.getPreferVectorWidth() >= 128))
▲ Show 20 Lines • Show All 54,692 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	#include "X86GenSubtargetInfo.inc"
}		}

// If there are no 512-bit vectors and we prefer not to use 512-bit registers,		// If there are no 512-bit vectors and we prefer not to use 512-bit registers,
// disable them in the legalizer.		// disable them in the legalizer.
bool useAVX512Regs() const {		bool useAVX512Regs() const {
return hasAVX512() && (canExtendTo512DQ() \|\| RequiredVectorWidth > 256);		return hasAVX512() && (canExtendTo512DQ() \|\| RequiredVectorWidth > 256);
}		}

		bool useLightAVX256Instructions() const {
		return getPreferVectorWidth() >= 256 \|\| AllowLightAVX;;
		RKSimonUnsubmitted Done Reply Inline Actions Depending on how prevalent this will be used, we might adjust it so people don't need to remember to check Subtarget.getPreferVectorWidth() >= 256 as well: bool useLightAVX256Instructions() const { return Subtarget.getPreferVectorWidth() >= 256 \|\| AllowLightAVX; } RKSimon: Depending on how prevalent this will be used, we might adjust it so people don't need to…
		RKSimonUnsubmitted Done Reply Inline Actions ;; RKSimon: ;;
		}

bool useBWIRegs() const {		bool useBWIRegs() const {
return hasBWI() && useAVX512Regs();		return hasBWI() && useAVX512Regs();
}		}

bool isXRaySupported() const override { return is64Bit(); }		bool isXRaySupported() const override { return is64Bit(); }

/// Use clflush if we have SSE2 or we're on x86-64 (even if we asked for		/// Use clflush if we have SSE2 or we're on x86-64 (even if we asked for
/// no-sse2). There isn't any reason to disable it if the target processor		/// no-sse2). There isn't any reason to disable it if the target processor
▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	const FeatureBitset InlineFeatureIgnoreList = {
X86::TuningPreferMaskRegisters,		X86::TuningPreferMaskRegisters,
X86::TuningInsertVZEROUPPER,		X86::TuningInsertVZEROUPPER,
X86::TuningUseSLMArithCosts,		X86::TuningUseSLMArithCosts,
X86::TuningUseGLMDivSqrtCosts,		X86::TuningUseGLMDivSqrtCosts,

// Perf-tuning flags.		// Perf-tuning flags.
X86::TuningFastGather,		X86::TuningFastGather,
X86::TuningSlowUAMem32,		X86::TuningSlowUAMem32,
		X86::TuningAllowLightAVX,

// Based on whether user set the -mprefer-vector-width command line.		// Based on whether user set the -mprefer-vector-width command line.
X86::TuningPrefer128Bit,		X86::TuningPrefer128Bit,
X86::TuningPrefer256Bit,		X86::TuningPrefer256Bit,

// CPU name enums. These just follow CPU string.		// CPU name enums. These just follow CPU string.
X86::ProcIntelAtom		X86::ProcIntelAtom
};		};
▲ Show 20 Lines • Show All 182 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/memcpy-light-avx.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell -mattr=prefer-128-bit \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,+allow-light-256-bit \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions Test with generic settings as well (e.g. -mattr=avx2,+prefer-128-bit,+allow-light-avx) RKSimon: Test with generic settings as well (e.g. -mattr=avx2,+prefer-128-bit,+allow-light-avx)
				pengfeiUnsubmitted Done Reply Inline Actions Why don't use `update_llc_test_checks.py` to generate the test? pengfei: Why don't use `update_llc_test_checks.py` to generate the test?

				declare void @llvm.memcpy.p0.p0.i64(ptr nocapture, ptr nocapture, i64, i1) nounwind

				RKSimonUnsubmitted Not Done Reply Inline Actions ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit \| FileCheck %s RKSimon: ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256…
				TokarIPAuthorUnsubmitted Done Reply Inline Actions Since this forces 128-bit, added with NO256 prefix: ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx2,+prefer-128-bit,-allow-light-256-bit \| FileCheck --check-prefixes=NO256 TokarIP: Since this forces 128-bit, added with NO256 prefix: ; RUN: llc < %s -mtriple=x86_64-unknown…
				define void @test1(ptr %a, ptr %b) nounwind {
				; CHECK-LABEL: test1:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rsi), %ymm0
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				tail call void @llvm.memcpy.p0.p0.i64(ptr %a, ptr %b, i64 32, i1 0 )
				ret void
				}

llvm/test/CodeGen/X86/vector-width-store-merge.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-- \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-- \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions Please can you add ryzen coverage here as well: RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1\| FileCheck %s --check-prefixes=CHECK,PREFER256 RKSimon: Please can you add ryzen coverage here as well: RUN: llc < %s -mtriple=x86_64-- -mcpu=znver1\|…

				RKSimonUnsubmitted Done Reply Inline Actions Clean up the prefixes - LIGHT256 might be a better prefix to use here? ; RUN: llc < %s -mtriple=x86_64-- -mcpu=skylake\| FileCheck %s --check-prefixes=CHECK,PREFER256 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=sandybridge\| FileCheck %s --check-prefixes=CHECK,LIGHT256 RKSimon: Clean up the prefixes - LIGHT256 might be a better prefix to use here? ``` ; RUN: llc < %s…
	; This tests whether or not we generate vectors large than preferred vector width when			; This tests whether or not we generate vectors large than preferred vector width when
	; lowering memmove.			; lowering memmove.

	; Function Attrs: nounwind uwtable			; Function Attrs: nounwind uwtable
	define weak_odr dso_local void @A(ptr %src, ptr %dst) local_unnamed_addr #0 {			define weak_odr dso_local void @A(ptr %src, ptr %dst) local_unnamed_addr #0 {
	; CHECK-LABEL: A:			; CHECK-LABEL: A:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK-NEXT: vmovups (%rdi), %xmm0			; CHECK-NEXT: vmovups (%rdi), %xmm0
	▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	entry:			entry:
	call void @llvm.memmove.p0.p0.i64(ptr align 1 %dst, ptr align 1 %src, i64 64, i1 false)			call void @llvm.memmove.p0.p0.i64(ptr align 1 %dst, ptr align 1 %src, i64 64, i1 false)
	ret void			ret void
	}			}

	; Function Attrs: argmemonly nounwind			; Function Attrs: argmemonly nounwind
	declare void @llvm.memmove.p0.p0.i64(ptr nocapture, ptr nocapture readonly, i64, i1 immarg) #1			declare void @llvm.memmove.p0.p0.i64(ptr nocapture, ptr nocapture readonly, i64, i1 immarg) #1

	attributes #0 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "frame-pointer"="none" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "prefer-vector-width"="128" "stack-protector-buffer-size"="8" "target-cpu"="skylake-avx512" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }			attributes #0 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "frame-pointer"="none" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "prefer-vector-width"="128" "stack-protector-buffer-size"="8" "target-cpu"="sandybridge" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }
				pengfeiUnsubmitted Not Done Reply Inline Actions This patch changes the behavior the test expected, though it should no correctness issue for 256-bits. We should update the test to show it rather than hide it. Note, it will have correctness issue or build error if force to generate 512-bits instructions. pengfei: This patch changes the behavior the test expected, though it should no correctness issue for…
				TokarIPAuthorUnsubmitted Done Reply Inline Actions We want to test 2 behaviors: 1)prefer-vector-width=128 and no TuningAllowLight256Bit should generate 128-bit load/store - this test 2)prefer-vector-width=128 and TuningAllowLight256Bit should generate 256-bit - memcpy-light-avx.ll Updating this test to check 256 case, means that we still need extra test for behavior #1, I'd rather keep the number of tests smaller with the same coverage. TokarIP: We want to test 2 behaviors: 1)prefer-vector-width=128 and no TuningAllowLight256Bit should…
				pengfeiUnsubmitted Not Done Reply Inline Actions You can add another RUN to test the behaviors in `llvm/test/CodeGen/X86/memcpy-light-avx.ll` if you like, e.g., `; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell \| FileCheck %s --check-prefix=NO-256` We don't have a method to disable it on new targets, so no 2 behaviors here. pengfei: You can add another RUN to test the behaviors in `llvm/test/CodeGen/X86/memcpy-light-avx.ll` if…
				TokarIPAuthorUnsubmitted Done Reply Inline Actions Thanks for the suggestion! Now we have 2 runs one cpus without this tuning and one with. TokarIP: Thanks for the suggestion! Now we have 2 runs one cpus without this tuning and one with.
	attributes #1 = { argmemonly nounwind }			attributes #1 = { argmemonly nounwind }
	attributes #2 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "frame-pointer"="none" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "prefer-vector-width"="256" "stack-protector-buffer-size"="8" "target-cpu"="skylake-avx512" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }			attributes #2 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "frame-pointer"="none" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "prefer-vector-width"="256" "stack-protector-buffer-size"="8" "target-cpu"="skylake-avx512" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }

	!0 = !{i32 1, !"wchar_size", i32 4}			!0 = !{i32 1, !"wchar_size", i32 4}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add support for "light" AVXClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 484641

llvm/lib/Target/X86/X86.td

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/lib/Target/X86/X86Subtarget.h

llvm/lib/Target/X86/X86TargetTransformInfo.h

llvm/test/CodeGen/X86/memcpy-light-avx.ll

llvm/test/CodeGen/X86/vector-width-store-merge.ll

[X86] Add support for "light" AVX
ClosedPublic