This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
3/5
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/X86/
-
X86/
-
X86TargetTransformInfo.h
3/5
X86TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
6/13
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
lookahead.ll
4/9
operandorder.ll

Differential D121354

[SLP] Fix lookahead operand reordering for splat loads.
ClosedPublic

Authored by vporpo on Mar 9 2022, 9:45 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
fhahn

Commits

rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads.

Summary

Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: movddup (%reg), xmm0. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:

It is usually a load and 2 inserts:

%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1

But it can also be a load, an insert and a shuffle:

%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,040 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg_mask.c
	60,050 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg_mask.c
	60,020 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

vporpo created this revision.Mar 9 2022, 9:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

vporpo requested review of this revision.Mar 9 2022, 9:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B153485: Diff 414275.Mar 9 2022, 9:45 PM

vporpo added a parent revision: D121353: [SLP][NFC] This adds a test for a follow-up patch that fixes a look-ahead operand reordering issue.Mar 9 2022, 9:46 PM

vporpo edited the summary of this revision. (Show Details)

RKSimon added inline comments.Mar 9 2022, 11:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130	why test both V1 and V2 if we know they are the same value?

Removed isa<LoadInst>(V2).

Harbormaster completed remote builds in B153493: Diff 414286.Mar 10 2022, 2:05 AM

ABataev added inline comments.Mar 10 2022, 2:58 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1091	Thos is target specific for x86. What about other targets?

lebedev.ri added a subscriber: lebedev.ri.Mar 10 2022, 4:23 AM

lebedev.ri added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130	This should be a lot more principled than this (i.e, query the actual costmodel). E.g., `movddup` is only for `double`, but i think this will also trigger for `float`? Also, what about broadcasting loads that are available in AVX/AVX2/AVX512?

vporpo added inline comments.Mar 10 2022, 10:52 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1091	Please see the discussion below.
1130	I agree, this needs to be target specific. I could add a new `ShuffleKind`, something like `SK_BroadcastLoad`. But this is not strictly a shuffle pattern, it is a combined Load + Shuffle. So this might need its own function, like `TargetTransformInfo::getBroadcastLoadCost(VectorType *VecTy)`. What do you think?

RKSimon added inline comments.Mar 10 2022, 11:01 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130	What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad (similar to the existing IsLegal*Load ops)?

vporpo added inline comments.Mar 10 2022, 11:09 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130	Yeah, something like `TTI::IsLegalBroadcastLoad` may be more suitable for this since the score that we are using in the reordering is not an actual TTI cost.

Changed the logic to use TTI::isLegalBroadcastLoad().

Harbormaster completed remote builds in B153633: Diff 414478.Mar 10 2022, 2:05 PM

This is blocking our internal process, so any help in getting this reviewed would be greatly appreciated.

Ping (sorry for the repeated pings but this is quite urgent as it is blocking our process).

ABataev added inline comments.Mar 14 2022, 12:39 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5117	I assume, you need to tweak the cost model for broadcast with loads.
5117–5121	return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && Ty->getElementType() == Type::getDoubleTy(Ty->getContext());
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1090–1091	We don't care about the instruction count for SLP, but the throughput.
1124–1125	Maybe pass a scalar type and number of elements to avoid constructing vector type?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.

Addressed comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1090–1091	Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	How did you come up with these throughput values ? The assembly code that comes out of llc for the original code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovupd (%ecx), %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) The new code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0] vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) I ran the function in a loop on a skylake and the new code is 25% faster.

ABataev added inline comments.Mar 14 2022, 4:29 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1090–1091	Use the throughput, not number of instructions.
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	https://godbolt.org/z/3rhGajsaT The first page is the result without patch, the second - with patch.

Harbormaster completed remote builds in B154211: Diff 415261.Mar 14 2022, 4:57 PM

Updated cost model to handle broadcast loads.

Harbormaster completed remote builds in B154238: Diff 415297.Mar 14 2022, 8:03 PM

RKSimon added inline comments.Mar 15 2022, 6:28 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1049	Why do we need this? Why not just use getShuffleCost?

vporpo added inline comments.Mar 15 2022, 9:54 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1049	We could use `getShuffleCost()` instead and introduce a new enumerator like `ShuffleKind::SK_BroadcastLoad`. But I feel it is better to use a separate function because `getShuffleCost()` is all about the cost of data shuffling (broadcast, reverse, select, transpose etc.), while a BroadcastLoad includes both a load from memory and a data shuffle. I don't have a strong preference though, what do you think?

My concern is that we've never worked out how we should account for folded load costs and this is setting a precedent that might back fire.

Wdyt about something like ShuffleKind::SK_BroadcastForBroadcastLoad that only takes into consideration the shuffle part of a broadcast load? A normal broadcast of a double to a 2-wide vector costs 1, so this could cost 0.

@ABataev wdyt about these options, any preference?

In D121354#3383755, @vporpo wrote:

@ABataev wdyt about these options, any preference?

I agree with @RKSimon here. Better to teach existing functions about possibly foldable loads somehow. There might be some other cases where we'll need similar info, e.g. some AVX512 targets has slow gathers for registers but not for loads, need to pass this info somehow too.

Removed getBroadcastLoadCost() and replaced it with getShuffleCost() which now has a new IsLoad argument.

Harbormaster completed remote builds in B154506: Diff 415690.Mar 15 2022, 9:07 PM

ABataev added inline comments.Mar 16 2022, 6:56 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1056	I belive better to pass the instruction here, just like in other functions. And then do the analysis of this instruction, if it was passed.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1124	Probably, need to check also for number of uses + external uses.

RKSimon added inline comments.Mar 16 2022, 8:39 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1056	Passing ArrayRef<const Value *> Args like we do for arith op makes sense here

RKSimon added inline comments.Mar 16 2022, 8:41 AM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1558	Won't this fail on pre-SSSEe targets?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
2	; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s Add a pre-SSSE3 run?

Replaced IsLoad argument with Args in getShuffleCost(), and addressed comments.

vporpo added inline comments.Mar 16 2022, 12:55 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1558	Good catch, I added a check for SSE3.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1124	Good point, I added a check which should only allow this if we don't have any internal/external uses. I will add checks for internal uses in a follow up patch, because it requires more changes + tests.

Harbormaster completed remote builds in B154676: Diff 415944.Mar 16 2022, 1:11 PM

ABataev added inline comments.Mar 16 2022, 1:36 PM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1056	You can drop `const` in `const Value *`
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1556–1558	Maybe rename it to `SSE3BroadcastLoadTbl`?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	What about this?

Changed ArrayRef<const Value*> to ArrayRef<Value *> and renamed table to SSE3BroadcastLoadTbl.

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost(): {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2pshuflw + 2pshufhw // + pshufd/unpck { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2pshuflw + 2pshufhw // + 2pshufd + 2unpck + 2*packus

Harbormaster completed remote builds in B154691: Diff 415963.Mar 16 2022, 2:00 PM

ABataev added inline comments.Mar 16 2022, 2:03 PM

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	It is still in terms of throughput. Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria

vporpo added a child revision: D121939: [SLP][NFC] Added a test for a followup patch that enables handling splat loads with uses..Mar 17 2022, 11:47 AM

vporpo added inline comments.Mar 17 2022, 2:24 PM

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	What kind of tweaking are you proposing?

vporpo added inline comments.Mar 17 2022, 3:43 PM

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191	As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with `vmovddup`. This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency. Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way.

This revision is now accepted and ready to land.Mar 17 2022, 4:16 PM

This revision was landed with ongoing or failed builds.Mar 17 2022, 6:06 PM

Closed by commit rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads. (authored by vporpo). · Explain Why

This revision was automatically updated to reflect the committed changes.

vporpo added a commit: rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads..

vporpo added a reverting change: rG9136145eb019: Revert "[SLP] Fix lookahead operand reordering for splat loads." due to build….Mar 17 2022, 6:23 PM

vporpo mentioned this in D121973: Recommit "[SLP] Fix lookahead operand reordering for splat loads.".Mar 17 2022, 7:33 PM

vporpo mentioned this in rG79613185d305: Recommit "[SLP] Fix lookahead operand reordering for splat loads.".Mar 21 2022, 3:58 PM

vporpo mentioned this in rG27bd8f949282: Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2….Mar 22 2022, 5:09 PM

vporpo mentioned this in rG39aa202affd9: Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3….Mar 23 2022, 6:33 PM

vporpo mentioned this in D123638: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64.Apr 12 2022, 3:24 PM

vporpo mentioned this in rG7ba702644bac: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for….Apr 22 2022, 7:48 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

10 lines

TargetTransformInfoImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

X86/

X86TargetTransformInfo.h

1 line

X86TargetTransformInfo.cpp

7 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

17 lines

test/

Transforms/

SLPVectorizer/

X86/

lookahead.ll

50 lines

operandorder.ll

34 lines

Diff 415261

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 652 Lines • ▼ Show 20 Lines	public:
/// Return true if the target supports masked load.		/// Return true if the target supports masked load.
bool isLegalMaskedLoad(Type *DataType, Align Alignment) const;		bool isLegalMaskedLoad(Type *DataType, Align Alignment) const;

/// Return true if the target supports nontemporal store.		/// Return true if the target supports nontemporal store.
bool isLegalNTStore(Type *DataType, Align Alignment) const;		bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.		/// Return true if the target supports nontemporal load.
bool isLegalNTLoad(Type *DataType, Align Alignment) const;		bool isLegalNTLoad(Type *DataType, Align Alignment) const;

		/// \Returns true if the target supports broadcasting a load to a vector of
		/// type <NumElements x ElementTy>.
		bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const;

/// Return true if the target supports masked scatter.		/// Return true if the target supports masked scatter.
bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;
/// Return true if the target supports masked gather.		/// Return true if the target supports masked gather.
bool isLegalMaskedGather(Type *DataType, Align Alignment) const;		bool isLegalMaskedGather(Type *DataType, Align Alignment) const;
/// Return true if the target forces scalarizing of llvm.masked.gather		/// Return true if the target forces scalarizing of llvm.masked.gather
/// intrinsics.		/// intrinsics.
bool forceScalarizeMaskedGather(VectorType *Type, Align Alignment) const;		bool forceScalarizeMaskedGather(VectorType *Type, Align Alignment) const;
/// Return true if the target forces scalarizing of llvm.masked.scatter		/// Return true if the target forces scalarizing of llvm.masked.scatter
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	InstructionCost getArithmeticInstrCost(
OperandValueKind Opd2Info = OK_AnyValue,		OperandValueKind Opd2Info = OK_AnyValue,
OperandValueProperties Opd1PropInfo = OP_None,		OperandValueProperties Opd1PropInfo = OP_None,
OperandValueProperties Opd2PropInfo = OP_None,		OperandValueProperties Opd2PropInfo = OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >(),		ArrayRef<const Value > Args = ArrayRef<const Value >(),
const Instruction *CxtI = nullptr) const;		const Instruction *CxtI = nullptr) const;

/// \return The cost of a shuffle instruction of kind Kind and of type Tp.		/// \return The cost of a shuffle instruction of kind Kind and of type Tp.
/// The exact mask may be passed as Mask, or else the array will be empty.		/// The exact mask may be passed as Mask, or else the array will be empty.
/// The index and subtype parameters are used by the subvector insertion and		/// The index and subtype parameters are used by the subvector insertion and
		RKSimonUnsubmitted Not Done Reply Inline Actions Why do we need this? Why not just use getShuffleCost? RKSimon: Why do we need this? Why not just use getShuffleCost?
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions We could use `getShuffleCost()` instead and introduce a new enumerator like `ShuffleKind::SK_BroadcastLoad`. But I feel it is better to use a separate function because `getShuffleCost()` is all about the cost of data shuffling (broadcast, reverse, select, transpose etc.), while a BroadcastLoad includes both a load from memory and a data shuffle. I don't have a strong preference though, what do you think? vporpo: We could use `getShuffleCost()` instead and introduce a new enumerator like `ShuffleKind…
/// extraction shuffle kinds to show the insert/extract point and the type of		/// extraction shuffle kinds to show the insert/extract point and the type of
/// the subvector being inserted/extracted.		/// the subvector being inserted/extracted.
/// NOTE: For subvector extractions Tp represents the source type.		/// NOTE: For subvector extractions Tp represents the source type.
InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,		InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask = None, int Index = 0,		ArrayRef<int> Mask = None, int Index = 0,
VectorType *SubTp = nullptr) const;		VectorType *SubTp = nullptr) const;

		ABataevUnsubmitted Done Reply Inline Actions I belive better to pass the instruction here, just like in other functions. And then do the analysis of this instruction, if it was passed. ABataev: I belive better to pass the instruction here, just like in other functions. And then do the…
		RKSimonUnsubmitted Done Reply Inline Actions Passing ArrayRef<const Value > Args like we do for arith op makes sense here RKSimon:* Passing ArrayRef<const Value *> Args like we do for arith op makes sense here
		ABataevUnsubmitted Done Reply Inline Actions You can drop `const` in `const Value ` ABataev:* You can drop `const` in `const Value *`
/// Represents a hint about the context in which a cast is used.		/// Represents a hint about the context in which a cast is used.
///		///
/// For zext/sext, the context of the cast is the operand, which must be a		/// For zext/sext, the context of the cast is the operand, which must be a
/// load of some kind. For trunc, the context is of the cast is the single		/// load of some kind. For trunc, the context is of the cast is the single
/// user of the instruction, which must be a store of some kind.		/// user of the instruction, which must be a store of some kind.
///		///
/// This enum allows the vectorizer to give getCastInstrCost an idea of the		/// This enum allows the vectorizer to give getCastInstrCost an idea of the
/// type of cast it's dealing with, as not every cast is equal. For instance,		/// type of cast it's dealing with, as not every cast is equal. For instance,
▲ Show 20 Lines • Show All 483 Lines • ▼ Show 20 Lines	virtual bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE,
LoopInfo LI, DominatorTree DT, AssumptionCache *AC,		LoopInfo LI, DominatorTree DT, AssumptionCache *AC,
TargetLibraryInfo *LibInfo) = 0;		TargetLibraryInfo *LibInfo) = 0;
virtual AddressingModeKind		virtual AddressingModeKind
getPreferredAddressingMode(const Loop L, ScalarEvolution SE) const = 0;		getPreferredAddressingMode(const Loop L, ScalarEvolution SE) const = 0;
virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;
virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;		virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;
virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;		virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;
		virtual bool isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const = 0;
virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;
virtual bool forceScalarizeMaskedGather(VectorType *DataType,		virtual bool forceScalarizeMaskedGather(VectorType *DataType,
Align Alignment) = 0;		Align Alignment) = 0;
virtual bool forceScalarizeMaskedScatter(VectorType *DataType,		virtual bool forceScalarizeMaskedScatter(VectorType *DataType,
Align Alignment) = 0;		Align Alignment) = 0;
virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;		virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;
virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;		virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;
▲ Show 20 Lines • Show All 387 Lines • ▼ Show 20 Lines	bool isLegalMaskedLoad(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedLoad(DataType, Alignment);		return Impl.isLegalMaskedLoad(DataType, Alignment);
}		}
bool isLegalNTStore(Type *DataType, Align Alignment) override {		bool isLegalNTStore(Type *DataType, Align Alignment) override {
return Impl.isLegalNTStore(DataType, Alignment);		return Impl.isLegalNTStore(DataType, Alignment);
}		}
bool isLegalNTLoad(Type *DataType, Align Alignment) override {		bool isLegalNTLoad(Type *DataType, Align Alignment) override {
return Impl.isLegalNTLoad(DataType, Alignment);		return Impl.isLegalNTLoad(DataType, Alignment);
}		}
		bool isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const override {
		return Impl.isLegalBroadcastLoad(ElementTy, NumElements);
		}
bool isLegalMaskedScatter(Type *DataType, Align Alignment) override {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedScatter(DataType, Alignment);		return Impl.isLegalMaskedScatter(DataType, Alignment);
}		}
bool isLegalMaskedGather(Type *DataType, Align Alignment) override {		bool isLegalMaskedGather(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedGather(DataType, Alignment);		return Impl.isLegalMaskedGather(DataType, Alignment);
}		}
bool forceScalarizeMaskedGather(VectorType *DataType,		bool forceScalarizeMaskedGather(VectorType *DataType,
Align Alignment) override {		Align Alignment) override {
▲ Show 20 Lines • Show All 541 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 252 Lines • ▼ Show 20 Lines	public:

bool isLegalNTLoad(Type *DataType, Align Alignment) const {		bool isLegalNTLoad(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory loads are available for loads that		// By default, assume nontemporal memory loads are available for loads that
// are aligned and have a size that is a power of 2.		// are aligned and have a size that is a power of 2.
unsigned DataSize = DL.getTypeStoreSize(DataType);		unsigned DataSize = DL.getTypeStoreSize(DataType);
return Alignment >= DataSize && isPowerOf2_32(DataSize);		return Alignment >= DataSize && isPowerOf2_32(DataSize);
}		}

		bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const {
		return false;
		}

bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {
return false;		return false;
}		}

bool isLegalMaskedGather(Type *DataType, Align Alignment) const {		bool isLegalMaskedGather(Type *DataType, Align Alignment) const {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 976 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);		return TTIImpl->isLegalNTStore(DataType, Alignment);
}		}

bool TargetTransformInfo::isLegalNTLoad(Type *DataType, Align Alignment) const {		bool TargetTransformInfo::isLegalNTLoad(Type *DataType, Align Alignment) const {
return TTIImpl->isLegalNTLoad(DataType, Alignment);		return TTIImpl->isLegalNTLoad(DataType, Alignment);
}		}

		bool TargetTransformInfo::isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const {
		return TTIImpl->isLegalBroadcastLoad(ElementTy, NumElements);
		}

bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,		bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedGather(DataType, Alignment);		return TTIImpl->isLegalMaskedGather(DataType, Alignment);
}		}

bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,		bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedScatter(DataType, Alignment);		return TTIImpl->isLegalMaskedScatter(DataType, Alignment);
▲ Show 20 Lines • Show All 798 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	InstructionCost getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2);		TargetTransformInfo::LSRCost &C2);
bool canMacroFuseCmp();		bool canMacroFuseCmp();
bool isLegalMaskedLoad(Type *DataType, Align Alignment);		bool isLegalMaskedLoad(Type *DataType, Align Alignment);
bool isLegalMaskedStore(Type *DataType, Align Alignment);		bool isLegalMaskedStore(Type *DataType, Align Alignment);
bool isLegalNTLoad(Type *DataType, Align Alignment);		bool isLegalNTLoad(Type *DataType, Align Alignment);
bool isLegalNTStore(Type *DataType, Align Alignment);		bool isLegalNTStore(Type *DataType, Align Alignment);
		bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const;
bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);		bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);
bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {		bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {
return forceScalarizeMaskedGather(VTy, Alignment);		return forceScalarizeMaskedGather(VTy, Alignment);
}		}
bool isLegalMaskedGather(Type *DataType, Align Alignment);		bool isLegalMaskedGather(Type *DataType, Align Alignment);
bool isLegalMaskedScatter(Type *DataType, Align Alignment);		bool isLegalMaskedScatter(Type *DataType, Align Alignment);
bool isLegalMaskedExpandLoad(Type *DataType);		bool isLegalMaskedExpandLoad(Type *DataType);
bool isLegalMaskedCompressStore(Type *DataType);		bool isLegalMaskedCompressStore(Type *DataType);
Show All 30 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,547 Lines • ▼ Show 20 Lines	InstructionCost X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
if (ST->hasSSE2())		if (ST->hasSSE2())
if (const auto *Entry = CostTableLookup(SSE2ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(SSE2ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

static const CostTblEntry SSE1ShuffleTbl[] = {		static const CostTblEntry SSE1ShuffleTbl[] = {
{ TTI::SK_Broadcast, MVT::v4f32, 1 }, // shufps		{ TTI::SK_Broadcast, MVT::v4f32, 1 }, // shufps
{ TTI::SK_Reverse, MVT::v4f32, 1 }, // shufps		{ TTI::SK_Reverse, MVT::v4f32, 1 }, // shufps
{ TTI::SK_Select, MVT::v4f32, 2 }, // 2*shufps		{ TTI::SK_Select, MVT::v4f32, 2 }, // 2*shufps
{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 1 }, // shufps		{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 1 }, // shufps
{ TTI::SK_PermuteTwoSrc, MVT::v4f32, 2 }, // 2*shufps		{ TTI::SK_PermuteTwoSrc, MVT::v4f32, 2 }, // 2*shufps
};		};
		RKSimonUnsubmitted Not Done Reply Inline Actions Won't this fail on pre-SSSEe targets? RKSimon: Won't this fail on pre-SSSEe targets?
		vporpoAuthorUnsubmitted Done Reply Inline Actions Good catch, I added a check for SSE3. vporpo: Good catch, I added a check for SSE3.
		ABataevUnsubmitted Done Reply Inline Actions Maybe rename it to `SSE3BroadcastLoadTbl`? ABataev: Maybe rename it to `SSE3BroadcastLoadTbl`?

if (ST->hasSSE1())		if (ST->hasSSE1())
if (const auto *Entry = CostTableLookup(SSE1ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(SSE1ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

return BaseT::getShuffleCost(Kind, BaseTp, Mask, Index, SubTp);		return BaseT::getShuffleCost(Kind, BaseTp, Mask, Index, SubTp);
}		}

▲ Show 20 Lines • Show All 3,540 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalNTStore(Type *DataType, Align Alignment) {
// loads require AVX2).		// loads require AVX2).
if (DataSize == 32)		if (DataSize == 32)
return ST->hasAVX();		return ST->hasAVX();
if (DataSize == 16)		if (DataSize == 16)
return ST->hasSSE1();		return ST->hasSSE1();
return true;		return true;
}		}

		bool X86TTIImpl::isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const {
		// movddup
		ABataevUnsubmitted Not Done Reply Inline Actions I assume, you need to tweak the cost model for broadcast with loads. ABataev: I assume, you need to tweak the cost model for broadcast with loads.
		return ST->hasSSSE3() && NumElements == 2 &&
		ElementTy == Type::getDoubleTy(ElementTy->getContext());
		}

		ABataevUnsubmitted Done Reply Inline Actions return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && Ty->getElementType() == Type::getDoubleTy(Ty->getContext()); ABataev: ``` return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 &&…
bool X86TTIImpl::isLegalMaskedExpandLoad(Type *DataTy) {		bool X86TTIImpl::isLegalMaskedExpandLoad(Type *DataTy) {
if (!isa<VectorType>(DataTy))		if (!isa<VectorType>(DataTy))
return false;		return false;

if (!ST->hasAVX512())		if (!ST->hasAVX512())
return false;		return false;

// The backend can't handle a single element vector.		// The backend can't handle a single element vector.
▲ Show 20 Lines • Show All 658 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,080 Lines • ▼ Show 20 Lines	class VLOperands {
// However, sometimes we have to break ties. For example we may have to		// However, sometimes we have to break ties. For example we may have to
// choose between matching loads vs matching opcodes. This is what these		// choose between matching loads vs matching opcodes. This is what these
// scores are helping us with: they provide the order of preference. Also,		// scores are helping us with: they provide the order of preference. Also,
// this is important if the scalar is externally used or used in another		// this is important if the scalar is externally used or used in another
// tree entry node in the different lane.		// tree entry node in the different lane.

/// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]).		/// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]).
static const int ScoreConsecutiveLoads = 4;		static const int ScoreConsecutiveLoads = 4;
		/// The same load multiple times. This should have a better score than
		/// `ScoreSplat` because it takes only 1 instruction in x86 for a 2-lane
		/// vector using `movddup (%reg), xmm0`.
		ABataevUnsubmitted Not Done Reply Inline Actions Thos is target specific for x86. What about other targets? ABataev: Thos is target specific for x86. What about other targets?
		vporpoAuthorUnsubmitted Done Reply Inline Actions Please see the discussion below. vporpo: Please see the discussion below.
		ABataevUnsubmitted Not Done Reply Inline Actions We don't care about the instruction count for SLP, but the throughput. ABataev: We don't care about the instruction count for SLP, but the throughput.
		vporpoAuthorUnsubmitted Done Reply Inline Actions Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ? vporpo: Agreed, but usually code with fewer instructions is better for various reasons. How would you…
		ABataevUnsubmitted Not Done Reply Inline Actions Use the throughput, not number of instructions. ABataev: Use the throughput, not number of instructions.
		static const int ScoreSplatLoads = 3;
/// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]).		/// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]).
static const int ScoreReversedLoads = 3;		static const int ScoreReversedLoads = 3;
/// ExtractElementInst from same vector and consecutive indexes.		/// ExtractElementInst from same vector and consecutive indexes.
static const int ScoreConsecutiveExtracts = 4;		static const int ScoreConsecutiveExtracts = 4;
/// ExtractElementInst from same vector and reversed indices.		/// ExtractElementInst from same vector and reversed indices.
static const int ScoreReversedExtracts = 3;		static const int ScoreReversedExtracts = 3;
/// Constants.		/// Constants.
static const int ScoreConstants = 2;		static const int ScoreConstants = 2;
Show All 10 Lines	class VLOperands {
/// Score if all users are vectorized.		/// Score if all users are vectorized.
static const int ScoreAllUserVectorized = 1;		static const int ScoreAllUserVectorized = 1;

/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.		/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.
/// Also, checks if \p V1 and \p V2 are compatible with instructions in \p		/// Also, checks if \p V1 and \p V2 are compatible with instructions in \p
/// MainAltOps.		/// MainAltOps.
static int getShallowScore(Value V1, Value V2, const DataLayout &DL,		static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
ScalarEvolution &SE, int NumLanes,		ScalarEvolution &SE, int NumLanes,
ArrayRef<Value *> MainAltOps) {		ArrayRef<Value *> MainAltOps,
if (V1 == V2)		const TargetTransformInfo *TTI) {
		if (V1 == V2) {
		if (isa<LoadInst>(V1)) {
		// A broadcast of a load can be cheaper on some targets.
		if (TTI->isLegalBroadcastLoad(V1->getType(), NumLanes))
		ABataevUnsubmitted Not Done Reply Inline Actions Probably, need to check also for number of uses + external uses. ABataev: Probably, need to check also for number of uses + external uses.
		vporpoAuthorUnsubmitted Done Reply Inline Actions Good point, I added a check which should only allow this if we don't have any internal/external uses. I will add checks for internal uses in a follow up patch, because it requires more changes + tests. vporpo: Good point, I added a check which should only allow this if we don't have any internal/external…
		return VLOperands::ScoreSplatLoads;
		ABataevUnsubmitted Done Reply Inline Actions Maybe pass a scalar type and number of elements to avoid constructing vector type? ABataev: Maybe pass a scalar type and number of elements to avoid constructing vector type?
		}
return VLOperands::ScoreSplat;		return VLOperands::ScoreSplat;
		}

auto *LI1 = dyn_cast<LoadInst>(V1);		auto *LI1 = dyn_cast<LoadInst>(V1);
		RKSimonUnsubmitted Done Reply Inline Actions why test both V1 and V2 if we know they are the same value? RKSimon: why test both V1 and V2 if we know they are the same value?
		lebedev.riUnsubmitted Not Done Reply Inline Actions This should be a lot more principled than this (i.e, query the actual costmodel). E.g., `movddup` is only for `double`, but i think this will also trigger for `float`? Also, what about broadcasting loads that are available in AVX/AVX2/AVX512? lebedev.ri: This should be a lot more principled than this (i.e, query the actual costmodel). E.g.
		vporpoAuthorUnsubmitted Done Reply Inline Actions I agree, this needs to be target specific. I could add a new `ShuffleKind`, something like `SK_BroadcastLoad`. But this is not strictly a shuffle pattern, it is a combined Load + Shuffle. So this might need its own function, like `TargetTransformInfo::getBroadcastLoadCost(VectorType VecTy)`. What do you think? vporpo:* I agree, this needs to be target specific. I could add a new `ShuffleKind`, something like…
		RKSimonUnsubmitted Not Done Reply Inline Actions What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad (similar to the existing IsLegalLoad ops)? RKSimon:* What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions Yeah, something like `TTI::IsLegalBroadcastLoad` may be more suitable for this since the score that we are using in the reordering is not an actual TTI cost. vporpo: Yeah, something like `TTI::IsLegalBroadcastLoad` may be more suitable for this since the score…
auto *LI2 = dyn_cast<LoadInst>(V2);		auto *LI2 = dyn_cast<LoadInst>(V2);
if (LI1 && LI2) {		if (LI1 && LI2) {
if (LI1->getParent() != LI2->getParent())		if (LI1->getParent() != LI2->getParent())
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;

Optional<int> Dist = getPointersDiff(		Optional<int> Dist = getPointersDiff(
LI1->getType(), LI1->getPointerOperand(), LI2->getType(),		LI1->getType(), LI1->getPointerOperand(), LI2->getType(),
LI2->getPointerOperand(), DL, SE, /StrictCheck=/true);		LI2->getPointerOperand(), DL, SE, /StrictCheck=/true);
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	class VLOperands {
/// Look-ahead SLP: Auto-vectorization in the presence of commutative		/// Look-ahead SLP: Auto-vectorization in the presence of commutative
/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,		/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
/// Luís F. W. Góes		/// Luís F. W. Góes
int getScoreAtLevelRec(Value LHS, Value RHS, int CurrLevel, int MaxLevel,		int getScoreAtLevelRec(Value LHS, Value RHS, int CurrLevel, int MaxLevel,
ArrayRef<Value *> MainAltOps) {		ArrayRef<Value *> MainAltOps) {

// Get the shallow score of V1 and V2.		// Get the shallow score of V1 and V2.
int ShallowScoreAtThisLevel =		int ShallowScoreAtThisLevel =
getShallowScore(LHS, RHS, DL, SE, getNumLanes(), MainAltOps);		getShallowScore(LHS, RHS, DL, SE, getNumLanes(), MainAltOps, R.TTI);

// If reached MaxLevel,		// If reached MaxLevel,
// or if V1 and V2 are not instructions,		// or if V1 and V2 are not instructions,
// or if they are SPLAT,		// or if they are SPLAT,
// or if they are not consecutive,		// or if they are not consecutive,
// or if profitable to vectorize loads or extractelements, early return		// or if profitable to vectorize loads or extractelements, early return
// the current cost.		// the current cost.
auto *I1 = dyn_cast<Instruction>(LHS);		auto *I1 = dyn_cast<Instruction>(LHS);
▲ Show 20 Lines • Show All 9,336 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

Show First 20 Lines • Show All 585 Lines • ▼ Show 20 Lines	;
ret i1 %cmp.i185		ret i1 %cmp.i185
}		}

; Same as @ChecksExtractScores, but the extratelement vector operands do not match.		; Same as @ChecksExtractScores, but the extratelement vector operands do not match.
define void @ChecksExtractScores_different_vectors(double* %storeArray, double* %array, <2 x double> %vecPtr1, <2 x double> %vecPtr2, <2 x double>* %vecPtr3, <2 x double>* %vecPtr4) {		define void @ChecksExtractScores_different_vectors(double* %storeArray, double* %array, <2 x double> %vecPtr1, <2 x double> %vecPtr2, <2 x double>* %vecPtr3, <2 x double>* %vecPtr4) {
; CHECK-LABEL: @ChecksExtractScores_different_vectors(		; CHECK-LABEL: @ChecksExtractScores_different_vectors(
; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[IDX0]] to <2 x double>*		; CHECK-NEXT: [[LOADA0:%.]] = load double, double [[IDX0]], align 4
; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, <2 x double> [[TMP1]], align 4		; CHECK-NEXT: [[LOADA1:%.]] = load double, double [[IDX1]], align 4
; CHECK-NEXT: [[LOADVEC:%.]] = load <2 x double>, <2 x double> [[VECPTR1:%.*]], align 4		; CHECK-NEXT: [[LOADVEC:%.]] = load <2 x double>, <2 x double> [[VECPTR1:%.*]], align 4
; CHECK-NEXT: [[LOADVEC2:%.]] = load <2 x double>, <2 x double> [[VECPTR2:%.*]], align 4		; CHECK-NEXT: [[LOADVEC2:%.]] = load <2 x double>, <2 x double> [[VECPTR2:%.*]], align 4
; CHECK-NEXT: [[EXTRA0:%.*]] = extractelement <2 x double> [[LOADVEC]], i32 0		; CHECK-NEXT: [[EXTRA0:%.*]] = extractelement <2 x double> [[LOADVEC]], i32 0
; CHECK-NEXT: [[EXTRA1:%.*]] = extractelement <2 x double> [[LOADVEC2]], i32 1		; CHECK-NEXT: [[EXTRA1:%.*]] = extractelement <2 x double> [[LOADVEC2]], i32 1
; CHECK-NEXT: [[LOADVEC3:%.]] = load <2 x double>, <2 x double> [[VECPTR3:%.*]], align 4		; CHECK-NEXT: [[LOADVEC3:%.]] = load <2 x double>, <2 x double> [[VECPTR3:%.*]], align 4
; CHECK-NEXT: [[LOADVEC4:%.]] = load <2 x double>, <2 x double> [[VECPTR4:%.*]], align 4		; CHECK-NEXT: [[LOADVEC4:%.]] = load <2 x double>, <2 x double> [[VECPTR4:%.*]], align 4
; CHECK-NEXT: [[EXTRB0:%.*]] = extractelement <2 x double> [[LOADVEC3]], i32 0		; CHECK-NEXT: [[EXTRB0:%.*]] = extractelement <2 x double> [[LOADVEC3]], i32 0
; CHECK-NEXT: [[EXTRB1:%.*]] = extractelement <2 x double> [[LOADVEC4]], i32 1		; CHECK-NEXT: [[EXTRB1:%.*]] = extractelement <2 x double> [[LOADVEC4]], i32 1
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> poison, double [[EXTRA1]], i32 0		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[EXTRB0]], i32 1		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[EXTRA1]], i32 1
; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x double> [[TMP4]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> poison, double [[LOADA0]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[LOADA0]], i32 1
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x double> [[TMP2]], [[TMP4]]
		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRB0]], i32 0
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1
; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP7]], [[TMP2]]		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> poison, double [[LOADA1]], i32 0
; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP8]]		; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double> [[TMP8]], double [[LOADA1]], i32 1
		; CHECK-NEXT: [[TMP10:%.*]] = fmul <2 x double> [[TMP7]], [[TMP9]]
		; CHECK-NEXT: [[TMP11:%.*]] = fadd <2 x double> [[TMP5]], [[TMP10]]
; CHECK-NEXT: [[SIDX0:%.]] = getelementptr inbounds double, double [[STOREARRAY:%.*]], i64 0		; CHECK-NEXT: [[SIDX0:%.]] = getelementptr inbounds double, double [[STOREARRAY:%.*]], i64 0
; CHECK-NEXT: [[SIDX1:%.]] = getelementptr inbounds double, double [[STOREARRAY]], i64 1		; CHECK-NEXT: [[SIDX1:%.]] = getelementptr inbounds double, double [[STOREARRAY]], i64 1
; CHECK-NEXT: [[TMP10:%.]] = bitcast double [[SIDX0]] to <2 x double>*		; CHECK-NEXT: [[TMP12:%.]] = bitcast double [[SIDX0]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP9]], <2 x double>* [[TMP10]], align 8		; CHECK-NEXT: store <2 x double> [[TMP11]], <2 x double>* [[TMP12]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%idx0 = getelementptr inbounds double, double* %array, i64 0		%idx0 = getelementptr inbounds double, double* %array, i64 0
%idx1 = getelementptr inbounds double, double* %array, i64 1		%idx1 = getelementptr inbounds double, double* %array, i64 1
%loadA0 = load double, double* %idx0, align 4		%loadA0 = load double, double* %idx0, align 4
%loadA1 = load double, double* %idx1, align 4		%loadA1 = load double, double* %idx1, align 4

%loadVec = load <2 x double>, <2 x double>* %vecPtr1, align 4		%loadVec = load <2 x double>, <2 x double>* %vecPtr1, align 4
Show All 25 Lines
; CHECK-LABEL: @splat_loads(		; CHECK-LABEL: @splat_loads(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds double, double [[ARRAY1:%.*]], i64 0		; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds double, double [[ARRAY1:%.*]], i64 0
; CHECK-NEXT: [[GEP_1_1:%.]] = getelementptr inbounds double, double [[ARRAY1]], i64 1		; CHECK-NEXT: [[GEP_1_1:%.]] = getelementptr inbounds double, double [[ARRAY1]], i64 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds double, double [[ARRAY2:%.*]], i64 0		; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds double, double [[ARRAY2:%.*]], i64 0
; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds double, double [[ARRAY2]], i64 1		; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds double, double [[ARRAY2]], i64 1
; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[GEP_2_0]] to <2 x double>*		; CHECK-NEXT: [[LD_2_0:%.]] = load double, double [[GEP_2_0]], align 8
; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8		; CHECK-NEXT: [[LD_2_1:%.]] = load double, double [[GEP_2_1]], align 8
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[LD_2_0]], i32 0
; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[LD_2_0]], i32 1
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[SHUFFLE]], i32 1		; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[TMP5]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[LD_2_1]], i32 0
; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x double> [[SHUFFLE]], i32 0		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[LD_2_1]], i32 1
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP7]], i32 1		; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x double> [[TMP1]], [[TMP6]]
; CHECK-NEXT: [[TMP9:%.*]] = fmul <2 x double> [[TMP1]], [[TMP8]]		; CHECK-NEXT: [[TMP8:%.*]] = fadd <2 x double> [[TMP4]], [[TMP7]]
; CHECK-NEXT: [[TMP10:%.*]] = fadd <2 x double> [[TMP4]], [[TMP9]]		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP8]], i32 0
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x double> [[TMP10]], i32 0		; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[TMP8]], i32 1
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP10]], i32 1		; CHECK-NEXT: [[ADD3:%.*]] = fadd double [[TMP9]], [[TMP10]]
; CHECK-NEXT: [[ADD3:%.*]] = fadd double [[TMP11]], [[TMP12]]
; CHECK-NEXT: ret double [[ADD3]]		; CHECK-NEXT: ret double [[ADD3]]
;		;
entry:		entry:
%gep_1_0 = getelementptr inbounds double, double* %array1, i64 0		%gep_1_0 = getelementptr inbounds double, double* %array1, i64 0
%gep_1_1 = getelementptr inbounds double, double* %array1, i64 1		%gep_1_1 = getelementptr inbounds double, double* %array1, i64 1
%ld_1_0 = load double, double* %gep_1_0, align 8		%ld_1_0 = load double, double* %gep_1_0, align 8
%ld_1_1 = load double, double* %gep_1_1, align 8		%ld_1_1 = load double, double* %gep_1_1, align 8

Show All 17 Lines

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions ; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s Add a pre-SSSE3 run? RKSimon: ; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S…

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"

	; Make sure we order the operands of commutative operations so that we get			; Make sure we order the operands of commutative operations so that we get
	; bigger vectorizable trees.			; bigger vectorizable trees.

	define void @shuffle_operands1(double * noalias %from, double * noalias %to, double %v1, double %v2) {			define void @shuffle_operands1(double * noalias %from, double * noalias %to, double %v1, double %v2) {
	; CHECK-LABEL: @shuffle_operands1(			; CHECK-LABEL: @shuffle_operands1(
	▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
	}			}

	define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, double %v1, double %v2) {			define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, double %v1, double %v2) {
	; CHECK-LABEL: @vecload_vs_broadcast4(			; CHECK-LABEL: @vecload_vs_broadcast4(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[LP:%.*]]			; CHECK-NEXT: br label [[LP:%.*]]
	; CHECK: lp:			; CHECK: lp:
	; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]			; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[FROM:%.]] to <2 x double>			; CHECK-NEXT: [[FROM_1:%.]] = getelementptr double, double [[FROM:%.*]], i32 1
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 4			; CHECK-NEXT: [[V0_1:%.]] = load double, double [[FROM]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[V0_2:%.]] = load double, double [[FROM_1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
	; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[TO:%.]] to <2 x double>			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
	; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[TO:%.]] to <2 x double>
				; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4
	; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]			; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]
				ABataevUnsubmitted Not Done Reply Inline Actions This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation. ABataev: This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other…
				vporpoAuthorUnsubmitted Done Reply Inline Actions How did you come up with these throughput values ? The assembly code that comes out of llc for the original code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovupd (%ecx), %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) The new code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0] vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) I ran the function in a loop on a skylake and the new code is 25% faster. vporpo: How did you come up with these throughput values ? The assembly code that comes out of llc…
				ABataevUnsubmitted Not Done Reply Inline Actions https://godbolt.org/z/3rhGajsaT The first page is the result without patch, the second - with patch. ABataev: https://godbolt.org/z/3rhGajsaT The first page is the result without patch, the second - with…
				ABataevUnsubmitted Not Done Reply Inline Actions What about this? ABataev: What about this?
				vporpoAuthorUnsubmitted Done Reply Inline Actions I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost(): {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2pshuflw + 2pshufhw // + pshufd/unpck { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2pshuflw + 2pshufhw // + 2pshufd + 2unpck + 2packus vporpo:* I am not sure what to do about this, it may have lower throughput but it has lower latency so…
				ABataevUnsubmitted Not Done Reply Inline Actions It is still in terms of throughput. Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria ABataev: It is still in terms of throughput. Yeah, it maybe faster on skylake but not on corei7-avx. And…
				vporpoAuthorUnsubmitted Done Reply Inline Actions What kind of tweaking are you proposing? vporpo: What kind of tweaking are you proposing?
				vporpoAuthorUnsubmitted Done Reply Inline Actions As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with `vmovddup`. This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency. Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way. vporpo: As far as I understand the reason for the lower throughput is that we are adding pressure to…
	; CHECK: ext:			; CHECK: ext:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %lp			br label %lp

	lp:			lp:
	%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]			%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
	Show All 13 Lines


	define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, double %v1, double %v2) {			define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, double %v1, double %v2) {
	; CHECK-LABEL: @shuffle_nodes_match2(			; CHECK-LABEL: @shuffle_nodes_match2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[LP:%.*]]			; CHECK-NEXT: br label [[LP:%.*]]
	; CHECK: lp:			; CHECK: lp:
	; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]			; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[FROM:%.]] to <2 x double>			; CHECK-NEXT: [[FROM_1:%.]] = getelementptr double, double [[FROM:%.*]], i32 1
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 4			; CHECK-NEXT: [[V0_1:%.]] = load double, double [[FROM]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[V0_2:%.]] = load double, double [[FROM_1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP2]]			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[TO:%.]] to <2 x double>			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
	; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[P]], i64 1
				; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[TO:%.]] to <2 x double>
				; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4
	; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]			; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]
	; CHECK: ext:			; CHECK: ext:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %lp			br label %lp

	lp:			lp:
	▲ Show 20 Lines • Show All 234 Lines • Show Last 20 Lines