This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
3/5
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/X86/
-
X86/
-
X86TargetTransformInfo.h
3/5
X86TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
6/13
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
lookahead.ll
4/9
operandorder.ll

Differential D121354

[SLP] Fix lookahead operand reordering for splat loads.
ClosedPublic

Authored by vporpo on Mar 9 2022, 9:45 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
fhahn

Commits

rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads.

Summary

Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: movddup (%reg), xmm0. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:

It is usually a load and 2 inserts:

%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1

But it can also be a load, an insert and a shuffle:

%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vporpo created this revision.Mar 9 2022, 9:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

vporpo requested review of this revision.Mar 9 2022, 9:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B153485: Diff 414275.Mar 9 2022, 9:45 PM

vporpo added a parent revision: D121353: [SLP][NFC] This adds a test for a follow-up patch that fixes a look-ahead operand reordering issue.Mar 9 2022, 9:46 PM

vporpo edited the summary of this revision. (Show Details)

RKSimon added inline comments.Mar 9 2022, 11:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189	why test both V1 and V2 if we know they are the same value?

Removed isa<LoadInst>(V2).

Harbormaster completed remote builds in B153493: Diff 414286.Mar 10 2022, 2:05 AM

ABataev added inline comments.Mar 10 2022, 2:58 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1147	Thos is target specific for x86. What about other targets?

lebedev.ri added a subscriber: lebedev.ri.Mar 10 2022, 4:23 AM

lebedev.ri added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189	This should be a lot more principled than this (i.e, query the actual costmodel). E.g., `movddup` is only for `double`, but i think this will also trigger for `float`? Also, what about broadcasting loads that are available in AVX/AVX2/AVX512?

vporpo added inline comments.Mar 10 2022, 10:52 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1147	Please see the discussion below.
1189	I agree, this needs to be target specific. I could add a new `ShuffleKind`, something like `SK_BroadcastLoad`. But this is not strictly a shuffle pattern, it is a combined Load + Shuffle. So this might need its own function, like `TargetTransformInfo::getBroadcastLoadCost(VectorType *VecTy)`. What do you think?

RKSimon added inline comments.Mar 10 2022, 11:01 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189	What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad (similar to the existing IsLegal*Load ops)?

vporpo added inline comments.Mar 10 2022, 11:09 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189	Yeah, something like `TTI::IsLegalBroadcastLoad` may be more suitable for this since the score that we are using in the reordering is not an actual TTI cost.

Changed the logic to use TTI::isLegalBroadcastLoad().

Harbormaster completed remote builds in B153633: Diff 414478.Mar 10 2022, 2:05 PM

This is blocking our internal process, so any help in getting this reviewed would be greatly appreciated.

Ping (sorry for the repeated pings but this is quite urgent as it is blocking our process).

ABataev added inline comments.Mar 14 2022, 12:39 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5136	I assume, you need to tweak the cost model for broadcast with loads.
5136–5140	return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && Ty->getElementType() == Type::getDoubleTy(Ty->getContext());
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1146–1147	We don't care about the instruction count for SLP, but the throughput.
1181–1182	Maybe pass a scalar type and number of elements to avoid constructing vector type?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.

Addressed comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1146–1147	Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	How did you come up with these throughput values ? The assembly code that comes out of llc for the original code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovupd (%ecx), %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) The new code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0] vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) I ran the function in a loop on a skylake and the new code is 25% faster.

ABataev added inline comments.Mar 14 2022, 4:29 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1146–1147	Use the throughput, not number of instructions.
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	https://godbolt.org/z/3rhGajsaT The first page is the result without patch, the second - with patch.

Harbormaster completed remote builds in B154211: Diff 415261.Mar 14 2022, 4:57 PM

Updated cost model to handle broadcast loads.

Harbormaster completed remote builds in B154238: Diff 415297.Mar 14 2022, 8:03 PM

RKSimon added inline comments.Mar 15 2022, 6:28 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1049	Why do we need this? Why not just use getShuffleCost?

vporpo added inline comments.Mar 15 2022, 9:54 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1049	We could use `getShuffleCost()` instead and introduce a new enumerator like `ShuffleKind::SK_BroadcastLoad`. But I feel it is better to use a separate function because `getShuffleCost()` is all about the cost of data shuffling (broadcast, reverse, select, transpose etc.), while a BroadcastLoad includes both a load from memory and a data shuffle. I don't have a strong preference though, what do you think?

My concern is that we've never worked out how we should account for folded load costs and this is setting a precedent that might back fire.

Wdyt about something like ShuffleKind::SK_BroadcastForBroadcastLoad that only takes into consideration the shuffle part of a broadcast load? A normal broadcast of a double to a 2-wide vector costs 1, so this could cost 0.

@ABataev wdyt about these options, any preference?

In D121354#3383755, @vporpo wrote:

@ABataev wdyt about these options, any preference?

I agree with @RKSimon here. Better to teach existing functions about possibly foldable loads somehow. There might be some other cases where we'll need similar info, e.g. some AVX512 targets has slow gathers for registers but not for loads, need to pass this info somehow too.

Removed getBroadcastLoadCost() and replaced it with getShuffleCost() which now has a new IsLoad argument.

Harbormaster completed remote builds in B154506: Diff 415690.Mar 15 2022, 9:07 PM

ABataev added inline comments.Mar 16 2022, 6:56 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1058	I belive better to pass the instruction here, just like in other functions. And then do the analysis of this instruction, if it was passed.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1181	Probably, need to check also for number of uses + external uses.

RKSimon added inline comments.Mar 16 2022, 8:39 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1058	Passing ArrayRef<const Value *> Args like we do for arith op makes sense here

RKSimon added inline comments.Mar 16 2022, 8:41 AM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1559	Won't this fail on pre-SSSEe targets?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
2–3	; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s Add a pre-SSSE3 run?

Replaced IsLoad argument with Args in getShuffleCost(), and addressed comments.

vporpo added inline comments.Mar 16 2022, 12:55 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1559	Good catch, I added a check for SSE3.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1181	Good point, I added a check which should only allow this if we don't have any internal/external uses. I will add checks for internal uses in a follow up patch, because it requires more changes + tests.

Harbormaster completed remote builds in B154676: Diff 415944.Mar 16 2022, 1:11 PM

ABataev added inline comments.Mar 16 2022, 1:36 PM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1058	You can drop `const` in `const Value *`
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1557–1559	Maybe rename it to `SSE3BroadcastLoadTbl`?
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	What about this?

Changed ArrayRef<const Value*> to ArrayRef<Value *> and renamed table to SSE3BroadcastLoadTbl.

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost(): {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2pshuflw + 2pshufhw // + pshufd/unpck { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2pshuflw + 2pshufhw // + 2pshufd + 2unpck + 2*packus

Harbormaster completed remote builds in B154691: Diff 415963.Mar 16 2022, 2:00 PM

ABataev added inline comments.Mar 16 2022, 2:03 PM

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	It is still in terms of throughput. Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria

vporpo added a child revision: D121939: [SLP][NFC] Added a test for a followup patch that enables handling splat loads with uses..Mar 17 2022, 11:47 AM

vporpo added inline comments.Mar 17 2022, 2:24 PM

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	What kind of tweaking are you proposing?

vporpo added inline comments.Mar 17 2022, 3:43 PM

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266	As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with `vmovddup`. This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency. Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way.

This revision is now accepted and ready to land.Mar 17 2022, 4:16 PM

This revision was landed with ongoing or failed builds.Mar 17 2022, 6:06 PM

Closed by commit rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads. (authored by vporpo). · Explain Why

This revision was automatically updated to reflect the committed changes.

vporpo added a commit: rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads..

vporpo added a reverting change: rG9136145eb019: Revert "[SLP] Fix lookahead operand reordering for splat loads." due to build….Mar 17 2022, 6:23 PM

vporpo mentioned this in D121973: Recommit "[SLP] Fix lookahead operand reordering for splat loads.".Mar 17 2022, 7:33 PM

vporpo mentioned this in rG79613185d305: Recommit "[SLP] Fix lookahead operand reordering for splat loads.".Mar 21 2022, 3:58 PM

vporpo mentioned this in rG27bd8f949282: Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2….Mar 22 2022, 5:09 PM

vporpo mentioned this in rG39aa202affd9: Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3….Mar 23 2022, 6:33 PM

vporpo mentioned this in D123638: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64.Apr 12 2022, 3:24 PM

vporpo mentioned this in rG7ba702644bac: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for….Apr 22 2022, 7:48 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

25 lines

TargetTransformInfoImpl.h

7 lines

CodeGen/

BasicTTIImpl.h

3 lines

lib/

Analysis/

TargetTransformInfo.cpp

16 lines

Target/

X86/

X86TargetTransformInfo.h

3 lines

X86TargetTransformInfo.cpp

30 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

24 lines

test/

Transforms/

SLPVectorizer/

X86/

lookahead.ll

50 lines

operandorder.ll

34 lines

Diff 416368

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 652 Lines • ▼ Show 20 Lines	public:
/// Return true if the target supports masked load.		/// Return true if the target supports masked load.
bool isLegalMaskedLoad(Type *DataType, Align Alignment) const;		bool isLegalMaskedLoad(Type *DataType, Align Alignment) const;

/// Return true if the target supports nontemporal store.		/// Return true if the target supports nontemporal store.
bool isLegalNTStore(Type *DataType, Align Alignment) const;		bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.		/// Return true if the target supports nontemporal load.
bool isLegalNTLoad(Type *DataType, Align Alignment) const;		bool isLegalNTLoad(Type *DataType, Align Alignment) const;

		/// \Returns true if the target supports broadcasting a load to a vector of
		/// type <NumElements x ElementTy>.
		bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const;

/// Return true if the target supports masked scatter.		/// Return true if the target supports masked scatter.
bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;
/// Return true if the target supports masked gather.		/// Return true if the target supports masked gather.
bool isLegalMaskedGather(Type *DataType, Align Alignment) const;		bool isLegalMaskedGather(Type *DataType, Align Alignment) const;
/// Return true if the target forces scalarizing of llvm.masked.gather		/// Return true if the target forces scalarizing of llvm.masked.gather
/// intrinsics.		/// intrinsics.
bool forceScalarizeMaskedGather(VectorType *Type, Align Alignment) const;		bool forceScalarizeMaskedGather(VectorType *Type, Align Alignment) const;
/// Return true if the target forces scalarizing of llvm.masked.scatter		/// Return true if the target forces scalarizing of llvm.masked.scatter
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	InstructionCost getArithmeticInstrCost(
OperandValueKind Opd2Info = OK_AnyValue,		OperandValueKind Opd2Info = OK_AnyValue,
OperandValueProperties Opd1PropInfo = OP_None,		OperandValueProperties Opd1PropInfo = OP_None,
OperandValueProperties Opd2PropInfo = OP_None,		OperandValueProperties Opd2PropInfo = OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >(),		ArrayRef<const Value > Args = ArrayRef<const Value >(),
const Instruction *CxtI = nullptr) const;		const Instruction *CxtI = nullptr) const;

/// \return The cost of a shuffle instruction of kind Kind and of type Tp.		/// \return The cost of a shuffle instruction of kind Kind and of type Tp.
/// The exact mask may be passed as Mask, or else the array will be empty.		/// The exact mask may be passed as Mask, or else the array will be empty.
/// The index and subtype parameters are used by the subvector insertion and		/// The index and subtype parameters are used by the subvector insertion and
		RKSimonUnsubmitted Not Done Reply Inline Actions Why do we need this? Why not just use getShuffleCost? RKSimon: Why do we need this? Why not just use getShuffleCost?
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions We could use `getShuffleCost()` instead and introduce a new enumerator like `ShuffleKind::SK_BroadcastLoad`. But I feel it is better to use a separate function because `getShuffleCost()` is all about the cost of data shuffling (broadcast, reverse, select, transpose etc.), while a BroadcastLoad includes both a load from memory and a data shuffle. I don't have a strong preference though, what do you think? vporpo: We could use `getShuffleCost()` instead and introduce a new enumerator like `ShuffleKind…
/// extraction shuffle kinds to show the insert/extract point and the type of		/// extraction shuffle kinds to show the insert/extract point and the type of
/// the subvector being inserted/extracted.		/// the subvector being inserted/extracted. The operands of the shuffle can be
		/// passed through \p Args, which helps improve the cost estimation in some
		/// cases, like in broadcast loads.
/// NOTE: For subvector extractions Tp represents the source type.		/// NOTE: For subvector extractions Tp represents the source type.
InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,		InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask = None, int Index = 0,		ArrayRef<int> Mask = None, int Index = 0,
VectorType *SubTp = nullptr) const;		VectorType *SubTp = nullptr,
		ArrayRef<Value *> Args = None) const;
		ABataevUnsubmitted Done Reply Inline Actions I belive better to pass the instruction here, just like in other functions. And then do the analysis of this instruction, if it was passed. ABataev: I belive better to pass the instruction here, just like in other functions. And then do the…
		RKSimonUnsubmitted Done Reply Inline Actions Passing ArrayRef<const Value > Args like we do for arith op makes sense here RKSimon:* Passing ArrayRef<const Value *> Args like we do for arith op makes sense here
		ABataevUnsubmitted Done Reply Inline Actions You can drop `const` in `const Value ` ABataev:* You can drop `const` in `const Value *`

/// Represents a hint about the context in which a cast is used.		/// Represents a hint about the context in which a cast is used.
///		///
/// For zext/sext, the context of the cast is the operand, which must be a		/// For zext/sext, the context of the cast is the operand, which must be a
/// load of some kind. For trunc, the context is of the cast is the single		/// load of some kind. For trunc, the context is of the cast is the single
/// user of the instruction, which must be a store of some kind.		/// user of the instruction, which must be a store of some kind.
///		///
/// This enum allows the vectorizer to give getCastInstrCost an idea of the		/// This enum allows the vectorizer to give getCastInstrCost an idea of the
▲ Show 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	virtual bool canSaveCmp(Loop L, BranchInst BI, ScalarEvolution SE,
LoopInfo LI, DominatorTree DT, AssumptionCache *AC,		LoopInfo LI, DominatorTree DT, AssumptionCache *AC,
TargetLibraryInfo *LibInfo) = 0;		TargetLibraryInfo *LibInfo) = 0;
virtual AddressingModeKind		virtual AddressingModeKind
getPreferredAddressingMode(const Loop L, ScalarEvolution SE) const = 0;		getPreferredAddressingMode(const Loop L, ScalarEvolution SE) const = 0;
virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;
virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;		virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;
virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;		virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;
		virtual bool isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const = 0;
virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;
virtual bool forceScalarizeMaskedGather(VectorType *DataType,		virtual bool forceScalarizeMaskedGather(VectorType *DataType,
Align Alignment) = 0;		Align Alignment) = 0;
virtual bool forceScalarizeMaskedScatter(VectorType *DataType,		virtual bool forceScalarizeMaskedScatter(VectorType *DataType,
Align Alignment) = 0;		Align Alignment) = 0;
virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;		virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;
virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;		virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	public:
virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;		virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;
virtual InstructionCost getArithmeticInstrCost(		virtual InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,		unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
OperandValueKind Opd1Info, OperandValueKind Opd2Info,		OperandValueKind Opd1Info, OperandValueKind Opd2Info,
OperandValueProperties Opd1PropInfo, OperandValueProperties Opd2PropInfo,		OperandValueProperties Opd1PropInfo, OperandValueProperties Opd2PropInfo,
ArrayRef<const Value > Args, const Instruction CxtI = nullptr) = 0;		ArrayRef<const Value > Args, const Instruction CxtI = nullptr) = 0;
virtual InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,		virtual InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp) = 0;		VectorType *SubTp,
		ArrayRef<Value *> Args) = 0;
virtual InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst,		virtual InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst,
Type *Src, CastContextHint CCH,		Type *Src, CastContextHint CCH,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) = 0;		const Instruction *I) = 0;
virtual InstructionCost getExtractWithExtendCost(unsigned Opcode, Type *Dst,		virtual InstructionCost getExtractWithExtendCost(unsigned Opcode, Type *Dst,
VectorType *VecTy,		VectorType *VecTy,
unsigned Index) = 0;		unsigned Index) = 0;
virtual InstructionCost getCFInstrCost(unsigned Opcode,		virtual InstructionCost getCFInstrCost(unsigned Opcode,
▲ Show 20 Lines • Show All 276 Lines • ▼ Show 20 Lines	bool isLegalMaskedLoad(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedLoad(DataType, Alignment);		return Impl.isLegalMaskedLoad(DataType, Alignment);
}		}
bool isLegalNTStore(Type *DataType, Align Alignment) override {		bool isLegalNTStore(Type *DataType, Align Alignment) override {
return Impl.isLegalNTStore(DataType, Alignment);		return Impl.isLegalNTStore(DataType, Alignment);
}		}
bool isLegalNTLoad(Type *DataType, Align Alignment) override {		bool isLegalNTLoad(Type *DataType, Align Alignment) override {
return Impl.isLegalNTLoad(DataType, Alignment);		return Impl.isLegalNTLoad(DataType, Alignment);
}		}
		bool isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const override {
		return Impl.isLegalBroadcastLoad(ElementTy, NumElements);
		}
bool isLegalMaskedScatter(Type *DataType, Align Alignment) override {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedScatter(DataType, Alignment);		return Impl.isLegalMaskedScatter(DataType, Alignment);
}		}
bool isLegalMaskedGather(Type *DataType, Align Alignment) override {		bool isLegalMaskedGather(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedGather(DataType, Alignment);		return Impl.isLegalMaskedGather(DataType, Alignment);
}		}
bool forceScalarizeMaskedGather(VectorType *DataType,		bool forceScalarizeMaskedGather(VectorType *DataType,
Align Alignment) override {		Align Alignment) override {
▲ Show 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	InstructionCost getArithmeticInstrCost(
OperandValueProperties Opd1PropInfo, OperandValueProperties Opd2PropInfo,		OperandValueProperties Opd1PropInfo, OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args,		ArrayRef<const Value *> Args,
const Instruction *CxtI = nullptr) override {		const Instruction *CxtI = nullptr) override {
return Impl.getArithmeticInstrCost(Opcode, Ty, CostKind, Opd1Info, Opd2Info,		return Impl.getArithmeticInstrCost(Opcode, Ty, CostKind, Opd1Info, Opd2Info,
Opd1PropInfo, Opd2PropInfo, Args, CxtI);		Opd1PropInfo, Opd2PropInfo, Args, CxtI);
}		}
InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,		InstructionCost getShuffleCost(ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp) override {		VectorType *SubTp,
return Impl.getShuffleCost(Kind, Tp, Mask, Index, SubTp);		ArrayRef<Value *> Args) override {
		return Impl.getShuffleCost(Kind, Tp, Mask, Index, SubTp, Args);
}		}
InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
CastContextHint CCH,		CastContextHint CCH,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) override {		const Instruction *I) override {
return Impl.getCastInstrCost(Opcode, Dst, Src, CCH, CostKind, I);		return Impl.getCastInstrCost(Opcode, Dst, Src, CCH, CostKind, I);
}		}
InstructionCost getExtractWithExtendCost(unsigned Opcode, Type *Dst,		InstructionCost getExtractWithExtendCost(unsigned Opcode, Type *Dst,
▲ Show 20 Lines • Show All 312 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	public:

bool isLegalNTLoad(Type *DataType, Align Alignment) const {		bool isLegalNTLoad(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory loads are available for loads that		// By default, assume nontemporal memory loads are available for loads that
// are aligned and have a size that is a power of 2.		// are aligned and have a size that is a power of 2.
unsigned DataSize = DL.getTypeStoreSize(DataType);		unsigned DataSize = DL.getTypeStoreSize(DataType);
return Alignment >= DataSize && isPowerOf2_32(DataSize);		return Alignment >= DataSize && isPowerOf2_32(DataSize);
}		}

		bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const {
		return false;
		}

bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {
return false;		return false;
}		}

bool isLegalMaskedGather(Type *DataType, Align Alignment) const {		bool isLegalMaskedGather(Type *DataType, Align Alignment) const {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	case Instruction::URem:
// FIXME: Unlikely to be true for CodeSize.		// FIXME: Unlikely to be true for CodeSize.
return TTI::TCC_Expensive;		return TTI::TCC_Expensive;
}		}
return 1;		return 1;
}		}

InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Ty,		InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Ty,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp) const {		VectorType *SubTp,
		ArrayRef<Value *> Args = None) const {
return 1;		return 1;
}		}

InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
TTI::CastContextHint CCH,		TTI::CastContextHint CCH,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) const {		const Instruction *I) const {
switch (Opcode) {		switch (Opcode) {
▲ Show 20 Lines • Show All 745 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 865 Lines • ▼ Show 20 Lines	TTI::ShuffleKind improveShuffleKindFromMask(TTI::ShuffleKind Kind,
case TTI::SK_Splice:		case TTI::SK_Splice:
break;		break;
}		}
return Kind;		return Kind;
}		}

InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,		InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp) {		VectorType *SubTp,
		ArrayRef<Value *> Args = None) {

switch (improveShuffleKindFromMask(Kind, Mask)) {		switch (improveShuffleKindFromMask(Kind, Mask)) {
case TTI::SK_Broadcast:		case TTI::SK_Broadcast:
if (auto *FVT = dyn_cast<FixedVectorType>(Tp))		if (auto *FVT = dyn_cast<FixedVectorType>(Tp))
return getBroadcastShuffleOverhead(FVT);		return getBroadcastShuffleOverhead(FVT);
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();
case TTI::SK_Select:		case TTI::SK_Select:
case TTI::SK_Splice:		case TTI::SK_Splice:
▲ Show 20 Lines • Show All 1,403 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);		return TTIImpl->isLegalNTStore(DataType, Alignment);
}		}

bool TargetTransformInfo::isLegalNTLoad(Type *DataType, Align Alignment) const {		bool TargetTransformInfo::isLegalNTLoad(Type *DataType, Align Alignment) const {
return TTIImpl->isLegalNTLoad(DataType, Alignment);		return TTIImpl->isLegalNTLoad(DataType, Alignment);
}		}

		bool TargetTransformInfo::isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const {
		return TTIImpl->isLegalBroadcastLoad(ElementTy, NumElements);
		}

bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,		bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedGather(DataType, Alignment);		return TTIImpl->isLegalMaskedGather(DataType, Alignment);
}		}

bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,		bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedScatter(DataType, Alignment);		return TTIImpl->isLegalMaskedScatter(DataType, Alignment);
▲ Show 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	InstructionCost TargetTransformInfo::getArithmeticInstrCost(
ArrayRef<const Value > Args, const Instruction CxtI) const {		ArrayRef<const Value > Args, const Instruction CxtI) const {
InstructionCost Cost =		InstructionCost Cost =
TTIImpl->getArithmeticInstrCost(Opcode, Ty, CostKind, Opd1Info, Opd2Info,		TTIImpl->getArithmeticInstrCost(Opcode, Ty, CostKind, Opd1Info, Opd2Info,
Opd1PropInfo, Opd2PropInfo, Args, CxtI);		Opd1PropInfo, Opd2PropInfo, Args, CxtI);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

InstructionCost TargetTransformInfo::getShuffleCost(ShuffleKind Kind,		InstructionCost TargetTransformInfo::getShuffleCost(
VectorType *Ty,		ShuffleKind Kind, VectorType *Ty, ArrayRef<int> Mask, int Index,
ArrayRef<int> Mask,		VectorType SubTp, ArrayRef<Value > Args) const {
int Index,		InstructionCost Cost =
VectorType *SubTp) const {		TTIImpl->getShuffleCost(Kind, Ty, Mask, Index, SubTp, Args);
InstructionCost Cost = TTIImpl->getShuffleCost(Kind, Ty, Mask, Index, SubTp);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

TTI::CastContextHint		TTI::CastContextHint
TargetTransformInfo::getCastContextHint(const Instruction *I) {		TargetTransformInfo::getCastContextHint(const Instruction *I) {
if (!I)		if (!I)
return CastContextHint::None;		return CastContextHint::None;
▲ Show 20 Lines • Show All 448 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	InstructionCost getArithmeticInstrCost(
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >(),		ArrayRef<const Value > Args = ArrayRef<const Value >(),
const Instruction *CxtI = nullptr);		const Instruction *CxtI = nullptr);
InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,		InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp);		VectorType SubTp, ArrayRef<Value > = None);
InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
TTI::CastContextHint CCH,		TTI::CastContextHint CCH,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);
InstructionCost getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy,		InstructionCost getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy,
CmpInst::Predicate VecPred,		CmpInst::Predicate VecPred,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	InstructionCost getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2);		TargetTransformInfo::LSRCost &C2);
bool canMacroFuseCmp();		bool canMacroFuseCmp();
bool isLegalMaskedLoad(Type *DataType, Align Alignment);		bool isLegalMaskedLoad(Type *DataType, Align Alignment);
bool isLegalMaskedStore(Type *DataType, Align Alignment);		bool isLegalMaskedStore(Type *DataType, Align Alignment);
bool isLegalNTLoad(Type *DataType, Align Alignment);		bool isLegalNTLoad(Type *DataType, Align Alignment);
bool isLegalNTStore(Type *DataType, Align Alignment);		bool isLegalNTStore(Type *DataType, Align Alignment);
		bool isLegalBroadcastLoad(Type *ElementTy, unsigned NumElements) const;
bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);		bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);
bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {		bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {
return forceScalarizeMaskedGather(VTy, Alignment);		return forceScalarizeMaskedGather(VTy, Alignment);
}		}
bool isLegalMaskedGather(Type *DataType, Align Alignment);		bool isLegalMaskedGather(Type *DataType, Align Alignment);
bool isLegalMaskedScatter(Type *DataType, Align Alignment);		bool isLegalMaskedScatter(Type *DataType, Align Alignment);
bool isLegalMaskedExpandLoad(Type *DataType);		bool isLegalMaskedExpandLoad(Type *DataType);
bool isLegalMaskedCompressStore(Type *DataType);		bool isLegalMaskedCompressStore(Type *DataType);
Show All 30 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,079 Lines • ▼ Show 20 Lines	InstructionCost X86TTIImpl::getArithmeticInstrCost(

// Fallback to the default implementation.		// Fallback to the default implementation.
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info);		return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info);
}		}

InstructionCost X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind,		InstructionCost X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
VectorType *BaseTp,		VectorType *BaseTp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp) {		VectorType *SubTp,
		ArrayRef<Value *> Args) {
// 64-bit packed float vectors (v2f32) are widened to type v4f32.		// 64-bit packed float vectors (v2f32) are widened to type v4f32.
// 64-bit packed integer vectors (v2i32) are widened to type v4i32.		// 64-bit packed integer vectors (v2i32) are widened to type v4i32.
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, BaseTp);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, BaseTp);

Kind = improveShuffleKindFromMask(Kind, Mask);		Kind = improveShuffleKindFromMask(Kind, Mask);
// Treat Transpose as 2-op shuffles - there's no difference in lowering.		// Treat Transpose as 2-op shuffles - there's no difference in lowering.
if (Kind == TTI::SK_Transpose)		if (Kind == TTI::SK_Transpose)
Kind = TTI::SK_PermuteTwoSrc;		Kind = TTI::SK_PermuteTwoSrc;
▲ Show 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	static const CostTblEntry SSE2ShuffleTbl[] = {

{ TTI::SK_PermuteTwoSrc, MVT::v2f64, 1 }, // shufpd		{ TTI::SK_PermuteTwoSrc, MVT::v2f64, 1 }, // shufpd
{ TTI::SK_PermuteTwoSrc, MVT::v2i64, 1 }, // shufpd		{ TTI::SK_PermuteTwoSrc, MVT::v2i64, 1 }, // shufpd
{ TTI::SK_PermuteTwoSrc, MVT::v4i32, 2 }, // 2*{unpck,movsd,pshufd}		{ TTI::SK_PermuteTwoSrc, MVT::v4i32, 2 }, // 2*{unpck,movsd,pshufd}
{ TTI::SK_PermuteTwoSrc, MVT::v8i16, 8 }, // blend+permute		{ TTI::SK_PermuteTwoSrc, MVT::v8i16, 8 }, // blend+permute
{ TTI::SK_PermuteTwoSrc, MVT::v16i8, 13 }, // blend+permute		{ TTI::SK_PermuteTwoSrc, MVT::v16i8, 13 }, // blend+permute
};		};

if (ST->hasSSE2())		static const CostTblEntry SSE3BroadcastLoadTbl[] = {
		{TTI::SK_Broadcast, MVT::v2f64, 0}, // broadcast handled by movddup
		};

		if (ST->hasSSE2()) {
		bool IsLoad = !Args.empty() && llvm::all_of(Args, [](const Value *V) {
		return isa<LoadInst>(V);
		});
		if (ST->hasSSE3() && IsLoad)
		if (const auto *Entry =
		CostTableLookup(SSE3BroadcastLoadTbl, Kind, LT.second)) {
		RKSimonUnsubmitted Not Done Reply Inline Actions Won't this fail on pre-SSSEe targets? RKSimon: Won't this fail on pre-SSSEe targets?
		vporpoAuthorUnsubmitted Done Reply Inline Actions Good catch, I added a check for SSE3. vporpo: Good catch, I added a check for SSE3.
		ABataevUnsubmitted Done Reply Inline Actions Maybe rename it to `SSE3BroadcastLoadTbl`? ABataev: Maybe rename it to `SSE3BroadcastLoadTbl`?
		assert(isLegalBroadcastLoad(
		BaseTp->getElementType(),
		cast<FixedVectorType>(BaseTp)->getNumElements()) &&
		"Table entry missing from isLegalBroadcastLoad()");
		return LT.first * Entry->Cost;
		}

if (const auto *Entry = CostTableLookup(SSE2ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(SSE2ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
		}

static const CostTblEntry SSE1ShuffleTbl[] = {		static const CostTblEntry SSE1ShuffleTbl[] = {
{ TTI::SK_Broadcast, MVT::v4f32, 1 }, // shufps		{ TTI::SK_Broadcast, MVT::v4f32, 1 }, // shufps
{ TTI::SK_Reverse, MVT::v4f32, 1 }, // shufps		{ TTI::SK_Reverse, MVT::v4f32, 1 }, // shufps
{ TTI::SK_Select, MVT::v4f32, 2 }, // 2*shufps		{ TTI::SK_Select, MVT::v4f32, 2 }, // 2*shufps
{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 1 }, // shufps		{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 1 }, // shufps
{ TTI::SK_PermuteTwoSrc, MVT::v4f32, 2 }, // 2*shufps		{ TTI::SK_PermuteTwoSrc, MVT::v4f32, 2 }, // 2*shufps
};		};
▲ Show 20 Lines • Show All 3,548 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalNTStore(Type *DataType, Align Alignment) {
// loads require AVX2).		// loads require AVX2).
if (DataSize == 32)		if (DataSize == 32)
return ST->hasAVX();		return ST->hasAVX();
if (DataSize == 16)		if (DataSize == 16)
return ST->hasSSE1();		return ST->hasSSE1();
return true;		return true;
}		}

		bool X86TTIImpl::isLegalBroadcastLoad(Type *ElementTy,
		unsigned NumElements) const {
		// movddup
		ABataevUnsubmitted Not Done Reply Inline Actions I assume, you need to tweak the cost model for broadcast with loads. ABataev: I assume, you need to tweak the cost model for broadcast with loads.
		return ST->hasSSSE3() && NumElements == 2 &&
		ElementTy == Type::getDoubleTy(ElementTy->getContext());
		}

		ABataevUnsubmitted Done Reply Inline Actions return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && Ty->getElementType() == Type::getDoubleTy(Ty->getContext()); ABataev: ``` return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 &&…
bool X86TTIImpl::isLegalMaskedExpandLoad(Type *DataTy) {		bool X86TTIImpl::isLegalMaskedExpandLoad(Type *DataTy) {
if (!isa<VectorType>(DataTy))		if (!isa<VectorType>(DataTy))
return false;		return false;

if (!ST->hasAVX512())		if (!ST->hasAVX512())
return false;		return false;

// The backend can't handle a single element vector.		// The backend can't handle a single element vector.
▲ Show 20 Lines • Show All 658 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,136 Lines • ▼ Show 20 Lines	class VLOperands {
// However, sometimes we have to break ties. For example we may have to		// However, sometimes we have to break ties. For example we may have to
// choose between matching loads vs matching opcodes. This is what these		// choose between matching loads vs matching opcodes. This is what these
// scores are helping us with: they provide the order of preference. Also,		// scores are helping us with: they provide the order of preference. Also,
// this is important if the scalar is externally used or used in another		// this is important if the scalar is externally used or used in another
// tree entry node in the different lane.		// tree entry node in the different lane.

/// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]).		/// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]).
static const int ScoreConsecutiveLoads = 4;		static const int ScoreConsecutiveLoads = 4;
		/// The same load multiple times. This should have a better score than
		/// `ScoreSplat` because it in x86 for a 2-lane vector we can represent it
		/// with `movddup (%reg), xmm0` which has a throughput of 0.5 versus 0.5 for
		ABataevUnsubmitted Not Done Reply Inline Actions Thos is target specific for x86. What about other targets? ABataev: Thos is target specific for x86. What about other targets?
		vporpoAuthorUnsubmitted Done Reply Inline Actions Please see the discussion below. vporpo: Please see the discussion below.
		ABataevUnsubmitted Not Done Reply Inline Actions We don't care about the instruction count for SLP, but the throughput. ABataev: We don't care about the instruction count for SLP, but the throughput.
		vporpoAuthorUnsubmitted Done Reply Inline Actions Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ? vporpo: Agreed, but usually code with fewer instructions is better for various reasons. How would you…
		ABataevUnsubmitted Not Done Reply Inline Actions Use the throughput, not number of instructions. ABataev: Use the throughput, not number of instructions.
		/// a vector load and 1.0 for a broadcast.
		static const int ScoreSplatLoads = 3;
/// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]).		/// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]).
static const int ScoreReversedLoads = 3;		static const int ScoreReversedLoads = 3;
/// ExtractElementInst from same vector and consecutive indexes.		/// ExtractElementInst from same vector and consecutive indexes.
static const int ScoreConsecutiveExtracts = 4;		static const int ScoreConsecutiveExtracts = 4;
/// ExtractElementInst from same vector and reversed indices.		/// ExtractElementInst from same vector and reversed indices.
static const int ScoreReversedExtracts = 3;		static const int ScoreReversedExtracts = 3;
/// Constants.		/// Constants.
static const int ScoreConstants = 2;		static const int ScoreConstants = 2;
Show All 10 Lines	class VLOperands {
/// Score if all users are vectorized.		/// Score if all users are vectorized.
static const int ScoreAllUserVectorized = 1;		static const int ScoreAllUserVectorized = 1;

/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.		/// \returns the score of placing \p V1 and \p V2 in consecutive lanes.
/// Also, checks if \p V1 and \p V2 are compatible with instructions in \p		/// Also, checks if \p V1 and \p V2 are compatible with instructions in \p
/// MainAltOps.		/// MainAltOps.
static int getShallowScore(Value V1, Value V2, const DataLayout &DL,		static int getShallowScore(Value V1, Value V2, const DataLayout &DL,
ScalarEvolution &SE, int NumLanes,		ScalarEvolution &SE, int NumLanes,
ArrayRef<Value *> MainAltOps) {		ArrayRef<Value *> MainAltOps,
if (V1 == V2)		const TargetTransformInfo *TTI) {
		if (V1 == V2) {
		if (isa<LoadInst>(V1)) {
		// A broadcast of a load can be cheaper on some targets.
		// TODO: For now accept a broadcast load with no other internal uses.
		ABataevUnsubmitted Not Done Reply Inline Actions Probably, need to check also for number of uses + external uses. ABataev: Probably, need to check also for number of uses + external uses.
		vporpoAuthorUnsubmitted Done Reply Inline Actions Good point, I added a check which should only allow this if we don't have any internal/external uses. I will add checks for internal uses in a follow up patch, because it requires more changes + tests. vporpo: Good point, I added a check which should only allow this if we don't have any internal/external…
		if (TTI->isLegalBroadcastLoad(V1->getType(), NumLanes) &&
		ABataevUnsubmitted Done Reply Inline Actions Maybe pass a scalar type and number of elements to avoid constructing vector type? ABataev: Maybe pass a scalar type and number of elements to avoid constructing vector type?
		(int)V1->getNumUses() == NumLanes)
		return VLOperands::ScoreSplatLoads;
		}
return VLOperands::ScoreSplat;		return VLOperands::ScoreSplat;
		}

auto *LI1 = dyn_cast<LoadInst>(V1);		auto *LI1 = dyn_cast<LoadInst>(V1);
		RKSimonUnsubmitted Done Reply Inline Actions why test both V1 and V2 if we know they are the same value? RKSimon: why test both V1 and V2 if we know they are the same value?
		lebedev.riUnsubmitted Not Done Reply Inline Actions This should be a lot more principled than this (i.e, query the actual costmodel). E.g., `movddup` is only for `double`, but i think this will also trigger for `float`? Also, what about broadcasting loads that are available in AVX/AVX2/AVX512? lebedev.ri: This should be a lot more principled than this (i.e, query the actual costmodel). E.g.
		vporpoAuthorUnsubmitted Done Reply Inline Actions I agree, this needs to be target specific. I could add a new `ShuffleKind`, something like `SK_BroadcastLoad`. But this is not strictly a shuffle pattern, it is a combined Load + Shuffle. So this might need its own function, like `TargetTransformInfo::getBroadcastLoadCost(VectorType VecTy)`. What do you think? vporpo:* I agree, this needs to be target specific. I could add a new `ShuffleKind`, something like…
		RKSimonUnsubmitted Not Done Reply Inline Actions What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad (similar to the existing IsLegalLoad ops)? RKSimon:* What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions Yeah, something like `TTI::IsLegalBroadcastLoad` may be more suitable for this since the score that we are using in the reordering is not an actual TTI cost. vporpo: Yeah, something like `TTI::IsLegalBroadcastLoad` may be more suitable for this since the score…
auto *LI2 = dyn_cast<LoadInst>(V2);		auto *LI2 = dyn_cast<LoadInst>(V2);
if (LI1 && LI2) {		if (LI1 && LI2) {
if (LI1->getParent() != LI2->getParent())		if (LI1->getParent() != LI2->getParent())
return VLOperands::ScoreFail;		return VLOperands::ScoreFail;

Optional<int> Dist = getPointersDiff(		Optional<int> Dist = getPointersDiff(
LI1->getType(), LI1->getPointerOperand(), LI2->getType(),		LI1->getType(), LI1->getPointerOperand(), LI2->getType(),
LI2->getPointerOperand(), DL, SE, /StrictCheck=/true);		LI2->getPointerOperand(), DL, SE, /StrictCheck=/true);
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	class VLOperands {
/// Look-ahead SLP: Auto-vectorization in the presence of commutative		/// Look-ahead SLP: Auto-vectorization in the presence of commutative
/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,		/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
/// Luís F. W. Góes		/// Luís F. W. Góes
int getScoreAtLevelRec(Value LHS, Value RHS, int CurrLevel, int MaxLevel,		int getScoreAtLevelRec(Value LHS, Value RHS, int CurrLevel, int MaxLevel,
ArrayRef<Value *> MainAltOps) {		ArrayRef<Value *> MainAltOps) {

// Get the shallow score of V1 and V2.		// Get the shallow score of V1 and V2.
int ShallowScoreAtThisLevel =		int ShallowScoreAtThisLevel =
getShallowScore(LHS, RHS, DL, SE, getNumLanes(), MainAltOps);		getShallowScore(LHS, RHS, DL, SE, getNumLanes(), MainAltOps, R.TTI);

// If reached MaxLevel,		// If reached MaxLevel,
// or if V1 and V2 are not instructions,		// or if V1 and V2 are not instructions,
// or if they are SPLAT,		// or if they are SPLAT,
// or if they are not consecutive,		// or if they are not consecutive,
// or if profitable to vectorize loads or extractelements, early return		// or if profitable to vectorize loads or extractelements, early return
// the current cost.		// the current cost.
auto *I1 = dyn_cast<Instruction>(LHS);		auto *I1 = dyn_cast<Instruction>(LHS);
▲ Show 20 Lines • Show All 3,872 Lines • ▼ Show 20 Lines	if ((E->getOpcode() == Instruction::ExtractElement \|\|
return Cost;		return Cost;
}		}
}		}
if (isSplat(VL)) {		if (isSplat(VL)) {
// Found the broadcasting of the single scalar, calculate the cost as the		// Found the broadcasting of the single scalar, calculate the cost as the
// broadcast.		// broadcast.
assert(VecTy == FinalVecTy &&		assert(VecTy == FinalVecTy &&
"No reused scalars expected for broadcast.");		"No reused scalars expected for broadcast.");
return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy);		return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy,
		/Mask=/None, /Index=/0,
		/SubTp=/nullptr, /Args=/VL);
}		}
InstructionCost ReuseShuffleCost = 0;		InstructionCost ReuseShuffleCost = 0;
if (NeedToShuffleReuses)		if (NeedToShuffleReuses)
ReuseShuffleCost = TTI->getShuffleCost(		ReuseShuffleCost = TTI->getShuffleCost(
TTI::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices);		TTI::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices);
// Improve gather cost for gather of loads, if we can group some of the		// Improve gather cost for gather of loads, if we can group some of the
// loads into vector loads.		// loads into vector loads.
if (VL.size() > 2 && E->getOpcode() == Instruction::Load &&		if (VL.size() > 2 && E->getOpcode() == Instruction::Load &&
▲ Show 20 Lines • Show All 5,603 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

Show First 20 Lines • Show All 585 Lines • ▼ Show 20 Lines	;
ret i1 %cmp.i185		ret i1 %cmp.i185
}		}

; Same as @ChecksExtractScores, but the extratelement vector operands do not match.		; Same as @ChecksExtractScores, but the extratelement vector operands do not match.
define void @ChecksExtractScores_different_vectors(double* %storeArray, double* %array, <2 x double> %vecPtr1, <2 x double> %vecPtr2, <2 x double>* %vecPtr3, <2 x double>* %vecPtr4) {		define void @ChecksExtractScores_different_vectors(double* %storeArray, double* %array, <2 x double> %vecPtr1, <2 x double> %vecPtr2, <2 x double>* %vecPtr3, <2 x double>* %vecPtr4) {
; CHECK-LABEL: @ChecksExtractScores_different_vectors(		; CHECK-LABEL: @ChecksExtractScores_different_vectors(
; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0		; CHECK-NEXT: [[IDX0:%.]] = getelementptr inbounds double, double [[ARRAY:%.*]], i64 0
; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1		; CHECK-NEXT: [[IDX1:%.]] = getelementptr inbounds double, double [[ARRAY]], i64 1
; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[IDX0]] to <2 x double>*		; CHECK-NEXT: [[LOADA0:%.]] = load double, double [[IDX0]], align 4
; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, <2 x double> [[TMP1]], align 4		; CHECK-NEXT: [[LOADA1:%.]] = load double, double [[IDX1]], align 4
; CHECK-NEXT: [[LOADVEC:%.]] = load <2 x double>, <2 x double> [[VECPTR1:%.*]], align 4		; CHECK-NEXT: [[LOADVEC:%.]] = load <2 x double>, <2 x double> [[VECPTR1:%.*]], align 4
; CHECK-NEXT: [[LOADVEC2:%.]] = load <2 x double>, <2 x double> [[VECPTR2:%.*]], align 4		; CHECK-NEXT: [[LOADVEC2:%.]] = load <2 x double>, <2 x double> [[VECPTR2:%.*]], align 4
; CHECK-NEXT: [[EXTRA0:%.*]] = extractelement <2 x double> [[LOADVEC]], i32 0		; CHECK-NEXT: [[EXTRA0:%.*]] = extractelement <2 x double> [[LOADVEC]], i32 0
; CHECK-NEXT: [[EXTRA1:%.*]] = extractelement <2 x double> [[LOADVEC2]], i32 1		; CHECK-NEXT: [[EXTRA1:%.*]] = extractelement <2 x double> [[LOADVEC2]], i32 1
; CHECK-NEXT: [[LOADVEC3:%.]] = load <2 x double>, <2 x double> [[VECPTR3:%.*]], align 4		; CHECK-NEXT: [[LOADVEC3:%.]] = load <2 x double>, <2 x double> [[VECPTR3:%.*]], align 4
; CHECK-NEXT: [[LOADVEC4:%.]] = load <2 x double>, <2 x double> [[VECPTR4:%.*]], align 4		; CHECK-NEXT: [[LOADVEC4:%.]] = load <2 x double>, <2 x double> [[VECPTR4:%.*]], align 4
; CHECK-NEXT: [[EXTRB0:%.*]] = extractelement <2 x double> [[LOADVEC3]], i32 0		; CHECK-NEXT: [[EXTRB0:%.*]] = extractelement <2 x double> [[LOADVEC3]], i32 0
; CHECK-NEXT: [[EXTRB1:%.*]] = extractelement <2 x double> [[LOADVEC4]], i32 1		; CHECK-NEXT: [[EXTRB1:%.*]] = extractelement <2 x double> [[LOADVEC4]], i32 1
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> poison, double [[EXTRA1]], i32 0		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[EXTRB0]], i32 1		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[EXTRA1]], i32 1
; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x double> [[TMP4]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> poison, double [[LOADA0]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[LOADA0]], i32 1
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x double> [[TMP2]], [[TMP4]]
		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRB0]], i32 0
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1
; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP7]], [[TMP2]]		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> poison, double [[LOADA1]], i32 0
; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP8]]		; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double> [[TMP8]], double [[LOADA1]], i32 1
		; CHECK-NEXT: [[TMP10:%.*]] = fmul <2 x double> [[TMP7]], [[TMP9]]
		; CHECK-NEXT: [[TMP11:%.*]] = fadd <2 x double> [[TMP5]], [[TMP10]]
; CHECK-NEXT: [[SIDX0:%.]] = getelementptr inbounds double, double [[STOREARRAY:%.*]], i64 0		; CHECK-NEXT: [[SIDX0:%.]] = getelementptr inbounds double, double [[STOREARRAY:%.*]], i64 0
; CHECK-NEXT: [[SIDX1:%.]] = getelementptr inbounds double, double [[STOREARRAY]], i64 1		; CHECK-NEXT: [[SIDX1:%.]] = getelementptr inbounds double, double [[STOREARRAY]], i64 1
; CHECK-NEXT: [[TMP10:%.]] = bitcast double [[SIDX0]] to <2 x double>*		; CHECK-NEXT: [[TMP12:%.]] = bitcast double [[SIDX0]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP9]], <2 x double>* [[TMP10]], align 8		; CHECK-NEXT: store <2 x double> [[TMP11]], <2 x double>* [[TMP12]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%idx0 = getelementptr inbounds double, double* %array, i64 0		%idx0 = getelementptr inbounds double, double* %array, i64 0
%idx1 = getelementptr inbounds double, double* %array, i64 1		%idx1 = getelementptr inbounds double, double* %array, i64 1
%loadA0 = load double, double* %idx0, align 4		%loadA0 = load double, double* %idx0, align 4
%loadA1 = load double, double* %idx1, align 4		%loadA1 = load double, double* %idx1, align 4

%loadVec = load <2 x double>, <2 x double>* %vecPtr1, align 4		%loadVec = load <2 x double>, <2 x double>* %vecPtr1, align 4
Show All 25 Lines
; CHECK-LABEL: @splat_loads(		; CHECK-LABEL: @splat_loads(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds double, double [[ARRAY1:%.*]], i64 0		; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds double, double [[ARRAY1:%.*]], i64 0
; CHECK-NEXT: [[GEP_1_1:%.]] = getelementptr inbounds double, double [[ARRAY1]], i64 1		; CHECK-NEXT: [[GEP_1_1:%.]] = getelementptr inbounds double, double [[ARRAY1]], i64 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*
; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8		; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds double, double [[ARRAY2:%.*]], i64 0		; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds double, double [[ARRAY2:%.*]], i64 0
; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds double, double [[ARRAY2]], i64 1		; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds double, double [[ARRAY2]], i64 1
; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[GEP_2_0]] to <2 x double>*		; CHECK-NEXT: [[LD_2_0:%.]] = load double, double [[GEP_2_0]], align 8
; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8		; CHECK-NEXT: [[LD_2_1:%.]] = load double, double [[GEP_2_1]], align 8
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[LD_2_0]], i32 0
; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[LD_2_0]], i32 1
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[SHUFFLE]], i32 1		; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[TMP5]], i32 0		; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[LD_2_1]], i32 0
; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x double> [[SHUFFLE]], i32 0		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[LD_2_1]], i32 1
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP7]], i32 1		; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x double> [[TMP1]], [[TMP6]]
; CHECK-NEXT: [[TMP9:%.*]] = fmul <2 x double> [[TMP1]], [[TMP8]]		; CHECK-NEXT: [[TMP8:%.*]] = fadd <2 x double> [[TMP4]], [[TMP7]]
; CHECK-NEXT: [[TMP10:%.*]] = fadd <2 x double> [[TMP4]], [[TMP9]]		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP8]], i32 0
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x double> [[TMP10]], i32 0		; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[TMP8]], i32 1
; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP10]], i32 1		; CHECK-NEXT: [[ADD3:%.*]] = fadd double [[TMP9]], [[TMP10]]
; CHECK-NEXT: [[ADD3:%.*]] = fadd double [[TMP11]], [[TMP12]]
; CHECK-NEXT: ret double [[ADD3]]		; CHECK-NEXT: ret double [[ADD3]]
;		;
entry:		entry:
%gep_1_0 = getelementptr inbounds double, double* %array1, i64 0		%gep_1_0 = getelementptr inbounds double, double* %array1, i64 0
%gep_1_1 = getelementptr inbounds double, double* %array1, i64 1		%gep_1_1 = getelementptr inbounds double, double* %array1, i64 1
%ld_1_0 = load double, double* %gep_1_0, align 8		%ld_1_0 = load double, double* %gep_1_0, align 8
%ld_1_1 = load double, double* %gep_1_1, align 8		%ld_1_1 = load double, double* %gep_1_1, align 8

Show All 17 Lines

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx \| FileCheck %s
	; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2			; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2
				RKSimonUnsubmitted Not Done Reply Inline Actions ; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s Add a pre-SSSE3 run? RKSimon: ; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S…

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"

	; Make sure we order the operands of commutative operations so that we get			; Make sure we order the operands of commutative operations so that we get
	; bigger vectorizable trees.			; bigger vectorizable trees.

	define void @shuffle_operands1(double * noalias %from, double * noalias %to, double %v1, double %v2) {			define void @shuffle_operands1(double * noalias %from, double * noalias %to, double %v1, double %v2) {
	; CHECK-LABEL: @shuffle_operands1(			; CHECK-LABEL: @shuffle_operands1(
	▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines
	}			}

	define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, double %v1, double %v2) {			define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, double %v1, double %v2) {
	; CHECK-LABEL: @vecload_vs_broadcast4(			; CHECK-LABEL: @vecload_vs_broadcast4(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[LP:%.*]]			; CHECK-NEXT: br label [[LP:%.*]]
	; CHECK: lp:			; CHECK: lp:
	; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]			; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[FROM:%.]] to <2 x double>			; CHECK-NEXT: [[FROM_1:%.]] = getelementptr double, double [[FROM:%.*]], i32 1
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 4			; CHECK-NEXT: [[V0_1:%.]] = load double, double [[FROM]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[V0_2:%.]] = load double, double [[FROM_1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
	; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[TO:%.]] to <2 x double>			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
	; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[TO:%.]] to <2 x double>
				; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4
	; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]			; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]
				ABataevUnsubmitted Not Done Reply Inline Actions This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation. ABataev: This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other…
				vporpoAuthorUnsubmitted Done Reply Inline Actions How did you come up with these throughput values ? The assembly code that comes out of llc for the original code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovupd (%ecx), %xmm0 vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0] vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) The new code is: movl 8(%esp), %eax movl 4(%esp), %ecx vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0] vaddpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, (%eax) I ran the function in a loop on a skylake and the new code is 25% faster. vporpo: How did you come up with these throughput values ? The assembly code that comes out of llc…
				ABataevUnsubmitted Not Done Reply Inline Actions https://godbolt.org/z/3rhGajsaT The first page is the result without patch, the second - with patch. ABataev: https://godbolt.org/z/3rhGajsaT The first page is the result without patch, the second - with…
				ABataevUnsubmitted Not Done Reply Inline Actions What about this? ABataev: What about this?
				vporpoAuthorUnsubmitted Done Reply Inline Actions I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost(): {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2pshuflw + 2pshufhw // + pshufd/unpck { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2pshuflw + 2pshufhw // + 2pshufd + 2unpck + 2packus vporpo:* I am not sure what to do about this, it may have lower throughput but it has lower latency so…
				ABataevUnsubmitted Not Done Reply Inline Actions It is still in terms of throughput. Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria ABataev: It is still in terms of throughput. Yeah, it maybe faster on skylake but not on corei7-avx. And…
				vporpoAuthorUnsubmitted Done Reply Inline Actions What kind of tweaking are you proposing? vporpo: What kind of tweaking are you proposing?
				vporpoAuthorUnsubmitted Done Reply Inline Actions As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with `vmovddup`. This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency. Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way. vporpo: As far as I understand the reason for the lower throughput is that we are adding pressure to…
	; CHECK: ext:			; CHECK: ext:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	; SSE2-LABEL: @vecload_vs_broadcast4(			; SSE2-LABEL: @vecload_vs_broadcast4(
	; SSE2-NEXT: entry:			; SSE2-NEXT: entry:
	; SSE2-NEXT: br label [[LP:%.*]]			; SSE2-NEXT: br label [[LP:%.*]]
	; SSE2: lp:			; SSE2: lp:
	; SSE2-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]			; SSE2-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]
	Show All 29 Lines


	define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, double %v1, double %v2) {			define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, double %v1, double %v2) {
	; CHECK-LABEL: @shuffle_nodes_match2(			; CHECK-LABEL: @shuffle_nodes_match2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[LP:%.*]]			; CHECK-NEXT: br label [[LP:%.*]]
	; CHECK: lp:			; CHECK: lp:
	; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]			; CHECK-NEXT: [[P:%.]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[FROM:%.]] to <2 x double>			; CHECK-NEXT: [[FROM_1:%.]] = getelementptr double, double [[FROM:%.*]], i32 1
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 4			; CHECK-NEXT: [[V0_1:%.]] = load double, double [[FROM]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[V0_2:%.]] = load double, double [[FROM_1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP2]]			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[TO:%.]] to <2 x double>			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
	; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[P]], i64 1
				; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[TO:%.]] to <2 x double>
				; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4
	; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]			; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]]
	; CHECK: ext:			; CHECK: ext:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	; SSE2-LABEL: @shuffle_nodes_match2(			; SSE2-LABEL: @shuffle_nodes_match2(
	; SSE2-NEXT: entry:			; SSE2-NEXT: entry:
	; SSE2-NEXT: br label [[LP:%.*]]			; SSE2-NEXT: br label [[LP:%.*]]
	; SSE2: lp:			; SSE2: lp:
	▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines