This is an archive of the discontinued LLVM Phabricator instance.

Strided Memory Access Vectorization
Needs ReviewPublic

Authored by ashutosh.nema on Jun 14 2016, 11:13 PM.

Download Raw Diff

Details

Reviewers

silviu.baranga
Ayal
delena

Summary

<< This patch is to demonstrate the Strided Memory Access Vectorization >>

Currently LLVM does not support strided access vectorization completely, some support is available via Interleave vectorization.

There are two main overheads with strided vectorizations:
• An overhead of consolidating data into an operable vector.
• An overhead of distributing the data elements after the operations.

Because of these overheads LLVM finds strided memory vectorization is not profitable and generating serial code sequence over vectorized code.
GATHER & SCATTER instruction support is available on few(not all) targets for consolidation & distribution operations.

In this approach we are trying to handle cases like this:
for (int i = 0; i < len; i++)

a[i*3] = b[i*2] + c[i*3];

We model strided memory load & store using shuffle & load/mask-store operations.
• Load is modeled as loads followed by shuffle.
• Store is modeled as shuffle followed by mask store.
• To minimize load and store operation introduced ‘SkipFactor’.

‘SkipFactor’:
• Multiple load operation required for consolidating data into an operable vector.
• Between various loads we skip by some offsets to effective consolidate.
• SkipFactor is the number of additional offsets required to move from the previous vector load memory-end address for the next vector load.
• SkipFactor allows all memory loads to be considered as identical (From valid element perspective).
• SkipFactor = (Stride - (VF % Stride)) % Stride)

How LOAD is modeled:
Load stride 3 (i.e. load for b [ 3 * i ])

%5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
%6 = bitcast i32* %5 to <4 x i32>*
%stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
%7 = getelementptr i32, i32* %5, i64 6 
%8 = bitcast i32* %7 to <4 x i32>*
%stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
%strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32> %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>

How STORE is modeled:
Store with stride 3 (i.e. store to c [ 3 * i ])
%10 = getelementptr inbounds i32, i32* %c, i64 %.lhs

%11 = bitcast i32* %10 to <4 x i32>*
%interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 1>
call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
%12 = getelementptr i32, i32* %10, i64 6
%13 = bitcast i32* %12 to <4 x i32>*
%interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)

NOTE: For Stride 3 SkipFactor is 2, that’s why the second GEP offset by 6(4+2).

To enable this feature below LLVM changes required.

Identify strided memory access (Its already there i.e. interleave vectorizer does this)
Costing changes:

a. Identify number of Load[s]/Store[s] & Shuffle[s] required to model Load/Store operation by considering SkipFactor.
b. Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.

Transform:

a. Generate Shuffle[s] followed by Mask-Store[s] instructions to model a Store operation.
b. Generate Load[s] followed by Shuffle[s] instructions to model a Load operation.

Use below option to enable this feature:
“-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-vectorization”

Gains observed with prototype:
TSVC kernel S111 1.15x
TSVC kernel S1111 1.42x

Thanks,
Ashutosh

Diff Detail

Repository: rL LLVM

Event Timeline

ashutosh.nema updated this revision to Diff 60792.Jun 14 2016, 11:13 PM

ashutosh.nema retitled this revision from to Strided Memory Access Vectorization.

ashutosh.nema updated this object.

ashutosh.nema set the repository for this revision to rL LLVM.

ashutosh.nema added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, sanjoy. · View Herald TranscriptJun 14 2016, 11:13 PM

Adding Shahid as co Author.

karthikthecool added a subscriber: karthikthecool.Jun 14 2016, 11:41 PM

mssimpso added a subscriber: mssimpso.Jun 15 2016, 4:35 AM

rengolin added reviewers: delena, silviu.baranga.Jun 15 2016, 5:09 AM

rengolin added a subscriber: rengolin.

asbirlea added a subscriber: asbirlea.Jun 17 2016, 2:56 PM

delena added inline comments.Jun 19 2016, 2:49 AM

lib/Target/X86/X86ISelLowering.h
686	Why 4?
lib/Target/X86/X86TargetTransformInfo.cpp
50	But it depends on element size..
lib/Transforms/Vectorize/LoopVectorize.cpp
1004	getPtrStride() does something similar

Thanks Elena for looking into this RFC.

lib/Target/X86/X86ISelLowering.h
686	This may not be required, ideally we should not put any limit. Things should be driven by costing. Will check and remove this.
lib/Target/X86/X86TargetTransformInfo.cpp
50	I did not understood this comment completely. Are you pointing where vectorizer can go above the target supported width ? I.e. double foo(double A, double B, int n) { double sum = 0; #pragma clang loop vectorize_width(16) for (int i = 0; i < n; ++i) sum += A[i] + 5; return sum; }
lib/Transforms/Vectorize/LoopVectorize.cpp
1004	Will check it.

Part 1. To be continued.

The performance improvements are relative to scalar, or relative to vectorizing with scalar loads/stores packed into vectors? For your 1-line example above the latter may not be better, but for more computationally intensive loops it might.

include/llvm/Analysis/TargetTransformInfo.h
531–542	Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of specializing it to getStridedMemoryOpCost()? Yes, we need to implement it for x86.
lib/Target/X86/X86TargetTransformInfo.cpp
40	This "ceiling(VF/Stride)" is not x86 specific. getNumStridedElementsInConsecutiveVF()? Better to simply inlining this short formula, or call it "ceiling".
50	requi[r]ed This is counting how many vectors of VF-consecutive-elements-each are needed to cover a set of VF strided elements. To match the number of target loads or stores, the element size would need to be such that VF elements fit in a simd register.
55	This is a similar "ceiling(VF/VMem)" computation as before, better use the same formula for clarity. Perhaps the clearest is "(VF+Stride-1) / Stride". (Or return getValidMemInSingleVFAccess(VF, Vmem) ... ;-)
61	Isn't this the same as getRequiredLoadStore() - 2? "number of shuffle[s]" What do you mean by "a[n] upper type"?
1158	Check if ! and return first.
1169	This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace); because we align each load on its first strided element; e.g., {0,3,6,9} can be loaded using 2 loads of {0,3}{6,9} rather than 3 consecutive loads of {0,11} = {0,3}{4,7}{8,11}.
1175	Explain this also above getShuffleRequiredForPromotion(). You can shuffle using a reduction binary tree, which may require fewer shuffles. In particular, if getRequiredLoadStore() is a power of 2, no additional promoting shuffles are needed. number of shuffle[s] can[']t Give a full proper example, not one using *
1178	For store, don't we need to add the cost of expanding a vector of VF elements to be masked-stored stridedly?
1182	Why not simply multiply ShuffleCost by number of shuffles instead? Isn't getShuffle[s]RequiredForPromotion() returning the TotalS[h]uffle[s], rather than the TotalSufflePerMemOp?
1185	How many shuffles total? Reached this far, to be continued.

Thanks Ayal for looking into this RFC.

The performance improvements are relative to scalar,
or relative to vectorizing with scalar loads/stores packed
into vectors?

The performance improvements are relative to scalar.

For your 1-line example above the latter may not be better,
but for more computationally intensive loops it might.

Agree, with more compute in loops the performance will be much better.
I just kept it to demonstrate this feature.

include/llvm/Analysis/TargetTransformInfo.h
531–542	Approach to get the cost is different i.e. “getInterleavedMemoryOpCost” consider wider load[s] & store[s] and gets the cost. Whereas we minimise the number of load[s] & store[s] and computes the cost. May not be a good idea to merge them as approach to get the cost itself is different. I’m open if people still feel it’s good to merge them will do it.
lib/Target/X86/X86TargetTransformInfo.cpp
50	This identifies how many vector[s] (wide VF) are required to cover a set of VF strided elements. Agree, there can be a possibility that VF wide vector may not fit the target simd register, Currently it leaves things to costing, which is not be good as some cases costing goes to basic TTI and gets the cost which may not be proper as per hardware. This should be handled correctly by considering target support.
61	Sorry, comments for this function is not very clear. The intent of the function is to identify the number of shuffle required to promote smaller vector type to a bigger vector type. When types are different we can’t directly shuffle elements because shuffle expects same type i.e. <4 x i32> & <2 x i32> shuffle is not possible, so its required to promote <2 x i32> to <4 x i32> with undefs vector.
1158	Sure.
1169	getMemoryOpCost is used to estimate cost of single load or store operation. Then we multiply it with number of load[s] or store[s] required to get total load[s] or store[s] cost. i.e. For VF 4 & Stride 3 the total load[s]/store[s] required are 2(MemOpRequired) so we identify the cost of single load/store than multiply with 2.
1175	Sure will provide more details here & at getShuffleRequiredForPromotion. Good idea reduction binary tree can be used here, will try it.
1178	For store we generate shuffle followed by mask-stores, so it’s not required. It’s only required for load, as while loading because of mismatch in vector type we can’t shuffle directly So it’s required there to promote smaller vector type using shuffle, but it’s not the case with store.
1182	Yes this looks bad, will directly multiply.
1185	Example: Load for stride 3 & VF8: 1 extra shuffle is required promotion: Load#1 : 0 * * 3 * * 6 * Load#2: 9 * * 12 * * 15 * Shuffle#1: ShuffleVector <8 x i32>Load#1, <8 x i32>Load#2, <6 x i32> <i32 0, i32 3, i32 6, i32 8, i32 11, i32 14> Load#3: 18 * * 21 * * * * Shuffle#2 : ShuffleVector <6 x i32>Shuffle#1, <6 x i32>undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 undef, i32 undef> Shuffle#3 : ShuffleVector <8 x i32>Shuffle#2, <8 x i32>Load#3 , <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 11> Note: “Shuffle#2” is extra generated for vector type promotion.

ashutosh.nema added a reviewer: Ayal.Jun 21 2016, 4:33 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

23 lines

TargetTransformInfoImpl.h

9 lines

CodeGen/

BasicTTIImpl.h

9 lines

lib/

Analysis/

TargetTransformInfo.cpp

9 lines

Target/

X86/

X86ISelLowering.h

3 lines

X86TargetTransformInfo.h

3 lines

X86TargetTransformInfo.cpp

92 lines

Transforms/

Vectorize/

LoopVectorize.cpp

626 lines

Diff 60792

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 522 Lines • ▼ Show 20 Lines	public:
/// \p Indices is the indices for interleaved load members (as interleaved		/// \p Indices is the indices for interleaved load members (as interleaved
/// load allows gaps)		/// load allows gaps)
/// \p Alignment is the alignment of the memory operation		/// \p Alignment is the alignment of the memory operation
/// \p AddressSpace is address space of the pointer.		/// \p AddressSpace is address space of the pointer.
int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace) const;		unsigned AddressSpace) const;

		/// \return The cost of the strided memory operation.
		/// \p Opcode is the memory operation code
		/// \p VecTy is the vector type of the interleaved access.
		/// \p Factor is the interleave factor
		/// \p Indices is the indices for interleaved load members (as interleaved
		/// load allows gaps)
		/// \p Alignment is the alignment of the memory operation
		/// \p AddressSpace is address space of the pointer.
		int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
		ArrayRef<unsigned> Indices, unsigned Alignment,
		unsigned AddressSpace) const;

		AyalUnsubmitted Not Done Reply Inline Actions Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of specializing it to getStridedMemoryOpCost()? Yes, we need to implement it for x86. Ayal: Why not simply call getInterleavedMemoryOpCost() with a single Index of 0 instead of…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Approach to get the cost is different i.e. “getInterleavedMemoryOpCost” consider wider load[s] & store[s] and gets the cost. Whereas we minimise the number of load[s] & store[s] and computes the cost. May not be a good idea to merge them as approach to get the cost itself is different. I’m open if people still feel it’s good to merge them will do it. ashutosh.nema: Approach to get the cost is different i.e. “getInterleavedMemoryOpCost” consider wider load[s]…
/// \brief Calculate the cost of performing a vector reduction.		/// \brief Calculate the cost of performing a vector reduction.
///		///
/// This is the cost of reducing the vector value of type \p Ty to a scalar		/// This is the cost of reducing the vector value of type \p Ty to a scalar
/// value using the operation denoted by \p Opcode. The form of the reduction		/// value using the operation denoted by \p Opcode. The form of the reduction
/// can either be a pairwise reduction or a reduction that splits the vector		/// can either be a pairwise reduction or a reduction that splits the vector
/// at every reduction level.		/// at every reduction level.
///		///
/// Pairwise:		/// Pairwise:
▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	public:
virtual int getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		virtual int getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
Value *Ptr, bool VariableMask,		Value *Ptr, bool VariableMask,
unsigned Alignment) = 0;		unsigned Alignment) = 0;
virtual int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		virtual int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
unsigned Factor,		unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) = 0;		unsigned AddressSpace) = 0;
		virtual int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Factor,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) = 0;
virtual int getReductionCost(unsigned Opcode, Type *Ty,		virtual int getReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) = 0;		bool IsPairwiseForm) = 0;
virtual int getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		virtual int getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
FastMathFlags FMF) = 0;		FastMathFlags FMF) = 0;
virtual int getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		virtual int getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Value *> Args,		ArrayRef<Value *> Args,
FastMathFlags FMF) = 0;		FastMathFlags FMF) = 0;
▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	return Impl.getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
Alignment);		Alignment);
}		}
int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace) override {		unsigned AddressSpace) override {
return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace);		Alignment, AddressSpace);
}		}
		int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
		ArrayRef<unsigned> Indices, unsigned Alignment,
		unsigned AddressSpace) override {
		return Impl.getStridedMemoryOpCost(Opcode, VecTy, Factor, Indices,
		Alignment, AddressSpace);
		}
int getReductionCost(unsigned Opcode, Type *Ty,		int getReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) override {		bool IsPairwiseForm) override {
return Impl.getReductionCost(Opcode, Ty, IsPairwiseForm);		return Impl.getReductionCost(Opcode, Ty, IsPairwiseForm);
}		}
int getIntrinsicInstrCost(Intrinsic::ID ID, Type RetTy, ArrayRef<Type > Tys,		int getIntrinsicInstrCost(Intrinsic::ID ID, Type RetTy, ArrayRef<Type > Tys,
FastMathFlags FMF) override {		FastMathFlags FMF) override {
return Impl.getIntrinsicInstrCost(ID, RetTy, Tys, FMF);		return Impl.getIntrinsicInstrCost(ID, RetTy, Tys, FMF);
}		}
▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 326 Lines • ▼ Show 20 Lines	public:
unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
unsigned Factor,		unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) {		unsigned AddressSpace) {
return 1;		return 1;
}		}

		unsigned getStridedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Factor,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) {
		return getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
		Alignment, AddressSpace);
		}

unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys, FastMathFlags FMF) {		ArrayRef<Type *> Tys, FastMathFlags FMF) {
return 1;		return 1;
}		}
unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Value *> Args, FastMathFlags FMF) {		ArrayRef<Value *> Args, FastMathFlags FMF) {
return 1;		return 1;
}		}
▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 589 Lines • ▼ Show 20 Lines	if (Opcode == Instruction::Load) {
for (unsigned i = 0; i < NumElts; i++)		for (unsigned i = 0; i < NumElts; i++)
Cost += static_cast<T *>(this)		Cost += static_cast<T *>(this)
->getVectorInstrCost(Instruction::InsertElement, VT, i);		->getVectorInstrCost(Instruction::InsertElement, VT, i);
}		}

return Cost;		return Cost;
}		}

		unsigned getStridedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Factor,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) {
		return getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
		Alignment, AddressSpace);
		}

/// Get intrinsic cost based on arguments		/// Get intrinsic cost based on arguments
unsigned getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,		unsigned getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Value *> Args, FastMathFlags FMF) {		ArrayRef<Value *> Args, FastMathFlags FMF) {
switch (IID) {		switch (IID) {
default: {		default: {
SmallVector<Type *, 4> Types;		SmallVector<Type *, 4> Types;
for (Value *Op : Args)		for (Value *Op : Args)
Types.push_back(Op->getType());		Types.push_back(Op->getType());
▲ Show 20 Lines • Show All 284 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 325 Lines • ▼ Show 20 Lines	int TargetTransformInfo::getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
unsigned Alignment, unsigned AddressSpace) const {		unsigned Alignment, unsigned AddressSpace) const {
int Cost = TTIImpl->getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		int Cost = TTIImpl->getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace);		Alignment, AddressSpace);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

		int TargetTransformInfo::getStridedMemoryOpCost(
		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
		unsigned Alignment, unsigned AddressSpace) const {
		int Cost = TTIImpl->getStridedMemoryOpCost(Opcode, VecTy, Factor, Indices,
		Alignment, AddressSpace);
		assert(Cost >= 0 && "TTI should not produce negative costs!");
		return Cost;
		}

int TargetTransformInfo::getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		int TargetTransformInfo::getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
FastMathFlags FMF) const {		FastMathFlags FMF) const {
int Cost = TTIImpl->getIntrinsicInstrCost(ID, RetTy, Tys, FMF);		int Cost = TTIImpl->getIntrinsicInstrCost(ID, RetTy, Tys, FMF);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 676 Lines • ▼ Show 20 Lines	public:

/// Return the desired alignment for ByVal aggregate		/// Return the desired alignment for ByVal aggregate
/// function arguments in the caller parameter area. For X86, aggregates		/// function arguments in the caller parameter area. For X86, aggregates
/// that contains are placed at 16-byte boundaries while the rest are at		/// that contains are placed at 16-byte boundaries while the rest are at
/// 4-byte boundaries.		/// 4-byte boundaries.
unsigned getByValTypeAlignment(Type *Ty,		unsigned getByValTypeAlignment(Type *Ty,
const DataLayout &DL) const override;		const DataLayout &DL) const override;

		/// Returns maximum supported Interleave factor.
		unsigned getMaxSupportedInterleaveFactor() const override { return 4; }
		delenaUnsubmitted Not Done Reply Inline Actions Why 4? delena: Why 4?
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions This may not be required, ideally we should not put any limit. Things should be driven by costing. Will check and remove this. ashutosh.nema: This may not be required, ideally we should not put any limit. Things should be driven by…

/// Returns the target specific optimal type for load		/// Returns the target specific optimal type for load
/// and store operations as a result of memset, memcpy, and memmove		/// and store operations as a result of memset, memcpy, and memmove
/// lowering. If DstAlign is zero that means it's safe to destination		/// lowering. If DstAlign is zero that means it's safe to destination
/// alignment can satisfy any constraint. Similarly if SrcAlign is zero it		/// alignment can satisfy any constraint. Similarly if SrcAlign is zero it
/// means there isn't a need to check it against alignment requirement,		/// means there isn't a need to check it against alignment requirement,
/// probably because the source does not need to be loaded. If 'IsMemset' is		/// probably because the source does not need to be loaded. If 'IsMemset' is
/// true, that means it's expanding a memset. If 'ZeroMemset' is true, that		/// true, that means it's expanding a memset. If 'ZeroMemset' is true, that
/// means it's a memset of zero. 'MemcpyStrSrc' indicates whether the memcpy		/// means it's a memset of zero. 'MemcpyStrSrc' indicates whether the memcpy
▲ Show 20 Lines • Show All 545 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	public:
int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);
int getCastInstrCost(unsigned Opcode, Type Dst, Type Src);		int getCastInstrCost(unsigned Opcode, Type Dst, Type Src);
int getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy);		int getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy);
int getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index);		int getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index);
int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);
int getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);
		int getStridedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
		ArrayRef<unsigned> Indices, unsigned Alignment,
		unsigned AddressSpace);
int getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,		int getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,
bool VariableMask, unsigned Alignment);		bool VariableMask, unsigned Alignment);
int getAddressComputationCost(Type *PtrTy, bool IsComplex);		int getAddressComputationCost(Type *PtrTy, bool IsComplex);

int getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,		int getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Type *> Tys, FastMathFlags FMF);		ArrayRef<Type *> Tys, FastMathFlags FMF);
int getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,		int getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Value *> Args, FastMathFlags FMF);		ArrayRef<Value *> Args, FastMathFlags FMF);
Show All 28 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show All 20 Lines
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Target/CostTable.h"		#include "llvm/Target/CostTable.h"
#include "llvm/Target/TargetLowering.h"		#include "llvm/Target/TargetLowering.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "x86tti"		#define DEBUG_TYPE "x86tti"

		/// Returns number of valid memory in one VF access.
		/// i.e. For Stride 2 and VF 4
		/// \|---\|---\|---\|---\|
		/// \| a \| * \| b \| * \|
		/// \|---\|---\|---\|---\|
		/// 0 1 2 3
		/// Only offset 0 & 2 holding value a & b are memory accesses of interest
		/// So it will return 2.
		/// NOTE: It assumes all iteration for a given stride holds common memory
		/// accesses pattern.
		static unsigned getValidMemInSingleVFAccess(unsigned VF, unsigned Stride) {
		return ((VF - 1) / Stride) + 1;
		AyalUnsubmitted Not Done Reply Inline Actions This "ceiling(VF/Stride)" is not x86 specific. getNumStridedElementsInConsecutiveVF()? Better to simply inlining this short formula, or call it "ceiling". Ayal: This "ceiling(VF/Stride)" is not x86 specific. getNumStridedElementsInConsecutiveVF()? Better…
		}

		/// Returns number of load/store required to fill a vector register.
		/// i.e. For Stride 3 and VF 4
		/// \|---\|---\|---\|---\|---\|---\|---\|---\|
		/// \| a \| * \| * \| b \| c \| * \| * \| d \|
		/// \|---\|---\|---\|---\|---\|---\|---\|---\|
		/// 0 1 2 3 6 7 8 9
		/// Offset 0, 3, 6, 9 required to fill vector register.
		/// So 2 vector load will be requied.
		delenaUnsubmitted Not Done Reply Inline Actions But it depends on element size.. delena: But it depends on element size..
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions I did not understood this comment completely. Are you pointing where vectorizer can go above the target supported width ? I.e. double foo(double A, double B, int n) { double sum = 0; #pragma clang loop vectorize_width(16) for (int i = 0; i < n; ++i) sum += A[i] + 5; return sum; } ashutosh.nema: I did not understood this comment completely. Are you pointing where vectorizer can go above…
		AyalUnsubmitted Not Done Reply Inline Actions requi[r]ed This is counting how many vectors of VF-consecutive-elements-each are needed to cover a set of VF strided elements. To match the number of target loads or stores, the element size would need to be such that VF elements fit in a simd register. Ayal: requi[r]ed This is counting how many vectors of VF-consecutive-elements-each are needed to…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions This identifies how many vector[s] (wide VF) are required to cover a set of VF strided elements. Agree, there can be a possibility that VF wide vector may not fit the target simd register, Currently it leaves things to costing, which is not be good as some cases costing goes to basic TTI and gets the cost which may not be proper as per hardware. This should be handled correctly by considering target support. ashutosh.nema: This identifies how many vector[s] (wide VF) are required to cover a set of VF strided elements.
		/// NOTE: It assumes all iteration for a given stride holds common memory
		/// accesses pattern.
		static unsigned getRequiredLoadStore(unsigned VF, unsigned Stride) {
		unsigned VMem = getValidMemInSingleVFAccess(VF, Stride);
		return (VF / VMem) + ((VF % VMem) ? 1 : 0);
		AyalUnsubmitted Not Done Reply Inline Actions This is a similar "ceiling(VF/VMem)" computation as before, better use the same formula for clarity. Perhaps the clearest is "(VF+Stride-1) / Stride". (Or return getValidMemInSingleVFAccess(VF, Vmem) ... ;-) Ayal: This is a similar "ceiling(VF/VMem)" computation as before, better use the same formula for…
		}

		/// Return number of shuffle required for promotion.
		/// To promote shuffle to a upper type may need to generate additional
		/// shuffle.
		static unsigned getShuffleRequiredForPromotion(unsigned VF, unsigned Stride) {
		AyalUnsubmitted Not Done Reply Inline Actions Isn't this the same as getRequiredLoadStore() - 2? "number of shuffle[s]" What do you mean by "a[n] upper type"? Ayal: Isn't this the same as getRequiredLoadStore() - 2? "number of shuffle[s]" What do you mean by…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Sorry, comments for this function is not very clear. The intent of the function is to identify the number of shuffle required to promote smaller vector type to a bigger vector type. When types are different we can’t directly shuffle elements because shuffle expects same type i.e. <4 x i32> & <2 x i32> shuffle is not possible, so its required to promote <2 x i32> to <4 x i32> with undefs vector. ashutosh.nema: Sorry, comments for this function is not very clear. The intent of the function is to identify…
		unsigned ValidElements = getValidMemInSingleVFAccess(VF, Stride);
		unsigned FixUpStride = 0;
		// Identify valid elements in first 2 loads.
		unsigned Prev = 2 * ValidElements;
		// Iterate collected elements less than VF
		while(Prev < VF) {
		FixUpStride++;
		Prev = Prev + ValidElements;
		}
		return FixUpStride;
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// X86 cost model.		// X86 cost model.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

TargetTransformInfo::PopcntSupportKind		TargetTransformInfo::PopcntSupportKind
X86TTIImpl::getPopcntSupport(unsigned TyWidth) {		X86TTIImpl::getPopcntSupport(unsigned TyWidth) {
▲ Show 20 Lines • Show All 1,058 Lines • ▼ Show 20 Lines	int X86TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *SrcTy,
}		}
if (!ST->hasAVX512())		if (!ST->hasAVX512())
return Cost + LT.first*4; // Each maskmov costs 4		return Cost + LT.first*4; // Each maskmov costs 4

// AVX-512 masked load/store is cheapper		// AVX-512 masked load/store is cheapper
return Cost+LT.first;		return Cost+LT.first;
}		}

		/// Returns strided memory load/store instruction vectorization cost.
		/// In addition to load/store instruction it considers cost of
		/// shuffle also.
		int X86TTIImpl::getStridedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Factor,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) {
		assert(Factor >= 2 && "Invalid interleave factor");
		assert(isa<VectorType>(VecTy) && "Expect a vector type");
		if (Factor <= TLI->getMaxSupportedInterleaveFactor()) {
		AyalUnsubmitted Not Done Reply Inline Actions Check if ! and return first. Ayal: Check if ! and return first.
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashutosh.nema: Sure.
		// Input is WideVector type (i.e. VectorTy * Stride)
		// First get the actual vector type.
		Type *SubVecTy = VectorType::get(VecTy->getScalarType(),
		(VecTy->getVectorNumElements() / Factor));
		// Identify vector factor and number of load/store required.
		unsigned VF = SubVecTy->getVectorNumElements();
		unsigned MemOpRequired = getRequiredLoadStore(VF, Factor);
		// Firstly, the cost of load/store operation.
		unsigned Cost = getMemoryOpCost(Opcode, SubVecTy, Alignment, AddressSpace);
		// Multiply Cost by number of load/store operation.
		Cost = Cost * MemOpRequired;
		AyalUnsubmitted Not Done Reply Inline Actions This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace); because we align each load on its first strided element; e.g., {0,3,6,9} can be loaded using 2 loads of {0,3}{6,9} rather than 3 consecutive loads of {0,11} = {0,3}{4,7}{8,11}. Ayal: This may be different than getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace); because we…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions getMemoryOpCost is used to estimate cost of single load or store operation. Then we multiply it with number of load[s] or store[s] required to get total load[s] or store[s] cost. i.e. For VF 4 & Stride 3 the total load[s]/store[s] required are 2(MemOpRequired) so we identify the cost of single load/store than multiply with 2. ashutosh.nema: getMemoryOpCost is used to estimate cost of single load or store operation. Then we multiply…
		unsigned ShuffleCost = 0;
		// Identify number of shuffle required for promotion
		// When types are different we cant directly shuffle elements
		// because shuffle expects same type
		// i.e. <4 x i32> * <2 x i32> is not possible, so its required
		// to promote <2 x i32> to <4 x i32> with undefs vector.
		AyalUnsubmitted Not Done Reply Inline Actions Explain this also above getShuffleRequiredForPromotion(). You can shuffle using a reduction binary tree, which may require fewer shuffles. In particular, if getRequiredLoadStore() is a power of 2, no additional promoting shuffles are needed. number of shuffle[s] can[']t Give a full proper example, not one using * Ayal: Explain this also above getShuffleRequiredForPromotion(). You can shuffle using a reduction…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Sure will provide more details here & at getShuffleRequiredForPromotion. Good idea reduction binary tree can be used here, will try it. ashutosh.nema: Sure will provide more details here & at getShuffleRequiredForPromotion. Good idea reduction…
		unsigned TotalSufflePerMemOp = getShuffleRequiredForPromotion(VF, Factor);
		// Identify cost of shuffle which required for promotion.
		// NOTE: This is only applicable for load.
		AyalUnsubmitted Not Done Reply Inline Actions For store, don't we need to add the cost of expanding a vector of VF elements to be masked-stored stridedly? Ayal: For store, don't we need to add the cost of expanding a vector of VF elements to be masked…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions For store we generate shuffle followed by mask-stores, so it’s not required. It’s only required for load, as while loading because of mismatch in vector type we can’t shuffle directly So it’s required there to promote smaller vector type using shuffle, but it’s not the case with store. ashutosh.nema: For store we generate shuffle followed by mask-stores, so it’s not required. It’s only required…
		if(Opcode == Instruction::Load)
		for (unsigned i = 0; i < MemOpRequired; i++)
		for (unsigned i = 0; i < TotalSufflePerMemOp; i++)
		ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, VecTy, 0, SubVecTy);
		AyalUnsubmitted Not Done Reply Inline Actions Why not simply multiply ShuffleCost by number of shuffles instead? Isn't getShuffle[s]RequiredForPromotion() returning the TotalS[h]uffle[s], rather than the TotalSufflePerMemOp? Ayal: Why not simply multiply ShuffleCost by number of shuffles instead? Isn't getShuffle…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Yes this looks bad, will directly multiply. ashutosh.nema: Yes this looks bad, will directly multiply.
		// Identify cost of shuffle required for multiple memory operation.
		for (unsigned i = 0; i < (MemOpRequired - 1); i++)
		ShuffleCost += getShuffleCost(TTI::SK_InsertSubvector, SubVecTy, 0, SubVecTy);
		AyalUnsubmitted Not Done Reply Inline Actions How many shuffles total? Reached this far, to be continued. Ayal: How many shuffles total? Reached this far, to be continued.
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Example: Load for stride 3 & VF8: 1 extra shuffle is required promotion: Load#1 : 0 * * 3 * * 6 * Load#2: 9 * * 12 * * 15 * Shuffle#1: ShuffleVector <8 x i32>Load#1, <8 x i32>Load#2, <6 x i32> <i32 0, i32 3, i32 6, i32 8, i32 11, i32 14> Load#3: 18 * * 21 * * * * Shuffle#2 : ShuffleVector <6 x i32>Shuffle#1, <6 x i32>undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 undef, i32 undef> Shuffle#3 : ShuffleVector <8 x i32>Shuffle#2, <8 x i32>Load#3 , <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 11> Note: “Shuffle#2” is extra generated for vector type promotion. ashutosh.nema: Example: Load for stride 3 & VF8: 1 extra shuffle is required promotion: Load#1 : 0 * * 3 * *…
		// Update final cost & return.
		Cost += ShuffleCost;
		return Cost;
		}
		// Default call Base's getInterleavedMemoryOpCost.
		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
		Alignment, AddressSpace);
		}

int X86TTIImpl::getAddressComputationCost(Type *Ty, bool IsComplex) {		int X86TTIImpl::getAddressComputationCost(Type *Ty, bool IsComplex) {
// Address computations in vectorized code with non-consecutive addresses will		// Address computations in vectorized code with non-consecutive addresses will
// likely result in more instructions compared to scalar code where the		// likely result in more instructions compared to scalar code where the
// computation can more often be merged into the index mode. The resulting		// computation can more often be merged into the index mode. The resulting
// extra micro-ops can significantly decrease throughput.		// extra micro-ops can significantly decrease throughput.
unsigned NumVectorInstToHideOverhead = 10;		unsigned NumVectorInstToHideOverhead = 10;

if (Ty->isVectorTy() && IsComplex)		if (Ty->isVectorTy() && IsComplex)
▲ Show 20 Lines • Show All 435 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines
static cl::opt<bool> EnableMemAccessVersioning(		static cl::opt<bool> EnableMemAccessVersioning(
"enable-mem-access-versioning", cl::init(true), cl::Hidden,		"enable-mem-access-versioning", cl::init(true), cl::Hidden,
cl::desc("Enable symbolic stride memory access versioning"));		cl::desc("Enable symbolic stride memory access versioning"));

static cl::opt<bool> EnableInterleavedMemAccesses(		static cl::opt<bool> EnableInterleavedMemAccesses(
"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,		"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,
cl::desc("Enable vectorization on interleaved memory accesses in a loop"));		cl::desc("Enable vectorization on interleaved memory accesses in a loop"));

		/// Option to enable strided vectorization.
		static cl::opt<bool> EnableStridedVectorization(
		"enable-strided-vectorization", cl::init(false), cl::Hidden,
		cl::desc("Enable strided memory accesses vectorization in a loop"));

		/// Option to enable mask load generation, will work only
		/// with strided vectorization.
		static cl::opt<bool> EnableStridedMaskLoad(
		"enable-strided-mask-load", cl::init(false), cl::Hidden,
		cl::desc("Enable strided memory vectorization with maked load generation"));

/// Maximum factor for an interleaved memory access.		/// Maximum factor for an interleaved memory access.
static cl::opt<unsigned> MaxInterleaveGroupFactor(		static cl::opt<unsigned> MaxInterleaveGroupFactor(
"max-interleave-group-factor", cl::Hidden,		"max-interleave-group-factor", cl::Hidden,
cl::desc("Maximum factor for an interleaved access group (default = 8)"),		cl::desc("Maximum factor for an interleaved access group (default = 8)"),
cl::init(8));		cl::init(8));

/// We don't interleave loops with a known constant trip count below this		/// We don't interleave loops with a known constant trip count below this
/// number.		/// number.
▲ Show 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	if (isa<BitCastInst>(Ptr) &&
Type *Pointee2Ty = cast<PointerType>(GEPTy)->getPointerElementType();		Type *Pointee2Ty = cast<PointerType>(GEPTy)->getPointerElementType();
const DataLayout &DL = cast<BitCastInst>(Ptr)->getModule()->getDataLayout();		const DataLayout &DL = cast<BitCastInst>(Ptr)->getModule()->getDataLayout();
if (DL.getTypeSizeInBits(Pointee1Ty) == DL.getTypeSizeInBits(Pointee2Ty))		if (DL.getTypeSizeInBits(Pointee1Ty) == DL.getTypeSizeInBits(Pointee2Ty))
return cast<GetElementPtrInst>(cast<BitCastInst>(Ptr)->getOperand(0));		return cast<GetElementPtrInst>(cast<BitCastInst>(Ptr)->getOperand(0));
}		}
return nullptr;		return nullptr;
}		}

		/// Returns SkipFactor.
		/// SkipFactor is the additional number of offsets to move Index pointer after
		/// a vector load/store. Index pointer is the start address of memory in a
		/// particular loop iteration, from this address we will do a vector load/store.
		/// Index pointer movement required for subsequent memory accesses in loop.
		/// By introducing SkipFactor we ensures all vector memory accesses holds same
		/// memory pattern. This is the reason why we use common mask for accesses
		/// across iteration.
		static unsigned getSkipFactor(unsigned VF, unsigned Stride) {
		return (Stride - (VF % Stride)) % Stride;
		}

		/// Returns number of valid memory in one VF access.
		/// i.e. For Stride 2 and VF 4
		/// \|---\|---\|---\|---\|
		/// \| a \| * \| b \| * \|
		/// \|---\|---\|---\|---\|
		/// 0 1 2 3
		/// Only offset 0 & 2 holding value a & b are memory accesses of interest
		/// So it will return 2.
		/// NOTE: It assumes all iteration for a given stride holds common memory
		/// accesses pattern.
		static unsigned getValidMemInSingleVFAccess(unsigned VF, unsigned Stride) {
		return ((VF - 1) / Stride) + 1;
		}

		/// Returns number of load/store required to fill a vector register.
		/// i.e. For Stride 3 and VF 4
		/// \|---\|---\|---\|---\|---\|---\|---\|---\|
		/// \| a \| * \| * \| b \| c \| * \| * \| d \|
		/// \|---\|---\|---\|---\|---\|---\|---\|---\|
		/// 0 1 2 3 6 7 8 9
		/// Offset 0, 3, 6, 9 required to fill vector register.
		/// So 2 vector load will be requied.
		/// NOTE: It assumes all iteration for a given stride holds common memory
		/// accesses pattern.
		static unsigned getRequiredLoadStore(unsigned VF, unsigned Stride) {
		unsigned VMem = getValidMemInSingleVFAccess(VF, Stride);
		return (VF / VMem) + ((VF % VMem) ? 1 : 0);
		}

/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
/// * It handles the code generation for reduction variables.		/// * It handles the code generation for reduction variables.
/// * Scalarization (implementation using scalars) of un-vectorizable		/// * Scalarization (implementation using scalars) of un-vectorizable
▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	protected:
/// When we widen (vectorize) values we place them in the map. If the values		/// When we widen (vectorize) values we place them in the map. If the values
/// are not within the map, they have to be loop invariant, so we simply		/// are not within the map, they have to be loop invariant, so we simply
/// broadcast them into a vector.		/// broadcast them into a vector.
VectorParts &getVectorValue(Value *V);		VectorParts &getVectorValue(Value *V);

/// Try to vectorize the interleaved access group that \p Instr belongs to.		/// Try to vectorize the interleaved access group that \p Instr belongs to.
void vectorizeInterleaveGroup(Instruction *Instr);		void vectorizeInterleaveGroup(Instruction *Instr);

		/// Try to vectorize the stride memory access.
		void vectorizeStridedMemoryAccess(Instruction *Instr);

		/// Vectorize the strided load.
		void vectorizeStridedLoad(Instruction *Instr);

		/// Vectorize the strided store.
		void vectorizeStridedStore(Instruction *Instr);

/// Generate a shuffle sequence that will reverse the vector Vec.		/// Generate a shuffle sequence that will reverse the vector Vec.
virtual Value reverseVector(Value Vec);		virtual Value reverseVector(Value Vec);

/// Returns (and creates if needed) the original loop trip count.		/// Returns (and creates if needed) the original loop trip count.
Value getOrCreateTripCount(Loop NewLoop);		Value getOrCreateTripCount(Loop NewLoop);

/// Returns (and creates if needed) the trip count of the widened loop.		/// Returns (and creates if needed) the trip count of the widened loop.
Value getOrCreateVectorTripCount(Loop NewLoop);		Value getOrCreateVectorTripCount(Loop NewLoop);

/// Emit a bypass check to see if the trip count would overflow, or we		/// Emit a bypass check to see if the trip count would overflow, or we
/// wouldn't have enough iterations to execute one vector loop.		/// wouldn't have enough iterations to execute one vector loop.
void emitMinimumIterationCountCheck(Loop L, BasicBlock Bypass);		void emitMinimumIterationCountCheck(Loop L, BasicBlock Bypass);
/// Emit a bypass check to see if the vector trip count is nonzero.		/// Emit a bypass check to see if the vector trip count is nonzero.
void emitVectorLoopEnteredCheck(Loop L, BasicBlock Bypass);		void emitVectorLoopEnteredCheck(Loop L, BasicBlock Bypass);
/// Emit a bypass check to see if all of the SCEV assumptions we've		/// Emit a bypass check to see if all of the SCEV assumptions we've
/// had to make are correct.		/// had to make are correct.
void emitSCEVChecks(Loop L, BasicBlock Bypass);		void emitSCEVChecks(Loop L, BasicBlock Bypass);
/// Emit bypass checks to check any memory assumptions we may have made.		/// Emit bypass checks to check any memory assumptions we may have made.
void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);		void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);
		/// Returns mask required for masked load/store instruction.
		Constant *getLoadStoreInstMask(IRBuilder<> &Builder, unsigned VF,
		unsigned Stride, unsigned NumVec);
		/// Returns mask required for shuffle store instruction.
		Constant *getStoreShuffleVectorMask(IRBuilder<> &Builder, unsigned VF,
		unsigned Stride, unsigned Part);
		/// Returns mask required for shuffle load instruction.
		Constant *getLoadShuffleVectorMask(IRBuilder<> &Builder, unsigned Start,
		unsigned Stride, unsigned VF,
		bool isFirstOpShuffle, unsigned FirstOpLen,
		unsigned SecondOpLen);
/// Add additional metadata to \p To that was not present on \p Orig.		/// Add additional metadata to \p To that was not present on \p Orig.
///		///
/// Currently this is used to add the noalias annotations based on the		/// Currently this is used to add the noalias annotations based on the
/// inserted memchecks. Use this for instructions that are cloned into the		/// inserted memchecks. Use this for instructions that are cloned into the
/// vector loop.		/// vector loop.
void addNewMetadata(Instruction To, const Instruction Orig);		void addNewMetadata(Instruction To, const Instruction Orig);

/// Add metadata from one instruction to another.		/// Add metadata from one instruction to another.
▲ Show 20 Lines • Show All 354 Lines • ▼ Show 20 Lines	private:
// %odd = load i32		// %odd = load i32
//		//
// store i32 %even		// store i32 %even
// %odd = add i32 // Def of %odd		// %odd = add i32 // Def of %odd
// store i32 %odd // Insert Position		// store i32 %odd // Insert Position
Instruction *InsertPos;		Instruction *InsertPos;
};		};

		/// \brief The descriptor for a strided memory access.
		struct StrideDescriptor {
		StrideDescriptor(int Stride, const SCEV *Scev, unsigned Size, unsigned Align)
		: Stride(Stride), Scev(Scev), Size(Size), Align(Align) {}

		StrideDescriptor() : Stride(0), Scev(nullptr), Size(0), Align(0) {}

		int Stride; // The access's stride. It is negative for a reverse access.
		const SCEV *Scev; // The scalar expression of this access
		unsigned Size; // The size of the memory object.
		unsigned Align; // The alignment of this access.
		};

		class StrideAccessInfo {
		public:
		const Instruction *MemAccessInstr;
		StrideDescriptor &SD;
		bool hasValidStride;
		ScalarEvolution *SE;
		Loop *CurLoop;
		LoopInfo *LInfo;
		SmallVector<const Instruction *, 8> DependentInstr;

		StrideAccessInfo(const Instruction *MemAccessInstr, StrideDescriptor &SD,
		ScalarEvolution SE, Loop CurLoop, LoopInfo *LInfo)
		: MemAccessInstr(MemAccessInstr), SD(SD), hasValidStride(false), SE(SE),
		CurLoop(CurLoop), LInfo(LInfo) {
		analyzeStride(MemAccessInstr);
		}

		void insertDependentInstr(const Instruction *I) {
		DependentInstr.push_back(I);
		}

		SmallVector<const Instruction , 8> &getDependentInstr(const Instruction I) {
		return DependentInstr;
		}

		bool isValidStrideOperation() const;
		bool validateStrideOperation(Instruction *I);
		void analyzeStride(const Instruction *);

		void print() const {
		dbgs() << "Memory Instruction : ";
		MemAccessInstr->dump();
		dbgs() << "Has dependent Instr \n";
		SmallVector<const Instruction *, 8>::const_iterator I, E;
		for (I = DependentInstr.begin(), E = DependentInstr.end(); I != E; ++I) {
		dbgs() << " ";
		(*I)->dump();
		dbgs() << "\n";
		}
		}
		};

		/// Validates if a instruction holds a valid strided operation
		/// As in index operation it only allows Add, Sub, Shl & Mul operations
		/// in a index has a non constant stride then it returns false.
		/// In case of PHI it should be InductionPHI.
		bool StrideAccessInfo::validateStrideOperation(Instruction *I) {
		switch (I->getOpcode()) {
		case Instruction::Add:
		case Instruction::Sub:
		case Instruction::Shl:
		case Instruction::Mul: {
		// Second operand should be a compile time constant.
		ConstantInt *CV = dyn_cast<ConstantInt>(I->getOperand(1));
		if (!CV)
		return false;
		// First operand should be a valid instruction.
		Instruction *Op0Inst = dyn_cast<Instruction>(I->getOperand(0));
		if (!Op0Inst)
		return false;
		// Analyze first operand further.
		if (!validateStrideOperation(Op0Inst)) {
		return false;
		}
		// Consider instruction as dependent & return true.
		insertDependentInstr(I);
		return true;
		}
		case Instruction::PHI: {
		// The PHI should be a induction PHI, else return false
		InductionDescriptor ID;
		PHINode *Phi = dyn_cast<PHINode>(I);
		if (InductionDescriptor::isInductionPHI(Phi, SE, ID)) {
		return true;
		}
		return false;
		}
		default:
		return false;
		}
		return false;
		}

		/// Analyze stride operation in an instruction and mark it as valid
		/// stride access, it considers operation involved in stride calculation
		/// It only consider simple stride patern.
		/// i.e. arr[iX], arr[(i+1) X], arr[(i*X)+1] (where X is a constant stride)
		void StrideAccessInfo::analyzeStride(const Instruction *I) {
		delenaUnsubmitted Not Done Reply Inline Actions getPtrStride() does something similar delena: getPtrStride() does something similar
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Will check it. ashutosh.nema: Will check it.
		const LoadInst *Ld = dyn_cast<const LoadInst>(I);
		const StoreInst *St = dyn_cast<const StoreInst>(I);
		assert((Ld \|\| St) && "Invalid instruction, not load or store");
		const Value *Ptr = Ld ? Ld->getPointerOperand() : St->getPointerOperand();

		const GetElementPtrInst *Gep = dyn_cast<GetElementPtrInst>(Ptr);
		if (!Gep)
		return;
		unsigned InductionOperand = getGEPInductionOperand(Gep);
		const SCEV *IndGep = SE->getSCEV(Gep->getOperand(InductionOperand));
		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(IndGep);
		if (!AR \|\| !AR->isAffine())
		return;
		for (unsigned int i = 0; i < Gep->getNumOperands(); i++) {
		Instruction *UI = dyn_cast<Instruction>(Gep->getOperand(i));
		if (!UI)
		continue;
		// consider only loop instructions
		if (LInfo->getLoopFor(UI->getParent()) != CurLoop)
		continue;
		// Return when invalid stride operation.
		if (!validateStrideOperation(UI))
		return;
		}
		hasValidStride = true;
		DEBUG(dbgs() << "LV: Identified valid stride operation for strided"
		<< " memory vectorization, from instruction:\n"
		<< *I << "\n");
		return;
		}

		/// For a strided memory access, it returns true if hold valid stride
		/// operation, else it returns false.
		bool StrideAccessInfo::isValidStrideOperation() const { return hasValidStride; }

/// \brief Drive the analysis of interleaved memory accesses in the loop.		/// \brief Drive the analysis of interleaved memory accesses in the loop.
///		///
/// Use this class to analyze interleaved accesses only when we can vectorize		/// Use this class to analyze interleaved accesses only when we can vectorize
/// a loop. Otherwise it's meaningless to do analysis as the vectorization		/// a loop. Otherwise it's meaningless to do analysis as the vectorization
/// on interleaved accesses is unsafe.		/// on interleaved accesses is unsafe.
///		///
/// The analysis collects interleave groups and records the relationships		/// The analysis collects interleave groups and records the relationships
/// between the member and the group in a map.		/// between the member and the group in a map.
Show All 9 Lines	~InterleavedAccessInfo() {
for (auto &I : InterleaveGroupMap)		for (auto &I : InterleaveGroupMap)
DelSet.insert(I.second);		DelSet.insert(I.second);
for (auto *Ptr : DelSet)		for (auto *Ptr : DelSet)
delete Ptr;		delete Ptr;
}		}

/// \brief Analyze the interleaved accesses and collect them in interleave		/// \brief Analyze the interleaved accesses and collect them in interleave
/// groups. Substitute symbolic strides using \p Strides.		/// groups. Substitute symbolic strides using \p Strides.
void analyzeInterleaving(const ValueToValueMap &Strides);		void analyzeInterleaving(const ValueToValueMap &Strides, LoopInfo *LInfo);

/// \brief Check if \p Instr belongs to any interleave group.		/// \brief Check if \p Instr belongs to any interleave group.
bool isInterleaved(Instruction *Instr) const {		bool isInterleaved(Instruction *Instr) const {
return InterleaveGroupMap.count(Instr);		return InterleaveGroupMap.count(Instr);
}		}

/// \brief Return the maximum interleave factor of all interleaved groups.		/// \brief Return the maximum interleave factor of all interleaved groups.
unsigned getMaxInterleaveFactor() const {		unsigned getMaxInterleaveFactor() const {
unsigned MaxFactor = 1;		unsigned MaxFactor = 1;
for (auto &Entry : InterleaveGroupMap)		for (auto &Entry : InterleaveGroupMap)
MaxFactor = std::max(MaxFactor, Entry.second->getFactor());		MaxFactor = std::max(MaxFactor, Entry.second->getFactor());
return MaxFactor;		return MaxFactor;
}		}

/// \brief Get the interleave group that \p Instr belongs to.		/// \brief Get the interleave group that \p Instr belongs to.
///		///
/// \returns nullptr if doesn't have such group.		/// \returns nullptr if doesn't have such group.
InterleaveGroup getInterleaveGroup(Instruction Instr) const {		InterleaveGroup getInterleaveGroup(Instruction Instr) const {
if (InterleaveGroupMap.count(Instr))		if (InterleaveGroupMap.count(Instr))
return InterleaveGroupMap.find(Instr)->second;		return InterleaveGroupMap.find(Instr)->second;
return nullptr;		return nullptr;
}		}

		/// For a strided memory access, it returns true if hold valid stride
		/// operation, else it returns false.
		bool isValidStrideOperation(Instruction *I) {
		MapVector<Instruction *, const StrideAccessInfo>::iterator It =
		StrideInfo.find(I);
		if (It != StrideInfo.end())
		return It->second.isValidStrideOperation();
		return false;
		}

		/// Holds instruction & its stride access information.
		MapVector<Instruction *, const StrideAccessInfo> StrideInfo;

/// \brief Returns true if an interleaved group that may access memory		/// \brief Returns true if an interleaved group that may access memory
/// out-of-bounds requires a scalar epilogue iteration for correctness.		/// out-of-bounds requires a scalar epilogue iteration for correctness.
bool requiresScalarEpilogue() const { return RequiresScalarEpilogue; }		bool requiresScalarEpilogue() const { return RequiresScalarEpilogue; }

private:		private:
/// A wrapper around ScalarEvolution, used to add runtime SCEV checks.		/// A wrapper around ScalarEvolution, used to add runtime SCEV checks.
/// Simplifies SCEV expressions in the context of existing SCEV assumptions.		/// Simplifies SCEV expressions in the context of existing SCEV assumptions.
/// The interleaved access analysis can also add new predicates (for example		/// The interleaved access analysis can also add new predicates (for example
/// by versioning strides of pointers).		/// by versioning strides of pointers).
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
Loop *TheLoop;		Loop *TheLoop;
DominatorTree *DT;		DominatorTree *DT;

/// True if the loop may contain non-reversed interleaved groups with		/// True if the loop may contain non-reversed interleaved groups with
/// out-of-bounds accesses. We ensure we don't speculatively access memory		/// out-of-bounds accesses. We ensure we don't speculatively access memory
/// out-of-bounds by executing at least one scalar epilogue iteration.		/// out-of-bounds by executing at least one scalar epilogue iteration.
bool RequiresScalarEpilogue;		bool RequiresScalarEpilogue;

/// Holds the relationships between the members and the interleave group.		/// Holds the relationships between the members and the interleave group.
DenseMap<Instruction , InterleaveGroup > InterleaveGroupMap;		DenseMap<Instruction , InterleaveGroup > InterleaveGroupMap;

/// \brief The descriptor for a strided memory access.
struct StrideDescriptor {
StrideDescriptor(int Stride, const SCEV *Scev, unsigned Size,
unsigned Align)
: Stride(Stride), Scev(Scev), Size(Size), Align(Align) {}

StrideDescriptor() : Stride(0), Scev(nullptr), Size(0), Align(0) {}

int Stride; // The access's stride. It is negative for a reverse access.
const SCEV *Scev; // The scalar expression of this access
unsigned Size; // The size of the memory object.
unsigned Align; // The alignment of this access.
};

/// \brief Create a new interleave group with the given instruction \p Instr,		/// \brief Create a new interleave group with the given instruction \p Instr,
/// stride \p Stride and alignment \p Align.		/// stride \p Stride and alignment \p Align.
///		///
/// \returns the newly created interleave group.		/// \returns the newly created interleave group.
InterleaveGroup createInterleaveGroup(Instruction Instr, int Stride,		InterleaveGroup createInterleaveGroup(Instruction Instr, int Stride,
unsigned Align) {		unsigned Align) {
assert(!InterleaveGroupMap.count(Instr) &&		assert(!InterleaveGroupMap.count(Instr) &&
"Already in an interleaved access group");		"Already in an interleaved access group");
InterleaveGroupMap[Instr] = new InterleaveGroup(Instr, Stride, Align);		InterleaveGroupMap[Instr] = new InterleaveGroup(Instr, Stride, Align);
return InterleaveGroupMap[Instr];		return InterleaveGroupMap[Instr];
}		}

/// \brief Release the group and remove all the relationships.		/// \brief Release the group and remove all the relationships.
void releaseGroup(InterleaveGroup *Group) {		void releaseGroup(InterleaveGroup *Group) {
for (unsigned i = 0; i < Group->getFactor(); i++)		for (unsigned i = 0; i < Group->getFactor(); i++)
if (Instruction *Member = Group->getMember(i))		if (Instruction *Member = Group->getMember(i))
InterleaveGroupMap.erase(Member);		InterleaveGroupMap.erase(Member);

delete Group;		delete Group;
}		}

/// \brief Collect all the accesses with a constant stride in program order.		/// \brief Collect all the accesses with a constant stride in program order.
void collectConstStridedAccesses(		void collectConstStridedAccesses(
MapVector<Instruction *, StrideDescriptor> &StrideAccesses,		MapVector<Instruction *, StrideDescriptor> &StrideAccesses,
const ValueToValueMap &Strides);		const ValueToValueMap &Strides, LoopInfo * LInfo);
};		};

/// Utility class for getting and setting loop vectorizer hints in the form		/// Utility class for getting and setting loop vectorizer hints in the form
/// of loop metadata.		/// of loop metadata.
/// This class keeps a number of loop annotations locally (as member variables)		/// This class keeps a number of loop annotations locally (as member variables)
/// and can, upon request, write them back as metadata on the loop. It will		/// and can, upon request, write them back as metadata on the loop. It will
/// initially scan the loop for existing metadata, and will update the local		/// initially scan the loop for existing metadata, and will update the local
/// values based on information in the loop.		/// values based on information in the loop.
▲ Show 20 Lines • Show All 316 Lines • ▼ Show 20 Lines
class LoopVectorizationLegality {		class LoopVectorizationLegality {
public:		public:
LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,		LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,
DominatorTree DT, TargetLibraryInfo TLI,		DominatorTree DT, TargetLibraryInfo TLI,
AliasAnalysis AA, Function F,		AliasAnalysis AA, Function F,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
LoopAccessAnalysis *LAA,		LoopAccessAnalysis *LAA,
LoopVectorizationRequirements *R,		LoopVectorizationRequirements *R,
LoopVectorizeHints *H)		LoopVectorizeHints H, LoopInfo LInfo)
: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),		: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),
TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),		TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),
Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),		Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),
Requirements(R), Hints(H) {}		Requirements(R), Hints(H), LInfo(LInfo) {}

/// ReductionList contains the reduction descriptors for all		/// ReductionList contains the reduction descriptors for all
/// of the reductions that were found in the loop.		/// of the reductions that were found in the loop.
typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;		typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;

/// InductionList saves induction variables and maps them to the		/// InductionList saves induction variables and maps them to the
/// induction descriptor.		/// induction descriptor.
typedef MapVector<PHINode *, InductionDescriptor> InductionList;		typedef MapVector<PHINode *, InductionDescriptor> InductionList;
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	public:

/// Returns true if vector representation of the instruction \p I		/// Returns true if vector representation of the instruction \p I
/// requires mask.		/// requires mask.
bool isMaskRequired(const Instruction *I) { return (MaskedOp.count(I) != 0); }		bool isMaskRequired(const Instruction *I) { return (MaskedOp.count(I) != 0); }
unsigned getNumStores() const { return LAI->getNumStores(); }		unsigned getNumStores() const { return LAI->getNumStores(); }
unsigned getNumLoads() const { return LAI->getNumLoads(); }		unsigned getNumLoads() const { return LAI->getNumLoads(); }
unsigned getNumPredStores() const { return NumPredStores; }		unsigned getNumPredStores() const { return NumPredStores; }

		/// \brief Get the interleaved access group that \p Instr belongs to.
		MapVector<Instruction*,const StrideAccessInfo> &getStrideAccessGroup() {
		return InterleaveInfo.StrideInfo;
		}
		/// For a strided memory access, it returns true if hold valid stride
		/// operation, else it returns false.
		bool isValidStrideOperation(Instruction * I) {
		return InterleaveInfo.isValidStrideOperation(I);
		}
private:		private:
/// Check if a single basic block loop is vectorizable.		/// Check if a single basic block loop is vectorizable.
/// At this point we know that this is a loop with a constant trip count		/// At this point we know that this is a loop with a constant trip count
/// and we only need to check individual instructions.		/// and we only need to check individual instructions.
bool canVectorizeInstrs();		bool canVectorizeInstrs();

/// When we vectorize loops we may change the order in which		/// When we vectorize loops we may change the order in which
/// we read and write from memory. This method checks if it is		/// we read and write from memory. This method checks if it is
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	private:
LoopVectorizeHints *Hints;		LoopVectorizeHints *Hints;

ValueToValueMap Strides;		ValueToValueMap Strides;
SmallPtrSet<Value *, 8> StrideSet;		SmallPtrSet<Value *, 8> StrideSet;

/// While vectorizing these instructions we have to generate a		/// While vectorizing these instructions we have to generate a
/// call to the appropriate masked intrinsic		/// call to the appropriate masked intrinsic
SmallPtrSet<const Instruction *, 8> MaskedOp;		SmallPtrSet<const Instruction *, 8> MaskedOp;
		LoopInfo * LInfo;
};		};

/// LoopVectorizationCostModel - estimates the expected speedups due to		/// LoopVectorizationCostModel - estimates the expected speedups due to
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
▲ Show 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	if (TC > 0u && TC < TinyTripCountVectorThreshold) {
}		}
}		}

PredicatedScalarEvolution PSE(SE, L);		PredicatedScalarEvolution PSE(SE, L);

// Check if it is legal to vectorize the loop.		// Check if it is legal to vectorize the loop.
LoopVectorizationRequirements Requirements;		LoopVectorizationRequirements Requirements;
LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, TTI, LAA,		LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, TTI, LAA,
&Requirements, &Hints);		&Requirements, &Hints, LI);
if (!LVL.canVectorize()) {		if (!LVL.canVectorize()) {
DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
emitMissedWarning(F, L, Hints);		emitMissedWarning(F, L, Hints);
return false;		return false;
}		}

// Use the cost model.		// Use the cost model.
LoopVectorizationCostModel CM(L, PSE, LI, &LVL, *TTI, TLI, DB, AC, F,		LoopVectorizationCostModel CM(L, PSE, LI, &LVL, *TTI, TLI, DB, AC, F,
▲ Show 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	Value InnerLoopVectorizer::reverseVector(Value Vec) {
for (unsigned i = 0; i < VF; ++i)		for (unsigned i = 0; i < VF; ++i)
ShuffleMask.push_back(Builder.getInt32(VF - i - 1));		ShuffleMask.push_back(Builder.getInt32(VF - i - 1));

return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()),		return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()),
ConstantVector::get(ShuffleMask),		ConstantVector::get(ShuffleMask),
"reverse");		"reverse");
}		}

		/// Based on number of element it will generate mask for load & store.
		/// i.e. if stride is 2 and VF is 4, valid elements are 2
		/// In this case it will generate
		/// Mask: <i1 1, i1 0, i32 1, i1 0>
		/// If valid elements is 1 (in some case) then it will generate
		/// Mask: <i1 1, i1 0, i32 0, i1 0>
		/// Here valid elements(NumVec) plays vital role, once valid elements are
		/// over, for remaining it will mark 0 as mask value.
		Constant *InnerLoopVectorizer::getLoadStoreInstMask(IRBuilder<> &Builder,
		unsigned VF,
		unsigned Stride,
		unsigned NumVec) {
		SmallVector<Constant *, 16> Mask;
		for (unsigned i = 0; i < VF; i++) {
		bool M = NumVec && ((i % Stride) ? 0 : 1);
		Mask.push_back(Builder.getInt1(M));
		if (M)
		NumVec = NumVec - 1;
		}
		return ConstantVector::get(Mask);
		}

		/// Create shuffle vector's mask for mask-store
		/// i.e stride 3 store with VF 4
		/// It will generate following 2 mask for different part.
		/// in 2 iterations:
		/// Part#1 Mask: <i32 0, i32 undef, i32 undef, i32 1>
		/// Part#2 Mask: <i32 2, i32 undef, i32 undef, i32 3>
		Constant *InnerLoopVectorizer::getStoreShuffleVectorMask(IRBuilder<> &Builder,
		unsigned VF,
		unsigned Stride,
		unsigned Part) {
		unsigned ValidElements =
		VF - (getValidMemInSingleVFAccess(VF, Stride) * Part);
		unsigned StartIndex = VF + Part * getValidMemInSingleVFAccess(VF, Stride);
		SmallVector<Constant *, 16> Mask;
		for (unsigned i = 0; i < VF; i++) {
		// If its not our element then place i
		// If valid elements are over push undef index
		if (!ValidElements \|\| (i % Stride)) {
		Mask.push_back(Builder.getInt32(i));
		} else {
		// If its our element then place StartIndex
		Mask.push_back(Builder.getInt32(StartIndex));
		StartIndex = StartIndex + 1;
		ValidElements = ValidElements - 1;
		}
		}
		return ConstantVector::get(Mask);
		}

		/// Create mask for load instruction based on operands.
		/// i.e. for stride 3 and vector factor 4 it will generate
		/// mask: <i32 0, i32 2, i32 4, i32 6>
		/// Any of the operand can be shuffle as well in that case. i.e.
		/// First operand:
		/// %shuf.vec = shufflevector <4 x i32> <i32 0, i32 3, i32 undef, i32 undef>
		/// Second operand:
		/// %load.vec = load <4 x i32>
		/// In this case it will generate mask: <i32 0, i32 1, i32 4, i32 7>
		/// NOTE: If any operand is shuffle, it has to be first operand.
		Constant *InnerLoopVectorizer::getLoadShuffleVectorMask(
		IRBuilder<> &Builder, unsigned Start, unsigned Stride, unsigned VF,
		bool isFirstOpShuffle, unsigned FirstOpLen, unsigned SecondOpLen) {
		SmallVector<Constant *, 16> Mask;
		if (isFirstOpShuffle) {
		for (unsigned i = 0; i < FirstOpLen; i++) {
		Mask.push_back(Builder.getInt32(Start + i));
		}
		} else {
		for (unsigned i = 0; i < FirstOpLen; i++) {
		Mask.push_back(Builder.getInt32(Start + (i * Stride)));
		}
		}
		// For second vector move the start by VF.
		Start = Start + VF;
		for (unsigned i = 0; i < SecondOpLen; i++) {
		Mask.push_back(Builder.getInt32(Start + i * Stride));
		}
		return ConstantVector::get(Mask);
		}

// Get a mask to interleave \p NumVec vectors into a wide vector.		// Get a mask to interleave \p NumVec vectors into a wide vector.
// I.e. <0, VF, VF2, ..., VF(NumVec-1), 1, VF+1, VF*2+1, ...>		// I.e. <0, VF, VF2, ..., VF(NumVec-1), 1, VF+1, VF*2+1, ...>
// E.g. For 2 interleaved vectors, if VF is 4, the mask is:		// E.g. For 2 interleaved vectors, if VF is 4, the mask is:
// <0, 4, 1, 5, 2, 6, 3, 7>		// <0, 4, 1, 5, 2, 6, 3, 7>
static Constant *getInterleavedMask(IRBuilder<> &Builder, unsigned VF,		static Constant *getInterleavedMask(IRBuilder<> &Builder, unsigned VF,
unsigned NumVec) {		unsigned NumVec) {
SmallVector<Constant *, 16> Mask;		SmallVector<Constant *, 16> Mask;
for (unsigned i = 0; i < VF; i++)		for (unsigned i = 0; i < VF; i++)
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines
void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {
const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);		const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
assert(Group && "Fail to get an interleaved access group.");		assert(Group && "Fail to get an interleaved access group.");

// Skip if current instruction is not the insert position.		// Skip if current instruction is not the insert position.
if (Instr != Group->getInsertPos())		if (Instr != Group->getInsertPos())
return;		return;

		// Vectorize strided memory if memory instruction holds a valid strided
		if (EnableStridedVectorization && Legal->isValidStrideOperation(Instr) &&
		(Group->getNumMembers() == 1)) {
		vectorizeStridedMemoryAccess(Instr);
		return;
		}

LoadInst *LI = dyn_cast<LoadInst>(Instr);		LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);		StoreInst *SI = dyn_cast<StoreInst>(Instr);
Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();		Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();

// Prepare for the vector type of the interleaved load/store.		// Prepare for the vector type of the interleaved load/store.
Type *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();		Type *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();
unsigned InterleaveFactor = Group->getFactor();		unsigned InterleaveFactor = Group->getFactor();
Type VecTy = VectorType::get(ScalarTy, InterleaveFactor VF);		Type VecTy = VectorType::get(ScalarTy, InterleaveFactor VF);
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	Value *IVec = Builder.CreateShuffleVector(WideVec, UndefVec, IMask,
"interleaved.vec");		"interleaved.vec");

Instruction *NewStoreInstr =		Instruction *NewStoreInstr =
Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment());		Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment());
addMetadata(NewStoreInstr, Instr);		addMetadata(NewStoreInstr, Instr);
}		}
}		}

		/// Vectorize strided load
		/// It generates multiple vector load by considering skip factor.
		/// Skip factor is the offset gap between 2 loads.
		/// It does multiple loads followed by shuffle to extract elements.
		///
		/// i.e. for stride 2, with vector factor 4
		/// <...> = arr[i*2]
		/// Will generate following instructions:
		///
		/// %1 = getelementptr inbounds [100 x i32], [100 x i32]* @a, i64 0, i64 %0
		/// %2 = bitcast i32* %1 to <4 x i32>*
		/// %stride.load = load <4 x i32>, <4 x i32>* %2
		///
		/// %3 = getelementptr i32, i32* %1, i64 4
		/// %4 = bitcast i32* %3 to <4 x i32>*
		/// %stride.load18 = load <4 x i32>, <4 x i32>* %4, align 16, !tbaa !1
		/// %strided.vec = shufflevector <4 x i32> %stride.load, <4 x i32> %stride.load18, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		///
		/// For stride 2, skip factor will be 0.
		///
		/// For memory safety if user wants to generate vector mask-load, its supported
		/// under "-enable-strided-mask-load" command line option.
		void InnerLoopVectorizer::vectorizeStridedLoad(Instruction *Instr) {
		DEBUG(dbgs() << "LV: Vectorizing strided load: " << *Instr << "\n");
		const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
		unsigned InterleaveFactor = Group->getFactor();
		LoadInst *LI = dyn_cast<LoadInst>(Instr);
		Value *Ptr = LI->getPointerOperand();
		Type *ScalarTy = LI->getType();
		Type *VecTy = VectorType::get(ScalarTy, VF);
		Type *PtrTy = VecTy->getPointerTo(Ptr->getType()->getPointerAddressSpace());
		unsigned SkipFactor = getSkipFactor(VF, InterleaveFactor);
		unsigned NumLoadRequired = getRequiredLoadStore(VF, InterleaveFactor);
		unsigned NumMemAccessInVF = getValidMemInSingleVFAccess(VF, InterleaveFactor);

		setDebugLocFromInst(Builder, Ptr);
		VectorParts &PtrParts = getVectorValue(Ptr);

		// Consider Unroll
		// Widen instructions based on unroll factor.
		// PtrParts[URIndex] holds memory address in a particular unroll index
		// For each unroll index with a start address widen memory instructions
		// by generating multipe unroll versions.
		for (unsigned URIndex = 0; URIndex < UF; URIndex++) {
		// Identify start address.
		Value *StartAddress = Builder.CreateExtractElement(
		PtrParts[URIndex],
		Group->isReverse() ? Builder.getInt32(VF - 1) : Builder.getInt32(0));

		Instruction *PrevInstr = nullptr;
		unsigned Index = 0;
		// Generate required loads, based on Index
		for (unsigned Part = 0; Part < NumLoadRequired; Part++) {
		// Create GEP based on Index
		// If stride is 2, VF 4 then in that case starting offset will start
		// from:
		// %StartIndex = shl i64 %index, 1
		//
		// In case of Reverse Loop:
		// .lhs = shl i64 %offset.idx, 1
		// %StartIndex = add i64 %.lhs, -6
		Value *NewPtr = Builder.CreateGEP(StartAddress, Builder.getInt32(Index));
		Instruction *NewLoadInstr = nullptr;
		bool LastIteration = (Part + 1 >= NumLoadRequired);
		// Make last load as masked load
		if (EnableStridedMaskLoad && LastIteration) {
		unsigned ElementsToLoad = VF - (NumMemAccessInVF * Part);
		Constant *LoadMask =
		getLoadStoreInstMask(Builder, VF, InterleaveFactor, ElementsToLoad);
		NewLoadInstr = Builder.CreateMaskedLoad(
		Builder.CreateBitCast(NewPtr, PtrTy), Group->getAlignment(),
		LoadMask, UndefValue::get(VecTy), "wide.masked.load");
		} else {
		// Alignments smaller than the target default indicate that the code
		// generator for the target needs to generate the unaligned load/store.
		// When SkipFactor is 0, generate aligned access else generate
		// unaligned memory access.
		// NOTE: Alignment value 1 indicates unaligned access.
		NewLoadInstr = Builder.CreateAlignedLoad(
		Builder.CreateBitCast(NewPtr, PtrTy),
		(SkipFactor ? 1 : Group->getAlignment()), "stride.load");
		}
		propagateMetadata(NewLoadInstr, Instr);

		if (PrevInstr == nullptr) {
		PrevInstr = NewLoadInstr;
		} else {
		// Check if previous instruction is shuffle
		bool isShuffle = dyn_cast<ShuffleVectorInst>(PrevInstr) ? true : false;
		// Identify number of elements to shuffle from first load
		unsigned FirstOpLen = isShuffle
		? dyn_cast<ShuffleVectorInst>(PrevInstr)
		->getShuffleMask()
		.size()
		: NumMemAccessInVF;
		// Identify number of elements to shuffle from second load
		unsigned SecondOpLen =
		LastIteration ? (VF - (NumMemAccessInVF * Part)) : NumMemAccessInVF;
		// Create input mask for shuffle
		Constant *StrideMask = getLoadShuffleVectorMask(
		Builder, 0, InterleaveFactor, VF, isShuffle, FirstOpLen,
		SecondOpLen);
		if (isShuffle) {
		// Making type equal for shuffle.
		while (PrevInstr->getType() != NewLoadInstr->getType()) {
		unsigned FirstOpLen =
		dyn_cast<ShuffleVectorInst>(PrevInstr)->getShuffleMask().size();
		unsigned SecondOpLen = VF - FirstOpLen;
		Constant *MakeUpMask = getLoadShuffleVectorMask(
		Builder, 0, 1,
		dyn_cast<ShuffleVectorInst>(PrevInstr)->getShuffleMask().size(),
		isShuffle, FirstOpLen, SecondOpLen);
		Value *MakeUpShuffle = Builder.CreateShuffleVector(
		PrevInstr, UndefValue::get(PrevInstr->getType()), MakeUpMask,
		"strided.vec");
		PrevInstr = dyn_cast<Instruction>(MakeUpShuffle);
		}
		}
		// Create shuffle vector
		Value *StridedVec = Builder.CreateShuffleVector(
		PrevInstr, NewLoadInstr, StrideMask, "strided.vec");
		// If this member has different type, cast the result type.
		if (Instr->getType() != ScalarTy) {
		VectorType *OtherVTy = VectorType::get(Instr->getType(), VF);
		StridedVec = Builder.CreateBitOrPointerCast(StridedVec, OtherVTy);
		}
		// Update instruction to WidenMap.
		VectorParts &Entry = WidenMap.get(Instr);
		Entry[URIndex] =
		Group->isReverse() ? reverseVector(StridedVec) : StridedVec;
		PrevInstr = dyn_cast<Instruction>(StridedVec);
		}
		// Move Index.
		Index = (NumMemAccessInVF * (Part + 1)) * InterleaveFactor;
		}
		}
		return;
		}

		/// Vectorize strided store
		/// It generates multiple vector mask-stores by considering skip factor.
		/// Will generate shuffle followed by mask-store to insert elements.
		/// Skip factor is the offset gap between 2 stores.
		/// i.e. for stride 3, with vector factor 4
		/// arr[3*i] = <...>
		/// Will generate following instructions:
		///
		/// %8 = getelementptr inbounds i32, i32* %arr, i64 %0
		/// %9 = bitcast i32* %8 to <4 x i32>*
		/// %interleaved.vec = shufflevector <4 x i32> %7, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 1, i32 undef>
		/// call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x i32>* %9, i32 4, <4 x i1> <i1 true, i1 false, i1 true, i1 false>)
		///
		/// %10 = getelementptr i32, i32* %8, i64 6
		/// %11 = bitcast i32* %10 to <4 x i32>*
		/// %interleaved.vec19 = shufflevector <4 x i32> %7, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 3, i32 undef>
		/// call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec19, <4 x i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 true, i1 false>)
		///
		/// For stride 3, skip factor will be 2.
		void InnerLoopVectorizer::vectorizeStridedStore(Instruction *Instr) {
		DEBUG(dbgs() << "LV: Vectorizing strided store: " << *Instr << "\n");
		const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
		unsigned InterleaveFactor = Group->getFactor();
		StoreInst *SI = dyn_cast<StoreInst>(Instr);
		Value *Ptr = SI->getPointerOperand();
		// Prepare for the vector type of the strided store.
		Type *ScalarTy = SI->getValueOperand()->getType();
		Type *VecTy = VectorType::get(ScalarTy, VF);
		Type *PtrTy = VecTy->getPointerTo(Ptr->getType()->getPointerAddressSpace());
		// Check required number of stores & valid elements in each memory access.
		unsigned NumStoreRequired = getRequiredLoadStore(VF, InterleaveFactor);
		unsigned NumMemAccessInVF = getValidMemInSingleVFAccess(VF, InterleaveFactor);

		setDebugLocFromInst(Builder, Ptr);
		VectorParts &PtrParts = getVectorValue(Ptr);
		VectorParts StoredVal = getVectorValue(SI->getValueOperand());
		Value *UndefVec = UndefValue::get(VecTy);
		// Consider Unroll.
		// Widen instructions based on unroll factor.
		// PtrParts[URIndex] holds memory address in a particular unroll index
		// For each unroll index with a start address widen memory instructions
		// by generating multipe unroll versions.
		for (unsigned URIndex = 0; URIndex < UF; URIndex++) {
		unsigned Index = 0;
		// Identify starting address
		Value *StartAddress = Builder.CreateExtractElement(
		PtrParts[URIndex],
		Group->isReverse() ? Builder.getInt32(VF - 1) : Builder.getInt32(0));
		// Generate required stores.
		for (unsigned Part = 0; Part < NumStoreRequired; Part++) {
		// Create GEP based on Index
		// If stride is 2, VF 4 then in that case starting offset will start
		// from:
		// %StartIndex = shl i64 %index, 1
		//
		// In case of Reverse Loop:
		// .lhs = shl i64 %offset.idx, 1
		// %StartIndex = add i64 %.lhs, -6
		Value *NewPtr = Builder.CreateGEP(StartAddress, Builder.getInt32(Index));
		NewPtr = Builder.CreateBitCast(NewPtr, PtrTy);
		// Interleave the elements in the wide vector.
		Constant *ShuffleMask =
		getStoreShuffleVectorMask(Builder, VF, InterleaveFactor, Part);
		// Create ShuffleVector, if loop is reverse then use reverse vector.
		Value *ShuffleVector = Builder.CreateShuffleVector(
		UndefVec, (Group->isReverse() ? reverseVector(StoredVal[URIndex])
		: StoredVal[URIndex]),
		ShuffleMask, "interleaved.vec");
		// Get mask-store instruction's mask
		Value *StoreMask =
		getLoadStoreInstMask(Builder, VF, InterleaveFactor, NumMemAccessInVF);
		// Generate mask-store instruction.
		Instruction *NewStoreInstr = Builder.CreateMaskedStore(
		ShuffleVector, NewPtr, Group->getAlignment(), StoreMask);
		propagateMetadata(NewStoreInstr, Instr);
		// Move Index.
		Index = (NumMemAccessInVF * (Part + 1)) * InterleaveFactor;
		}
		}
		}

		/// Try to vectorize strided load and store memory accesses.
		/// Will generate multiple load operations followed by shuffle to
		/// model extract elements. For store it generate multiple shuffle followed
		/// by mask-store instructions.
		void InnerLoopVectorizer::vectorizeStridedMemoryAccess(Instruction *Instr) {
		LoadInst *LI = dyn_cast<LoadInst>(Instr);
		StoreInst *SI = dyn_cast<StoreInst>(Instr);
		if (LI) {
		vectorizeStridedLoad(Instr);
		} else if (SI) {
		vectorizeStridedStore(Instr);
		} else {
		DEBUG(dbgs() << "LV : Unsupported instruction for Strided Vectorization"
		<< "\n");
		}
		return;
		}

void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
// Attempt to issue a wide load.		// Attempt to issue a wide load.
LoadInst *LI = dyn_cast<LoadInst>(Instr);		LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);		StoreInst *SI = dyn_cast<StoreInst>(Instr);

assert((LI \|\| SI) && "Invalid Load/Store instruction");		assert((LI \|\| SI) && "Invalid Load/Store instruction");

// Try to vectorize the interleave group if this access is interleaved.		// Try to vectorize the interleave group if this access is interleaved.
▲ Show 20 Lines • Show All 2,046 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize() {
bool UseInterleaved = TTI->enableInterleavedAccessVectorization();		bool UseInterleaved = TTI->enableInterleavedAccessVectorization();

// If an override option has been passed in for interleaved accesses, use it.		// If an override option has been passed in for interleaved accesses, use it.
if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)		if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
UseInterleaved = EnableInterleavedMemAccesses;		UseInterleaved = EnableInterleavedMemAccesses;

// Analyze interleaved memory accesses.		// Analyze interleaved memory accesses.
if (UseInterleaved)		if (UseInterleaved)
InterleaveInfo.analyzeInterleaving(Strides);		InterleaveInfo.analyzeInterleaving(Strides, LInfo);

unsigned SCEVThreshold = VectorizeSCEVCheckThreshold;		unsigned SCEVThreshold = VectorizeSCEVCheckThreshold;
if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)
SCEVThreshold = PragmaVectorizeSCEVCheckThreshold;		SCEVThreshold = PragmaVectorizeSCEVCheckThreshold;

if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) {		if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) {
emitAnalysis(VectorizationReport()		emitAnalysis(VectorizationReport()
<< "Too many SCEV assumptions need to be made and checked "		<< "Too many SCEV assumptions need to be made and checked "
▲ Show 20 Lines • Show All 431 Lines • ▼ Show 20 Lines	for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
}		}
}		}

return true;		return true;
}		}

void InterleavedAccessInfo::collectConstStridedAccesses(		void InterleavedAccessInfo::collectConstStridedAccesses(
MapVector<Instruction *, StrideDescriptor> &StrideAccesses,		MapVector<Instruction *, StrideDescriptor> &StrideAccesses,
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides, LoopInfo *LInfo) {
// Holds load/store instructions in program order.		// Holds load/store instructions in program order.
SmallVector<Instruction *, 16> AccessList;		SmallVector<Instruction *, 16> AccessList;

for (auto *BB : TheLoop->getBlocks()) {		for (auto *BB : TheLoop->getBlocks()) {
bool IsPred = LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);		bool IsPred = LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);

for (auto &I : *BB) {		for (auto &I : *BB) {
if (!isa<LoadInst>(&I) && !isa<StoreInst>(&I))		if (!isa<LoadInst>(&I) && !isa<StoreInst>(&I))
Show All 29 Lines	for (auto I : AccessList) {
unsigned Size = DL.getTypeAllocSize(PtrTy->getElementType());		unsigned Size = DL.getTypeAllocSize(PtrTy->getElementType());

// An alignment of 0 means target ABI alignment.		// An alignment of 0 means target ABI alignment.
unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();
if (!Align)		if (!Align)
Align = DL.getABITypeAlignment(PtrTy->getElementType());		Align = DL.getABITypeAlignment(PtrTy->getElementType());

StrideAccesses[I] = StrideDescriptor(Stride, Scev, Size, Align);		StrideAccesses[I] = StrideDescriptor(Stride, Scev, Size, Align);
		// Update StrideInfo for instruction.
		if (EnableStridedVectorization) {
		StrideInfo.insert(std::pair<Instruction *, const StrideAccessInfo>(
		I,
		StrideAccessInfo(I, StrideAccesses[I], PSE.getSE(), TheLoop, LInfo)));
		}
}		}
}		}

// Analyze interleaved accesses and collect them into interleave groups.		// Analyze interleaved accesses and collect them into interleave groups.
//		//
// Notice that the vectorization on interleaved groups will change instruction		// Notice that the vectorization on interleaved groups will change instruction
// orders and may break dependences. But the memory dependence check guarantees		// orders and may break dependences. But the memory dependence check guarantees
// that there is no overlap between two pointers of different strides, element		// that there is no overlap between two pointers of different strides, element
// sizes or underlying bases.		// sizes or underlying bases.
//		//
// For pointers sharing the same stride, element size and underlying base, no		// For pointers sharing the same stride, element size and underlying base, no
// need to worry about Read-After-Write dependences and Write-After-Read		// need to worry about Read-After-Write dependences and Write-After-Read
// dependences.		// dependences.
//		//
// E.g. The RAW dependence: A[i] = a;		// E.g. The RAW dependence: A[i] = a;
// b = A[i];		// b = A[i];
// This won't exist as it is a store-load forwarding conflict, which has		// This won't exist as it is a store-load forwarding conflict, which has
// already been checked and forbidden in the dependence check.		// already been checked and forbidden in the dependence check.
//		//
// E.g. The WAR dependence: a = A[i]; // (1)		// E.g. The WAR dependence: a = A[i]; // (1)
// A[i] = b; // (2)		// A[i] = b; // (2)
// The store group of (2) is always inserted at or below (2), and the load group		// The store group of (2) is always inserted at or below (2), and the load group
// of (1) is always inserted at or above (1). The dependence is safe.		// of (1) is always inserted at or above (1). The dependence is safe.
void InterleavedAccessInfo::analyzeInterleaving(		void InterleavedAccessInfo::analyzeInterleaving(
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides, LoopInfo *LInfo) {
DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");		DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");

// Holds all the stride accesses.		// Holds all the stride accesses.
MapVector<Instruction *, StrideDescriptor> StrideAccesses;		MapVector<Instruction *, StrideDescriptor> StrideAccesses;
collectConstStridedAccesses(StrideAccesses, Strides);		collectConstStridedAccesses(StrideAccesses, Strides, LInfo);

if (StrideAccesses.empty())		if (StrideAccesses.empty())
return;		return;

// Holds all interleaved store groups temporarily.		// Holds all interleaved store groups temporarily.
SmallSetVector<InterleaveGroup *, 4> StoreGroups;		SmallSetVector<InterleaveGroup *, 4> StoreGroups;
// Holds all interleaved load groups temporarily.		// Holds all interleaved load groups temporarily.
SmallSetVector<InterleaveGroup *, 4> LoadGroups;		SmallSetVector<InterleaveGroup *, 4> LoadGroups;
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	for (auto II = std::next(I); II != E; ++II) {
if (B->mayReadFromMemory())		if (B->mayReadFromMemory())
Group->setInsertPos(B);		Group->setInsertPos(B);
}		}
} // Iteration on instruction B		} // Iteration on instruction B
} // Iteration on instruction A		} // Iteration on instruction A

// Remove interleaved store groups with gaps.		// Remove interleaved store groups with gaps.
for (InterleaveGroup *Group : StoreGroups)		for (InterleaveGroup *Group : StoreGroups)
if (Group->getNumMembers() != Group->getFactor())		if (Group->getNumMembers() != Group->getFactor()) {
		Instruction *I = Group->getMember(0);
		// Dont release group, if group has only 1 element & that element
		// has valid strided operation. Else release it.
		if (!(EnableStridedVectorization && Group->getNumMembers() == 1 &&
		isValidStrideOperation(I)))
releaseGroup(Group);		releaseGroup(Group);
		}

// If there is a non-reversed interleaved load group with gaps, we will need		// If there is a non-reversed interleaved load group with gaps, we will need
// to execute at least one scalar epilogue iteration. This will ensure that		// to execute at least one scalar epilogue iteration. This will ensure that
// we don't speculatively access memory out-of-bounds. Note that we only need		// we don't speculatively access memory out-of-bounds. Note that we only need
// to look for a member at index factor - 1, since every group must have a		// to look for a member at index factor - 1, since every group must have a
// member at index zero.		// member at index zero.
for (InterleaveGroup *Group : LoadGroups)		for (InterleaveGroup *Group : LoadGroups)
if (!Group->getMember(Group->getFactor() - 1)) {		if (!Group->getMember(Group->getFactor() - 1)) {
▲ Show 20 Lines • Show All 816 Lines • ▼ Show 20 Lines	if (Legal->isAccessInterleaved(I)) {
// Holds the indices of existing members in an interleaved load group.		// Holds the indices of existing members in an interleaved load group.
// An interleaved store group doesn't need this as it doesn't allow gaps.		// An interleaved store group doesn't need this as it doesn't allow gaps.
SmallVector<unsigned, 4> Indices;		SmallVector<unsigned, 4> Indices;
if (LI) {		if (LI) {
for (unsigned i = 0; i < InterleaveFactor; i++)		for (unsigned i = 0; i < InterleaveFactor; i++)
if (Group->getMember(i))		if (Group->getMember(i))
Indices.push_back(i);		Indices.push_back(i);
}		}
		unsigned Cost = 0;
// Calculate the cost of the whole interleaved group.		// If its a valid stride then get of cost of memory operations
unsigned Cost = TTI.getInterleavedMemoryOpCost(		// involved in it, else get the cost of interleave memory group.
I->getOpcode(), WideVecTy, Group->getFactor(), Indices,		if (EnableStridedVectorization && Legal->isValidStrideOperation(I) &&
		(Group->getNumMembers() == 1)) {
		Cost = TTI.getStridedMemoryOpCost(I->getOpcode(), WideVecTy,
		Group->getFactor(), Indices,
Group->getAlignment(), AS);		Group->getAlignment(), AS);
		} else {
		Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy,
		Group->getFactor(), Indices,
		Group->getAlignment(), AS);
		}
if (Group->isReverse())		if (Group->isReverse())
Cost +=		Cost +=
Group->getNumMembers() *		Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);

// FIXME: The interleaved load group with a huge gap could be even more		// FIXME: The interleaved load group with a huge gap could be even more
// expensive than scalar operations. Then we could ignore such group and		// expensive than scalar operations. Then we could ignore such group and
// use scalar operations instead.		// use scalar operations instead.
▲ Show 20 Lines • Show All 363 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Strided Memory Access VectorizationNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 60792

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/CodeGen/BasicTTIImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86TargetTransformInfo.h

lib/Target/X86/X86TargetTransformInfo.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

Strided Memory Access Vectorization
Needs ReviewPublic