This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
5/5
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
2/2
load-inseltpoison.ll
5/6
load-widening.ll
-
load.ll

Differential D106399

[VectorCombine] Widening of partial vector loads
AbandonedPublic

Authored by lebedev.ri on Jul 20 2021, 2:40 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
fhahn

Summary

If we are loading some vector, and we know it will be legalized into a vector,
and occupy (potentially a number of) vector registers, iff we load less bytes
than the total size of the occupied vector registers, then the legalization
will have a hard time. At worst, the load will be scalarized, at least partially,
and scalar vector elements inserted forming the narrow vector.

But sometimes, if the vector will be widened, we can tell that we are allowed
to load those extra 'padding' elements, based on dereferenceability or alignment knowledge.

I think, this approach with asking the legalization about the final vector size
is the most straight-forward.

I've checked, and i believe this is endianness-agnostic as per alive2.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	260 ms	x64 debian > Clang.CodeGen::attr-arm-sve-vector-bits-bitcast.c
	300 ms	x64 debian > Clang.CodeGen::attr-arm-sve-vector-bits-globals.c
	2,880 ms	x64 debian > libarcher.critical::critical.c
	2,550 ms	x64 debian > libarcher.parallel::parallel-firstprivate.c
	2,500 ms	x64 debian > libarcher.parallel::parallel-simple.c
		View Full Test Results (22 Failed)

Event Timeline

lebedev.ri created this revision.Jul 20 2021, 2:40 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptJul 20 2021, 2:40 PM

lebedev.ri requested review of this revision.Jul 20 2021, 2:40 PM

lebedev.ri mentioned this in D106280: [X86][AVX] scalar_to_vector(load_scalar()) -> load_vector() for fast dereferencable loads.

lebedev.ri edited the summary of this revision. (Show Details)Jul 20 2021, 2:43 PM

Streamline the code.

Harbormaster completed remote builds in B115213: Diff 360300.Jul 20 2021, 5:51 PM

Thanks for looking at this - I'd delayed requesting something like this until we have a better idea of what the non-pow2 SLP IR from D57059 is likely to look like.

I also ended up wondering whether we should consider using the dereferencable data in DAGTypeLegalizer::GenWidenVectorLoads? As that would help with the float3 case on D106280 that raised concern

In D106399#2892863, @RKSimon wrote:

Thanks for looking at this - I'd delayed requesting something like this until we have a better idea of what the non-pow2 SLP IR from D57059 is likely to look like.

Even if SLP learns to emit wide-enough loads (which it should, regardless of non-pow2 vectorization support/etc),
i would guess we'd still want this, because here in IR we have much more information to deduce
the legality of such a transformation rather than leaving it up to the backend.

I also ended up wondering whether we should consider using the dereferencable data in DAGTypeLegalizer::GenWidenVectorLoads? As that would help with the float3 case on D106280 that raised concern

I actually tried looking at exactly that before doing this, without much success.

Harbormaster completed remote builds in B115283: Diff 360412.Jul 21 2021, 7:01 AM

Thanks for working on this! I agree that this is useful independent of whatever we can/should do to improve SLP.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
290	worth -> width
312–313	Use the unary variant of `CreateShuffleVector` here. Could call this value `ExtractSubvector` or state that in the code comment. Can we always assume that the extract op is free, or should we add that potential cost into the equation?
llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
590–593	Can we preserve the better alignment?

spatel added inline comments.Jul 21 2021, 7:36 AM

llvm/test/Transforms/VectorCombine/X86/load-widening.ll
4	Hmm...I don't think I've ever tried that experiment. :) Did you confirm that the layout "wins" over the target to provide the coverage you expected?

D57059 looks to be performing masked loads when the wider range is dereferencable (costs permitting) - see the llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll test changes

Thanks for taking a look!
Addressing nits.

In D106399#2893246, @RKSimon wrote:

D57059 looks to be performing masked loads when the wider range is dereferencable (costs permitting) - see the llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll test changes

Then we'll also need a load unmasking transform.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
290	This is not a typo, but okay.
312–313	That subvector we're extracting here will be legalized (widened) into the full vector we've just loaded, so the shuffle just pretends that a few high elements of that legal vector don't exist. That is the whole assumption behind the transformation here. There isn't any actual subvector extraction here. So i'm not really seeing why we'd need to check the cost of that.
llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
590–593	Some other transformation does this, i'll take a look.
llvm/test/Transforms/VectorCombine/X86/load-widening.ll
4	https://alive2.llvm.org/ce/z/yAGpMV

spatel added inline comments.Jul 21 2021, 8:28 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
312–313	Ok - please add a code comment with that explanation then. That way, we'll document why we don't factor cost of the shuffle.
llvm/test/Transforms/VectorCombine/X86/load-widening.ll
4	I'm not questioning the logic, just that there is no big-endian x86 target, so I don't know what is happening internally in LLVM in this situation. I think it would be better to add a different test file for a target that actually does support both modes (AArch64 or PowerPC?). We still want an x86 test file to verify that SSE vs. AVX is behaving as expected though.
194	Can you explain the difference between this test and vec_with_7elts_256bits for an SSE target? It's not obvious to me why we are ok widening to 256-bit if that's not legal, but not wider than that.

Something to keep in mind. Loading more data than was stored can prevent store to load forwarding. If there is a store in the store buffer that hasn't written to the cache yet, its data can forward to a load instead of waiting until the store gets written to the cache. This doesn't work if the load is larger than the size of the store. The load will have to wait until the store gets to the cache to merge with the surrounding data.

Harbormaster completed remote builds in B115323: Diff 360465.Jul 21 2021, 9:22 AM

a.elovikov added a subscriber: a.elovikov.Jul 21 2021, 10:37 AM

lebedev.ri marked an inline comment as done.Jul 21 2021, 12:04 PM

lebedev.ri marked an inline comment as done.Jul 21 2021, 12:10 PM

lebedev.ri added inline comments.

llvm/test/Transforms/VectorCombine/X86/load-widening.ll
172	We widen this to either 2x XMM, or an 1x YMM, and we know we can do this as per deref info.
194	We need to widen this to either 3x XMM or 2x YMM, but we don't know we can load that many bytes. There is another problem hiding here, iff we know we can load 3x XMM, we still try to widen to 4x XMM, because that's what the legalizer told us, because it only knows how to double.

Matt added a subscriber: Matt.Jul 22 2021, 3:35 AM

I guess i need to mimic X86TTIImpl::getMemoryOpCost() more.

lebedev.ri abandoned this revision.Jan 17 2022, 2:35 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

8 lines

TargetTransformInfoImpl.h

1 line

CodeGen/

BasicTTIImpl.h

6 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Transforms/

Vectorize/

VectorCombine.cpp

85 lines

test/

Transforms/

VectorCombine/

X86/

load-inseltpoison.ll

6 lines

load-widening.ll

51 lines

load.ll

6 lines

Diff 360465

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,177 Lines • ▼ Show 20 Lines
InstructionCost getCallInstrCost(		InstructionCost getCallInstrCost(
Function F, Type RetTy, ArrayRef<Type *> Tys,		Function F, Type RetTy, ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency) const;		TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency) const;

/// \returns The number of pieces into which the provided type must be		/// \returns The number of pieces into which the provided type must be
/// split during legalization. Zero is returned when the answer is unknown.		/// split during legalization. Zero is returned when the answer is unknown.
unsigned getNumberOfParts(Type *Tp) const;		unsigned getNumberOfParts(Type *Tp) const;

		/// \returns The type of the piece into which the provided type must be
		/// split during legalization.
		Type getLegalizedPartType(Type Tp) const;

/// \returns The cost of the address computation. For most targets this can be		/// \returns The cost of the address computation. For most targets this can be
/// merged into the instruction indexing mode. Some targets might want to		/// merged into the instruction indexing mode. Some targets might want to
/// distinguish between address computation for memory operations on vector		/// distinguish between address computation for memory operations on vector
/// types and scalar types. Such targets should override this function.		/// types and scalar types. Such targets should override this function.
/// The 'SE' parameter holds pointer for the scalar evolution object which		/// The 'SE' parameter holds pointer for the scalar evolution object which
/// is used in order to get the Ptr step value in case of constant stride.		/// is used in order to get the Ptr step value in case of constant stride.
/// The 'Ptr' parameter holds SCEV of the access pointer.		/// The 'Ptr' parameter holds SCEV of the access pointer.
InstructionCost getAddressComputationCost(Type *Ty,		InstructionCost getAddressComputationCost(Type *Ty,
▲ Show 20 Lines • Show All 433 Lines • ▼ Show 20 Lines	virtual InstructionCost getExtendedAddReductionCost(
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;
virtual InstructionCost		virtual InstructionCost
getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual InstructionCost getCallInstrCost(Function F, Type RetTy,		virtual InstructionCost getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual unsigned getNumberOfParts(Type *Tp) = 0;		virtual unsigned getNumberOfParts(Type *Tp) = 0;
		virtual Type getLegalizedPartType(Type Tp) = 0;
virtual InstructionCost		virtual InstructionCost
getAddressComputationCost(Type Ty, ScalarEvolution SE, const SCEV *Ptr) = 0;		getAddressComputationCost(Type Ty, ScalarEvolution SE, const SCEV *Ptr) = 0;
virtual InstructionCost		virtual InstructionCost
getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) = 0;		getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) = 0;
virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst,		virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst,
MemIntrinsicInfo &Info) = 0;		MemIntrinsicInfo &Info) = 0;
virtual unsigned getAtomicMemIntrinsicMaxElementSize() const = 0;		virtual unsigned getAtomicMemIntrinsicMaxElementSize() const = 0;
virtual Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		virtual Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
▲ Show 20 Lines • Show All 497 Lines • ▼ Show 20 Lines	public:
InstructionCost getCallInstrCost(Function F, Type RetTy,		InstructionCost getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getCallInstrCost(F, RetTy, Tys, CostKind);		return Impl.getCallInstrCost(F, RetTy, Tys, CostKind);
}		}
unsigned getNumberOfParts(Type *Tp) override {		unsigned getNumberOfParts(Type *Tp) override {
return Impl.getNumberOfParts(Tp);		return Impl.getNumberOfParts(Tp);
}		}
		Type getLegalizedPartType(Type Tp) override {
		return Impl.getLegalizedPartType(Tp);
		}
InstructionCost getAddressComputationCost(Type Ty, ScalarEvolution SE,		InstructionCost getAddressComputationCost(Type Ty, ScalarEvolution SE,
const SCEV *Ptr) override {		const SCEV *Ptr) override {
return Impl.getAddressComputationCost(Ty, SE, Ptr);		return Impl.getAddressComputationCost(Ty, SE, Ptr);
}		}
InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) override {		InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) override {
return Impl.getCostOfKeepingLiveOverCall(Tys);		return Impl.getCostOfKeepingLiveOverCall(Tys);
}		}
bool getTgtMemIntrinsic(IntrinsicInst *Inst,		bool getTgtMemIntrinsic(IntrinsicInst *Inst,
▲ Show 20 Lines • Show All 210 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 609 Lines • ▼ Show 20 Lines	public:

InstructionCost getCallInstrCost(Function F, Type RetTy,		InstructionCost getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind) const {		TTI::TargetCostKind CostKind) const {
return 1;		return 1;
}		}

unsigned getNumberOfParts(Type *Tp) const { return 0; }		unsigned getNumberOfParts(Type *Tp) const { return 0; }
		Type getLegalizedPartType(Type Tp) const { return nullptr; }

InstructionCost getAddressComputationCost(Type Tp, ScalarEvolution ,		InstructionCost getAddressComputationCost(Type Tp, ScalarEvolution ,
const SCEV *) const {		const SCEV *) const {
return 0;		return 0;
}		}

InstructionCost getArithmeticReductionCost(unsigned, VectorType *,		InstructionCost getArithmeticReductionCost(unsigned, VectorType *,
TTI::TargetCostKind) const {		TTI::TargetCostKind) const {
▲ Show 20 Lines • Show All 531 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

	Show First 20 Lines • Show All 1,978 Lines • ▼ Show 20 Lines
	}			}

	unsigned getNumberOfParts(Type *Tp) {			unsigned getNumberOfParts(Type *Tp) {
	std::pair<InstructionCost, MVT> LT =			std::pair<InstructionCost, MVT> LT =
	getTLI()->getTypeLegalizationCost(DL, Tp);			getTLI()->getTypeLegalizationCost(DL, Tp);
	return *LT.first.getValue();			return *LT.first.getValue();
	}			}

				Type getLegalizedPartType(Type Tp) {
				std::pair<InstructionCost, MVT> LT =
				getTLI()->getTypeLegalizationCost(DL, Tp);
				return EVT(LT.second).getTypeForEVT(Tp->getContext());
				}

	InstructionCost getAddressComputationCost(Type Ty, ScalarEvolution ,			InstructionCost getAddressComputationCost(Type Ty, ScalarEvolution ,
	const SCEV *) {			const SCEV *) {
	return 0;			return 0;
	}			}

	/// Try to calculate arithmetic and shuffle op costs for reduction intrinsics.			/// Try to calculate arithmetic and shuffle op costs for reduction intrinsics.
	/// We're assuming that reduction operation are performing the following way:			/// We're assuming that reduction operation are performing the following way:
	///			///
	▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 873 Lines • ▼ Show 20 Lines	TargetTransformInfo::getCallInstrCost(Function F, Type RetTy,
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {		unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
return TTIImpl->getNumberOfParts(Tp);		return TTIImpl->getNumberOfParts(Tp);
}		}

		Type TargetTransformInfo::getLegalizedPartType(Type Tp) const {
		return TTIImpl->getLegalizedPartType(Tp);
		}

InstructionCost		InstructionCost
TargetTransformInfo::getAddressComputationCost(Type Tp, ScalarEvolution SE,		TargetTransformInfo::getAddressComputationCost(Type Tp, ScalarEvolution SE,
const SCEV *Ptr) const {		const SCEV *Ptr) const {
InstructionCost Cost = TTIImpl->getAddressComputationCost(Tp, SE, Ptr);		InstructionCost Cost = TTIImpl->getAddressComputationCost(Tp, SE, Ptr);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

▲ Show 20 Lines • Show All 277 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 30 Lines
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "vector-combine"		#define DEBUG_TYPE "vector-combine"
STATISTIC(NumVecLoad, "Number of vector loads formed");		STATISTIC(NumVecLoad, "Number of vector loads formed");
		STATISTIC(NumVecLoadWiden, "Number of vector loads widened");
STATISTIC(NumVecCmp, "Number of vector compares formed");		STATISTIC(NumVecCmp, "Number of vector compares formed");
STATISTIC(NumVecBO, "Number of vector binops formed");		STATISTIC(NumVecBO, "Number of vector binops formed");
STATISTIC(NumVecCmpBO, "Number of vector compare + binop formed");		STATISTIC(NumVecCmpBO, "Number of vector compare + binop formed");
STATISTIC(NumShufOfBitcast, "Number of shuffles moved after bitcast");		STATISTIC(NumShufOfBitcast, "Number of shuffles moved after bitcast");
STATISTIC(NumScalarBO, "Number of scalar binops formed");		STATISTIC(NumScalarBO, "Number of scalar binops formed");
STATISTIC(NumScalarCmp, "Number of scalar compares formed");		STATISTIC(NumScalarCmp, "Number of scalar compares formed");

static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
Show All 23 Lines	private:
Function &F;		Function &F;
IRBuilder<> Builder;		IRBuilder<> Builder;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
const DominatorTree &DT;		const DominatorTree &DT;
AAResults &AA;		AAResults &AA;
AssumptionCache &AC;		AssumptionCache &AC;

bool vectorizeLoadInsert(Instruction &I);		bool vectorizeLoadInsert(Instruction &I);
		bool widenPartialVectorLoad(Instruction &I);
ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,		ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,
ExtractElementInst *Ext1,		ExtractElementInst *Ext1,
unsigned PreferredExtractIndex) const;		unsigned PreferredExtractIndex) const;
bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,		bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned Opcode,		unsigned Opcode,
ExtractElementInst *&ConvertToShuffle,		ExtractElementInst *&ConvertToShuffle,
unsigned PreferredExtractIndex);		unsigned PreferredExtractIndex);
void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	bool VectorCombine::vectorizeLoadInsert(Instruction &I) {
Value *VecLd = Builder.CreateAlignedLoad(MinVecTy, CastedPtr, Alignment);		Value *VecLd = Builder.CreateAlignedLoad(MinVecTy, CastedPtr, Alignment);
VecLd = Builder.CreateShuffleVector(VecLd, Mask);		VecLd = Builder.CreateShuffleVector(VecLd, Mask);

replaceValue(I, *VecLd);		replaceValue(I, *VecLd);
++NumVecLoad;		++NumVecLoad;
return true;		return true;
}		}

		bool VectorCombine::widenPartialVectorLoad(Instruction &I) {
		const DataLayout &DL = I.getModule()->getDataLayout();

		auto *Load = dyn_cast<LoadInst>(&I);
		if (!Load)
		return false;

		Value *OrigPtr = Load->getPointerOperand();
		Align Alignment = Load->getAlign();
		unsigned AS = Load->getPointerAddressSpace();

		// What vector type do we currently load?
		auto *OrigVecTy = dyn_cast<FixedVectorType>(Load->getType());
		if (!OrigVecTy)
		return false;

		Type *ScalarEltTy = OrigVecTy->getScalarType();
		unsigned OrigNumElts = OrigVecTy->getNumElements();
		unsigned NumBitsPerElt = DL.getTypeSizeInBits(ScalarEltTy);

		// How will that type be legalized? I.e. into what vector register
		// will it be loaded, and how many registers will be occupied?
		auto *LegalizedPartVecTy =
		dyn_cast_or_null<FixedVectorType>(TTI.getLegalizedPartType(OrigVecTy));
		unsigned NumOfLegalizedVecParts = TTI.getNumberOfParts(OrigVecTy);

		// If it doesn't legalize into (a number of) vector registers, don't bother.
		if (!LegalizedPartVecTy \|\| !NumOfLegalizedVecParts)
		return false;

		unsigned OrigBitCount = DL.getTypeSizeInBits(OrigVecTy);
		unsigned LegalizedVecBitCount =
		NumOfLegalizedVecParts * DL.getTypeSizeInBits(LegalizedPartVecTy);
		assert(LegalizedVecBitCount >= OrigBitCount &&
		"Number of bits-to-be-loaded shouldn't decrease!");

		// Do we already load a multiple of the legalized type?
		if (OrigBitCount == LegalizedVecBitCount)
		return false;

		// How many more elements would we need to load?
		unsigned NumExtraBits = LegalizedVecBitCount - OrigBitCount;
		if (NumExtraBits % NumBitsPerElt != 0)
		return false; // Not a multiple of element size.
		// FIXME: might be able to handle some cases if they are multiple of byte.

		unsigned NumExtraElts = NumExtraBits / NumBitsPerElt;

		auto *WideVecTy =
		FixedVectorType::get(ScalarEltTy, OrigNumElts + NumExtraElts);
		assert(DL.getTypeSizeInBits(WideVecTy) == LegalizedVecBitCount &&
		"Failed to properly widen OrigVecTy to match the total legalized "
		"vector size?");

		// Okay, we currently load less than full width of the legalized vectors.
		spatelUnsubmitted Done Reply Inline Actions worth -> width spatel: worth -> width
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions This is not a typo, but okay. lebedev.ri: This is not a typo, but okay.
		// If we'd widen the load, would that be more costly than the current load?
		InstructionCost OldLoadCost =
		TTI.getMemoryOpCost(Instruction::Load, OrigVecTy, Alignment, AS);
		InstructionCost NewLoadCost =
		TTI.getMemoryOpCost(Instruction::Load, WideVecTy, Alignment, AS);
		if (NewLoadCost > OldLoadCost)
		return false;

		// It would not be more costly. But can we perform such a wide load?
		if (!isSafeToLoadUnconditionally(OrigPtr, WideVecTy, Align(1), DL, Load, &DT,
		/TLI=/nullptr))
		return false;

		IRBuilder<> Builder(Load);
		Value *CastedPtr =
		Builder.CreateBitCast(OrigPtr, WideVecTy->getPointerTo(AS));
		Value *WideVecLd = Builder.CreateAlignedLoad(WideVecTy, CastedPtr, Alignment);
		// We loaded some extra elements, we only need the low NumElts ones.
		// This is endiannes-insensitive.
		SmallVector<int, 32> Mask(OrigNumElts);
		std::iota(Mask.begin(), Mask.end(), 0);
		Value *ExtractedLowSubvector = Builder.CreateShuffleVector(WideVecLd, Mask);
		replaceValue(I, *ExtractedLowSubvector);
		spatelUnsubmitted Done Reply Inline Actions Use the unary variant of `CreateShuffleVector` here. Could call this value `ExtractSubvector` or state that in the code comment. Can we always assume that the extract op is free, or should we add that potential cost into the equation? spatel: Use the unary variant of `CreateShuffleVector` here. Could call this value `ExtractSubvector`…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions That subvector we're extracting here will be legalized (widened) into the full vector we've just loaded, so the shuffle just pretends that a few high elements of that legal vector don't exist. That is the whole assumption behind the transformation here. There isn't any actual subvector extraction here. So i'm not really seeing why we'd need to check the cost of that. lebedev.ri: That subvector we're extracting here will be legalized (widened) into the full vector we've…
		spatelUnsubmitted Done Reply Inline Actions Ok - please add a code comment with that explanation then. That way, we'll document why we don't factor cost of the shuffle. spatel: Ok - please add a code comment with that explanation then. That way, we'll document why we…
		++NumVecLoadWiden;
		return true;
		}

/// Determine which, if any, of the inputs should be replaced by a shuffle		/// Determine which, if any, of the inputs should be replaced by a shuffle
/// followed by extract from a different index.		/// followed by extract from a different index.
ExtractElementInst *VectorCombine::getShuffleExtract(		ExtractElementInst *VectorCombine::getShuffleExtract(
ExtractElementInst Ext0, ExtractElementInst Ext1,		ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned PreferredExtractIndex = InvalidIndex) const {		unsigned PreferredExtractIndex = InvalidIndex) const {
assert(isa<ConstantInt>(Ext0->getIndexOperand()) &&		assert(isa<ConstantInt>(Ext0->getIndexOperand()) &&
isa<ConstantInt>(Ext1->getIndexOperand()) &&		isa<ConstantInt>(Ext1->getIndexOperand()) &&
"Expected constant extract indexes");		"Expected constant extract indexes");
▲ Show 20 Lines • Show All 720 Lines • ▼ Show 20 Lines	for (BasicBlock &BB : F) {
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Use early increment range so that we can erase instructions in loop.		// Use early increment range so that we can erase instructions in loop.
for (Instruction &I : make_early_inc_range(BB)) {		for (Instruction &I : make_early_inc_range(BB)) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
MadeChange \|= vectorizeLoadInsert(I);		MadeChange \|= vectorizeLoadInsert(I);
		MadeChange \|= widenPartialVectorLoad(I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= scalarizeBinopOrCmp(I);		MadeChange \|= scalarizeBinopOrCmp(I);
MadeChange \|= foldExtractedCmps(I);		MadeChange \|= foldExtractedCmps(I);
MadeChange \|= scalarizeLoadExtract(I);		MadeChange \|= scalarizeLoadExtract(I);
MadeChange \|= foldSingleElementStore(I);		MadeChange \|= foldSingleElementStore(I);
}		}
}		}
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll

Show First 20 Lines • Show All 581 Lines • ▼ Show 20 Lines	;
%result1 = insertelement <2 x float> %result0, float %t5, i32 1		%result1 = insertelement <2 x float> %result0, float %t5, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8		store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void		ret void
}		}

define <4 x float> @load_v2f32_extract_insert_v4f32(<2 x float>* align 16 dereferenceable(16) %p) nofree nosync {		define <4 x float> @load_v2f32_extract_insert_v4f32(<2 x float>* align 16 dereferenceable(16) %p) nofree nosync {
; CHECK-LABEL: @load_v2f32_extract_insert_v4f32(		; CHECK-LABEL: @load_v2f32_extract_insert_v4f32(
; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> [[P:%.]] to <4 x float>		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> [[P:%.]] to <4 x float>
; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 16		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>		; CHECK-NEXT: [[L:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
		; CHECK-NEXT: [[S:%.*]] = extractelement <2 x float> [[L]], i32 0
		; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> poison, float [[S]], i32 0
		spatelUnsubmitted Done Reply Inline Actions Can we preserve the better alignment? spatel: Can we preserve the better alignment?
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Some other transformation does this, i'll take a look. lebedev.ri: Some other transformation does this, i'll take a look.
; CHECK-NEXT: ret <4 x float> [[R]]		; CHECK-NEXT: ret <4 x float> [[R]]
;		;
%l = load <2 x float>, <2 x float>* %p, align 4		%l = load <2 x float>, <2 x float>* %p, align 4
%s = extractelement <2 x float> %l, i32 0		%s = extractelement <2 x float> %l, i32 0
%r = insertelement <4 x float> poison, float %s, i32 0		%r = insertelement <4 x float> poison, float %s, i32 0
ret <4 x float> %r		ret <4 x float> %r
}		}

▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load-widening.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK		; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK
; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK		; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK
; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK		; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK
		spatelUnsubmitted Done Reply Inline Actions Hmm...I don't think I've ever tried that experiment. :) Did you confirm that the layout "wins" over the target to provide the coverage you expected? spatel: Hmm...I don't think I've ever tried that experiment. :) Did you confirm that the layout "wins"…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions https://alive2.llvm.org/ce/z/yAGpMV lebedev.ri: https://alive2.llvm.org/ce/z/yAGpMV
		spatelUnsubmitted Not Done Reply Inline Actions I'm not questioning the logic, just that there is no big-endian x86 target, so I don't know what is happening internally in LLVM in this situation. I think it would be better to add a different test file for a target that actually does support both modes (AArch64 or PowerPC?). We still want an x86 test file to verify that SSE vs. AVX is behaving as expected though. spatel: I'm not questioning the logic, just that there is no big-endian x86 target, so I don't know…
; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK		; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK

;-------------------------------------------------------------------------------		;-------------------------------------------------------------------------------
; Here we know we can load 128 bits as per dereferenceability and alignment.		; Here we know we can load 128 bits as per dereferenceability and alignment.

; We don't widen scalar loads per-se.		; We don't widen scalar loads per-se.
define <1 x float> @scalar(<1 x float>* align 16 dereferenceable(16) %p) {		define <1 x float> @scalar(<1 x float>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @scalar(		; CHECK-LABEL: @scalar(
Show All 11 Lines
; CHECK-NEXT: ret <1 x float> [[R]]		; CHECK-NEXT: ret <1 x float> [[R]]
;		;
%r = load <1 x float>, <1 x float>* %p, align 16		%r = load <1 x float>, <1 x float>* %p, align 16
ret <1 x float> %r		ret <1 x float> %r
}		}

define <2 x float> @vec_with_2elts(<2 x float>* align 16 dereferenceable(16) %p) {		define <2 x float> @vec_with_2elts(<2 x float>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_2elts(		; CHECK-LABEL: @vec_with_2elts(
; CHECK-NEXT: [[R:%.]] = load <2 x float>, <2 x float> [[P:%.*]], align 16		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> [[P:%.]] to <4 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 16
		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
; CHECK-NEXT: ret <2 x float> [[R]]		; CHECK-NEXT: ret <2 x float> [[R]]
;		;
%r = load <2 x float>, <2 x float>* %p, align 16		%r = load <2 x float>, <2 x float>* %p, align 16
ret <2 x float> %r		ret <2 x float> %r
}		}

define <3 x float> @vec_with_3elts(<3 x float>* align 16 dereferenceable(16) %p) {		define <3 x float> @vec_with_3elts(<3 x float>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_3elts(		; CHECK-LABEL: @vec_with_3elts(
; CHECK-NEXT: [[R:%.]] = load <3 x float>, <3 x float> [[P:%.*]], align 16		; CHECK-NEXT: [[TMP1:%.]] = bitcast <3 x float> [[P:%.]] to <4 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 16
		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <3 x i32> <i32 0, i32 1, i32 2>
; CHECK-NEXT: ret <3 x float> [[R]]		; CHECK-NEXT: ret <3 x float> [[R]]
;		;
%r = load <3 x float>, <3 x float>* %p, align 16		%r = load <3 x float>, <3 x float>* %p, align 16
ret <3 x float> %r		ret <3 x float> %r
}		}

; Full-vector load. All good already.		; Full-vector load. All good already.
define <4 x float> @vec_with_4elts(<4 x float>* align 16 dereferenceable(16) %p) {		define <4 x float> @vec_with_4elts(<4 x float>* align 16 dereferenceable(16) %p) {
Show All 15 Lines	;
ret <5 x float> %r		ret <5 x float> %r
}		}

;-------------------------------------------------------------------------------		;-------------------------------------------------------------------------------

; We can load 128 bits, and the fact that it's underaligned isn't relevant.		; We can load 128 bits, and the fact that it's underaligned isn't relevant.
define <3 x float> @vec_with_3elts_underaligned(<3 x float>* align 8 dereferenceable(16) %p) {		define <3 x float> @vec_with_3elts_underaligned(<3 x float>* align 8 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_3elts_underaligned(		; CHECK-LABEL: @vec_with_3elts_underaligned(
; CHECK-NEXT: [[R:%.]] = load <3 x float>, <3 x float> [[P:%.*]], align 8		; CHECK-NEXT: [[TMP1:%.]] = bitcast <3 x float> [[P:%.]] to <4 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 8
		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <3 x i32> <i32 0, i32 1, i32 2>
; CHECK-NEXT: ret <3 x float> [[R]]		; CHECK-NEXT: ret <3 x float> [[R]]
;		;
%r = load <3 x float>, <3 x float>* %p, align 8		%r = load <3 x float>, <3 x float>* %p, align 8
ret <3 x float> %r		ret <3 x float> %r
}		}

; We don't know we can load 128 bits, but since it's aligned, we still can do wide load.		; We don't know we can load 128 bits, but since it's aligned, we still can do wide load.
; FIXME: this should still get widened.		; FIXME: this should still get widened.
Show All 25 Lines
; CHECK-NEXT: ret <1 x float> [[R]]		; CHECK-NEXT: ret <1 x float> [[R]]
;		;
%r = load <1 x float>, <1 x float>* %p, align 32		%r = load <1 x float>, <1 x float>* %p, align 32
ret <1 x float> %r		ret <1 x float> %r
}		}

define <2 x float> @vec_with_2elts_256bits(<2 x float>* align 32 dereferenceable(32) %p) {		define <2 x float> @vec_with_2elts_256bits(<2 x float>* align 32 dereferenceable(32) %p) {
; CHECK-LABEL: @vec_with_2elts_256bits(		; CHECK-LABEL: @vec_with_2elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <2 x float>, <2 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> [[P:%.]] to <4 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 32
		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
; CHECK-NEXT: ret <2 x float> [[R]]		; CHECK-NEXT: ret <2 x float> [[R]]
;		;
%r = load <2 x float>, <2 x float>* %p, align 32		%r = load <2 x float>, <2 x float>* %p, align 32
ret <2 x float> %r		ret <2 x float> %r
}		}

define <3 x float> @vec_with_3elts_256bits(<3 x float>* align 32 dereferenceable(32) %p) {		define <3 x float> @vec_with_3elts_256bits(<3 x float>* align 32 dereferenceable(32) %p) {
; CHECK-LABEL: @vec_with_3elts_256bits(		; CHECK-LABEL: @vec_with_3elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <3 x float>, <3 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[TMP1:%.]] = bitcast <3 x float> [[P:%.]] to <4 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 32
		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <3 x i32> <i32 0, i32 1, i32 2>
; CHECK-NEXT: ret <3 x float> [[R]]		; CHECK-NEXT: ret <3 x float> [[R]]
;		;
%r = load <3 x float>, <3 x float>* %p, align 32		%r = load <3 x float>, <3 x float>* %p, align 32
ret <3 x float> %r		ret <3 x float> %r
}		}

define <4 x float> @vec_with_4elts_256bits(<4 x float>* align 32 dereferenceable(32) %p) {		define <4 x float> @vec_with_4elts_256bits(<4 x float>* align 32 dereferenceable(32) %p) {
; CHECK-LABEL: @vec_with_4elts_256bits(		; CHECK-LABEL: @vec_with_4elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <4 x float>, <4 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[R:%.]] = load <4 x float>, <4 x float> [[P:%.*]], align 32
; CHECK-NEXT: ret <4 x float> [[R]]		; CHECK-NEXT: ret <4 x float> [[R]]
;		;
%r = load <4 x float>, <4 x float>* %p, align 32		%r = load <4 x float>, <4 x float>* %p, align 32
ret <4 x float> %r		ret <4 x float> %r
}		}

define <5 x float> @vec_with_5elts_256bits(<5 x float>* align 32 dereferenceable(32) %p) {		define <5 x float> @vec_with_5elts_256bits(<5 x float>* align 32 dereferenceable(32) %p) {
; CHECK-LABEL: @vec_with_5elts_256bits(		; CHECK-LABEL: @vec_with_5elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <5 x float>, <5 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[TMP1:%.]] = bitcast <5 x float> [[P:%.]] to <8 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <8 x float>, <8 x float> [[TMP1]], align 32
		; CHECK-NEXT: [[R:%.*]] = shufflevector <8 x float> [[TMP2]], <8 x float> poison, <5 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4>
; CHECK-NEXT: ret <5 x float> [[R]]		; CHECK-NEXT: ret <5 x float> [[R]]
;		;
%r = load <5 x float>, <5 x float>* %p, align 32		%r = load <5 x float>, <5 x float>* %p, align 32
ret <5 x float> %r		ret <5 x float> %r
}		}

define <6 x float> @vec_with_6elts_256bits(<6 x float>* align 32 dereferenceable(32) %p) {		define <6 x float> @vec_with_6elts_256bits(<6 x float>* align 32 dereferenceable(32) %p) {
; CHECK-LABEL: @vec_with_6elts_256bits(		; CHECK-LABEL: @vec_with_6elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <6 x float>, <6 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[TMP1:%.]] = bitcast <6 x float> [[P:%.]] to <8 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <8 x float>, <8 x float> [[TMP1]], align 32
		; CHECK-NEXT: [[R:%.*]] = shufflevector <8 x float> [[TMP2]], <8 x float> poison, <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
; CHECK-NEXT: ret <6 x float> [[R]]		; CHECK-NEXT: ret <6 x float> [[R]]
;		;
%r = load <6 x float>, <6 x float>* %p, align 32		%r = load <6 x float>, <6 x float>* %p, align 32
ret <6 x float> %r		ret <6 x float> %r
}		}

define <7 x float> @vec_with_7elts_256bits(<7 x float>* align 32 dereferenceable(32) %p) {		define <7 x float> @vec_with_7elts_256bits(<7 x float>* align 32 dereferenceable(32) %p) {
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions We widen this to either 2x XMM, or an 1x YMM, and we know we can do this as per deref info. lebedev.ri: We widen this to either 2x XMM, or an 1x YMM, and we know we can do this as per deref info.
; CHECK-LABEL: @vec_with_7elts_256bits(		; CHECK-LABEL: @vec_with_7elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <7 x float>, <7 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[TMP1:%.]] = bitcast <7 x float> [[P:%.]] to <8 x float>
		; CHECK-NEXT: [[TMP2:%.]] = load <8 x float>, <8 x float> [[TMP1]], align 32
		; CHECK-NEXT: [[R:%.*]] = shufflevector <8 x float> [[TMP2]], <8 x float> poison, <7 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6>
; CHECK-NEXT: ret <7 x float> [[R]]		; CHECK-NEXT: ret <7 x float> [[R]]
;		;
%r = load <7 x float>, <7 x float>* %p, align 32		%r = load <7 x float>, <7 x float>* %p, align 32
ret <7 x float> %r		ret <7 x float> %r
}		}

; Full-vector load. All good already.		; Full-vector load. All good already.
define <8 x float> @vec_with_8elts_256bits(<8 x float>* align 32 dereferenceable(32) %p) {		define <8 x float> @vec_with_8elts_256bits(<8 x float>* align 32 dereferenceable(32) %p) {
; CHECK-LABEL: @vec_with_8elts_256bits(		; CHECK-LABEL: @vec_with_8elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <8 x float>, <8 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[R:%.]] = load <8 x float>, <8 x float> [[P:%.*]], align 32
; CHECK-NEXT: ret <8 x float> [[R]]		; CHECK-NEXT: ret <8 x float> [[R]]
;		;
%r = load <8 x float>, <8 x float>* %p, align 32		%r = load <8 x float>, <8 x float>* %p, align 32
ret <8 x float> %r		ret <8 x float> %r
}		}

; We can't tell if we can load more than 256 bits.		; We can't tell if we can load more than 256 bits.
define <9 x float> @vec_with_9elts_256bits(<9 x float>* align 32 dereferenceable(32) %p) {		define <9 x float> @vec_with_9elts_256bits(<9 x float>* align 32 dereferenceable(32) %p) {
		spatelUnsubmitted Done Reply Inline Actions Can you explain the difference between this test and vec_with_7elts_256bits for an SSE target? It's not obvious to me why we are ok widening to 256-bit if that's not legal, but not wider than that. spatel: Can you explain the difference between this test and vec_with_7elts_256bits for an SSE target?
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions We need to widen this to either 3x XMM or 2x YMM, but we don't know we can load that many bytes. There is another problem hiding here, iff we know we can load 3x XMM, we still try to widen to 4x XMM, because that's what the legalizer told us, because it only knows how to double. lebedev.ri: We need to widen this to either 3x XMM or 2x YMM, but we don't know we can load that many bytes.
; CHECK-LABEL: @vec_with_9elts_256bits(		; CHECK-LABEL: @vec_with_9elts_256bits(
; CHECK-NEXT: [[R:%.]] = load <9 x float>, <9 x float> [[P:%.*]], align 32		; CHECK-NEXT: [[R:%.]] = load <9 x float>, <9 x float> [[P:%.*]], align 32
; CHECK-NEXT: ret <9 x float> [[R]]		; CHECK-NEXT: ret <9 x float> [[R]]
;		;
%r = load <9 x float>, <9 x float>* %p, align 32		%r = load <9 x float>, <9 x float>* %p, align 32
ret <9 x float> %r		ret <9 x float> %r
}		}

;-------------------------------------------------------------------------------		;-------------------------------------------------------------------------------

; Weird types we don't deal with		; Weird types we don't deal with

define <2 x i7> @vec_with_two_subbyte_elts(<2 x i7>* align 16 dereferenceable(16) %p) {		define <2 x i7> @vec_with_two_subbyte_elts(<2 x i7>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_two_subbyte_elts(		; CHECK-LABEL: @vec_with_two_subbyte_elts(
; CHECK-NEXT: [[R:%.]] = load <2 x i7>, <2 x i7> [[P:%.*]], align 16		; CHECK-NEXT: [[R:%.]] = load <2 x i7>, <2 x i7> [[P:%.*]], align 16
; CHECK-NEXT: ret <2 x i7> [[R]]		; CHECK-NEXT: ret <2 x i7> [[R]]
;		;
%r = load <2 x i7>, <2 x i7>* %p, align 16		%r = load <2 x i7>, <2 x i7>* %p, align 16
ret <2 x i7> %r		ret <2 x i7> %r
}		}
Show All 13 Lines
; CHECK-NEXT: ret <2 x i24> [[R]]		; CHECK-NEXT: ret <2 x i24> [[R]]
;		;
%r = load <2 x i24>, <2 x i24>* %p, align 16		%r = load <2 x i24>, <2 x i24>* %p, align 16
ret <2 x i24> %r		ret <2 x i24> %r
}		}

define <2 x float> @vec_with_2elts_addressspace(<2 x float> addrspace(2)* align 16 dereferenceable(16) %p) {		define <2 x float> @vec_with_2elts_addressspace(<2 x float> addrspace(2)* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_2elts_addressspace(		; CHECK-LABEL: @vec_with_2elts_addressspace(
; CHECK-NEXT: [[R:%.]] = load <2 x float>, <2 x float> addrspace(2) [[P:%.*]], align 16		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> addrspace(2) [[P:%.]] to <4 x float> addrspace(2)
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> addrspace(2) [[TMP1]], align 16
		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
; CHECK-NEXT: ret <2 x float> [[R]]		; CHECK-NEXT: ret <2 x float> [[R]]
;		;
%r = load <2 x float>, <2 x float> addrspace(2)* %p, align 16		%r = load <2 x float>, <2 x float> addrspace(2)* %p, align 16
ret <2 x float> %r		ret <2 x float> %r
}		}

;-------------------------------------------------------------------------------		;-------------------------------------------------------------------------------

; Widening these would change the legalized type, so leave them alone.		; Weird types we do deal with

define <2 x i1> @vec_with_2elts_128bits_i1(<2 x i1>* align 16 dereferenceable(16) %p) {		define <2 x i1> @vec_with_2elts_128bits_i1(<2 x i1>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_2elts_128bits_i1(		; CHECK-LABEL: @vec_with_2elts_128bits_i1(
; CHECK-NEXT: [[R:%.]] = load <2 x i1>, <2 x i1> [[P:%.*]], align 16		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x i1> [[P:%.]] to <128 x i1>
		; CHECK-NEXT: [[TMP2:%.]] = load <128 x i1>, <128 x i1> [[TMP1]], align 16
		; CHECK-NEXT: [[R:%.*]] = shufflevector <128 x i1> [[TMP2]], <128 x i1> poison, <2 x i32> <i32 0, i32 1>
; CHECK-NEXT: ret <2 x i1> [[R]]		; CHECK-NEXT: ret <2 x i1> [[R]]
;		;
%r = load <2 x i1>, <2 x i1>* %p, align 16		%r = load <2 x i1>, <2 x i1>* %p, align 16
ret <2 x i1> %r		ret <2 x i1> %r
}		}
define <2 x i2> @vec_with_2elts_128bits_i2(<2 x i2>* align 16 dereferenceable(16) %p) {		define <2 x i2> @vec_with_2elts_128bits_i2(<2 x i2>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_2elts_128bits_i2(		; CHECK-LABEL: @vec_with_2elts_128bits_i2(
; CHECK-NEXT: [[R:%.]] = load <2 x i2>, <2 x i2> [[P:%.*]], align 16		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x i2> [[P:%.]] to <64 x i2>
		; CHECK-NEXT: [[TMP2:%.]] = load <64 x i2>, <64 x i2> [[TMP1]], align 16
		; CHECK-NEXT: [[R:%.*]] = shufflevector <64 x i2> [[TMP2]], <64 x i2> poison, <2 x i32> <i32 0, i32 1>
; CHECK-NEXT: ret <2 x i2> [[R]]		; CHECK-NEXT: ret <2 x i2> [[R]]
;		;
%r = load <2 x i2>, <2 x i2>* %p, align 16		%r = load <2 x i2>, <2 x i2>* %p, align 16
ret <2 x i2> %r		ret <2 x i2> %r
}		}
define <2 x i4> @vec_with_2elts_128bits_i4(<2 x i4>* align 16 dereferenceable(16) %p) {		define <2 x i4> @vec_with_2elts_128bits_i4(<2 x i4>* align 16 dereferenceable(16) %p) {
; CHECK-LABEL: @vec_with_2elts_128bits_i4(		; CHECK-LABEL: @vec_with_2elts_128bits_i4(
; CHECK-NEXT: [[R:%.]] = load <2 x i4>, <2 x i4> [[P:%.*]], align 16		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x i4> [[P:%.]] to <32 x i4>
		; CHECK-NEXT: [[TMP2:%.]] = load <32 x i4>, <32 x i4> [[TMP1]], align 16
		; CHECK-NEXT: [[R:%.*]] = shufflevector <32 x i4> [[TMP2]], <32 x i4> poison, <2 x i32> <i32 0, i32 1>
; CHECK-NEXT: ret <2 x i4> [[R]]		; CHECK-NEXT: ret <2 x i4> [[R]]
;		;
%r = load <2 x i4>, <2 x i4>* %p, align 16		%r = load <2 x i4>, <2 x i4>* %p, align 16
ret <2 x i4> %r		ret <2 x i4> %r
}		}

llvm/test/Transforms/VectorCombine/X86/load.ll

Show First 20 Lines • Show All 581 Lines • ▼ Show 20 Lines	;
%result1 = insertelement <2 x float> %result0, float %t5, i32 1		%result1 = insertelement <2 x float> %result0, float %t5, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8		store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void		ret void
}		}

define <4 x float> @load_v2f32_extract_insert_v4f32(<2 x float>* align 16 dereferenceable(16) %p) nofree nosync {		define <4 x float> @load_v2f32_extract_insert_v4f32(<2 x float>* align 16 dereferenceable(16) %p) nofree nosync {
; CHECK-LABEL: @load_v2f32_extract_insert_v4f32(		; CHECK-LABEL: @load_v2f32_extract_insert_v4f32(
; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> [[P:%.]] to <4 x float>		; CHECK-NEXT: [[TMP1:%.]] = bitcast <2 x float> [[P:%.]] to <4 x float>
; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 16		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>		; CHECK-NEXT: [[L:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> poison, <2 x i32> <i32 0, i32 1>
		; CHECK-NEXT: [[S:%.*]] = extractelement <2 x float> [[L]], i32 0
		; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
; CHECK-NEXT: ret <4 x float> [[R]]		; CHECK-NEXT: ret <4 x float> [[R]]
;		;
%l = load <2 x float>, <2 x float>* %p, align 4		%l = load <2 x float>, <2 x float>* %p, align 4
%s = extractelement <2 x float> %l, i32 0		%s = extractelement <2 x float> %l, i32 0
%r = insertelement <4 x float> undef, float %s, i32 0		%r = insertelement <4 x float> undef, float %s, i32 0
ret <4 x float> %r		ret <4 x float> %r
}		}

▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Widening of partial vector loadsAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 360465

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll

llvm/test/Transforms/VectorCombine/X86/load-widening.ll

llvm/test/Transforms/VectorCombine/X86/load.ll

[VectorCombine] Widening of partial vector loads
AbandonedPublic