This is an archive of the discontinued LLVM Phabricator instance.

Teach instcombine to canonicalize "element extraction" from a load of an integer and "element insertion" into a store of an integer into actual element extraction, element insertion, and vector loads and stores.
ClosedPublic

Authored by chandlerc on Dec 5 2014, 3:56 AM.

Download Raw Diff

Details

Reviewers

majnemer
hfinkel

Commits

rG7415205113f8: Teach instcombine to canonicalize "element extraction" from a load of an…
rL223764: Teach instcombine to canonicalize "element extraction" from a load of an

Summary

Previously various parts of LLVM (including instcombine itself) would
introduce integer loads and stores into the code as a way of opaquely
loading and storing "bits". In some cases (such as a memcpy of
std::complex<float> object) we will eventually end up using those bits
in non-integer types. In order for SROA to effectively promote the
allocas involved, it splits these "store a bag of bits" integer loads
and stores up into the constituent parts. However, for non-alloca loads
and tsores which remain, it uses integer math to recombine the values
into a large integer to load or store.

All of this would be "fine", except that it forces LLVM to go through
integer math to combine and split up values. While this makes perfect
sense for integers (and in fact is critical for bitfields to end up
lowering efficiently) it is *terrible* for non-integer types, especially
floating point types. We have a much more canonical way of representing
the act of concatenating the bits of two SSA values in LLVM: a vector
and insertelement. This patch teaching InstCombine to use this
representation.

With this patch applied, LLVM will no longer introduce integer math into
the critical path of every loop over std::complex<float> operations such
as those that make up the hot path of ... oh, most HPC code, Eigen, and
any other heavy linear algebra library.

For the record, I looked *extensively* at fixing this in other parts of
the compiler, but it just doesn't work:

We really do want to canonicalize memcpy and other bit-motion to integer loads and stores. SSA values are tremendously more powerful than "copy" intrinsics. Not doing this regresses massive amounts of LLVM's scalar optimizer.
We really do need to split up integer loads and stores of this form in SROA or every memcpy of a trivially copyable struct will prevent SSA formation of the members of that struct. It essentially turns off SROA.
The closest alternative is to actually split the loads and stores when partitioning with SROA, but this has all of the downsides historically discussed of splitting up loads and stores -- the wide-store information is fundamentally lost. We would also see performance regressions for bitfield-heavy code and other places where the integers aren't really intended to be split.
We *can* effectively fix this in instcombine, so it isn't that hard of a choice to make IMO.

Diff Detail

Repository: rL LLVM

Event Timeline

chandlerc updated this revision to Diff 16983.Dec 5 2014, 3:56 AM

chandlerc retitled this revision from to Teach instcombine to canonicalize "element extraction" from a load of an integer and "element insertion" into a store of an integer into actual element extraction, element insertion, and vector loads and stores..

chandlerc updated this object.

chandlerc edited the test plan for this revision. (Show Details)

chandlerc added reviewers: hfinkel, majnemer.

chandlerc added a subscriber: Unknown Object (MLST).

Fix a somewhat obvious goof with endianness and a bunch of formatting badness
that drifted into the patch.

Update the test case to be much more precise about what it checks and to
enforce that we get endianness correct now.

(ping)

LGTM.

This will obviously lead to the formation of more vector types, many/most of which will be scalarized. We'll need to be careful re: code quality here, as the default expansion of extract_vector_elt, for example, stores the vector to the stack and loads one element (and will do this separately for each extraction). Nevertheless, this does seem to make sense as a canonical form.

It seems like the two primary types affected by this change will be floating-point types and pointers. You have tests with fp types, but none with pointers. Can you please add a test case with pointers? [The underlying logic is obviously the same, but we should have regression tests covering this basic facet of our canonical form]

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
390 ↗	(On Diff #16984)	You have a more-explanatory comment for the store version, perhaps you could copy that here, "All of the elements extracted need to be the same type...."
395 ↗	(On Diff #16984)	So you will catch pointer types here and form vectors of pointers; this is likely worth a comment somewhere.

This revision is now accepted and ready to land.Dec 8 2014, 12:25 AM

Thanks for the review! I've cleaned up the test a touch, added the pointer vector test, fixed the things you commented on.

Planning to commit. While I agree there may need to be more work if we start getting vectors of pointers and stuff, what I've tested already lowers better than the old code with the inputs we have today (floating point, and the existing backends).

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
390 ↗	(On Diff #16984)	Done.
395 ↗	(On Diff #16984)	Sure.

Closed by commit rL223764 (authored by @chandlerc).

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineLoadStoreAlloca.cpp

390 lines

test/

Transforms/

InstCombine/

loadstore-vector.ll

210 lines

Diff 17073

llvm/trunk/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp

//===- InstCombineLoadStoreAlloca.cpp -------------------------------------===//		//===- InstCombineLoadStoreAlloca.cpp -------------------------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements the visit functions for load, store and alloca.		// This file implements the visit functions for load, store and alloca.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "InstCombine.h"		#include "InstCombine.h"
		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/Loads.h"		#include "llvm/Analysis/Loads.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/LLVMContext.h"		#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
		#include "llvm/IR/PatternMatch.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
using namespace llvm;		using namespace llvm;
		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "instcombine"		#define DEBUG_TYPE "instcombine"

STATISTIC(NumDeadStore, "Number of dead stores eliminated");		STATISTIC(NumDeadStore, "Number of dead stores eliminated");
STATISTIC(NumGlobalCopies, "Number of allocas copied from constant global");		STATISTIC(NumGlobalCopies, "Number of allocas copied from constant global");

/// pointsToConstantGlobal - Return true if V (possibly indirectly) points to		/// pointsToConstantGlobal - Return true if V (possibly indirectly) points to
/// some part of a constant global variable. This intentionally only accepts		/// some part of a constant global variable. This intentionally only accepts
▲ Show 20 Lines • Show All 309 Lines • ▼ Show 20 Lines	case LLVMContext::MD_range:
// FIXME: It would be nice to propagate this in some way, but the type		// FIXME: It would be nice to propagate this in some way, but the type
// conversions make it hard.		// conversions make it hard.
break;		break;
}		}
}		}
return NewLoad;		return NewLoad;
}		}

		/// \brief Combine integer loads to vector stores when the integers bits are
		/// just a concatenation of non-integer (and non-vector) types.
		///
		/// This specifically matches the pattern of loading an integer, right-shifting,
		/// trucating, and casting it to a non-integer type. When the shift is an exact
		/// multiple of the result non-integer type's size, this is more naturally
		/// expressed as a load of a vector and an extractelement. This shows up largely
		/// because large integers are sometimes used to represent a "generic" load or
		/// store, and only later optimization may uncover that there is a more natural
		/// type to represent the load with.
		static Instruction *combineIntegerLoadToVectorLoad(InstCombiner &IC,
		LoadInst &LI) {
		// FIXME: This is probably a reasonable transform to make for atomic stores.
		assert(LI.isSimple() && "Do not call for non-simple stores!");

		const DataLayout &DL = *IC.getDataLayout();
		unsigned BaseBits = LI.getType()->getIntegerBitWidth();
		Type *ElementTy = nullptr;
		int ElementSize;

		// We match any number of element extractions from the loaded integer. Each of
		// these should be RAUW'ed with an actual extract element instruction at the
		// given index of a loaded vector.
		struct ExtractedElement {
		Instruction *Element;
		int Index;
		};
		SmallVector<ExtractedElement, 2> Elements;

		// Lambda to match the bit cast in the extracted element (which is the root
		// pattern matched). Accepts the instruction and shifted bits, returns false
		// if at any point we failed to match a suitable bitcast for element
		// extraction.
		auto MatchCast = [&](Instruction *I, unsigned ShiftBits) {
		// The truncate must be casted to some element type. This cast can only be
		// a bitcast or an inttoptr cast which is the same size.
		if (!isa<BitCastInst>(I)) {
		if (auto *PC = dyn_cast<IntToPtrInst>(I)) {
		// Ensure that the pointer and integer have the exact same size.
		if (PC->getOperand(0)->getType()->getIntegerBitWidth() !=
		DL.getTypeSizeInBits(PC->getType()))
		return false;
		} else {
		// We only support bitcast and inttoptr.
		return false;
		}
		}

		// All of the elements inserted need to be the same type. Either capture the
		// first element type or check this element type against the previous
		// element types.
		if (!ElementTy) {
		ElementTy = I->getType();
		// We don't handle integers, sub-vectors, or any aggregate types. We
		// handle pointers and floating ponit types.
		if (!ElementTy->isSingleValueType() \|\| ElementTy->isIntegerTy() \|\|
		ElementTy->isVectorTy())
		return false;

		ElementSize = DL.getTypeSizeInBits(ElementTy);
		// The base integer size and the shift need to be multiples of the element
		// size in bits.
		if (BaseBits % ElementSize \|\| ShiftBits % ElementSize)
		return false;
		} else if (ElementTy != I->getType()) {
		return false;
		}

		// Compute the vector index and store the element with it.
		int Index =
		(DL.isLittleEndian() ? ShiftBits : BaseBits - ElementSize - ShiftBits) /
		ElementSize;
		ExtractedElement E = {I, Index};
		Elements.push_back(std::move(E));
		return true;
		};

		// Lambda to match the truncate in the extracted element. Accepts the
		// instruction and shifted bits. Returns false if at any point we failed to
		// match a suitable truncate for element extraction.
		auto MatchTruncate = [&](Instruction *I, unsigned ShiftBits) {
		// Handle the truncate to the bit size of the element.
		auto *T = dyn_cast<TruncInst>(I);
		if (!T)
		return false;

		// Walk all the users of the truncate, whuch must all be bitcasts.
		for (User *TU : T->users())
		if (!MatchCast(cast<Instruction>(TU), ShiftBits))
		return false;
		return true;
		};

		for (User *U : LI.users()) {
		Instruction *I = cast<Instruction>(U);

		// Strip off a logical shift right and retain the shifted amount.
		ConstantInt *ShiftC;
		if (!match(I, m_LShr(m_Value(), m_ConstantInt(ShiftC)))) {
		// This must be a direct truncate.
		if (!MatchTruncate(I, 0))
		return nullptr;
		continue;
		}

		unsigned ShiftBits = ShiftC->getLimitedValue(BaseBits);
		// We can't handle shifts of more than the number of bits in the integer.
		if (ShiftBits == BaseBits)
		return nullptr;

		// Match all the element extraction users of the shift.
		for (User *IU : I->users())
		if (!MatchTruncate(cast<Instruction>(IU), ShiftBits))
		return nullptr;
		}

		// If didn't find any extracted elements, there is nothing to do here.
		if (Elements.empty())
		return nullptr;

		// Create a vector load and rewrite all of the elements extracted as
		// extractelement instructions.
		VectorType *VTy = VectorType::get(ElementTy, BaseBits / ElementSize);
		LoadInst *NewLI = combineLoadToNewType(IC, LI, VTy);

		for (const auto &E : Elements) {
		IC.Builder->SetInsertPoint(E.Element);
		E.Element->replaceAllUsesWith(
		IC.Builder->CreateExtractElement(NewLI, IC.Builder->getInt32(E.Index)));
		IC.EraseInstFromFunction(*E.Element);
		}

		// Return the original load to indicate it has been combined away.
		return &LI;
		}

/// \brief Combine loads to match the type of value their uses after looking		/// \brief Combine loads to match the type of value their uses after looking
/// through intervening bitcasts.		/// through intervening bitcasts.
///		///
/// The core idea here is that if the result of a load is used in an operation,		/// The core idea here is that if the result of a load is used in an operation,
/// we should load the type most conducive to that operation. For example, when		/// we should load the type most conducive to that operation. For example, when
/// loading an integer and converting that immediately to a pointer, we should		/// loading an integer and converting that immediately to a pointer, we should
/// instead directly load a pointer.		/// instead directly load a pointer.
///		///
Show All 12 Lines	static Instruction *combineLoadToOperationType(InstCombiner &IC, LoadInst &LI) {
if (!LI.isSimple())		if (!LI.isSimple())
return nullptr;		return nullptr;

if (LI.use_empty())		if (LI.use_empty())
return nullptr;		return nullptr;


// Fold away bit casts of the loaded value by loading the desired type.		// Fold away bit casts of the loaded value by loading the desired type.
		// FIXME: We should also canonicalize loads of vectors when their elements are
		// cast to other types.
if (LI.hasOneUse())		if (LI.hasOneUse())
if (auto *BC = dyn_cast<BitCastInst>(LI.user_back())) {		if (auto *BC = dyn_cast<BitCastInst>(LI.user_back())) {
LoadInst *NewLoad = combineLoadToNewType(IC, LI, BC->getDestTy());		LoadInst *NewLoad = combineLoadToNewType(IC, LI, BC->getDestTy());
BC->replaceAllUsesWith(NewLoad);		BC->replaceAllUsesWith(NewLoad);
IC.EraseInstFromFunction(*BC);		IC.EraseInstFromFunction(*BC);
return &LI;		return &LI;
}		}

// FIXME: We should also canonicalize loads of vectors when their elements are		// Try to combine integer loads into vector loads when the integer is just
// cast to other types.		// loading a bag of bits that are casted into vector element chunks.
		if (LI.getType()->isIntegerTy())
		if (Instruction *R = combineIntegerLoadToVectorLoad(IC, LI))
		return R;

return nullptr;		return nullptr;
}		}

Instruction *InstCombiner::visitLoadInst(LoadInst &LI) {		Instruction *InstCombiner::visitLoadInst(LoadInst &LI) {
Value *Op = LI.getOperand(0);		Value *Op = LI.getOperand(0);

// Try to canonicalize the loaded type.		// Try to canonicalize the loaded type.
if (Instruction Res = combineLoadToOperationType(this, LI))		if (Instruction Res = combineLoadToOperationType(this, LI))
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	if (SelectInst *SI = dyn_cast<SelectInst>(Op)) {
LI.setOperand(0, SI->getOperand(1));		LI.setOperand(0, SI->getOperand(1));
return &LI;		return &LI;
}		}
}		}
}		}
return nullptr;		return nullptr;
}		}

/// \brief Combine stores to match the type of value being stored.		/// \brief Helper to combine a store to use a new value.
///		///
/// The core idea here is that the memory does not have any intrinsic type and		/// This just does the work of combining a store to use a new value, potentially
/// where we can we should match the type of a store to the type of value being		/// of a different type. It handles metadata, etc., and returns the new
/// stored.		/// instruction. The new value is stored to a bitcast of the pointer argument to
		/// the original store.
///		///
/// However, this routine must never change the width of a store or the number of		/// Note that this will create the instructions with whatever insert point the
/// stores as that would introduce a semantic change. This combine is expected to		/// \c InstCombiner currently is using.
/// be a semantic no-op which just allows stores to more closely model the types		static StoreInst *combineStoreToNewValue(InstCombiner &IC, StoreInst &OldSI,
/// of their incoming values.		Value *V) {
///		Value *Ptr = OldSI.getPointerOperand();
/// Currently, we also refuse to change the precise type used for an atomic or		unsigned AS = OldSI.getPointerAddressSpace();
/// volatile store. This is debatable, and might be reasonable to change later.
/// However, it is risky in case some backend or other part of LLVM is relying
/// on the exact type stored to select appropriate atomic operations.
///
/// \returns true if the store was successfully combined away. This indicates
/// the caller must erase the store instruction. We have to let the caller erase
/// the store instruction sas otherwise there is no way to signal whether it was
/// combined or not: IC.EraseInstFromFunction returns a null pointer.
static bool combineStoreToValueType(InstCombiner &IC, StoreInst &SI) {
// FIXME: We could probably with some care handle both volatile and atomic
// stores here but it isn't clear that this is important.
if (!SI.isSimple())
return false;

Value *Ptr = SI.getPointerOperand();
Value *V = SI.getValueOperand();
unsigned AS = SI.getPointerAddressSpace();
SmallVector<std::pair<unsigned, MDNode *>, 8> MD;		SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
SI.getAllMetadata(MD);		OldSI.getAllMetadata(MD);

// Fold away bit casts of the stored value by storing the original type.		StoreInst *NewSI = IC.Builder->CreateAlignedStore(
if (auto *BC = dyn_cast<BitCastInst>(V)) {
V = BC->getOperand(0);
StoreInst *NewStore = IC.Builder->CreateAlignedStore(
V, IC.Builder->CreateBitCast(Ptr, V->getType()->getPointerTo(AS)),		V, IC.Builder->CreateBitCast(Ptr, V->getType()->getPointerTo(AS)),
SI.getAlignment());		OldSI.getAlignment());
for (const auto &MDPair : MD) {		for (const auto &MDPair : MD) {
unsigned ID = MDPair.first;		unsigned ID = MDPair.first;
MDNode *N = MDPair.second;		MDNode *N = MDPair.second;
// Note, essentially every kind of metadata should be preserved here! This		// Note, essentially every kind of metadata should be preserved here! This
// routine is supposed to clone a store instruction changing *only its		// routine is supposed to clone a store instruction changing *only its
// type*. The only metadata it makes sense to drop is metadata which is		// type*. The only metadata it makes sense to drop is metadata which is
// invalidated when the pointer type changes. This should essentially		// invalidated when the pointer type changes. This should essentially
// never be the case in LLVM, but we explicitly switch over only known		// never be the case in LLVM, but we explicitly switch over only known
// metadata to be conservatively correct. If you are adding metadata to		// metadata to be conservatively correct. If you are adding metadata to
// LLVM which pertains to stores, you almost certainly want to add it		// LLVM which pertains to stores, you almost certainly want to add it
// here.		// here.
switch (ID) {		switch (ID) {
case LLVMContext::MD_dbg:		case LLVMContext::MD_dbg:
case LLVMContext::MD_tbaa:		case LLVMContext::MD_tbaa:
case LLVMContext::MD_prof:		case LLVMContext::MD_prof:
case LLVMContext::MD_fpmath:		case LLVMContext::MD_fpmath:
case LLVMContext::MD_tbaa_struct:		case LLVMContext::MD_tbaa_struct:
case LLVMContext::MD_alias_scope:		case LLVMContext::MD_alias_scope:
case LLVMContext::MD_noalias:		case LLVMContext::MD_noalias:
case LLVMContext::MD_nontemporal:		case LLVMContext::MD_nontemporal:
case LLVMContext::MD_mem_parallel_loop_access:		case LLVMContext::MD_mem_parallel_loop_access:
case LLVMContext::MD_nonnull:		case LLVMContext::MD_nonnull:
// All of these directly apply.		// All of these directly apply.
NewStore->setMetadata(ID, N);		NewSI->setMetadata(ID, N);
break;		break;

case LLVMContext::MD_invariant_load:		case LLVMContext::MD_invariant_load:
case LLVMContext::MD_range:		case LLVMContext::MD_range:
break;		break;
}		}
}		}
		return NewSI;
		}

		/// \brief Combine integer stores to vector stores when the integers bits are
		/// just a concatenation of non-integer (and non-vector) types.
		///
		/// This specifically matches the pattern of taking a sequence of non-integer
		/// types, casting them to integers, extending, shifting, and or-ing them
		/// together to make a concatenation, and then storing the result. This shows up
		/// because large integers are sometimes used to represent a "generic" load or
		/// store, and only later optimization may uncover that there is a more natural
		/// type to represent the store with.
		///
		/// \returns true if the store was successfully combined away. This indicates
		/// the caller must erase the store instruction. We have to let the caller erase
		/// the store instruction sas otherwise there is no way to signal whether it was
		/// combined or not: IC.EraseInstFromFunction returns a null pointer.
		static bool combineIntegerStoreToVectorStore(InstCombiner &IC, StoreInst &SI) {
		// FIXME: This is probably a reasonable transform to make for atomic stores.
		assert(SI.isSimple() && "Do not call for non-simple stores!");

		Instruction *OrigV = dyn_cast<Instruction>(SI.getValueOperand());
		if (!OrigV)
		return false;

		// We only handle values which are used entirely to store to memory. If the
		// value is used directly as an SSA value, then even if there are matching
		// element insertion and element extraction, we rely on basic integer
		// combining to forward the bits and delete the intermediate math. Here we
		// just need to clean up the places where it actually reaches memory.
		SmallVector<StoreInst *, 2> Stores;
		for (User *U : OrigV->users())
		if (auto *SIU = dyn_cast<StoreInst>(U))
		Stores.push_back(SIU);
		else
		return false;

		const DataLayout &DL = *IC.getDataLayout();
		unsigned BaseBits = OrigV->getType()->getIntegerBitWidth();
		Type *ElementTy = nullptr;
		int ElementSize;

		// We need to match some number of element insertions into an integer. Each
		// insertion takes the form of an element value (and type), index (multiple of
		// the bitwidth of the type) of insertion, and the base it was inserted into.
		struct InsertedElement {
		Value *Base;
		Value *Element;
		int Index;
		};
		auto MatchInsertedElement = [&](Value *V) -> Optional<InsertedElement> {
		// Handle a null input to make it easy to loop over bases.
		if (!V)
		return Optional<InsertedElement>();

		assert(!V->getType()->isVectorTy() && "Must not be a vector.");
		assert(V->getType()->isIntegerTy() && "Must be an integer value.");

		Value Base = nullptr, Cast;
		ConstantInt *ShiftC = nullptr;
		auto InsertPattern = m_CombineOr(
		m_Shl(m_OneUse(m_ZExt(m_OneUse(m_Value(Cast)))), m_ConstantInt(ShiftC)),
		m_ZExt(m_OneUse(m_Value(Cast))));
		if (!match(V, m_CombineOr(m_CombineOr(m_Or(m_OneUse(m_Value(Base)),
		m_OneUse(InsertPattern)),
		m_Or(m_OneUse(InsertPattern),
		m_OneUse(m_Value(Base)))),
		InsertPattern)))
		return Optional<InsertedElement>();

		Value *Element;
		if (auto *BC = dyn_cast<BitCastInst>(Cast)) {
		// Bit casts are trivially correct here.
		Element = BC->getOperand(0);
		} else if (auto *PC = dyn_cast<PtrToIntInst>(Cast)) {
		Element = PC->getOperand(0);
		// If this changes the bit width at all, reject it.
		if (PC->getType()->getIntegerBitWidth() !=
		DL.getTypeSizeInBits(Element->getType()))
		return Optional<InsertedElement>();
		} else {
		// All other casts are rejected.
		return Optional<InsertedElement>();
		}

		// We can't handle shifts wider than the number of bits in the integer.
		unsigned ShiftBits = ShiftC ? ShiftC->getLimitedValue(BaseBits) : 0;
		if (ShiftBits == BaseBits)
		return Optional<InsertedElement>();

		// All of the elements inserted need to be the same type. Either capture the
		// first element type or check this element type against the previous
		// element types.
		if (!ElementTy) {
		ElementTy = Element->getType();
		// The base integer size and the shift need to be multiples of the element
		// size in bits.
		ElementSize = DL.getTypeSizeInBits(ElementTy);
		if (BaseBits % ElementSize \|\| ShiftBits % ElementSize)
		return Optional<InsertedElement>();
		} else if (ElementTy != Element->getType()) {
		return Optional<InsertedElement>();
		}

		// We don't handle integers, sub-vectors, or any aggregate types. We
		// handle pointers and floating ponit types.
		if (!ElementTy->isSingleValueType() \|\| ElementTy->isIntegerTy() \|\|
		ElementTy->isVectorTy())
		return Optional<InsertedElement>();

		int Index =
		(DL.isLittleEndian() ? ShiftBits : BaseBits - ElementSize - ShiftBits) /
		ElementSize;
		InsertedElement Result = {Base, Element, Index};
		return Result;
		};

		SmallVector<InsertedElement, 2> Elements;
		Value *V = OrigV;
		while (Optional<InsertedElement> E = MatchInsertedElement(V)) {
		V = E->Base;
		Elements.push_back(std::move(*E));
		}
		// If searching for elements found none, or didn't terminate in either an
		// undef or a direct zext, we can't form a vector.
		if (Elements.empty() \|\| (V && !isa<UndefValue>(V)))
		return false;

		// Build a storable vector by looping over the inserted elements.
		VectorType *VTy = VectorType::get(ElementTy, BaseBits / ElementSize);
		V = UndefValue::get(VTy);
		IC.Builder->SetInsertPoint(OrigV);
		for (const auto &E : Elements)
		V = IC.Builder->CreateInsertElement(V, E.Element,
		IC.Builder->getInt32(E.Index));

		for (StoreInst *OldSI : Stores) {
		IC.Builder->SetInsertPoint(OldSI);
		combineStoreToNewValue(IC, *OldSI, V);
		if (OldSI != &SI)
		IC.EraseInstFromFunction(*OldSI);
		}
return true;		return true;
}		}

		/// \brief Combine stores to match the type of value being stored.
		///
		/// The core idea here is that the memory does not have any intrinsic type and
		/// where we can we should match the type of a store to the type of value being
		/// stored.
		///
		/// However, this routine must never change the width of a store or the number of
		/// stores as that would introduce a semantic change. This combine is expected to
		/// be a semantic no-op which just allows stores to more closely model the types
		/// of their incoming values.
		///
		/// Currently, we also refuse to change the precise type used for an atomic or
		/// volatile store. This is debatable, and might be reasonable to change later.
		/// However, it is risky in case some backend or other part of LLVM is relying
		/// on the exact type stored to select appropriate atomic operations.
		///
		/// \returns true if the store was successfully combined away. This indicates
		/// the caller must erase the store instruction. We have to let the caller erase
		/// the store instruction sas otherwise there is no way to signal whether it was
		/// combined or not: IC.EraseInstFromFunction returns a null pointer.
		static bool combineStoreToValueType(InstCombiner &IC, StoreInst &SI) {
		// FIXME: We could probably with some care handle both volatile and atomic
		// stores here but it isn't clear that this is important.
		if (!SI.isSimple())
		return false;

		Value *V = SI.getValueOperand();

		// Fold away bit casts of the stored value by storing the original type.
		if (auto *BC = dyn_cast<BitCastInst>(V)) {
		combineStoreToNewValue(IC, SI, BC->getOperand(0));
		return true;
		}

		// If this is an integer store and we have data layout, look for a pattern of
		// storing a vector as an integer (modeled as a bag of bits).
		if (V->getType()->isIntegerTy() && IC.getDataLayout() &&
		combineIntegerStoreToVectorStore(IC, SI))
		return true;

// FIXME: We should also canonicalize loads of vectors when their elements are		// FIXME: We should also canonicalize loads of vectors when their elements are
// cast to other types.		// cast to other types.
return false;		return false;
}		}

/// equivalentAddressValues - Test if A and B will obviously have the same		/// equivalentAddressValues - Test if A and B will obviously have the same
/// value. This includes recognizing that %t0 and %t1 will have the same		/// value. This includes recognizing that %t0 and %t1 will have the same
/// value in code like this:		/// value in code like this:
▲ Show 20 Lines • Show All 275 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/loadstore-vector.ll

				; RUN: opt -instcombine -S < %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				; Basic test for turning element extraction from integer loads and element
				; insertion into integer stores into extraction and insertion with vectors.
				define void @test1({ float, float }* %x, float %a, float %b, { float, float }* %out) {
				; CHECK-LABEL: @test1(
				entry:
				%x.cast = bitcast { float, float }* %x to i64*
				%x.load = load i64* %x.cast, align 4
				; CHECK-NOT: load i64*
				; CHECK: %[[LOAD:.]] = load <2 x float>

				%lo.trunc = trunc i64 %x.load to i32
				%hi.shift = lshr i64 %x.load, 32
				%hi.trunc = trunc i64 %hi.shift to i32
				%hi.cast = bitcast i32 %hi.trunc to float
				%lo.cast = bitcast i32 %lo.trunc to float
				; CHECK-NOT: trunc
				; CHECK-NOT: lshr
				; CHECK: %[[HI:.*]] = extractelement <2 x float> %[[LOAD]], i32 1
				; CHECK: %[[LO:.*]] = extractelement <2 x float> %[[LOAD]], i32 0

				%add.i.i = fadd float %lo.cast, %a
				%add5.i.i = fadd float %hi.cast, %b
				; CHECK: %[[LO_SUM:.*]] = fadd float %[[LO]], %a
				; CHECK: %[[HI_SUM:.*]] = fadd float %[[HI]], %b

				%add.lo.cast = bitcast float %add.i.i to i32
				%add.hi.cast = bitcast float %add5.i.i to i32
				%add.hi.ext = zext i32 %add.hi.cast to i64
				%add.hi.shift = shl nuw i64 %add.hi.ext, 32
				%add.lo.ext = zext i32 %add.lo.cast to i64
				%add.lo.or = or i64 %add.hi.shift, %add.lo.ext
				; CHECK-NOT: zext i32
				; CHECK-NOT: shl {{.*}} i64
				; CHECK-NOT: or i64
				; CHECK: %[[INSERT1:.*]] = insertelement <2 x float> undef, float %[[LO_SUM]], i32 0
				; CHECK: %[[INSERT2:.*]] = insertelement <2 x float> %[[INSERT1]], float %[[HI_SUM]], i32 1

				%out.cast = bitcast { float, float }* %out to i64*
				store i64 %add.lo.or, i64* %out.cast, align 4
				; CHECK-NOT: store i64
				; CHECK: store <2 x float> %[[INSERT2]]

				ret void
				}

				define void @test2({ float, float }* %x, float %a, float %b, { float, float }* %out1, { float, float }* %out2) {
				; CHECK-LABEL: @test2(
				entry:
				%x.cast = bitcast { float, float }* %x to i64*
				%x.load = load i64* %x.cast, align 4
				; CHECK-NOT: load i64*
				; CHECK: %[[LOAD:.]] = load <2 x float>

				%lo.trunc = trunc i64 %x.load to i32
				%hi.shift = lshr i64 %x.load, 32
				%hi.trunc = trunc i64 %hi.shift to i32
				%hi.cast = bitcast i32 %hi.trunc to float
				%lo.cast = bitcast i32 %lo.trunc to float
				; CHECK-NOT: trunc
				; CHECK-NOT: lshr
				; CHECK: %[[HI:.*]] = extractelement <2 x float> %[[LOAD]], i32 1
				; CHECK: %[[LO:.*]] = extractelement <2 x float> %[[LOAD]], i32 0

				%add.i.i = fadd float %lo.cast, %a
				%add5.i.i = fadd float %hi.cast, %b
				; CHECK: %[[LO_SUM:.*]] = fadd float %[[LO]], %a
				; CHECK: %[[HI_SUM:.*]] = fadd float %[[HI]], %b

				%add.lo.cast = bitcast float %add.i.i to i32
				%add.hi.cast = bitcast float %add5.i.i to i32
				%add.hi.ext = zext i32 %add.hi.cast to i64
				%add.hi.shift = shl nuw i64 %add.hi.ext, 32
				%add.lo.ext = zext i32 %add.lo.cast to i64
				%add.lo.or = or i64 %add.hi.shift, %add.lo.ext
				; CHECK-NOT: zext i32
				; CHECK-NOT: shl {{.*}} i64
				; CHECK-NOT: or i64
				; CHECK: %[[INSERT1:.*]] = insertelement <2 x float> undef, float %[[LO_SUM]], i32 0
				; CHECK: %[[INSERT2:.*]] = insertelement <2 x float> %[[INSERT1]], float %[[HI_SUM]], i32 1

				%out1.cast = bitcast { float, float }* %out1 to i64*
				store i64 %add.lo.or, i64* %out1.cast, align 4
				%out2.cast = bitcast { float, float }* %out2 to i64*
				store i64 %add.lo.or, i64* %out2.cast, align 4
				; CHECK-NOT: store i64
				; CHECK: store <2 x float> %[[INSERT2]]
				; CHECK-NOT: store i64
				; CHECK: store <2 x float> %[[INSERT2]]

				ret void
				}

				; We handle some cases where there is partial CSE but not complete CSE of
				; repeated insertion and extraction. Currently, we don't catch the store side
				; yet because it would require extreme heroics to match this reliably.
				define void @test3({ float, float, float }* %x, float %a, float %b, { float, float, float }* %out1, { float, float, float }* %out2) {
				; CHECK-LABEL: @test3(
				entry:
				%x.cast = bitcast { float, float, float }* %x to i96*
				%x.load = load i96* %x.cast, align 4
				; CHECK-NOT: load i96*
				; CHECK: %[[LOAD:.]] = load <3 x float>

				%lo.trunc = trunc i96 %x.load to i32
				%lo.cast = bitcast i32 %lo.trunc to float
				%mid.shift = lshr i96 %x.load, 32
				%mid.trunc = trunc i96 %mid.shift to i32
				%mid.cast = bitcast i32 %mid.trunc to float
				%mid.trunc2 = trunc i96 %mid.shift to i32
				%mid.cast2 = bitcast i32 %mid.trunc2 to float
				%hi.shift = lshr i96 %mid.shift, 32
				%hi.trunc = trunc i96 %hi.shift to i32
				%hi.cast = bitcast i32 %hi.trunc to float
				; CHECK-NOT: trunc
				; CHECK-NOT: lshr
				; CHECK: %[[LO:.*]] = extractelement <3 x float> %[[LOAD]], i32 0
				; CHECK: %[[MID1:.*]] = extractelement <3 x float> %[[LOAD]], i32 1
				; CHECK: %[[MID2:.*]] = extractelement <3 x float> %[[LOAD]], i32 1
				; CHECK: %[[HI:.*]] = extractelement <3 x float> %[[LOAD]], i32 2

				%add.lo = fadd float %lo.cast, %a
				%add.mid = fadd float %mid.cast, %b
				%add.hi = fadd float %hi.cast, %mid.cast2
				; CHECK: %[[LO_SUM:.*]] = fadd float %[[LO]], %a
				; CHECK: %[[MID_SUM:.*]] = fadd float %[[MID1]], %b
				; CHECK: %[[HI_SUM:.*]] = fadd float %[[HI]], %[[MID2]]

				%add.lo.cast = bitcast float %add.lo to i32
				%add.mid.cast = bitcast float %add.mid to i32
				%add.hi.cast = bitcast float %add.hi to i32
				%result.hi.ext = zext i32 %add.hi.cast to i96
				%result.hi.shift = shl nuw i96 %result.hi.ext, 32
				%result.mid.ext = zext i32 %add.mid.cast to i96
				%result.mid.or = or i96 %result.hi.shift, %result.mid.ext
				%result.mid.shift = shl nuw i96 %result.mid.or, 32
				%result.lo.ext = zext i32 %add.lo.cast to i96
				%result.lo.or = or i96 %result.mid.shift, %result.lo.ext
				; FIXME-NOT: zext i32
				; FIXME-NOT: shl {{.*}} i64
				; FIXME-NOT: or i64
				; FIXME: %[[INSERT1:.*]] = insertelement <3 x float> undef, float %[[HI_SUM]], i32 2
				; FIXME: %[[INSERT2:.*]] = insertelement <3 x float> %[[INSERT1]], float %[[MID_SUM]], i32 1
				; FIXME: %[[INSERT3:.*]] = insertelement <3 x float> %[[INSERT2]], float %[[LO_SUM]], i32 0

				%out1.cast = bitcast { float, float, float }* %out1 to i96*
				store i96 %result.lo.or, i96* %out1.cast, align 4
				; FIXME-NOT: store i96
				; FIXME: store <3 x float> %[[INSERT3]]

				%result2.lo.ext = zext i32 %add.lo.cast to i96
				%result2.lo.or = or i96 %result.mid.shift, %result2.lo.ext
				; FIXME-NOT: zext i32
				; FIXME-NOT: shl {{.*}} i64
				; FIXME-NOT: or i64
				; FIXME: %[[INSERT4:.*]] = insertelement <3 x float> %[[INSERT2]], float %[[LO_SUM]], i32 0

				%out2.cast = bitcast { float, float, float }* %out2 to i96*
				store i96 %result2.lo.or, i96* %out2.cast, align 4
				; FIXME-NOT: store i96
				; FIXME: store <3 x float>

				ret void
				}

				; Basic test that pointers work correctly as the element type.
				define void @test4({ i8, i8 }* %x, i64 %a, i64 %b, { i8, i8 }* %out) {
				; CHECK-LABEL: @test4(
				entry:
				%x.cast = bitcast { i8, i8 }* %x to i128*
				%x.load = load i128* %x.cast, align 4
				; CHECK-NOT: load i128*
				; CHECK: %[[LOAD:.]] = load <2 x i8>* {{.*}}, align 4

				%lo.trunc = trunc i128 %x.load to i64
				%hi.shift = lshr i128 %x.load, 64
				%hi.trunc = trunc i128 %hi.shift to i64
				%hi.cast = inttoptr i64 %hi.trunc to i8*
				%lo.cast = inttoptr i64 %lo.trunc to i8*
				; CHECK-NOT: trunc
				; CHECK-NOT: lshr
				; CHECK: %[[HI:.]] = extractelement <2 x i8> %[[LOAD]], i32 1
				; CHECK: %[[LO:.]] = extractelement <2 x i8> %[[LOAD]], i32 0

				%gep.lo = getelementptr i8* %lo.cast, i64 %a
				%gep.hi = getelementptr i8* %hi.cast, i64 %b
				; CHECK: %[[LO_GEP:.]] = getelementptr i8 %[[LO]], i64 %a
				; CHECK: %[[HI_GEP:.]] = getelementptr i8 %[[HI]], i64 %b

				%gep.lo.cast = ptrtoint i8* %gep.lo to i64
				%gep.hi.cast = ptrtoint i8* %gep.hi to i64
				%gep.hi.ext = zext i64 %gep.hi.cast to i128
				%gep.hi.shift = shl nuw i128 %gep.hi.ext, 64
				%gep.lo.ext = zext i64 %gep.lo.cast to i128
				%gep.lo.or = or i128 %gep.hi.shift, %gep.lo.ext
				; CHECK-NOT: zext i32
				; CHECK-NOT: shl {{.*}} i64
				; CHECK-NOT: or i64
				; CHECK: %[[INSERT1:.]] = insertelement <2 x i8> undef, i8* %[[LO_GEP]], i32 0
				; CHECK: %[[INSERT2:.]] = insertelement <2 x i8> %[[INSERT1]], i8* %[[HI_GEP]], i32 1

				%out.cast = bitcast { i8, i8 }* %out to i128*
				store i128 %gep.lo.or, i128* %out.cast, align 4
				; CHECK-NOT: store i128
				; CHECK: store <2 x i8> %[[INSERT2]], <2 x i8>* {{.*}}, align 4

				ret void
				}