This is an archive of the discontinued LLVM Phabricator instance.

Differential D5921

[CodeGenPrepare] Move extractelement close to store if they can be combined.
ClosedPublic

Authored by qcolombet on Oct 22 2014, 3:55 PM.

Download Raw Diff

Details

Reviewers

hfinkel

Commits

rGc32615dfef2e: [CodeGenPrepare] Move extractelement close to store if they can be combined.

Summary

Hi,

This patch adds an optimization in CodeGenPrepare to move an extractelement right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the way to make that possible.

Thanks for your feedbacks.

Context **

Some targets use different register files for both vector and scalar operations. This means that transitioning from one domain to another may incur copy from one register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR move is 20 cycles.

Motivating Example **

Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {

%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void

}

As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr

Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr

Half of the computation made in the vector is useless, but this allows to get rid of the expensive cross-register-file copy.

Proposed Solution **

To avoid this cross-register-copy penalty, we promote the scalar operations to vector operations. The penalty will be removed if we manage to promote the whole chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the target is able to combine an extract with a store.

Stores are the most likely candidates, because other instructions produce values that would need to be promoted and so, extracted as some point[1]. Moreover, this is customary that targets feature stores that perform a vector extract (see AArch64 and X86 for instance).

The proposed implementation relies on the TargetTransformInfo to decide whether or not it is beneficial to promote a chain of computation in the vector domain. Unfortunately, this interface is rather inaccurate for this level of details and although this optimization may be beneficial for X86 and AArch64, the inaccuracy will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1, whereas, even if a vector type is legal, usually a vector operation is slightly more expensive than its scalar counterpart. That will lead to too many promotions that may not be counter balanced by the saving of the cross-register-file copy. For instance, on AArch64 this penalty is just 4 cycles.

For now, the optimization is just enabled for ARM prior than v8, since those processors have a larger penalty on cross-register-file copies, and the scope is limited to basic blocks. Because of these two factors, we limit the effects of the inaccuracy. Indeed, I did not want to build up a fancy cost model with block frequency and everything on top of that.

[1] We can imagine targets that can combine an extractelement with other instructions than just stores. If we want to go into that direction, the current interfaces must be augmented and, moreover, I think this becomes a global isel problem.

Thanks,
-Quentin

Diff Detail

Event Timeline

qcolombet updated this revision to Diff 15279.Oct 22 2014, 3:55 PM

qcolombet retitled this revision from to [CodeGenPrepare] Move extractelement close to store if they can be combined..

qcolombet updated this object.

qcolombet edited the test plan for this revision. (Show Details)

qcolombet added a reviewer: hfinkel.

qcolombet set the repository for this revision to rL LLVM.

qcolombet added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptOct 22 2014, 3:55 PM

hfinkel added inline comments.Oct 23 2014, 2:11 PM

include/llvm/Target/TargetLowering.h
268	I would really like this to be a callback that passes the type and the extraction index. For PowerPC, for example, I can make this free for some types, but only for index 0. For other indices, I can do it with a shuffle (which obviously has some extra cost, although not much as the load/store-based domain crossing). IMHO, a great interface for this could be: int getStoreExtractCombineCost(Type VectorTy, Value Idx);
lib/CodeGen/CodeGenPrepare.cpp
3268	You should add: because the target has asserted that it can combine the extract with the store (for free).
3318	What is the complication to handling this case? If you really can't test this on ARM, please explicitly note this as FIXME.
3328	You also need to exclude division (because introducing a division by undef is a bad idea), or make sure that if you do this for division, then the other fields are set to some non-zero quantity.

hfinkel added inline comments.Oct 23 2014, 2:13 PM

lib/CodeGen/CodeGenPrepare.cpp
3391	Not withstanding my previous comment on division, we could really put more undefs in this vector instead of always using a splat.

Hi Hal,

Thanks for the feedback, in particular for catching the corner case with division.

I’ll update the patch shortly in the meantime a couple of inline comments.

Cheers,
-Quentin

include/llvm/Target/TargetLowering.h
268	Although I like the direction of this, how do we get the cost of the original sequence? In other words, what value do you propose we use to decide whether or not it is worth kicking this optimization given those two parameters? Indeed, I expect that we will replace the condition that guards the optimization with something like: TLI->getStoreExtractCombineCost(ExtractInst->getType(), ExtractInst->getOperand(1)) < CostOfIndividualStoreAndExtract I assume you were thinking "== 0” since you talked about free. Is it what you had in mind?
lib/CodeGen/CodeGenPrepare.cpp
3268	Agreed.
3318	None, I just have not seen any motivating example. I’ll see if I can come up with something, otherwise I will put a FIXME.
3328	Good point, the problem may arise if the non-constant vector is used on the right hand side.
3391	Although it would give more freedom to the lowering to use undef, I was afraid the backends would be able to simplify the sequence to a scalar operation. A quick test proved me I was wrong. So I will update with more undef (unless we are on the right hand side of a division :)). Are you seeing other cases where we shouldn’t use undef?

hfinkel added inline comments.Oct 23 2014, 10:38 PM

include/llvm/Target/TargetLowering.h
268	Thinking about this, the interface I proposed will be difficult to use (because it seems like you'd need to cost-model the entire transformation in order to figure anything out). How about: bool canCombineStoreAndExtract(Type VectorTy, Value Idx, unsigned &Cost) where, if true is returned and we move forward with modeling the transformation, we add Cost to VectorCost in isProfitableToPromote. This seems better.

qcolombet added inline comments.Oct 24 2014, 10:34 AM

include/llvm/Target/TargetLowering.h
268	Agreed. Thanks!

Do not introduce undefined behavior when the variable to be promoted is on the rhs of a division.
Support ConstantFP value for promotion.
Change the target hook to something more precise that takes the vector type and the index of the extract.
Use as many undef as possible when materializing the vector constant.
Add a stress mode to cover all the added features even if they were not relevant for ARM.
Update the test cases: ConstantFP, Division (def and undef), Splat Constant, and as many as possible undef.

Hi Hal,

I think I have incorporated all your feedbacks with this update.
However, I would like your opinion on the target hook. In particular, I realized that we are missing a correlation to an alignment.
Indeed, we can imagine targets that can perform the store(extract) only on certain alignment (which is not the case for ARM, it supports unaligned memory accesses for vector).
I thought about adding another out parameter for the alignment, but I was thinking that some targets can have a different cost for different alignment. Therefore, that wouldn't be sufficient.

I think we could either:

Have an out parameter (or another getter) that is a map of alignment => cost (sounds expensive though).
Have a second target hook that gives the actual cost when we know the alignment.

Assuming we go for #2, the general approach would be:

Check that the target supports the store(extract) combine for the given vector type and index (no cost involved).
Find an opportunity.
Check for the cost of this combine.
Promote or not according to the profitability.

What do you think?

Thanks,
-Quentin

In D5921#13, @qcolombet wrote:

Hi Hal,

I think I have incorporated all your feedbacks with this update.
However, I would like your opinion on the target hook. In particular, I realized that we are missing a correlation to an alignment.
Indeed, we can imagine targets that can perform the store(extract) only on certain alignment (which is not the case for ARM, it supports unaligned memory accesses for vector).
I thought about adding another out parameter for the alignment, but I was thinking that some targets can have a different cost for different alignment. Therefore, that wouldn't be sufficient.

I think we could either:

Have an out parameter (or another getter) that is a map of alignment => cost (sounds expensive though).

Have a second target hook that gives the actual cost when we know the alignment.

Assuming we go for #2, the general approach would be:

Check that the target supports the store(extract) combine for the given vector type and index (no cost involved).

Find an opportunity.

Check for the cost of this combine.

Promote or not according to the profitability.

What do you think?

I think that, for the time being, call TLI's existing allowsMisalignedMemoryAccesses hook, and only perform the transformation if the function returns true, and also Fast == true. Otherwise, abort. If someone needs more control than this, then we'll have a concrete use case and can design around that.

Thanks,
-Quentin

hfinkel added inline comments.Oct 26 2014, 1:32 AM

lib/CodeGen/CodeGenPrepare.cpp
3362	Also urem, srem, frem For fdiv and frem, allow the transformation if the instruction has the nnan fast-math flag.
3446	Don't repeat this logic; extract it into a function.

Refactor the logic to test whether or not we may introduce undefined behavior.
Add support for frem, urem, and srem.
Add support for fast-math.
Update tests.

Check whether or not the store to be combined is properly aligned.

Hi Hal,

Thanks for the new inputs.
I’ve addressed the <u|s|f>rem point and added support for fast-math.
However, I haven’t used the fast information with the store alignment. Indeed, I am not sure this gets us anything. This will tell us whether or not a store is slow, but since we are not changing the type of the store, this does not tell us if we are slowing down the code or if the combine is still supported.
We could assume "slow store" == "no combine”, but I prefer, like you said, that we wait for actual use cases to start doing heuristic in this domain.

Thoughts?

Thanks again,
-Quentin

However, I haven’t used the fast information with the store alignment. Indeed, I am not sure this gets us anything. This will tell us whether or not a store is slow, but since we are not changing the type of the store, this does not tell us if we are slowing down the code or if the combine is still supported.
We could assume "slow store" == "no combine”, but I prefer, like you said, that we wait for actual use cases to start doing heuristic in this domain.

Thoughts?

I agree. Thinking about it, checking Fast is actually not really right (Fast == false likely indicates that the actual unaligned case is handled by some OS trap handler, for example, which really has nothing to do with instruction selection, and hurts the case where the data is actually aligned).

lib/CodeGen/CodeGenPrepare.cpp
3280	Why are you also checking against the ABI alignment?
3367	This should be !Use->hasNoNaNs(); (I'd like to keep us from over-generalizing these kinds of checks and just using 'unsafe algebra' for everything -- we can be more specific here).

Remove a copy-paste-o.
Use a more specific fast-math flag.

Hi Hal,

Here is the updated version.

Thanks again for the careful review!

Ok to commit?

Cheers,
-Quentin

lib/CodeGen/CodeGenPrepare.cpp
3280	Copy-paste-o.
3367	Make sense, thanks.

Yes, LGTM. Thanks!

This revision is now accepted and ready to land.Oct 30 2014, 10:02 PM

Thanks Hal.

Committed r220978.

shchenz mentioned this in D146602: [PowerPC] Enable canCombineStoreAndExtract target hook.Mar 29 2023, 12:53 AM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

10 lines

lib/

CodeGen/

CodeGenPrepare.cpp

380 lines

Target/

ARM/

ARMISelLowering.h

3 lines

ARMISelLowering.cpp

29 lines

test/

CodeGen/

ARM/

vector-promotion.ll

403 lines

Diff 15597

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	public:
/// \endcode		/// \endcode
/// into a single machine instruction of a form like:		/// into a single machine instruction of a form like:
/// \code		/// \code
/// brOnBitSet %register, #bitNumber, dest		/// brOnBitSet %register, #bitNumber, dest
/// \endcode		/// \endcode
bool isMaskAndBranchFoldingLegal() const {		bool isMaskAndBranchFoldingLegal() const {
return MaskAndBranchFoldingIsLegal;		return MaskAndBranchFoldingIsLegal;
}		}

		/// Return true if the target can combine store(extractelement VectorTy,
		/// Idx).
		/// \p Cost[out] gives the cost of that transformation when this is true.
		hfinkelUnsubmitted Not Done Reply Inline Actions I would really like this to be a callback that passes the type and the extraction index. For PowerPC, for example, I can make this free for some types, but only for index 0. For other indices, I can do it with a shuffle (which obviously has some extra cost, although not much as the load/store-based domain crossing). IMHO, a great interface for this could be: int getStoreExtractCombineCost(Type VectorTy, Value Idx); hfinkel: I would really like this to be a callback that passes the type and the extraction index. For…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Although I like the direction of this, how do we get the cost of the original sequence? In other words, what value do you propose we use to decide whether or not it is worth kicking this optimization given those two parameters? Indeed, I expect that we will replace the condition that guards the optimization with something like: TLI->getStoreExtractCombineCost(ExtractInst->getType(), ExtractInst->getOperand(1)) < CostOfIndividualStoreAndExtract I assume you were thinking "== 0” since you talked about free. Is it what you had in mind? qcolombet: Although I like the direction of this, how do we get the cost of the original sequence? In…
		hfinkelUnsubmitted Not Done Reply Inline Actions Thinking about this, the interface I proposed will be difficult to use (because it seems like you'd need to cost-model the entire transformation in order to figure anything out). How about: bool canCombineStoreAndExtract(Type VectorTy, Value Idx, unsigned &Cost) where, if true is returned and we move forward with modeling the transformation, we add Cost to VectorCost in isProfitableToPromote. This seems better. hfinkel: Thinking about this, the interface I proposed will be difficult to use (because it seems like…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Agreed. Thanks! qcolombet: Agreed. Thanks!
		virtual bool canCombineStoreAndExtract(Type VectorTy, Value Idx,
		unsigned &Cost) const {
		return false;
		}

/// Return true if target supports floating point exceptions.		/// Return true if target supports floating point exceptions.
bool hasFloatingPointExceptions() const {		bool hasFloatingPointExceptions() const {
return HasFloatingPointExceptions;		return HasFloatingPointExceptions;
}		}

/// Return true if target always beneficiates from combining into FMA for a		/// Return true if target always beneficiates from combining into FMA for a
/// given value type. This must typically return false on targets where FMA		/// given value type. This must typically return false on targets where FMA
/// takes more cycles to execute than FADD.		/// takes more cycles to execute than FADD.
▲ Show 20 Lines • Show All 2,466 Lines • Show Last 20 Lines

lib/CodeGen/CodeGenPrepare.cpp

Show All 12 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/CodeGen/Passes.h"		#include "llvm/CodeGen/Passes.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
Show All 29 Lines
STATISTIC(NumMemoryInsts, "Number of memory instructions whose address "		STATISTIC(NumMemoryInsts, "Number of memory instructions whose address "
"computations were sunk");		"computations were sunk");
STATISTIC(NumExtsMoved, "Number of [s\|z]ext instructions combined with loads");		STATISTIC(NumExtsMoved, "Number of [s\|z]ext instructions combined with loads");
STATISTIC(NumExtUses, "Number of uses of [s\|z]ext instructions optimized");		STATISTIC(NumExtUses, "Number of uses of [s\|z]ext instructions optimized");
STATISTIC(NumRetsDup, "Number of return instructions duplicated");		STATISTIC(NumRetsDup, "Number of return instructions duplicated");
STATISTIC(NumDbgValueMoved, "Number of debug value instructions moved");		STATISTIC(NumDbgValueMoved, "Number of debug value instructions moved");
STATISTIC(NumSelectsExpanded, "Number of selects turned into branches");		STATISTIC(NumSelectsExpanded, "Number of selects turned into branches");
STATISTIC(NumAndCmpsMoved, "Number of and/cmp's pushed into branches");		STATISTIC(NumAndCmpsMoved, "Number of and/cmp's pushed into branches");
		STATISTIC(NumStoreExtractExposed, "Number of store(extractelement) exposed");

static cl::opt<bool> DisableBranchOpts(		static cl::opt<bool> DisableBranchOpts(
"disable-cgp-branch-opts", cl::Hidden, cl::init(false),		"disable-cgp-branch-opts", cl::Hidden, cl::init(false),
cl::desc("Disable branch optimizations in CodeGenPrepare"));		cl::desc("Disable branch optimizations in CodeGenPrepare"));

static cl::opt<bool> DisableSelectToBranch(		static cl::opt<bool> DisableSelectToBranch(
"disable-cgp-select2branch", cl::Hidden, cl::init(false),		"disable-cgp-select2branch", cl::Hidden, cl::init(false),
cl::desc("Disable select to branch conversion."));		cl::desc("Disable select to branch conversion."));

static cl::opt<bool> AddrSinkUsingGEPs(		static cl::opt<bool> AddrSinkUsingGEPs(
"addr-sink-using-gep", cl::Hidden, cl::init(false),		"addr-sink-using-gep", cl::Hidden, cl::init(false),
cl::desc("Address sinking in CGP using GEPs."));		cl::desc("Address sinking in CGP using GEPs."));

static cl::opt<bool> EnableAndCmpSinking(		static cl::opt<bool> EnableAndCmpSinking(
"enable-andcmp-sinking", cl::Hidden, cl::init(true),		"enable-andcmp-sinking", cl::Hidden, cl::init(true),
cl::desc("Enable sinkinig and/cmp into branches."));		cl::desc("Enable sinkinig and/cmp into branches."));

		static cl::opt<bool> DisableStoreExtract(
		"disable-cgp-store-extract", cl::Hidden, cl::init(false),
		cl::desc("Disable store(extract) optimizations in CodeGenPrepare"));

		static cl::opt<bool> StressStoreExtract(
		"stress-cgp-store-extract", cl::Hidden, cl::init(false),
		cl::desc("Stress test store(extract) optimizations in CodeGenPrepare"));

namespace {		namespace {
typedef SmallPtrSet<Instruction *, 16> SetOfInstrs;		typedef SmallPtrSet<Instruction *, 16> SetOfInstrs;
typedef DenseMap<Instruction , Type > InstrToOrigTy;		typedef DenseMap<Instruction , Type > InstrToOrigTy;

class CodeGenPrepare : public FunctionPass {		class CodeGenPrepare : public FunctionPass {
/// TLI - Keep a pointer of a TargetLowering to consult for determining		/// TLI - Keep a pointer of a TargetLowering to consult for determining
/// transformation profitability.		/// transformation profitability.
const TargetMachine *TM;		const TargetMachine *TM;
const TargetLowering *TLI;		const TargetLowering *TLI;
		const TargetTransformInfo *TTI;
const TargetLibraryInfo *TLInfo;		const TargetLibraryInfo *TLInfo;
DominatorTree *DT;		DominatorTree *DT;

/// CurInstIterator - As we scan instructions optimizing them, this is the		/// CurInstIterator - As we scan instructions optimizing them, this is the
/// next instruction to optimize. Xforms that can invalidate this should		/// next instruction to optimize. Xforms that can invalidate this should
/// update it.		/// update it.
BasicBlock::iterator CurInstIterator;		BasicBlock::iterator CurInstIterator;

Show All 13 Lines	class CodeGenPrepare : public FunctionPass {
bool ModifiedDT;		bool ModifiedDT;

/// OptSize - True if optimizing for size.		/// OptSize - True if optimizing for size.
bool OptSize;		bool OptSize;

public:		public:
static char ID; // Pass identification, replacement for typeid		static char ID; // Pass identification, replacement for typeid
explicit CodeGenPrepare(const TargetMachine *TM = nullptr)		explicit CodeGenPrepare(const TargetMachine *TM = nullptr)
: FunctionPass(ID), TM(TM), TLI(nullptr) {		: FunctionPass(ID), TM(TM), TLI(nullptr), TTI(nullptr) {
initializeCodeGenPreparePass(*PassRegistry::getPassRegistry());		initializeCodeGenPreparePass(*PassRegistry::getPassRegistry());
}		}
bool runOnFunction(Function &F) override;		bool runOnFunction(Function &F) override;

const char *getPassName() const override { return "CodeGen Prepare"; }		const char *getPassName() const override { return "CodeGen Prepare"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addRequired<TargetLibraryInfo>();		AU.addRequired<TargetLibraryInfo>();
		AU.addRequired<TargetTransformInfo>();
}		}

private:		private:
bool EliminateFallThrough(Function &F);		bool EliminateFallThrough(Function &F);
bool EliminateMostlyEmptyBlocks(Function &F);		bool EliminateMostlyEmptyBlocks(Function &F);
bool CanMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;		bool CanMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;
void EliminateMostlyEmptyBlock(BasicBlock *BB);		void EliminateMostlyEmptyBlock(BasicBlock *BB);
bool OptimizeBlock(BasicBlock &BB);		bool OptimizeBlock(BasicBlock &BB);
bool OptimizeInst(Instruction *I);		bool OptimizeInst(Instruction *I);
bool OptimizeMemoryInst(Instruction I, Value Addr, Type *AccessTy);		bool OptimizeMemoryInst(Instruction I, Value Addr, Type *AccessTy);
bool OptimizeInlineAsmInst(CallInst *CS);		bool OptimizeInlineAsmInst(CallInst *CS);
bool OptimizeCallInst(CallInst *CI);		bool OptimizeCallInst(CallInst *CI);
bool MoveExtToFormExtLoad(Instruction *I);		bool MoveExtToFormExtLoad(Instruction *I);
bool OptimizeExtUses(Instruction *I);		bool OptimizeExtUses(Instruction *I);
bool OptimizeSelectInst(SelectInst *SI);		bool OptimizeSelectInst(SelectInst *SI);
bool OptimizeShuffleVectorInst(ShuffleVectorInst *SI);		bool OptimizeShuffleVectorInst(ShuffleVectorInst *SI);
		bool OptimizeExtractElementInst(Instruction *Inst);
bool DupRetToEnableTailCallOpts(BasicBlock *BB);		bool DupRetToEnableTailCallOpts(BasicBlock *BB);
bool PlaceDbgValues(Function &F);		bool PlaceDbgValues(Function &F);
bool sinkAndCmp(Function &F);		bool sinkAndCmp(Function &F);
};		};
}		}

char CodeGenPrepare::ID = 0;		char CodeGenPrepare::ID = 0;
INITIALIZE_TM_PASS(CodeGenPrepare, "codegenprepare",		INITIALIZE_TM_PASS(CodeGenPrepare, "codegenprepare",
Show All 11 Lines	bool CodeGenPrepare::runOnFunction(Function &F) {
// Clear per function information.		// Clear per function information.
InsertedTruncsSet.clear();		InsertedTruncsSet.clear();
PromotedInsts.clear();		PromotedInsts.clear();

ModifiedDT = false;		ModifiedDT = false;
if (TM)		if (TM)
TLI = TM->getSubtargetImpl()->getTargetLowering();		TLI = TM->getSubtargetImpl()->getTargetLowering();
TLInfo = &getAnalysis<TargetLibraryInfo>();		TLInfo = &getAnalysis<TargetLibraryInfo>();
		TTI = &getAnalysis<TargetTransformInfo>();
DominatorTreeWrapperPass *DTWP =		DominatorTreeWrapperPass *DTWP =
getAnalysisIfAvailable<DominatorTreeWrapperPass>();		getAnalysisIfAvailable<DominatorTreeWrapperPass>();
DT = DTWP ? &DTWP->getDomTree() : nullptr;		DT = DTWP ? &DTWP->getDomTree() : nullptr;
OptSize = F.getAttributes().hasAttribute(AttributeSet::FunctionIndex,		OptSize = F.getAttributes().hasAttribute(AttributeSet::FunctionIndex,
Attribute::OptimizeForSize);		Attribute::OptimizeForSize);

/// This optimization identifies DIV instructions that can be		/// This optimization identifies DIV instructions that can be
/// profitably bypassed and carried out with a shorter, faster divide.		/// profitably bypassed and carried out with a shorter, faster divide.
▲ Show 20 Lines • Show All 2,981 Lines • ▼ Show 20 Lines	bool CodeGenPrepare::OptimizeShuffleVectorInst(ShuffleVectorInst *SVI) {
if (SVI->use_empty()) {		if (SVI->use_empty()) {
SVI->eraseFromParent();		SVI->eraseFromParent();
MadeChange = true;		MadeChange = true;
}		}

return MadeChange;		return MadeChange;
}		}

		namespace {
		/// \brief Helper class to promote a scalar operation to a vector one.
		/// This class is used to move downward extractelement transition.
		/// E.g.,
		/// a = vector_op <2 x i32>
		/// b = extractelement <2 x i32> a, i32 0
		/// c = scalar_op b
		/// store c
		///
		/// =>
		/// a = vector_op <2 x i32>
		/// c = vector_op a (equivalent to scalar_op on the related lane)
		/// * d = extractelement <2 x i32> c, i32 0
		/// * store d
		/// Assuming both extractelement and store can be combine, we get rid of the
		/// transition.
		class VectorPromoteHelper {
		/// Used to perform some checks on the legality of vector operations.
		const TargetLowering &TLI;

		/// Used to estimated the cost of the promoted chain.
		const TargetTransformInfo &TTI;

		/// The transition being moved downwards.
		Instruction *Transition;
		/// The sequence of instructions to be promoted.
		SmallVector<Instruction *, 4> InstsToBePromoted;
		/// Cost of combining a store and an extract.
		unsigned StoreExtractCombineCost;
		/// Instruction that will be combined with the transition.
		Instruction *CombineInst;

		/// \brief The instruction that represents the current end of the transition.
		/// Since we are faking the promotion until we reach the end of the chain
		/// of computation, we need a way to get the current end of the transition.
		Instruction *getEndOfTransition() const {
		if (InstsToBePromoted.empty())
		return Transition;
		return InstsToBePromoted.back();
		}

		/// \brief Return the index of the original value in the transition.
		/// E.g., for "extractelement <2 x i32> c, i32 1" the original value,
		/// c, is at index 0.
		unsigned getTransitionOriginalValueIdx() const {
		assert(isa<ExtractElementInst>(Transition) &&
		"Other kind of transitions are not supported yet");
		return 0;
		}

		/// \brief Return the index of the index in the transition.
		/// E.g., for "extractelement <2 x i32> c, i32 0" the index
		/// is at index 1.
		unsigned getTransitionIdx() const {
		assert(isa<ExtractElementInst>(Transition) &&
		"Other kind of transitions are not supported yet");
		return 1;
		}

		/// \brief Get the type of the transition.
		/// This is the type of the original value.
		/// E.g., for "extractelement <2 x i32> c, i32 1" the type of the
		/// transition is <2 x i32>.
		Type *getTransitionType() const {
		return Transition->getOperand(getTransitionOriginalValueIdx())->getType();
		}

		/// \brief Promote \p ToBePromoted by moving \p Def downward through.
		/// I.e., we have the following sequence:
		/// Def = Transition <ty1> a to <ty2>
		/// b = ToBePromoted <ty2> Def, ...
		/// =>
		/// b = ToBePromoted <ty1> a, ...
		/// Def = Transition <ty1> ToBePromoted to <ty2>
		void promoteImpl(Instruction *ToBePromoted);

		/// \brief Check whether or not it is profitable to promote all the
		/// instructions enqueued to be promoted.
		bool isProfitableToPromote() {
		Value *ValIdx = Transition->getOperand(getTransitionOriginalValueIdx());
		unsigned Index = isa<ConstantInt>(ValIdx)
		? cast<ConstantInt>(ValIdx)->getZExtValue()
		: -1;
		Type *PromotedType = getTransitionType();
		hfinkelUnsubmitted Not Done Reply Inline Actions You should add: because the target has asserted that it can combine the extract with the store (for free). hfinkel: You should add: because the target has asserted that it can combine the extract with the store…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Agreed. qcolombet: Agreed.

		StoreInst *ST = cast<StoreInst>(CombineInst);
		unsigned AS = ST->getPointerAddressSpace();
		unsigned Align = ST->getAlignment();
		// Check if this store is supported.
		if (!TLI.allowsMisalignedMemoryAccesses(
		EVT::getEVT(ST->getValueOperand()->getType()), AS, Align)) {
		// If this is not supported, there is no way we can combine
		// the extract with the store.
		return false;
		}

		hfinkelUnsubmitted Not Done Reply Inline Actions Why are you also checking against the ABI alignment? hfinkel: Why are you also checking against the ABI alignment?
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Copy-paste-o. qcolombet: Copy-paste-o.
		// The scalar chain of computation has to pay for the transition
		// scalar to vector.
		// The vector chain has to account for the combining cost.
		uint64_t ScalarCost =
		TTI.getVectorInstrCost(Transition->getOpcode(), PromotedType, Index);
		uint64_t VectorCost = StoreExtractCombineCost;
		for (const auto &Inst : InstsToBePromoted) {
		// Compute the cost.
		// By construction, all instructions being promoted are arithmetic ones.
		// Moreover, one argument is a constant that can be viewed as a splat
		// constant.
		Value *Arg0 = Inst->getOperand(0);
		bool IsArg0Constant = isa<UndefValue>(Arg0) \|\| isa<ConstantInt>(Arg0) \|\|
		isa<ConstantFP>(Arg0);
		TargetTransformInfo::OperandValueKind Arg0OVK =
		IsArg0Constant ? TargetTransformInfo::OK_UniformConstantValue
		: TargetTransformInfo::OK_AnyValue;
		TargetTransformInfo::OperandValueKind Arg1OVK =
		!IsArg0Constant ? TargetTransformInfo::OK_UniformConstantValue
		: TargetTransformInfo::OK_AnyValue;
		ScalarCost += TTI.getArithmeticInstrCost(
		Inst->getOpcode(), Inst->getType(), Arg0OVK, Arg1OVK);
		VectorCost += TTI.getArithmeticInstrCost(Inst->getOpcode(), PromotedType,
		Arg0OVK, Arg1OVK);
		}
		DEBUG(dbgs() << "Estimated cost of computation to be promoted:\nScalar: "
		<< ScalarCost << "\nVector: " << VectorCost << '\n');
		return ScalarCost > VectorCost;
		}

		/// \brief Generate a constant vector with \p Val with the same
		/// number of elements as the transition.
		/// \p UseSplat defines whether or not \p Val should be replicated
		/// accross the whole vector.
		/// In other words, if UseSplat == true, we generate <Val, Val, ..., Val>,
		/// otherwise we generate a vector with as many undef as possible:
		/// <undef, ..., undef, Val, undef, ..., undef> where \p Val is only
		/// used at the index of the extract.
		hfinkelUnsubmitted Not Done Reply Inline Actions What is the complication to handling this case? If you really can't test this on ARM, please explicitly note this as FIXME. hfinkel: What is the complication to handling this case? If you really can't test this on ARM, please…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions None, I just have not seen any motivating example. I’ll see if I can come up with something, otherwise I will put a FIXME. qcolombet: None, I just have not seen any motivating example. I’ll see if I can come up with something…
		Value getConstantVector(Constant Val, bool UseSplat) const {
		unsigned ExtractIdx = UINT_MAX;
		if (!UseSplat) {
		// If we cannot determine where the constant must be, we have to
		// use a splat constant.
		Value *ValExtractIdx = Transition->getOperand(getTransitionIdx());
		if (ConstantInt *CstVal = dyn_cast<ConstantInt>(ValExtractIdx))
		ExtractIdx = CstVal->getSExtValue();
		else
		UseSplat = true;
		hfinkelUnsubmitted Not Done Reply Inline Actions You also need to exclude division (because introducing a division by undef is a bad idea), or make sure that if you do this for division, then the other fields are set to some non-zero quantity. hfinkel: You also need to exclude division (because introducing a division by undef is a bad idea), or…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Good point, the problem may arise if the non-constant vector is used on the right hand side. qcolombet: Good point, the problem may arise if the non-constant vector is used on the right hand side.
		}

		unsigned End = getTransitionType()->getVectorNumElements();
		if (UseSplat)
		return ConstantVector::getSplat(End, Val);

		SmallVector<Constant *, 4> ConstVec;
		UndefValue *UndefVal = UndefValue::get(Val->getType());
		for (unsigned Idx = 0; Idx != End; ++Idx) {
		if (Idx == ExtractIdx)
		ConstVec.push_back(Val);
		else
		ConstVec.push_back(UndefVal);
		}
		return ConstantVector::get(ConstVec);
		}

		/// \brief Check if promoting to a vector type an operand at \p OperandIdx
		/// in \p Use can trigger undefined behavior.
		static bool canCauseUndefinedBehavior(const Instruction *Use,
		unsigned OperandIdx) {
		// This is not safe to introduce undef when the operand is on
		// the right hand side of a division-like instruction.
		if (OperandIdx != 1)
		return false;
		switch (Use->getOpcode()) {
		default:
		return false;
		case Instruction::SDiv:
		case Instruction::UDiv:
		case Instruction::SRem:
		case Instruction::URem:
		return true;
		case Instruction::FDiv:
		hfinkelUnsubmitted Not Done Reply Inline Actions Also urem, srem, frem For fdiv and frem, allow the transformation if the instruction has the nnan fast-math flag. hfinkel: Also urem, srem, frem For fdiv and frem, allow the transformation if the instruction has the…
		case Instruction::FRem:
		return !Use->hasNoNaNs();
		}
		llvm_unreachable(nullptr);
		}
		hfinkelUnsubmitted Not Done Reply Inline Actions This should be !Use->hasNoNaNs(); (I'd like to keep us from over-generalizing these kinds of checks and just using 'unsafe algebra' for everything -- we can be more specific here). hfinkel: This should be !Use->hasNoNaNs(); (I'd like to keep us from over-generalizing these kinds of…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Make sense, thanks. qcolombet: Make sense, thanks.

		public:
		VectorPromoteHelper(const TargetLowering &TLI, const TargetTransformInfo &TTI,
		Instruction *Transition, unsigned CombineCost)
		: TLI(TLI), TTI(TTI), Transition(Transition),
		StoreExtractCombineCost(CombineCost), CombineInst(nullptr) {
		assert(Transition && "Do not know how to promote null");
		}

		/// \brief Check if we can promote \p ToBePromoted to \p Type.
		bool canPromote(const Instruction *ToBePromoted) const {
		// We could support CastInst too.
		return isa<BinaryOperator>(ToBePromoted);
		}

		/// \brief Check if it is profitable to promote \p ToBePromoted
		/// by moving downward the transition through.
		bool shouldPromote(const Instruction *ToBePromoted) const {
		// Promote only if all the operands can be statically expanded.
		// Indeed, we do not want to introduce any new kind of transitions.
		for (const Use &U : ToBePromoted->operands()) {
		const Value *Val = U.get();
		if (Val == getEndOfTransition()) {
		// If the use is a division and the transition is on the rhs,
		hfinkelUnsubmitted Not Done Reply Inline Actions Not withstanding my previous comment on division, we could really put more undefs in this vector instead of always using a splat. hfinkel: Not withstanding my previous comment on division, we could really put more undefs in this…
		qcolombetAuthorUnsubmitted Not Done Reply Inline Actions Although it would give more freedom to the lowering to use undef, I was afraid the backends would be able to simplify the sequence to a scalar operation. A quick test proved me I was wrong. So I will update with more undef (unless we are on the right hand side of a division :)). Are you seeing other cases where we shouldn’t use undef? qcolombet: Although it would give more freedom to the lowering to use undef, I was afraid the backends…
		// we cannot promote the operation, otherwise we may create a
		// division by zero.
		if (canCauseUndefinedBehavior(ToBePromoted, U.getOperandNo()))
		return false;
		continue;
		}
		if (!isa<ConstantInt>(Val) && !isa<UndefValue>(Val) &&
		!isa<ConstantFP>(Val))
		return false;
		}
		// Check that the resulting operation is legal.
		int ISDOpcode = TLI.InstructionOpcodeToISD(ToBePromoted->getOpcode());
		if (!ISDOpcode)
		return false;
		return StressStoreExtract \|\|
		TLI.isOperationLegalOrCustom(ISDOpcode,
		EVT::getEVT(getTransitionType(), true));
		}

		/// \brief Check whether or not \p Use can be combined
		/// with the transition.
		/// I.e., is it possible to do Use(Transition) => AnotherUse?
		bool canCombine(const Instruction *Use) { return isa<StoreInst>(Use); }

		/// \brief Record \p ToBePromoted as part of the chain to be promoted.
		void enqueueForPromotion(Instruction *ToBePromoted) {
		InstsToBePromoted.push_back(ToBePromoted);
		}

		/// \brief Set the instruction that will be combined with the transition.
		void recordCombineInstruction(Instruction *ToBeCombined) {
		assert(canCombine(ToBeCombined) && "Unsupported instruction to combine");
		CombineInst = ToBeCombined;
		}

		/// \brief Promote all the instructions enqueued for promotion if it is
		/// is profitable.
		/// \return True if the promotion happened, false otherwise.
		bool promote() {
		// Check if there is something to promote.
		// Right now, if we do not have anything to combine with,
		// we assume the promotion is not profitable.
		if (InstsToBePromoted.empty() \|\| !CombineInst)
		return false;

		// Check cost.
		if (!StressStoreExtract && !isProfitableToPromote())
		return false;

		// Promote.
		for (auto &ToBePromoted : InstsToBePromoted)
		promoteImpl(ToBePromoted);
		InstsToBePromoted.clear();
		return true;
		}
		hfinkelUnsubmitted Not Done Reply Inline Actions Don't repeat this logic; extract it into a function. hfinkel: Don't repeat this logic; extract it into a function.
		};
		} // End of anonymous namespace.

		void VectorPromoteHelper::promoteImpl(Instruction *ToBePromoted) {
		// At this point, we know that all the operands of ToBePromoted but Def
		// can be statically promoted.
		// For Def, we need to use its parameter in ToBePromoted:
		// b = ToBePromoted ty1 a
		// Def = Transition ty1 b to ty2
		// Move the transition down.
		// 1. Replace all uses of the promoted operation by the transition.
		// = ... b => = ... Def.
		assert(ToBePromoted->getType() == Transition->getType() &&
		"The type of the result of the transition does not match "
		"the final type");
		ToBePromoted->replaceAllUsesWith(Transition);
		// 2. Update the type of the uses.
		// b = ToBePromoted ty2 Def => b = ToBePromoted ty1 Def.
		Type *TransitionTy = getTransitionType();
		ToBePromoted->mutateType(TransitionTy);
		// 3. Update all the operands of the promoted operation with promoted
		// operands.
		// b = ToBePromoted ty1 Def => b = ToBePromoted ty1 a.
		for (Use &U : ToBePromoted->operands()) {
		Value *Val = U.get();
		Value *NewVal = nullptr;
		if (Val == Transition)
		NewVal = Transition->getOperand(getTransitionOriginalValueIdx());
		else if (isa<UndefValue>(Val) \|\| isa<ConstantInt>(Val) \|\|
		isa<ConstantFP>(Val)) {
		// Use a splat constant if it is not safe to use undef.
		NewVal = getConstantVector(
		cast<Constant>(Val),
		isa<UndefValue>(Val) \|\|
		canCauseUndefinedBehavior(ToBePromoted, U.getOperandNo()));
		} else
		assert(0 && "Did you modified shouldPromote and forgot to update this?");
		ToBePromoted->setOperand(U.getOperandNo(), NewVal);
		}
		Transition->removeFromParent();
		Transition->insertAfter(ToBePromoted);
		Transition->setOperand(getTransitionOriginalValueIdx(), ToBePromoted);
		}

		/// Some targets can do store(extractelement) with one instruction.
		/// Try to push the extractelement towards the stores when the target
		/// has this feature and this is profitable.
		bool CodeGenPrepare::OptimizeExtractElementInst(Instruction *Inst) {
		unsigned CombineCost = UINT_MAX;
		if (DisableStoreExtract \|\| !TLI \|\|
		(!StressStoreExtract &&
		!TLI->canCombineStoreAndExtract(Inst->getOperand(0)->getType(),
		Inst->getOperand(1), CombineCost)))
		return false;

		// At this point we know that Inst is a vector to scalar transition.
		// Try to move it down the def-use chain, until:
		// - We can combine the transition with its single use
		// => we got rid of the transition.
		// - We escape the current basic block
		// => we would need to check that we are moving it at a cheaper place and
		// we do not do that for now.
		BasicBlock *Parent = Inst->getParent();
		DEBUG(dbgs() << "Found an interesting transition: " << *Inst << '\n');
		VectorPromoteHelper VPH(TLI, TTI, Inst, CombineCost);
		// If the transition has more than one use, assume this is not going to be
		// beneficial.
		while (Inst->hasOneUse()) {
		Instruction ToBePromoted = cast<Instruction>(Inst->user_begin());
		DEBUG(dbgs() << "Use: " << *ToBePromoted << '\n');

		if (ToBePromoted->getParent() != Parent) {
		DEBUG(dbgs() << "Instruction to promote is in a different block ("
		<< ToBePromoted->getParent()->getName()
		<< ") than the transition (" << Parent->getName() << ").\n");
		return false;
		}

		if (VPH.canCombine(ToBePromoted)) {
		DEBUG(dbgs() << "Assume " << *Inst << '\n'
		<< "will be combined with: " << *ToBePromoted << '\n');
		VPH.recordCombineInstruction(ToBePromoted);
		bool Changed = VPH.promote();
		NumStoreExtractExposed += Changed;
		return Changed;
		}

		DEBUG(dbgs() << "Try promoting.\n");
		if (!VPH.canPromote(ToBePromoted) \|\| !VPH.shouldPromote(ToBePromoted))
		return false;

		DEBUG(dbgs() << "Promoting is possible... Enqueue for promotion!\n");

		VPH.enqueueForPromotion(ToBePromoted);
		Inst = ToBePromoted;
		}
		return false;
		}

bool CodeGenPrepare::OptimizeInst(Instruction *I) {		bool CodeGenPrepare::OptimizeInst(Instruction *I) {
if (PHINode *P = dyn_cast<PHINode>(I)) {		if (PHINode *P = dyn_cast<PHINode>(I)) {
// It is possible for very late stage optimizations (such as SimplifyCFG)		// It is possible for very late stage optimizations (such as SimplifyCFG)
// to introduce PHI nodes too late to be cleaned up. If we detect such a		// to introduce PHI nodes too late to be cleaned up. If we detect such a
// trivial PHI, go ahead and zap it here.		// trivial PHI, go ahead and zap it here.
if (Value *V = SimplifyInstruction(P, TLI ? TLI->getDataLayout() : nullptr,		if (Value *V = SimplifyInstruction(P, TLI ? TLI->getDataLayout() : nullptr,
TLInfo, DT)) {		TLInfo, DT)) {
P->replaceAllUsesWith(V);		P->replaceAllUsesWith(V);
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	if (CallInst *CI = dyn_cast<CallInst>(I))
return OptimizeCallInst(CI);		return OptimizeCallInst(CI);

if (SelectInst *SI = dyn_cast<SelectInst>(I))		if (SelectInst *SI = dyn_cast<SelectInst>(I))
return OptimizeSelectInst(SI);		return OptimizeSelectInst(SI);

if (ShuffleVectorInst *SVI = dyn_cast<ShuffleVectorInst>(I))		if (ShuffleVectorInst *SVI = dyn_cast<ShuffleVectorInst>(I))
return OptimizeShuffleVectorInst(SVI);		return OptimizeShuffleVectorInst(SVI);

		if (isa<ExtractElementInst>(I))
		return OptimizeExtractElementInst(I);

return false;		return false;
}		}

// In this pass we look for GEP and cast instructions that are used		// In this pass we look for GEP and cast instructions that are used
// across basic blocks and rewrite them to improve basic-block-at-a-time		// across basic blocks and rewrite them to improve basic-block-at-a-time
// selection.		// selection.
bool CodeGenPrepare::OptimizeBlock(BasicBlock &BB) {		bool CodeGenPrepare::OptimizeBlock(BasicBlock &BB) {
SunkAddrs.clear();		SunkAddrs.clear();
▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	Instruction* emitTrailingFence(IRBuilder<> &Builder, AtomicOrdering Ord,
bool IsStore, bool IsLoad) const override;		bool IsStore, bool IsLoad) const override;

bool shouldExpandAtomicLoadInIR(LoadInst *LI) const override;		bool shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
bool shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;		bool shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;

bool useLoadStackGuardNode() const override;		bool useLoadStackGuardNode() const override;

		bool canCombineStoreAndExtract(Type VectorTy, Value Idx,
		unsigned &Cost) const override;

protected:		protected:
std::pair<const TargetRegisterClass*, uint8_t>		std::pair<const TargetRegisterClass*, uint8_t>
findRepresentativeClass(MVT VT) const override;		findRepresentativeClass(MVT VT) const override;

private:		private:
/// Subtarget - Keep a pointer to the ARMSubtarget around so that we can		/// Subtarget - Keep a pointer to the ARMSubtarget around so that we can
/// make the right decision when generating code for different targets.		/// make the right decision when generating code for different targets.
const ARMSubtarget *Subtarget;		const ARMSubtarget *Subtarget;
▲ Show 20 Lines • Show All 200 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,099 Lines • ▼ Show 20 Lines	bool ARMTargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
return Size <= (Subtarget->isMClass() ? 32U : 64U);		return Size <= (Subtarget->isMClass() ? 32U : 64U);
}		}

// This has so far only been implemented for MachO.		// This has so far only been implemented for MachO.
bool ARMTargetLowering::useLoadStackGuardNode() const {		bool ARMTargetLowering::useLoadStackGuardNode() const {
return Subtarget->getTargetTriple().getObjectFormat() == Triple::MachO;		return Subtarget->getTargetTriple().getObjectFormat() == Triple::MachO;
}		}

		bool ARMTargetLowering::canCombineStoreAndExtract(Type VectorTy, Value Idx,
		unsigned &Cost) const {
		// If we do not have NEON, vector types are not natively supported.
		if (!Subtarget->hasNEON())
		return false;

		// Floating point values and vector values map to the same register file.
		// Therefore, althought we could do a store extract of a vector type, this is
		// better to leave at float as we have more freedom in the addressing mode for
		// those.
		if (VectorTy->isFPOrFPVectorTy())
		return false;

		// If the index is unknown at compile time, this is very expensive to lower
		// and it is not possible to combine the store with the extract.
		if (!isa<ConstantInt>(Idx))
		return false;

		assert(VectorTy->isVectorTy() && "VectorTy is not a vector type");
		unsigned BitWidth = cast<VectorType>(VectorTy)->getBitWidth();
		// We can do a store + vector extract on any vector that fits perfectly in a D
		// or Q register.
		if (BitWidth == 64 \|\| BitWidth == 128) {
		Cost = 0;
		return true;
		}
		return false;
		}

Value ARMTargetLowering::emitLoadLinked(IRBuilder<> &Builder, Value Addr,		Value ARMTargetLowering::emitLoadLinked(IRBuilder<> &Builder, Value Addr,
AtomicOrdering Ord) const {		AtomicOrdering Ord) const {
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
Type *ValTy = cast<PointerType>(Addr->getType())->getElementType();		Type *ValTy = cast<PointerType>(Addr->getType())->getElementType();
bool IsAcquire = isAtLeastAcquire(Ord);		bool IsAcquire = isAtLeastAcquire(Ord);

// Since i64 isn't legal and intrinsics don't get type-lowered, the ldrexd		// Since i64 isn't legal and intrinsics don't get type-lowered, the ldrexd
// intrinsic must return {i32, i32} and we have to recombine them into a		// intrinsic must return {i32, i32} and we have to recombine them into a
▲ Show 20 Lines • Show All 133 Lines • Show Last 20 Lines

test/CodeGen/ARM/vector-promotion.ll

				; RUN: opt -codegenprepare -mtriple=thumbv7-apple-ios %s -o - -mattr=+neon -S \| FileCheck --check-prefix=IR-BOTH --check-prefix=IR-NORMAL %s
				; RUN: opt -codegenprepare -mtriple=thumbv7-apple-ios %s -o - -mattr=+neon -S -stress-cgp-store-extract \| FileCheck --check-prefix=IR-BOTH --check-prefix=IR-STRESS %s
				; RUN: llc -mtriple=thumbv7-apple-ios %s -o - -mattr=+neon \| FileCheck --check-prefix=ASM %s

				; IR-BOTH-LABEL: @simpleOneInstructionPromotion
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[VECTOR_OR:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[LOAD]], <i32 undef, i32 1>
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[VECTOR_OR]], i32 1
				; IR-BOTH-NEXT: store i32 [[EXTRACT]], i32* %dest
				; IR-BOTH-NEXT: ret
				;
				; Make sure we got rid of any expensive vmov.32 instructions.
				; ASM-LABEL: simpleOneInstructionPromotion:
				; ASM: vldr [[LOAD:d[0-9]+]], [r0]
				; ASM-NEXT: vorr.i32 [[LOAD]], #0x1
				; ASM-NEXT: vst1.32 {[[LOAD]][1]}, [r1:32]
				; ASM-NEXT: bx
				define void @simpleOneInstructionPromotion(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = or i32 %extract, 1
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @unsupportedInstructionForPromotion
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 0
				; IR-BOTH-NEXT: [[CMP:%[a-zA-Z_0-9-]+]] = icmp eq i32 [[EXTRACT]], %in2
				; IR-BOTH-NEXT: store i1 [[CMP]], i1* %dest
				; IR-BOTH-NEXT: ret
				;
				; ASM-LABEL: unsupportedInstructionForPromotion:
				; ASM: vldr [[LOAD:d[0-9]+]], [r0]
				; ASM: vmov.32 {{r[0-9]+}}, [[LOAD]]
				; ASM: bx
				define void @unsupportedInstructionForPromotion(<2 x i32>* %addr1, i32 %in2, i1* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 0
				%out = icmp eq i32 %extract, %in2
				store i1 %out, i1* %dest, align 4
				ret void
				}


				; IR-BOTH-LABEL: @unsupportedChainInDifferentBBs
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 0
				; IR-BOTH-NEXT: br i1 %bool, label %bb2, label %end
				; BB2
				; IR-BOTH: [[OR:%[a-zA-Z_0-9-]+]] = or i32 [[EXTRACT]], 1
				; IR-BOTH-NEXT: store i32 [[OR]], i32* %dest, align 4
				; IR-BOTH: ret
				;
				; ASM-LABEL: unsupportedChainInDifferentBBs:
				; ASM: vldrne [[LOAD:d[0-9]+]], [r0]
				; ASM: vmovne.32 {{r[0-9]+}}, [[LOAD]]
				; ASM: bx
				define void @unsupportedChainInDifferentBBs(<2 x i32>* %addr1, i32* %dest, i1 %bool) {
				bb1:
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 0
				br i1 %bool, label %bb2, label %end
				bb2:
				%out = or i32 %extract, 1
				store i32 %out, i32* %dest, align 4
				br label %end
				end:
				ret void
				}

				; IR-LABEL: @chainOfInstructionsToPromote
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[VECTOR_OR1:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[LOAD]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[VECTOR_OR2:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[VECTOR_OR1]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[VECTOR_OR3:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[VECTOR_OR2]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[VECTOR_OR4:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[VECTOR_OR3]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[VECTOR_OR5:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[VECTOR_OR4]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[VECTOR_OR6:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[VECTOR_OR5]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[VECTOR_OR7:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[VECTOR_OR6]], <i32 1, i32 undef>
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[VECTOR_OR7]], i32 0
				; IR-BOTH-NEXT: store i32 [[EXTRACT]], i32* %dest
				; IR-BOTH-NEXT: ret
				;
				; ASM-LABEL: chainOfInstructionsToPromote:
				; ASM: vldr [[LOAD:d[0-9]+]], [r0]
				; ASM-NOT: vmov.32 {{r[0-9]+}}, [[LOAD]]
				; ASM: bx
				define void @chainOfInstructionsToPromote(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 0
				%out1 = or i32 %extract, 1
				%out2 = or i32 %out1, 1
				%out3 = or i32 %out2, 1
				%out4 = or i32 %out3, 1
				%out5 = or i32 %out4, 1
				%out6 = or i32 %out5, 1
				%out7 = or i32 %out6, 1
				store i32 %out7, i32* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @unsupportedMultiUses
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-BOTH-NEXT: [[OR:%[a-zA-Z_0-9-]+]] = or i32 [[EXTRACT]], 1
				; IR-BOTH-NEXT: store i32 [[OR]], i32* %dest
				; IR-BOTH-NEXT: ret i32 [[OR]]
				;
				; ASM-LABEL: unsupportedMultiUses:
				; ASM: vldr [[LOAD:d[0-9]+]], [r0]
				; ASM: vmov.32 {{r[0-9]+}}, [[LOAD]]
				; ASM: bx
				define i32 @unsupportedMultiUses(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = or i32 %extract, 1
				store i32 %out, i32* %dest, align 4
				ret i32 %out
				}

				; Check that we promote we a splat constant when this is a division.
				; The NORMAL mode does not promote anything as divisions are not legal.
				; IR-BOTH-LABEL: @udivCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = udiv i32 [[EXTRACT]], 7
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = udiv <2 x i32> [[LOAD]], <i32 7, i32 7>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @udivCase(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = udiv i32 %extract, 7
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @uremCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = urem i32 [[EXTRACT]], 7
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = urem <2 x i32> [[LOAD]], <i32 7, i32 7>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @uremCase(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = urem i32 %extract, 7
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @sdivCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = sdiv i32 [[EXTRACT]], 7
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = sdiv <2 x i32> [[LOAD]], <i32 7, i32 7>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @sdivCase(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = sdiv i32 %extract, 7
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @sremCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = srem i32 [[EXTRACT]], 7
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = srem <2 x i32> [[LOAD]], <i32 7, i32 7>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @sremCase(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = srem i32 %extract, 7
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @fdivCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x float>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = fdiv float [[EXTRACT]], 7.0
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = fdiv <2 x float> [[LOAD]], <float 7.000000e+00, float 7.000000e+00>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store float [[RES]], float* %dest
				; IR-BOTH-NEXT: ret
				define void @fdivCase(<2 x float>* %addr1, float* %dest) {
				%in1 = load <2 x float>* %addr1, align 8
				%extract = extractelement <2 x float> %in1, i32 1
				%out = fdiv float %extract, 7.0
				store float %out, float* %dest, align 4
				ret void
				}

				; IR-BOTH-LABEL: @fremCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x float>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = frem float [[EXTRACT]], 7.0
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = frem <2 x float> [[LOAD]], <float 7.000000e+00, float 7.000000e+00>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store float [[RES]], float* %dest
				; IR-BOTH-NEXT: ret
				define void @fremCase(<2 x float>* %addr1, float* %dest) {
				%in1 = load <2 x float>* %addr1, align 8
				%extract = extractelement <2 x float> %in1, i32 1
				%out = frem float %extract, 7.0
				store float %out, float* %dest, align 4
				ret void
				}

				; Check that we do not promote when we may introduce undefined behavior
				; like division by zero.
				; IR-BOTH-LABEL: @undefDivCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-BOTH-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = udiv i32 7, [[EXTRACT]]
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @undefDivCase(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = udiv i32 7, %extract
				store i32 %out, i32* %dest, align 4
				ret void
				}


				; Check that we do not promote when we may introduce undefined behavior
				; like division by zero.
				; IR-BOTH-LABEL: @undefRemCase
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; IR-BOTH-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = srem i32 7, [[EXTRACT]]
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @undefRemCase(<2 x i32>* %addr1, i32* %dest) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 1
				%out = srem i32 7, %extract
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; Check that we use an undef mask for undefined behavior if the fast-math
				; flag is set.
				; IR-BOTH-LABEL: @undefConstantFRemCaseWithFastMath
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x float>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = frem nnan float [[EXTRACT]], 7.0
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = frem nnan <2 x float> [[LOAD]], <float undef, float 7.000000e+00>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store float [[RES]], float* %dest
				; IR-BOTH-NEXT: ret
				define void @undefConstantFRemCaseWithFastMath(<2 x float>* %addr1, float* %dest) {
				%in1 = load <2 x float>* %addr1, align 8
				%extract = extractelement <2 x float> %in1, i32 1
				%out = frem nnan float %extract, 7.0
				store float %out, float* %dest, align 4
				ret void
				}

				; Check that we use an undef mask for undefined behavior if the fast-math
				; flag is set.
				; IR-BOTH-LABEL: @undefVectorFRemCaseWithFastMath
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x float>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = frem nnan float 7.000000e+00, [[EXTRACT]]
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = frem nnan <2 x float> <float undef, float 7.000000e+00>, [[LOAD]]
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store float [[RES]], float* %dest
				; IR-BOTH-NEXT: ret
				define void @undefVectorFRemCaseWithFastMath(<2 x float>* %addr1, float* %dest) {
				%in1 = load <2 x float>* %addr1, align 8
				%extract = extractelement <2 x float> %in1, i32 1
				%out = frem nnan float 7.0, %extract
				store float %out, float* %dest, align 4
				ret void
				}

				; Check that we are able to promote floating point value.
				; This requires the STRESS mode, as floating point value are
				; not promote on armv7.
				; IR-BOTH-LABEL: @simpleOneInstructionPromotionFloat
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x float>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = fadd float [[EXTRACT]], 1.0
				; Vector version:
				; IR-STRESS-NEXT: [[DIV:%[a-zA-Z_0-9-]+]] = fadd <2 x float> [[LOAD]], <float undef, float 1.000000e+00>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x float> [[DIV]], i32 1
				;
				; IR-BOTH-NEXT: store float [[RES]], float* %dest
				; IR-BOTH-NEXT: ret
				define void @simpleOneInstructionPromotionFloat(<2 x float>* %addr1, float* %dest) {
				%in1 = load <2 x float>* %addr1, align 8
				%extract = extractelement <2 x float> %in1, i32 1
				%out = fadd float %extract, 1.0
				store float %out, float* %dest, align 4
				ret void
				}

				; Check that we correctly use a splat constant when we cannot
				; determine at compile time the index of the extract.
				; This requires the STRESS modes, as variable index are expensive
				; to lower.
				; IR-BOTH-LABEL: @simpleOneInstructionPromotionVariableIdx
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <2 x i32>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[LOAD]], i32 %idx
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = or i32 [[EXTRACT]], 1
				; Vector version:
				; IR-STRESS-NEXT: [[OR:%[a-zA-Z_0-9-]+]] = or <2 x i32> [[LOAD]], <i32 1, i32 1>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <2 x i32> [[OR]], i32 %idx
				;
				; IR-BOTH-NEXT: store i32 [[RES]], i32* %dest
				; IR-BOTH-NEXT: ret
				define void @simpleOneInstructionPromotionVariableIdx(<2 x i32>* %addr1, i32* %dest, i32 %idx) {
				%in1 = load <2 x i32>* %addr1, align 8
				%extract = extractelement <2 x i32> %in1, i32 %idx
				%out = or i32 %extract, 1
				store i32 %out, i32* %dest, align 4
				ret void
				}

				; Check a vector with more than 2 elements.
				; This requires the STRESS mode because currently 'or v8i8' is not marked
				; as legal or custom, althought the actual assembly is better if we were
				; promoting it.
				; IR-BOTH-LABEL: @simpleOneInstructionPromotion8x8
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <8 x i8>* %addr1
				; Scalar version:
				; IR-NORMAL-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <8 x i8> [[LOAD]], i32 1
				; IR-NORMAL-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = or i8 [[EXTRACT]], 1
				; Vector version:
				; IR-STRESS-NEXT: [[OR:%[a-zA-Z_0-9-]+]] = or <8 x i8> [[LOAD]], <i8 undef, i8 1, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>
				; IR-STRESS-NEXT: [[RES:%[a-zA-Z_0-9-]+]] = extractelement <8 x i8> [[OR]], i32 1
				;
				; IR-BOTH-NEXT: store i8 [[RES]], i8* %dest
				; IR-BOTH-NEXT: ret
				define void @simpleOneInstructionPromotion8x8(<8 x i8>* %addr1, i8* %dest) {
				%in1 = load <8 x i8>* %addr1, align 8
				%extract = extractelement <8 x i8> %in1, i32 1
				%out = or i8 %extract, 1
				store i8 %out, i8* %dest, align 4
				ret void
				}

				; Check that we optimized the sequence correctly when it can be
				; lowered on a Q register.
				; IR-BOTH-LABEL: @simpleOneInstructionPromotion
				; IR-BOTH: [[LOAD:%[a-zA-Z_0-9-]+]] = load <4 x i32>* %addr1
				; IR-BOTH-NEXT: [[VECTOR_OR:%[a-zA-Z_0-9-]+]] = or <4 x i32> [[LOAD]], <i32 undef, i32 1, i32 undef, i32 undef>
				; IR-BOTH-NEXT: [[EXTRACT:%[a-zA-Z_0-9-]+]] = extractelement <4 x i32> [[VECTOR_OR]], i32 1
				; IR-BOTH-NEXT: store i32 [[EXTRACT]], i32* %dest
				; IR-BOTH-NEXT: ret
				;
				; Make sure we got rid of any expensive vmov.32 instructions.
				; ASM-LABEL: simpleOneInstructionPromotion4x32:
				; ASM: vld1.64 {[[LOAD:d[0-9]+]], d{{[0-9]+}}}, [r0]
				; The Q register used here must be [[LOAD]] / 2, but we cannot express that.
				; ASM-NEXT: vorr.i32 q{{[[0-9]+}}, #0x1
				; ASM-NEXT: vst1.32 {[[LOAD]][1]}, [r1]
				; ASM-NEXT: bx
				define void @simpleOneInstructionPromotion4x32(<4 x i32>* %addr1, i32* %dest) {
				%in1 = load <4 x i32>* %addr1, align 8
				%extract = extractelement <4 x i32> %in1, i32 1
				%out = or i32 %extract, 1
				store i32 %out, i32* %dest, align 1
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Move extractelement close to store if they can be combined.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 15597

include/llvm/Target/TargetLowering.h

lib/CodeGen/CodeGenPrepare.cpp

lib/Target/ARM/ARMISelLowering.h

lib/Target/ARM/ARMISelLowering.cpp

test/CodeGen/ARM/vector-promotion.ll

[CodeGenPrepare] Move extractelement close to store if they can be combined.
ClosedPublic