This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Add support for sinking operands to their users, if they are free.
ClosedPublic

Authored by fhahn on Jan 29 2019, 3:27 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
t.p.northover
samparker
efriedma
RKSimon
spatel

Commits

rG3b251963c303: [CGP] Add support for sinking operands to their users, if they are free.
rL353152: [CGP] Add support for sinking operands to their users, if they are free.

Summary

This patch improves code generation for some AArch64 ACLE intrinsics. It adds
support to CGP to duplicate and sink operands to their user, if they can be
folded into a target instruction, like zexts and sub into usubl. It adds a
TargetLowering hook shouldSinkOperands, which looks at the operands of
instructions to see if sinking is profitable.

I decided to add a new target hook, as for the sinking to be profitable,
at least on AArch64, we have to look at multiple operands of an
instruction, instead of looking at the users of a zext for example.

The sinking is done in CGP, because it works around an instruction
selection limitation. If instruction selection is not limited to a
single basic block, this patch should not be needed any longer.

Alternatively this could be done in the LoopSink pass, which tries to
undo LICM for instructions in blocks that are not executed frequently.

Note that we do not force the operands to sink to have a single user,
because we duplicate them before sinking. Therefore this is only
desirable if they really can be done for free. Additionally we could
consider the impact on live ranges later on.

This should fix https://bugs.llvm.org/show_bug.cgi?id=40025.

As for performance, we have internal code that uses intrinsics and can
be speed up by 10% by this change.

I would appreciate any feedback, especially related to where to best put
the target hook.

Diff Detail

Repository: rL LLVM

Event Timeline

fhahn created this revision.Jan 29 2019, 3:27 AM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar. · View Herald TranscriptJan 29 2019, 3:27 AM

fhahn mentioned this in D56668: [WIP][CodeGenPrepare] Duplicate and sink shuffles and extends if they can be done for free..Jan 29 2019, 3:28 AM

Harbormaster completed remote builds in B27433: Diff 184056.Jan 29 2019, 3:29 AM

Adding Simon and Sanjay in case this might be useful for X86 as well.

Hi Florian,

I've had a case where I have wanted to do this, so it looks useful to me! CGP seems like a reasonable place to perform this too and I think TargetLowering is an aptly named place to include the hook.

cheers,

llvm/lib/CodeGen/CodeGenPrepare.cpp
5984 ↗	(On Diff #184056)	Is this order of ops enforced?
5991 ↗	(On Diff #184056)	This doesn't need to larger than OpsToSink, same goes for MaybeDead.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8292 ↗	(On Diff #184056)	I'm assuming cast is all you need, otherwise you're missing a nullptr check.

We already have various sinking transforms in CGP, so this is the right approach IMO. I do wonder if we could adapt the existing (cmp/and-mask/etc) sinking that we already have in CGP to use this hook as a refinement, but that can be a follow-up of this patch.

I'm not sure yet if/how we'd use this with x86 shuffles, but there's always hope. :)
You might be interested in looking at some recent DAGCombiner and x86 patches that are motivated by narrowing the width of vector ops and shuffles:
D55126
D55866
D56604
D56875
D57156
D57336

Address Sam's comments. Thanks!

Harbormaster completed remote builds in B27501: Diff 184348.Jan 30 2019, 12:02 PM

fhahn marked 2 inline comments as done.Jan 30 2019, 12:06 PM

fhahn added inline comments.

llvm/lib/CodeGen/CodeGenPrepare.cpp
5984 ↗	(On Diff #184056)	Do you mean enforced as in by an assertion? Not at the moment. I think we would could check it with OrderedInstructions, if you think it would be beneficial.

LGTM with one comment.

llvm/lib/CodeGen/CodeGenPrepare.cpp
5984 ↗	(On Diff #184056)	It's okay, I hadn't noticed that you've added this requirement in the documentation.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8292 ↗	(On Diff #184056)	the same for below.

This revision is now accepted and ready to land.Jan 31 2019, 6:25 AM

Thanks Sam! I plan to commit this early next week, unless there are additional comments.

In D57377#1377221, @spatel wrote:

We already have various sinking transforms in CGP, so this is the right approach IMO. I do wonder if we could adapt the existing (cmp/and-mask/etc) sinking that we already have in CGP to use this hook as a refinement, but that can be a follow-up of this patch.

Thanks, I'll take a look at that as a follow-up.

I'm not sure yet if/how we'd use this with x86 shuffles, but there's always hope. :)

It is not really related to shuffles, but rather to any operations that could be done for free with a target instruction.

You might be interested in looking at some recent DAGCombiner and x86 patches that are motivated by narrowing the width of vector ops and shuffles:
D55126
D55866
D56604
D56875
D57156
D57336

Great, those look quite useful!

Closed by commit rL353152: [CGP] Add support for sinking operands to their users, if they are free. (authored by fhahn). · Explain WhyFeb 5 2019, 2:27 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 2:27 AM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

CodeGen/

TargetLowering.h

10 lines

lib/

CodeGen/

CodeGenPrepare.cpp

48 lines

Target/

AArch64/

AArch64ISelLowering.h

3 lines

AArch64ISelLowering.cpp

107 lines

test/

Transforms/

CodeGenPrepare/

AArch64/

sink-free-instructions.ll

236 lines

Diff 185260

llvm/trunk/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 2,274 Lines • ▼ Show 20 Lines	public:
}		}

/// Return true if sign-extension from FromTy to ToTy is cheaper than		/// Return true if sign-extension from FromTy to ToTy is cheaper than
/// zero-extension.		/// zero-extension.
virtual bool isSExtCheaperThanZExt(EVT FromTy, EVT ToTy) const {		virtual bool isSExtCheaperThanZExt(EVT FromTy, EVT ToTy) const {
return false;		return false;
}		}

		/// Return true if sinking I's operands to the same basic block as I is
		/// profitable, e.g. because the operands can be folded into a target
		/// instruction during instruction selection. After calling the function
		/// \p Ops contains the Uses to sink ordered by dominance (dominating users
		/// come first).
		virtual bool shouldSinkOperands(Instruction *I,
		SmallVectorImpl<Use *> &Ops) const {
		return false;
		}

/// Return true if the target supplies and combines to a paired load		/// Return true if the target supplies and combines to a paired load
/// two loaded values of type LoadedType next to each other in memory.		/// two loaded values of type LoadedType next to each other in memory.
/// RequiredAlignment gives the minimal alignment constraints that must be met		/// RequiredAlignment gives the minimal alignment constraints that must be met
/// to be able to select this paired load.		/// to be able to select this paired load.
///		///
/// This information is not used to generate actual paired loads, but it is		/// This information is not used to generate actual paired loads, but it is
/// used to generate a sequence of loads that is easier to combine into a		/// used to generate a sequence of loads that is easier to combine into a
/// paired load.		/// paired load.
▲ Show 20 Lines • Show All 1,635 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/CodeGenPrepare.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 369 Lines • ▼ Show 20 Lines	private:
bool splitLargeGEPOffsets();		bool splitLargeGEPOffsets();
bool performAddressTypePromotion(		bool performAddressTypePromotion(
Instruction *&Inst,		Instruction *&Inst,
bool AllowPromotionWithoutCommonHeader,		bool AllowPromotionWithoutCommonHeader,
bool HasPromoted, TypePromotionTransaction &TPT,		bool HasPromoted, TypePromotionTransaction &TPT,
SmallVectorImpl<Instruction *> &SpeculativelyMovedExts);		SmallVectorImpl<Instruction *> &SpeculativelyMovedExts);
bool splitBranchCondition(Function &F);		bool splitBranchCondition(Function &F);
bool simplifyOffsetableRelocate(Instruction &I);		bool simplifyOffsetableRelocate(Instruction &I);

		bool tryToSinkFreeOperands(Instruction *I);
};		};

} // end anonymous namespace		} // end anonymous namespace

char CodeGenPrepare::ID = 0;		char CodeGenPrepare::ID = 0;

INITIALIZE_PASS_BEGIN(CodeGenPrepare, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(CodeGenPrepare, DEBUG_TYPE,
"Optimize for code generation", false, false)		"Optimize for code generation", false, false)
▲ Show 20 Lines • Show All 1,361 Lines • ▼ Show 20 Lines	case Intrinsic::aarch64_stxr: {
return false;		return false;
// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.		// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.
ExtVal->moveBefore(CI);		ExtVal->moveBefore(CI);
// Mark this instruction as "inserted by CGP", so that other		// Mark this instruction as "inserted by CGP", so that other
// optimizations don't touch it.		// optimizations don't touch it.
InsertedInsts.insert(ExtVal);		InsertedInsts.insert(ExtVal);
return true;		return true;
}		}

case Intrinsic::launder_invariant_group:		case Intrinsic::launder_invariant_group:
case Intrinsic::strip_invariant_group: {		case Intrinsic::strip_invariant_group: {
Value *ArgVal = II->getArgOperand(0);		Value *ArgVal = II->getArgOperand(0);
auto it = LargeOffsetGEPMap.find(II);		auto it = LargeOffsetGEPMap.find(II);
if (it != LargeOffsetGEPMap.end()) {		if (it != LargeOffsetGEPMap.end()) {
// Merge entries in LargeOffsetGEPMap to reflect the RAUW.		// Merge entries in LargeOffsetGEPMap to reflect the RAUW.
// Make sure not to have to deal with iterator invalidation		// Make sure not to have to deal with iterator invalidation
// after possibly adding ArgVal to LargeOffsetGEPMap.		// after possibly adding ArgVal to LargeOffsetGEPMap.
▲ Show 20 Lines • Show All 4,205 Lines • ▼ Show 20 Lines	bool CodeGenPrepare::optimizeShuffleVectorInst(ShuffleVectorInst *SVI) {
if (SVI->use_empty()) {		if (SVI->use_empty()) {
SVI->eraseFromParent();		SVI->eraseFromParent();
MadeChange = true;		MadeChange = true;
}		}

return MadeChange;		return MadeChange;
}		}

		bool CodeGenPrepare::tryToSinkFreeOperands(Instruction *I) {
		// If the operands of I can be folded into a target instruction together with
		// I, duplicate and sink them.
		SmallVector<Use *, 4> OpsToSink;
		if (!TLI \|\| !TLI->shouldSinkOperands(I, OpsToSink))
		return false;

		// OpsToSink can contain multiple uses in a use chain (e.g.
		// (%u1 with %u1 = shufflevector), (%u2 with %u2 = zext %u1)). The dominating
		// uses must come first, which means they are sunk first, temporarily creating
		// invalid IR. This will be fixed once their dominated users are sunk and
		// updated.
		BasicBlock *TargetBB = I->getParent();
		bool Changed = false;
		SmallVector<Use *, 4> ToReplace;
		for (Use *U : OpsToSink) {
		auto *UI = cast<Instruction>(U->get());
		if (UI->getParent() == TargetBB \|\| isa<PHINode>(UI))
		continue;
		ToReplace.push_back(U);
		}

		SmallPtrSet<Instruction *, 4> MaybeDead;
		for (Use *U : ToReplace) {
		auto *UI = cast<Instruction>(U->get());
		Instruction *NI = UI->clone();
		MaybeDead.insert(UI);
		LLVM_DEBUG(dbgs() << "Sinking " << UI << " to user " << I << "\n");
		NI->insertBefore(I);
		InsertedInsts.insert(NI);
		U->set(NI);
		Changed = true;
		}

		// Remove instructions that are dead after sinking.
		for (auto *I : MaybeDead)
		if (!I->hasNUsesOrMore(1))
		I->eraseFromParent();

		return Changed;
		}

bool CodeGenPrepare::optimizeSwitchInst(SwitchInst *SI) {		bool CodeGenPrepare::optimizeSwitchInst(SwitchInst *SI) {
if (!TLI \|\| !DL)		if (!TLI \|\| !DL)
return false;		return false;

Value *Cond = SI->getCondition();		Value *Cond = SI->getCondition();
Type *OldType = Cond->getType();		Type *OldType = Cond->getType();
LLVMContext &Context = Cond->getContext();		LLVMContext &Context = Cond->getContext();
MVT RegType = TLI->getRegisterType(Context, TLI->getValueType(*DL, OldType));		MVT RegType = TLI->getRegisterType(Context, TLI->getValueType(*DL, OldType));
▲ Show 20 Lines • Show All 798 Lines • ▼ Show 20 Lines	if (GEPI->hasAllZeroIndices()) {
return true;		return true;
}		}
if (tryUnmergingGEPsAcrossIndirectBr(GEPI, TTI)) {		if (tryUnmergingGEPsAcrossIndirectBr(GEPI, TTI)) {
return true;		return true;
}		}
return false;		return false;
}		}

		if (tryToSinkFreeOperands(I))
		return true;

if (CallInst *CI = dyn_cast<CallInst>(I))		if (CallInst *CI = dyn_cast<CallInst>(I))
return optimizeCallInst(CI, ModifiedDT);		return optimizeCallInst(CI, ModifiedDT);

if (SelectInst *SI = dyn_cast<SelectInst>(I))		if (SelectInst *SI = dyn_cast<SelectInst>(I))
return optimizeSelectInst(SI);		return optimizeSelectInst(SI);

if (ShuffleVectorInst *SVI = dyn_cast<ShuffleVectorInst>(I))		if (ShuffleVectorInst *SVI = dyn_cast<ShuffleVectorInst>(I))
return optimizeShuffleVectorInst(SVI);		return optimizeShuffleVectorInst(SVI);
▲ Show 20 Lines • Show All 297 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	public:
bool isTruncateFree(EVT VT1, EVT VT2) const override;		bool isTruncateFree(EVT VT1, EVT VT2) const override;

bool isProfitableToHoist(Instruction *I) const override;		bool isProfitableToHoist(Instruction *I) const override;

bool isZExtFree(Type Ty1, Type Ty2) const override;		bool isZExtFree(Type Ty1, Type Ty2) const override;
bool isZExtFree(EVT VT1, EVT VT2) const override;		bool isZExtFree(EVT VT1, EVT VT2) const override;
bool isZExtFree(SDValue Val, EVT VT2) const override;		bool isZExtFree(SDValue Val, EVT VT2) const override;

		bool shouldSinkOperands(Instruction *I,
		SmallVectorImpl<Use *> &Ops) const override;

bool hasPairedLoad(EVT LoadedType, unsigned &RequiredAligment) const override;		bool hasPairedLoad(EVT LoadedType, unsigned &RequiredAligment) const override;

unsigned getMaxSupportedInterleaveFactor() const override { return 4; }		unsigned getMaxSupportedInterleaveFactor() const override { return 4; }

bool lowerInterleavedLoad(LoadInst *LI,		bool lowerInterleavedLoad(LoadInst *LI,
ArrayRef<ShuffleVectorInst *> Shuffles,		ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Factor) const override;		unsigned Factor) const override;
▲ Show 20 Lines • Show All 398 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
#include "llvm/IR/DebugLoc.h"		#include "llvm/IR/DebugLoc.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/GlobalValue.h"		#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/OperandTraits.h"		#include "llvm/IR/OperandTraits.h"
		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/IR/Use.h"		#include "llvm/IR/Use.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/CodeGen.h"		#include "llvm/Support/CodeGen.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
Show All 13 Lines
#include <cstdlib>		#include <cstdlib>
#include <iterator>		#include <iterator>
#include <limits>		#include <limits>
#include <tuple>		#include <tuple>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;
		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "aarch64-lower"		#define DEBUG_TYPE "aarch64-lower"

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");
STATISTIC(NumShiftInserts, "Number of vector shift inserts");		STATISTIC(NumShiftInserts, "Number of vector shift inserts");
STATISTIC(NumOptimizedImms, "Number of times immediates were optimized");		STATISTIC(NumOptimizedImms, "Number of times immediates were optimized");

static cl::opt<bool>		static cl::opt<bool>
▲ Show 20 Lines • Show All 8,168 Lines • ▼ Show 20 Lines	for (const Use &U : Ext->uses()) {
}		}

// At this point we can use the bfm family, so this extension is free		// At this point we can use the bfm family, so this extension is free
// for that use.		// for that use.
}		}
return true;		return true;
}		}

		/// Check if both Op1 and Op2 are shufflevector extracts of either the lower
		/// or upper half of the vector elements.
		static bool areExtractShuffleVectors(Value Op1, Value Op2) {
		auto areTypesHalfed = [](Value FullV, Value HalfV) {
		auto *FullVT = cast<VectorType>(FullV->getType());
		auto *HalfVT = cast<VectorType>(HalfV->getType());
		return FullVT->getBitWidth() == 2 * HalfVT->getBitWidth();
		};

		auto extractHalf = [](Value FullV, Value HalfV) {
		auto *FullVT = cast<VectorType>(FullV->getType());
		auto *HalfVT = cast<VectorType>(HalfV->getType());
		return FullVT->getNumElements() == 2 * HalfVT->getNumElements();
		};

		Constant M1, M2;
		Value S1Op1, S2Op1;
		if (!match(Op1, m_ShuffleVector(m_Value(S1Op1), m_Undef(), m_Constant(M1))) \|\|
		!match(Op2, m_ShuffleVector(m_Value(S2Op1), m_Undef(), m_Constant(M2))))
		return false;

		// Check that the operands are half as wide as the result and we extract
		// half of the elements of the input vectors.
		if (!areTypesHalfed(S1Op1, Op1) \|\| !areTypesHalfed(S2Op1, Op2) \|\|
		!extractHalf(S1Op1, Op1) \|\| !extractHalf(S2Op1, Op2))
		return false;

		// Check the mask extracts either the lower or upper half of vector
		// elements.
		int M1Start = -1;
		int M2Start = -1;
		int NumElements = cast<VectorType>(Op1->getType())->getNumElements() * 2;
		if (!ShuffleVectorInst::isExtractSubvectorMask(M1, NumElements, M1Start) \|\|
		!ShuffleVectorInst::isExtractSubvectorMask(M2, NumElements, M2Start) \|\|
		M1Start != M2Start \|\| (M1Start != 0 && M2Start != (NumElements / 2)))
		return false;

		return true;
		}

		/// Check if Ext1 and Ext2 are extends of the same type, doubling the bitwidth
		/// of the vector elements.
		static bool areExtractExts(Value Ext1, Value Ext2) {
		auto areExtDoubled = [](Instruction *Ext) {
		return Ext->getType()->getScalarSizeInBits() ==
		2 * Ext->getOperand(0)->getType()->getScalarSizeInBits();
		};

		if (!match(Ext1, m_ZExtOrSExt(m_Value())) \|\|
		!match(Ext2, m_ZExtOrSExt(m_Value())) \|\|
		!areExtDoubled(cast<Instruction>(Ext1)) \|\|
		!areExtDoubled(cast<Instruction>(Ext2)))
		return false;

		return true;
		}

		/// Check if sinking \p I's operands to I's basic block is profitable, because
		/// the operands can be folded into a target instruction, e.g.
		/// shufflevectors extracts and/or sext/zext can be folded into (u,s)subl(2).
		bool AArch64TargetLowering::shouldSinkOperands(
		Instruction I, SmallVectorImpl<Use > &Ops) const {
		if (!I->getType()->isVectorTy())
		return false;

		if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(I)) {
		switch (II->getIntrinsicID()) {
		case Intrinsic::aarch64_neon_umull:
		if (!areExtractShuffleVectors(II->getOperand(0), II->getOperand(1)))
		return false;
		Ops.push_back(&II->getOperandUse(0));
		Ops.push_back(&II->getOperandUse(1));
		return true;
		default:
		return false;
		}
		}

		switch (I->getOpcode()) {
		case Instruction::Sub:
		case Instruction::Add: {
		if (!areExtractExts(I->getOperand(0), I->getOperand(1)))
		return false;

		// If the exts' operands extract either the lower or upper elements, we
		// can sink them too.
		auto Ext1 = cast<Instruction>(I->getOperand(0));
		auto Ext2 = cast<Instruction>(I->getOperand(1));
		if (areExtractShuffleVectors(Ext1, Ext2)) {
		Ops.push_back(&Ext1->getOperandUse(0));
		Ops.push_back(&Ext2->getOperandUse(0));
		}

		Ops.push_back(&I->getOperandUse(0));
		Ops.push_back(&I->getOperandUse(1));

		return true;
		}
		default:
		return false;
		}
		return false;
		}

bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,		bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,
unsigned &RequiredAligment) const {		unsigned &RequiredAligment) const {
if (!LoadedType.isSimple() \|\|		if (!LoadedType.isSimple() \|\|
(!LoadedType.isInteger() && !LoadedType.isFloatingPoint()))		(!LoadedType.isInteger() && !LoadedType.isFloatingPoint()))
return false;		return false;
// Cyclone supports unaligned accesses.		// Cyclone supports unaligned accesses.
RequiredAligment = 0;		RequiredAligment = 0;
unsigned NumBits = LoadedType.getSizeInBits();		unsigned NumBits = LoadedType.getSizeInBits();
▲ Show 20 Lines • Show All 3,598 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -codegenprepare -S \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown"

				define <8 x i16> @sink_zext(<8 x i8> %a, <8 x i8> %b, i1 %c) {
				; CHECK-LABEL: @sink_zext(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 [[C:%.]], label [[IF_THEN:%.]], label [[IF_ELSE:%.*]]
				; CHECK: if.then:
				; CHECK-NEXT: [[ZB_1:%.]] = zext <8 x i8> [[B:%.]] to <8 x i16>
				; CHECK-NEXT: [[TMP0:%.]] = zext <8 x i8> [[A:%.]] to <8 x i16>
				; CHECK-NEXT: [[RES_1:%.*]] = add <8 x i16> [[TMP0]], [[ZB_1]]
				; CHECK-NEXT: ret <8 x i16> [[RES_1]]
				; CHECK: if.else:
				; CHECK-NEXT: [[ZB_2:%.*]] = zext <8 x i8> [[B]] to <8 x i16>
				; CHECK-NEXT: [[TMP1:%.*]] = zext <8 x i8> [[A]] to <8 x i16>
				; CHECK-NEXT: [[RES_2:%.*]] = sub <8 x i16> [[TMP1]], [[ZB_2]]
				; CHECK-NEXT: ret <8 x i16> [[RES_2]]
				;
				entry:
				%za = zext <8 x i8> %a to <8 x i16>
				br i1 %c, label %if.then, label %if.else

				if.then:
				%zb.1 = zext <8 x i8> %b to <8 x i16>
				%res.1 = add <8 x i16> %za, %zb.1
				ret <8 x i16> %res.1

				if.else:
				%zb.2 = zext <8 x i8> %b to <8 x i16>
				%res.2 = sub <8 x i16> %za, %zb.2
				ret <8 x i16> %res.2
				}

				define <8 x i16> @sink_sext(<8 x i8> %a, <8 x i8> %b, i1 %c) {
				; CHECK-LABEL: @sink_sext(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 [[C:%.]], label [[IF_THEN:%.]], label [[IF_ELSE:%.*]]
				; CHECK: if.then:
				; CHECK-NEXT: [[ZB_1:%.]] = sext <8 x i8> [[B:%.]] to <8 x i16>
				; CHECK-NEXT: [[TMP0:%.]] = sext <8 x i8> [[A:%.]] to <8 x i16>
				; CHECK-NEXT: [[RES_1:%.*]] = add <8 x i16> [[TMP0]], [[ZB_1]]
				; CHECK-NEXT: ret <8 x i16> [[RES_1]]
				; CHECK: if.else:
				; CHECK-NEXT: [[ZB_2:%.*]] = sext <8 x i8> [[B]] to <8 x i16>
				; CHECK-NEXT: [[TMP1:%.*]] = sext <8 x i8> [[A]] to <8 x i16>
				; CHECK-NEXT: [[RES_2:%.*]] = sub <8 x i16> [[TMP1]], [[ZB_2]]
				; CHECK-NEXT: ret <8 x i16> [[RES_2]]
				;
				entry:
				%za = sext <8 x i8> %a to <8 x i16>
				br i1 %c, label %if.then, label %if.else

				if.then:
				%zb.1 = sext <8 x i8> %b to <8 x i16>
				%res.1 = add <8 x i16> %za, %zb.1
				ret <8 x i16> %res.1

				if.else:
				%zb.2 = sext <8 x i8> %b to <8 x i16>
				%res.2 = sub <8 x i16> %za, %zb.2
				ret <8 x i16> %res.2
				}

				define <8 x i16> @do_not_sink_nonfree_zext(<8 x i8> %a, <8 x i8> %b, i1 %c) {
				; CHECK-LABEL: @do_not_sink_nonfree_zext(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 [[C:%.]], label [[IF_THEN:%.]], label [[IF_ELSE:%.*]]
				; CHECK: if.then:
				; CHECK-NEXT: [[ZB_1:%.]] = sext <8 x i8> [[B:%.]] to <8 x i16>
				; CHECK-NEXT: [[TMP0:%.]] = sext <8 x i8> [[A:%.]] to <8 x i16>
				; CHECK-NEXT: [[RES_1:%.*]] = add <8 x i16> [[TMP0]], [[ZB_1]]
				; CHECK-NEXT: ret <8 x i16> [[RES_1]]
				; CHECK: if.else:
				; CHECK-NEXT: [[ZB_2:%.*]] = sext <8 x i8> [[B]] to <8 x i16>
				; CHECK-NEXT: ret <8 x i16> [[ZB_2]]
				;
				entry:
				%za = sext <8 x i8> %a to <8 x i16>
				br i1 %c, label %if.then, label %if.else

				if.then:
				%zb.1 = sext <8 x i8> %b to <8 x i16>
				%res.1 = add <8 x i16> %za, %zb.1
				ret <8 x i16> %res.1

				if.else:
				%zb.2 = sext <8 x i8> %b to <8 x i16>
				ret <8 x i16> %zb.2
				}

				define <8 x i16> @do_not_sink_nonfree_sext(<8 x i8> %a, <8 x i8> %b, i1 %c) {
				; CHECK-LABEL: @do_not_sink_nonfree_sext(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 [[C:%.]], label [[IF_THEN:%.]], label [[IF_ELSE:%.*]]
				; CHECK: if.then:
				; CHECK-NEXT: [[ZB_1:%.]] = sext <8 x i8> [[B:%.]] to <8 x i16>
				; CHECK-NEXT: [[TMP0:%.]] = sext <8 x i8> [[A:%.]] to <8 x i16>
				; CHECK-NEXT: [[RES_1:%.*]] = add <8 x i16> [[TMP0]], [[ZB_1]]
				; CHECK-NEXT: ret <8 x i16> [[RES_1]]
				; CHECK: if.else:
				; CHECK-NEXT: [[ZB_2:%.*]] = sext <8 x i8> [[B]] to <8 x i16>
				; CHECK-NEXT: ret <8 x i16> [[ZB_2]]
				;
				entry:
				%za = sext <8 x i8> %a to <8 x i16>
				br i1 %c, label %if.then, label %if.else

				if.then:
				%zb.1 = sext <8 x i8> %b to <8 x i16>
				%res.1 = add <8 x i16> %za, %zb.1
				ret <8 x i16> %res.1

				if.else:
				%zb.2 = sext <8 x i8> %b to <8 x i16>
				ret <8 x i16> %zb.2
				}

				; The masks used are suitable for umull, sink shufflevector to users.
				define <8 x i16> @sink_shufflevector_umull(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-LABEL: @sink_shufflevector_umull(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 undef, label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
				; CHECK: if.then:
				; CHECK-NEXT: [[S2:%.]] = shufflevector <16 x i8> [[B:%.]], <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP0:%.]] = shufflevector <16 x i8> [[A:%.]], <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[VMULL0:%.*]] = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> [[TMP0]], <8 x i8> [[S2]])
				; CHECK-NEXT: ret <8 x i16> [[VMULL0]]
				; CHECK: if.else:
				; CHECK-NEXT: [[S4:%.*]] = shufflevector <16 x i8> [[B]], <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <16 x i8> [[A]], <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				; CHECK-NEXT: [[VMULL1:%.*]] = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> [[TMP1]], <8 x i8> [[S4]])
				; CHECK-NEXT: ret <8 x i16> [[VMULL1]]
				;
				entry:
				%s1 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%s3 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				br i1 undef, label %if.then, label %if.else

				if.then:
				%s2 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%vmull0 = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> %s1, <8 x i8> %s2) #3
				ret <8 x i16> %vmull0

				if.else:
				%s4 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%vmull1 = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> %s3, <8 x i8> %s4) #3
				ret <8 x i16> %vmull1
				}

				; Both exts and their shufflevector operands can be sunk.
				define <8 x i16> @sink_shufflevector_ext_subadd(<16 x i8> %a, <16 x i8> %b) {
				entry:
				%s1 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%z1 = zext <8 x i8> %s1 to <8 x i16>
				%s3 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%z3 = sext <8 x i8> %s3 to <8 x i16>
				br i1 undef, label %if.then, label %if.else

				if.then:
				%s2 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%z2 = zext <8 x i8> %s2 to <8 x i16>
				%res1 = add <8 x i16> %z1, %z2
				ret <8 x i16> %res1

				if.else:
				%s4 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%z4 = sext <8 x i8> %s4 to <8 x i16>
				%res2 = sub <8 x i16> %z3, %z4
				ret <8 x i16> %res2
				}


				declare void @user1(<8 x i16>)

				; Both exts and their shufflevector operands can be sunk.
				define <8 x i16> @sink_shufflevector_ext_subadd_multiuse(<16 x i8> %a, <16 x i8> %b) {
				entry:
				%s1 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%z1 = zext <8 x i8> %s1 to <8 x i16>
				%s3 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%z3 = sext <8 x i8> %s3 to <8 x i16>
				call void @user1(<8 x i16> %z3)
				br i1 undef, label %if.then, label %if.else

				if.then:
				%s2 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%z2 = zext <8 x i8> %s2 to <8 x i16>
				%res1 = add <8 x i16> %z1, %z2
				ret <8 x i16> %res1

				if.else:
				%s4 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%z4 = sext <8 x i8> %s4 to <8 x i16>
				%res2 = sub <8 x i16> %z3, %z4
				ret <8 x i16> %res2
				}


				; The masks used are not suitable for umull, do not sink.
				define <8 x i16> @no_sink_shufflevector_umull(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-LABEL: @no_sink_shufflevector_umull(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[S1:%.]] = shufflevector <16 x i8> [[A:%.]], <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 1, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[S3:%.*]] = shufflevector <16 x i8> [[A]], <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				; CHECK-NEXT: br i1 undef, label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
				; CHECK: if.then:
				; CHECK-NEXT: [[S2:%.]] = shufflevector <16 x i8> [[B:%.]], <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[VMULL0:%.*]] = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> [[S1]], <8 x i8> [[S2]])
				; CHECK-NEXT: ret <8 x i16> [[VMULL0]]
				; CHECK: if.else:
				; CHECK-NEXT: [[S4:%.*]] = shufflevector <16 x i8> [[B]], <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 10, i32 12, i32 13, i32 14, i32 15>
				; CHECK-NEXT: [[VMULL1:%.*]] = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> [[S3]], <8 x i8> [[S4]])
				; CHECK-NEXT: ret <8 x i16> [[VMULL1]]
				;
				entry:
				%s1 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 1, i32 5, i32 6, i32 7>
				%s3 = shufflevector <16 x i8> %a, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				br i1 undef, label %if.then, label %if.else

				if.then:
				%s2 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%vmull0 = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> %s1, <8 x i8> %s2) #3
				ret <8 x i16> %vmull0

				if.else:
				%s4 = shufflevector <16 x i8> %b, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 10, i32 12, i32 13, i32 14, i32 15>
				%vmull1 = tail call <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8> %s3, <8 x i8> %s4) #3
				ret <8 x i16> %vmull1
				}


				; Function Attrs: nounwind readnone
				declare <8 x i16> @llvm.aarch64.neon.umull.v8i16(<8 x i8>, <8 x i8>) #2