This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/ARM/
-
Target/
-
ARM/
3/8
ARMParallelDSP.cpp
-
test/
-
CodeGen/ARM/ParallelDSP/
-
ARM/
-
ParallelDSP/
-
complex_dot_prod.ll
-
exchange.ll
-
inner-full-unroll.ll
-
multi-use-loads.ll
-
overlapping.ll
-
pr43073.ll
-
smlad11.ll
-
smladx-1.ll
-
smlaldx-1.ll
-
smlaldx-2.ll
-
unroll-n-jam-smlad.ll
-
MC/AsmParser/
-
AsmParser/
-
preserve-comments-crlf.s

Differential D67392

[ARM][ParallelDSP] Change smlad insertion order
ClosedPublic

Authored by samparker on Sep 10 2019, 3:48 AM.

Download Raw Diff

Details

Reviewers

efriedma
SjoerdMeijer
dmgreen

Commits

rG1c3ca61294de: [ARM][ParallelDSP] Change smlad insertion order
rL374981: [ARM][ParallelDSP] Change smlad insertion order

Summary

Instead of inserting everything after the 'root' of the reduction, insert all instructions as close to their operands as possible. This can help reduce register pressure.

Note: I have no idea why git has decided that I've made a change to an MC test.

Diff Detail

Event Timeline

samparker created this revision.Sep 10 2019, 3:48 AM

Herald added subscribers: mgrang, zzheng, kristof.beyls. · View Herald TranscriptSep 10 2019, 3:48 AM

This can help reduce register pressure.

Is misched really so weak that this helps significantly? Or is it not enabled for the targets in question?

lib/Target/ARM/ARMParallelDSP.cpp
698	dominates() is linear in the length of the basic block; might want to use OrderedBasicBlock. (This is probably not the only place where this pass is quadratic in the size of the basic block, but just happened to spot it.)

Note: I have no idea why git has decided that I've made a change to an MC test.

Like the name suggests, preserve-comments-crlf.s has Windows line endings; maybe your text editor accidentally "fixed" them?

Thanks Eli, I had no idea that OrderedBasicBlock existed! I'm now using this in a couple of places. I've also updated a couple of tests to highlight the register pressure changes. The scheduler does a good job, but it appears to struggle with some of the large blocks that we chuck at it and every cycle counts for these DSP kernels.

efriedma added inline comments.Sep 12 2019, 1:34 PM

lib/Target/ARM/ARMParallelDSP.cpp
654	You decided not to modify this dominates() call?

samparker marked an inline comment as done.Sep 13 2019, 12:38 AM

samparker added inline comments.

lib/Target/ARM/ARMParallelDSP.cpp
654	I assumed it wouldn't be worth it... My understanding is that I'd have to create a new ordered block each time, because the block will change after each call.

efriedma added inline comments.Sep 13 2019, 12:08 PM

lib/Target/ARM/ARMParallelDSP.cpp
654	My general concern here would be that the pass is O(N^2) in the number of transformations in a given BB. (If you unroll a loop containing a transformable operation N times, for example.) This contributes to that, constructing the OrderedBB later in InsertParallelMACs contributes to that. But there are a bunch of other places with similar issues... for example, RecordMemoryOps has a loop that's O(N^2) in the number of loads. Actually, thinking about it a bit more, I have another concern; you might not be picking a legal insertion point. The instruction that produces the accumulator could be anything: for example, a phi, or an invoke, or an exception-handling instruction.

samparker marked an inline comment as done.Sep 16 2019, 7:19 AM

samparker added inline comments.

lib/Target/ARM/ARMParallelDSP.cpp
654	What about if I delayed the RecordMemoryOps logic until after we've discovered the reduction? Then, at least, we won't be paying the cost for the vast majority of cases. The accumulator could be a phi, but I don't think that's an issue here as it would dominate all other instructions. Only the original accumulator input value can be a phi or a non-instruction and, as we don't compare same values, I don't see how a phi could be in the insertion point. And please pardon my ignorance, but could you elaborate why an invoke would be an issue? And why exception handling needs to be considered?

I also hadn't though much about complexity, but indeed, function RecordMemoryOps, for example, is a bit of an expensive hobby.
Looking at it again, the bookkeeping looks essential, I don't see an easy way to reduce complexity. Delaying it may help a bit, but fundamentally that won't change much I think.
The usual way to deal with expensive hobbies is to introduce a threshold, and bail if it exceeds that.

The current implementation of comparing loads is quadratic, yes, but you could use a different algorithm, like splitting loads into a base pointer plus an offset, and constructing a map from base pointers to load offsets. Maybe not worthwhile, though; a threshold might be good enough.

lib/Target/ARM/ARMParallelDSP.cpp
654	Oh, if the accumulator isn't an instruction in the same block as the multiply, we always choose the multiply as the insertion point? That makes sense... but it probably deserves a comment noting that invariant.

The current implementation of comparing loads is quadratic, yes, but you could use a different algorithm, like splitting loads into a base pointer plus an offset, and constructing a map from base pointers to load offsets.

Ah, yes, nice one!

Thanks both. I've added a threshold to the number of loads that we can inspect, which causes us to bail before examining any loads. I've also added an early exit into the troublesome loop in RecordMemoryOps.

SjoerdMeijer added inline comments.Oct 15 2019, 6:29 AM

lib/Target/ARM/ARMParallelDSP.cpp
373	Just curious, why did you change the iteration order, Loads/Writes vs. Writes/Loads?
test/CodeGen/ARM/ParallelDSP/blocks.ll
139 ↗	(On Diff #221770)	Double checking I understand this test: this test has 16 loads, so is within the limit of 16. So why don't we generate the 4th SMLAD?

samparker marked 2 inline comments as done.Oct 15 2019, 7:21 AM

samparker added inline comments.

lib/Target/ARM/ARMParallelDSP.cpp
373	We know that Loads isn't empty, but Writes maybe so we can skip the loop if it is.
test/CodeGen/ARM/ParallelDSP/blocks.ll
139 ↗	(On Diff #221770)	We're within the limit, but it looks like there's some limitation with the search algorithm... the adds are just ordered differently to 'usual'.

Thanks for clarifying.

I am happy with this if @efriedma is happy too.

LGTM, assuming you fix your tree so test/MC/AsmParser/preserve-comments-crlf.s isn't modified.

This revision is now accepted and ready to land.Oct 15 2019, 11:08 AM

Closed by commit rG1c3ca61294de: [ARM][ParallelDSP] Change smlad insertion order (authored by samparker). · Explain WhyOct 16 2019, 2:43 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptOct 16 2019, 2:43 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMParallelDSP.cpp

51 lines

test/

CodeGen/

ARM/

ParallelDSP/

57 lines

12 lines

4 lines

32 lines

18 lines

16 lines

4 lines

9 lines

9 lines

9 lines

unroll-n-jam-smlad.ll

3 lines

MC/

AsmParser/

preserve-comments-crlf.s

Diff 219671

lib/Target/ARM/ARMParallelDSP.cpp

Show All 12 Lines
/// This pass runs only when unaligned accesses is supported/enabled.		/// This pass runs only when unaligned accesses is supported/enabled.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"		#include "llvm/Analysis/LoopAccessAnalysis.h"
		#include "llvm/Analysis/OrderedBasicBlock.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/NoFolder.h"		#include "llvm/IR/NoFolder.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/PassRegistry.h"		#include "llvm/PassRegistry.h"
#include "llvm/PassSupport.h"		#include "llvm/PassSupport.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
▲ Show 20 Lines • Show All 312 Lines • ▼ Show 20 Lines

/// Iterate through the block and record base, offset pairs of loads which can		/// Iterate through the block and record base, offset pairs of loads which can
/// be widened into a single load.		/// be widened into a single load.
bool ARMParallelDSP::RecordMemoryOps(BasicBlock *BB) {		bool ARMParallelDSP::RecordMemoryOps(BasicBlock *BB) {
SmallVector<LoadInst*, 8> Loads;		SmallVector<LoadInst*, 8> Loads;
SmallVector<Instruction*, 8> Writes;		SmallVector<Instruction*, 8> Writes;
LoadPairs.clear();		LoadPairs.clear();
WideLoads.clear();		WideLoads.clear();
		OrderedBasicBlock OrderedBB(BB);

// Collect loads and instruction that may write to memory. For now we only		// Collect loads and instruction that may write to memory. For now we only
// record loads which are simple, sign-extended and have a single user.		// record loads which are simple, sign-extended and have a single user.
// TODO: Allow zero-extended loads.		// TODO: Allow zero-extended loads.
for (auto &I : *BB) {		for (auto &I : *BB) {
if (I.mayWriteToMemory())		if (I.mayWriteToMemory())
Writes.push_back(&I);		Writes.push_back(&I);
auto *Ld = dyn_cast<LoadInst>(&I);		auto *Ld = dyn_cast<LoadInst>(&I);
if (!Ld \|\| !Ld->isSimple() \|\|		if (!Ld \|\| !Ld->isSimple() \|\|
!Ld->hasOneUse() \|\| !isa<SExtInst>(Ld->user_back()))		!Ld->hasOneUse() \|\| !isa<SExtInst>(Ld->user_back()))
continue;		continue;
Loads.push_back(Ld);		Loads.push_back(Ld);
}		}

using InstSet = std::set<Instruction*>;		using InstSet = std::set<Instruction*>;
using DepMap = std::map<Instruction*, InstSet>;		using DepMap = std::map<Instruction*, InstSet>;
DepMap RAWDeps;		DepMap RAWDeps;

// Record any writes that may alias a load.		// Record any writes that may alias a load.
const auto Size = LocationSize::unknown();		const auto Size = LocationSize::unknown();
for (auto Read : Loads) {		for (auto Read : Loads) {
for (auto Write : Writes) {		for (auto Write : Writes) {
MemoryLocation ReadLoc =		MemoryLocation ReadLoc =
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Just curious, why did you change the iteration order, Loads/Writes vs. Writes/Loads? SjoerdMeijer: Just curious, why did you change the iteration order, Loads/Writes vs. Writes/Loads?
		samparkerAuthorUnsubmitted Done Reply Inline Actions We know that Loads isn't empty, but Writes maybe so we can skip the loop if it is. samparker: We know that Loads isn't empty, but Writes maybe so we can skip the loop if it is.
MemoryLocation(Read->getPointerOperand(), Size);		MemoryLocation(Read->getPointerOperand(), Size);

if (!isModOrRefSet(intersectModRef(AA->getModRefInfo(Write, ReadLoc),		if (!isModOrRefSet(intersectModRef(AA->getModRefInfo(Write, ReadLoc),
ModRefInfo::ModRef)))		ModRefInfo::ModRef)))
continue;		continue;
if (DT->dominates(Write, Read))		if (OrderedBB.dominates(Write, Read))
RAWDeps[Read].insert(Write);		RAWDeps[Read].insert(Write);
}		}
}		}

// Check whether there's not a write between the two loads which would		// Check whether there's not a write between the two loads which would
// prevent them from being safely merged.		// prevent them from being safely merged.
auto SafeToPair = [&](LoadInst Base, LoadInst Offset) {		auto SafeToPair = [&](LoadInst Base, LoadInst Offset) {
LoadInst *Dominator = DT->dominates(Base, Offset) ? Base : Offset;		LoadInst *Dominator = OrderedBB.dominates(Base, Offset) ? Base : Offset;
LoadInst *Dominated = DT->dominates(Base, Offset) ? Offset : Base;		LoadInst *Dominated = OrderedBB.dominates(Base, Offset) ? Offset : Base;

if (RAWDeps.count(Dominated)) {		if (RAWDeps.count(Dominated)) {
InstSet &WritesBefore = RAWDeps[Dominated];		InstSet &WritesBefore = RAWDeps[Dominated];

for (auto Before : WritesBefore) {		for (auto Before : WritesBefore) {
// We can't move the second load backward, past a write, to merge		// We can't move the second load backward, past a write, to merge
// with the first load.		// with the first load.
if (DT->dominates(Dominator, Before))		if (OrderedBB.dominates(Dominator, Before))
return false;		return false;
}		}
}		}
return true;		return true;
};		};

// Record base, offset load pairs.		// Record base, offset load pairs.
for (auto *Base : Loads) {		for (auto *Base : Loads) {
▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines	for (unsigned j = 0; j < Elems; ++j) {

if (CanPair(R, PMul0, PMul1))		if (CanPair(R, PMul0, PMul1))
break;		break;
}		}
}		}
return !R.getMulPairs().empty();		return !R.getMulPairs().empty();
}		}


void ARMParallelDSP::InsertParallelMACs(Reduction &R) {		void ARMParallelDSP::InsertParallelMACs(Reduction &R) {

auto CreateSMLAD = [&](LoadInst* WideLd0, LoadInst *WideLd1,		auto CreateSMLAD = [&](LoadInst* WideLd0, LoadInst *WideLd1,
Value *Acc, bool Exchange,		Value *Acc, bool Exchange,
Instruction *InsertAfter) {		Instruction *InsertAfter) {
// Replace the reduction chain with an intrinsic call		// Replace the reduction chain with an intrinsic call

Value* Args[] = { WideLd0, WideLd1, Acc };		Value* Args[] = { WideLd0, WideLd1, Acc };
Function *SMLAD = nullptr;		Function *SMLAD = nullptr;
if (Exchange)		if (Exchange)
SMLAD = Acc->getType()->isIntegerTy(32) ?		SMLAD = Acc->getType()->isIntegerTy(32) ?
Intrinsic::getDeclaration(M, Intrinsic::arm_smladx) :		Intrinsic::getDeclaration(M, Intrinsic::arm_smladx) :
Intrinsic::getDeclaration(M, Intrinsic::arm_smlaldx);		Intrinsic::getDeclaration(M, Intrinsic::arm_smlaldx);
else		else
SMLAD = Acc->getType()->isIntegerTy(32) ?		SMLAD = Acc->getType()->isIntegerTy(32) ?
Intrinsic::getDeclaration(M, Intrinsic::arm_smlad) :		Intrinsic::getDeclaration(M, Intrinsic::arm_smlad) :
Intrinsic::getDeclaration(M, Intrinsic::arm_smlald);		Intrinsic::getDeclaration(M, Intrinsic::arm_smlald);

IRBuilder<NoFolder> Builder(InsertAfter->getParent(),		IRBuilder<NoFolder> Builder(InsertAfter->getParent(),
++BasicBlock::iterator(InsertAfter));		BasicBlock::iterator(InsertAfter));
Instruction *Call = Builder.CreateCall(SMLAD, Args);		Instruction *Call = Builder.CreateCall(SMLAD, Args);
NumSMLAD++;		NumSMLAD++;
return Call;		return Call;
};		};

Instruction *InsertAfter = R.getRoot();		// Return the instruction after the dominated instruction.
		auto GetInsertPoint = [this](Value A, Value B) {
		assert(isa<Instruction>(A) \|\| isa<Instruction>(B) &&
		"expected at least one instruction");

		Value *V = nullptr;
		if (!isa<Instruction>(A))
		V = B;
		else if (!isa<Instruction>(B))
		V = A;
		else
		V = DT->dominates(cast<Instruction>(A), cast<Instruction>(B)) ? B : A;
		efriedmaUnsubmitted Not Done Reply Inline Actions You decided not to modify this dominates() call? efriedma: You decided not to modify this dominates() call?
		samparkerAuthorUnsubmitted Done Reply Inline Actions I assumed it wouldn't be worth it... My understanding is that I'd have to create a new ordered block each time, because the block will change after each call. samparker: I assumed it wouldn't be worth it... My understanding is that I'd have to create a new ordered…
		efriedmaUnsubmitted Not Done Reply Inline Actions My general concern here would be that the pass is O(N^2) in the number of transformations in a given BB. (If you unroll a loop containing a transformable operation N times, for example.) This contributes to that, constructing the OrderedBB later in InsertParallelMACs contributes to that. But there are a bunch of other places with similar issues... for example, RecordMemoryOps has a loop that's O(N^2) in the number of loads. Actually, thinking about it a bit more, I have another concern; you might not be picking a legal insertion point. The instruction that produces the accumulator could be anything: for example, a phi, or an invoke, or an exception-handling instruction. efriedma: My general concern here would be that the pass is O(N^2) in the number of transformations in a…
		samparkerAuthorUnsubmitted Done Reply Inline Actions What about if I delayed the RecordMemoryOps logic until after we've discovered the reduction? Then, at least, we won't be paying the cost for the vast majority of cases. The accumulator could be a phi, but I don't think that's an issue here as it would dominate all other instructions. Only the original accumulator input value can be a phi or a non-instruction and, as we don't compare same values, I don't see how a phi could be in the insertion point. And please pardon my ignorance, but could you elaborate why an invoke would be an issue? And why exception handling needs to be considered? samparker: What about if I delayed the RecordMemoryOps logic until after we've discovered the reduction?
		efriedmaUnsubmitted Not Done Reply Inline Actions Oh, if the accumulator isn't an instruction in the same block as the multiply, we always choose the multiply as the insertion point? That makes sense... but it probably deserves a comment noting that invariant. efriedma: Oh, if the accumulator isn't an instruction in the same block as the multiply, we always choose…

		return &*++BasicBlock::iterator(cast<Instruction>(V));
		};

Value *Acc = R.getAccumulator();		Value *Acc = R.getAccumulator();

// For any muls that were discovered but not paired, accumulate their values		// For any muls that were discovered but not paired, accumulate their values
// as before.		// as before.
IRBuilder<NoFolder> Builder(InsertAfter->getParent(),		IRBuilder<NoFolder> Builder(R.getRoot()->getParent());
++BasicBlock::iterator(InsertAfter));
MulCandList &MulCands = R.getMuls();		MulCandList &MulCands = R.getMuls();
for (auto &MulCand : MulCands) {		for (auto &MulCand : MulCands) {
if (MulCand->Paired)		if (MulCand->Paired)
continue;		continue;

Value *Mul = MulCand->Root;		Instruction *Mul = cast<Instruction>(MulCand->Root);
LLVM_DEBUG(dbgs() << "Accumulating unpaired mul: " << *Mul << "\n");		LLVM_DEBUG(dbgs() << "Accumulating unpaired mul: " << *Mul << "\n");

if (R.getType() != Mul->getType()) {		if (R.getType() != Mul->getType()) {
assert(R.is64Bit() && "expected 64-bit result");		assert(R.is64Bit() && "expected 64-bit result");
Mul = Builder.CreateSExt(Mul, R.getType());		Builder.SetInsertPoint(&*++BasicBlock::iterator(Mul));
		Mul = cast<Instruction>(Builder.CreateSExt(Mul, R.getRoot()->getType()));
}		}

if (!Acc) {		if (!Acc) {
Acc = Mul;		Acc = Mul;
continue;		continue;
}		}

		Builder.SetInsertPoint(GetInsertPoint(Mul, Acc));
Acc = Builder.CreateAdd(Mul, Acc);		Acc = Builder.CreateAdd(Mul, Acc);
InsertAfter = cast<Instruction>(Acc);
}		}

if (!Acc) {		if (!Acc) {
Acc = R.is64Bit() ?		Acc = R.is64Bit() ?
ConstantInt::get(IntegerType::get(M->getContext(), 64), 0) :		ConstantInt::get(IntegerType::get(M->getContext(), 64), 0) :
ConstantInt::get(IntegerType::get(M->getContext(), 32), 0);		ConstantInt::get(IntegerType::get(M->getContext(), 32), 0);
} else if (Acc->getType() != R.getType()) {		} else if (Acc->getType() != R.getType()) {
Builder.SetInsertPoint(R.getRoot());		Builder.SetInsertPoint(R.getRoot());
Acc = Builder.CreateSExt(Acc, R.getType());		Acc = Builder.CreateSExt(Acc, R.getType());
}		}

		// Roughly sort the mul pairs in their program order.
		OrderedBasicBlock OrderedBB(R.getRoot()->getParent());
		llvm::sort(R.getMulPairs(), [&OrderedBB](auto &PairA, auto &PairB) {
		efriedmaUnsubmitted Not Done Reply Inline Actions dominates() is linear in the length of the basic block; might want to use OrderedBasicBlock. (This is probably not the only place where this pass is quadratic in the size of the basic block, but just happened to spot it.) efriedma: dominates() is linear in the length of the basic block; might want to use OrderedBasicBlock.
		const Instruction *A = PairA.first->Root;
		const Instruction *B = PairB.first->Root;
		return OrderedBB.dominates(A, B);
		});

IntegerType *Ty = IntegerType::get(M->getContext(), 32);		IntegerType *Ty = IntegerType::get(M->getContext(), 32);
for (auto &Pair : R.getMulPairs()) {		for (auto &Pair : R.getMulPairs()) {
MulCandidate *LHSMul = Pair.first;		MulCandidate *LHSMul = Pair.first;
MulCandidate *RHSMul = Pair.second;		MulCandidate *RHSMul = Pair.second;
LoadInst *BaseLHS = LHSMul->getBaseLoad();		LoadInst *BaseLHS = LHSMul->getBaseLoad();
LoadInst *BaseRHS = RHSMul->getBaseLoad();		LoadInst *BaseRHS = RHSMul->getBaseLoad();
LoadInst *WideLHS = WideLoads.count(BaseLHS) ?		LoadInst *WideLHS = WideLoads.count(BaseLHS) ?
WideLoads[BaseLHS]->getLoad() : CreateWideLoad(LHSMul->VecLd, Ty);		WideLoads[BaseLHS]->getLoad() : CreateWideLoad(LHSMul->VecLd, Ty);
LoadInst *WideRHS = WideLoads.count(BaseRHS) ?		LoadInst *WideRHS = WideLoads.count(BaseRHS) ?
WideLoads[BaseRHS]->getLoad() : CreateWideLoad(RHSMul->VecLd, Ty);		WideLoads[BaseRHS]->getLoad() : CreateWideLoad(RHSMul->VecLd, Ty);

		Instruction *InsertAfter = GetInsertPoint(WideLHS, WideRHS);
		InsertAfter = GetInsertPoint(InsertAfter, Acc);
Acc = CreateSMLAD(WideLHS, WideRHS, Acc, RHSMul->Exchange, InsertAfter);		Acc = CreateSMLAD(WideLHS, WideRHS, Acc, RHSMul->Exchange, InsertAfter);
InsertAfter = cast<Instruction>(Acc);
}		}
R.UpdateRoot(cast<Instruction>(Acc));		R.UpdateRoot(cast<Instruction>(Acc));
}		}

LoadInst* ARMParallelDSP::CreateWideLoad(MemInstList &Loads,		LoadInst* ARMParallelDSP::CreateWideLoad(MemInstList &Loads,
IntegerType *LoadTy) {		IntegerType *LoadTy) {
assert(Loads.size() == 2 && "currently only support widening two loads");		assert(Loads.size() == 2 && "currently only support widening two loads");

▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/complex_dot_prod.ll

; RUN: llc -mtriple=thumbv7em -mcpu=cortex-m4 -O3 %s -o - \| FileCheck %s		; RUN: llc -mtriple=thumbv7em -mcpu=cortex-m4 -O3 %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-LLC
		; RUN: opt -S -mtriple=armv7-a -arm-parallel-dsp -dce %s -o - \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-OPT

; TODO: Think we should be able to use smlsdx/smlsldx here.		; TODO: Think we should be able to use smlsdx/smlsldx here.

; CHECK-LABEL: complex_dot_prod		; CHECK-LABEL: complex_dot_prod

; CHECK: smulbb		; CHECK-LLC: smlaldx
; CHECK: smultt		; CHECK-LLC: smulbb
; CHECK: smlalbb		; CHECK-LLC: smultt
; CHECK: smultt		; CHECK-LLC: smlalbb
; CHECK: smlaldx		; CHECK-LLC: smlaldx
; CHECK: smlalbb		; CHECK-LLC: smultt
; CHECK: smlaldx		; CHECK-LLC: smlalbb
; CHECK: smultt		; CHECK-LLC: smultt
; CHECK: smlalbb		; CHECK-LLC: smlalbb
; CHECK: smlaldx		; CHECK-LLC: smlaldx
; CHECK: smultt		; CHECK-LLC: smultt
; CHECK: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}		; CHECK-LCC: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}

		; CHECK-OPT: [[ADDR_A:%[^ ]+]] = bitcast i16* %pSrcA to i32*
		; CHECK-OPT: [[A:%[^ ]+]] = load i32, i32* [[ADDR_A]], align 2
		; CHECK-OPT: [[ADDR_A_2:%[^ ]+]] = getelementptr inbounds i16, i16* %pSrcA, i32 2
		; CHECK-OPT: [[ADDR_B:%[^ ]+]] = bitcast i16* %pSrcB to i32*
		; CHECK-OPT: [[B:%[^ ]+]] = load i32, i32* [[ADDR_B]], align 2
		; CHECK-OPT: [[ACC0:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[A]], i32 [[B]], i64 0)
		; CHECK-OPT: [[ADDR_B_2:%[^ ]+]] = getelementptr inbounds i16, i16* %pSrcB, i32 2
		; CHECK-OPT: [[CAST_ADDR_A_2:%[^ ]+]] = bitcast i16* [[ADDR_A_2]] to i32*
		; CHECK-OPT: [[A_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_A_2]], align 2
		; CHECK-OPT: [[ADDR_A_4:%[^ ]+]] = getelementptr inbounds i16, i16* %pSrcA, i32 4
		; CHECK-OPT: [[CAST_ADDR_B_2:%[^ ]+]] = bitcast i16* [[ADDR_B_2]] to i32*
		; CHECK-OPT: [[B_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_2]], align 2
		; CHECK-OPT: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[A_2]], i32 [[B_2]], i64 [[ACC0]])
		; CHECK-OPT: [[ADDR_B_4:%[^ ]+]] = getelementptr inbounds i16, i16* %pSrcB, i32 4
		; CHECK-OPT: [[CAST_ADDR_A_4:%[^ ]+]] = bitcast i16* [[ADDR_A_4]] to i32*
		; CHECK-OPT: [[A_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_A_4]], align 2
		; CHECK-OPT: [[ADDR_A_6:%[^ ]+]] = getelementptr inbounds i16, i16* %pSrcA, i32 6
		; CHECK-OPT: [[CAST_ADDR_B_4:%[^ ]+]] = bitcast i16* [[ADDR_B_4]] to i32*
		; CHECK-OPT: [[B_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_4]], align 2
		; CHECK-OPT: [[ACC2:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[A_4]], i32 [[B_4]], i64 [[ACC1]])
		; CHECK-OPT: [[ADDR_B_6:%[^ ]+]] = getelementptr inbounds i16, i16* %pSrcB, i32 6
		; CHECK-OPT: [[CAST_ADDR_A_6:%[^ ]+]] = bitcast i16* [[ADDR_A_6]] to i32*
		; CHECK-OPT: [[A_6:%[^ ]+]] = load i32, i32* [[CAST_ADDR_A_6]], align 2
		; CHECK-OPT: [[CAST_ADDR_B_6:%[^ ]+]] = bitcast i16* [[ADDR_B_6]] to i32*
		; CHECK-OPT: [[B_6:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_6]], align 2
		; CHECK-OPT: call i64 @llvm.arm.smlaldx(i32 [[A_6]], i32 [[B_6]], i64 [[ACC2]])

define dso_local arm_aapcscc void @complex_dot_prod(i16* nocapture readonly %pSrcA, i16* nocapture readonly %pSrcB, i32* nocapture %realResult, i32* nocapture %imagResult) {		define dso_local arm_aapcscc void @complex_dot_prod(i16* nocapture readonly %pSrcA, i16* nocapture readonly %pSrcB, i32* nocapture %realResult, i32* nocapture %imagResult) {
entry:		entry:
%incdec.ptr = getelementptr inbounds i16, i16* %pSrcA, i32 1		%incdec.ptr = getelementptr inbounds i16, i16* %pSrcA, i32 1
%0 = load i16, i16* %pSrcA, align 2		%0 = load i16, i16* %pSrcA, align 2
%incdec.ptr1 = getelementptr inbounds i16, i16* %pSrcA, i32 2		%incdec.ptr1 = getelementptr inbounds i16, i16* %pSrcA, i32 2
%1 = load i16, i16* %incdec.ptr, align 2		%1 = load i16, i16* %incdec.ptr, align 2
%incdec.ptr2 = getelementptr inbounds i16, i16* %pSrcB, i32 1		%incdec.ptr2 = getelementptr inbounds i16, i16* %pSrcB, i32 1
%2 = load i16, i16* %pSrcB, align 2		%2 = load i16, i16* %pSrcB, align 2
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	entry:
%conv78 = sext i16 %15 to i32		%conv78 = sext i16 %15 to i32
%mul79 = mul nsw i32 %conv78, %conv72		%mul79 = mul nsw i32 %conv78, %conv72
%conv80 = sext i32 %mul79 to i64		%conv80 = sext i32 %mul79 to i64
%conv82 = sext i16 %13 to i32		%conv82 = sext i16 %13 to i32
%mul84 = mul nsw i32 %conv78, %conv82		%mul84 = mul nsw i32 %conv78, %conv82
%conv85 = sext i32 %mul84 to i64		%conv85 = sext i32 %mul84 to i64
%sub86 = sub nsw i64 %add76, %conv85		%sub86 = sub nsw i64 %add76, %conv85
%mul89 = mul nsw i32 %conv73, %conv82		%mul89 = mul nsw i32 %conv73, %conv82
%conv90 = sext i32 %mul89 to i64		%conv90 = sext i32 %mul89 to i64
%add81 = add nsw i64 %add67, %conv90		%add81 = add nsw i64 %add67, %conv90
%add91 = add nsw i64 %add81, %conv80		%add91 = add nsw i64 %add81, %conv80
%16 = lshr i64 %sub86, 6		%16 = lshr i64 %sub86, 6
%conv92 = trunc i64 %16 to i32		%conv92 = trunc i64 %16 to i32
store i32 %conv92, i32* %realResult, align 4		store i32 %conv92, i32* %realResult, align 4
%17 = lshr i64 %add91, 6		%17 = lshr i64 %add91, 6
%conv94 = trunc i64 %17 to i32		%conv94 = trunc i64 %17 to i32
store i32 %conv94, i32* %imagResult, align 4		store i32 %conv94, i32* %imagResult, align 4
ret void		ret void
}		}

test/CodeGen/ARM/ParallelDSP/exchange.ll

Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	entry:
ret i32 %res		ret i32 %res
}		}

; CHECK-LABEL: exchange_multi_use_1		; CHECK-LABEL: exchange_multi_use_1
; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*		; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]		; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*		; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]		; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
		; CHECK: [[X:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[LD_A]], i32 [[LD_B]], i32 %acc
; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2		; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*		; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*
; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]		; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
; CHECK: [[X:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[LD_A]], i32 [[LD_B]], i32 %acc
; CHECK: call i32 @llvm.arm.smlad(i32 [[LD_A_2]], i32 [[LD_B]], i32 [[X]])		; CHECK: call i32 @llvm.arm.smlad(i32 [[LD_A_2]], i32 [[LD_B]], i32 [[X]])
define i32 @exchange_multi_use_1(i16* %a, i16* %b, i32 %acc) {		define i32 @exchange_multi_use_1(i16* %a, i16* %b, i32 %acc) {
entry:		entry:
%addr.a.1 = getelementptr i16, i16* %a, i32 1		%addr.a.1 = getelementptr i16, i16* %a, i32 1
%addr.b.1 = getelementptr i16, i16* %b, i32 1		%addr.b.1 = getelementptr i16, i16* %b, i32 1
%ld.a.0 = load i16, i16* %a		%ld.a.0 = load i16, i16* %a
%sext.a.0 = sext i16 %ld.a.0 to i32		%sext.a.0 = sext i16 %ld.a.0 to i32
%ld.b.0 = load i16, i16* %b		%ld.b.0 = load i16, i16* %b
Show All 19 Lines	entry:
ret i32 %res		ret i32 %res
}		}

; CHECK-LABEL: exchange_multi_use_64_1		; CHECK-LABEL: exchange_multi_use_64_1
; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*		; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]		; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*		; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]		; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
		; CHECK: [[X:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[LD_A]], i32 [[LD_B]], i64 %acc
; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2		; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*		; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*
; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]		; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
; CHECK: [[X:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[LD_A]], i32 [[LD_B]], i64 %acc
; CHECK: call i64 @llvm.arm.smlald(i32 [[LD_A_2]], i32 [[LD_B]], i64 [[X]])		; CHECK: call i64 @llvm.arm.smlald(i32 [[LD_A_2]], i32 [[LD_B]], i64 [[X]])
define i64 @exchange_multi_use_64_1(i16* %a, i16* %b, i64 %acc) {		define i64 @exchange_multi_use_64_1(i16* %a, i16* %b, i64 %acc) {
entry:		entry:
%addr.a.1 = getelementptr i16, i16* %a, i32 1		%addr.a.1 = getelementptr i16, i16* %a, i32 1
%addr.b.1 = getelementptr i16, i16* %b, i32 1		%addr.b.1 = getelementptr i16, i16* %b, i32 1
%ld.a.0 = load i16, i16* %a		%ld.a.0 = load i16, i16* %a
%sext.a.0 = sext i16 %ld.a.0 to i32		%sext.a.0 = sext i16 %ld.a.0 to i32
%ld.b.0 = load i16, i16* %b		%ld.b.0 = load i16, i16* %b
Show All 20 Lines	entry:
ret i64 %res		ret i64 %res
}		}

; CHECK-LABEL: exchange_multi_use_64_2		; CHECK-LABEL: exchange_multi_use_64_2
; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*		; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]		; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*		; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]		; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
		; CHECK: [[X:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[LD_A]], i32 [[LD_B]], i64 %acc
; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2		; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*		; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*
; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]		; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
; CHECK: [[X:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[LD_A]], i32 [[LD_B]], i64 %acc
; CHECK: call i64 @llvm.arm.smlald(i32 [[LD_A_2]], i32 [[LD_B]], i64 [[X]])		; CHECK: call i64 @llvm.arm.smlald(i32 [[LD_A_2]], i32 [[LD_B]], i64 [[X]])
define i64 @exchange_multi_use_64_2(i16* %a, i16* %b, i64 %acc) {		define i64 @exchange_multi_use_64_2(i16* %a, i16* %b, i64 %acc) {
entry:		entry:
%addr.a.1 = getelementptr i16, i16* %a, i32 1		%addr.a.1 = getelementptr i16, i16* %a, i32 1
%addr.b.1 = getelementptr i16, i16* %b, i32 1		%addr.b.1 = getelementptr i16, i16* %b, i32 1
%ld.a.0 = load i16, i16* %a		%ld.a.0 = load i16, i16* %a
%sext.a.0 = sext i16 %ld.a.0 to i32		%sext.a.0 = sext i16 %ld.a.0 to i32
%ld.b.0 = load i16, i16* %b		%ld.b.0 = load i16, i16* %b
Show All 21 Lines	entry:
ret i64 %res		ret i64 %res
}		}

; CHECK-LABEL: exchange_multi_use_2		; CHECK-LABEL: exchange_multi_use_2
; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*		; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]		; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*		; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]		; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
		; CHECK: [[X:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 %acc
; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2		; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*		; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*
; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]		; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
; CHECK: [[X:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 %acc
; CHECK: call i32 @llvm.arm.smladx(i32 [[LD_B]], i32 [[LD_A_2]], i32 [[X]])		; CHECK: call i32 @llvm.arm.smladx(i32 [[LD_B]], i32 [[LD_A_2]], i32 [[X]])
define i32 @exchange_multi_use_2(i16* %a, i16* %b, i32 %acc) {		define i32 @exchange_multi_use_2(i16* %a, i16* %b, i32 %acc) {
entry:		entry:
%addr.a.1 = getelementptr i16, i16* %a, i32 1		%addr.a.1 = getelementptr i16, i16* %a, i32 1
%addr.b.1 = getelementptr i16, i16* %b, i32 1		%addr.b.1 = getelementptr i16, i16* %b, i32 1
%ld.a.0 = load i16, i16* %a		%ld.a.0 = load i16, i16* %a
%sext.a.0 = sext i16 %ld.a.0 to i32		%sext.a.0 = sext i16 %ld.a.0 to i32
%ld.b.0 = load i16, i16* %b		%ld.b.0 = load i16, i16* %b
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
; CHECK-LABEL: exchange_multi_use_64_3		; CHECK-LABEL: exchange_multi_use_64_3
; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*		; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]		; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*		; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]		; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2		; CHECK: [[GEP:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*		; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP]] to i32*
; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]		; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
; CHECK: [[ACC:%[^ ]+]] = call i64 @llvm.arm.smlald(i32 [[LD_A]], i32 [[LD_B]], i64 0)		; CHECK: [[ACC:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[LD_B]], i32 [[LD_A_2]], i64 0)
; CHECK: [[X:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[LD_B]], i32 [[LD_A_2]], i64 [[ACC]])		; CHECK: [[X:%[^ ]+]] = call i64 @llvm.arm.smlald(i32 [[LD_A]], i32 [[LD_B]], i64 [[ACC]])
define i64 @exchange_multi_use_64_3(i16* %a, i16* %b, i64 %acc) {		define i64 @exchange_multi_use_64_3(i16* %a, i16* %b, i64 %acc) {
entry:		entry:
%addr.a.1 = getelementptr i16, i16* %a, i32 1		%addr.a.1 = getelementptr i16, i16* %a, i32 1
%addr.b.1 = getelementptr i16, i16* %b, i32 1		%addr.b.1 = getelementptr i16, i16* %b, i32 1
%ld.a.0 = load i16, i16* %a		%ld.a.0 = load i16, i16* %a
%sext.a.0 = sext i16 %ld.a.0 to i32		%sext.a.0 = sext i16 %ld.a.0 to i32
%ld.b.0 = load i16, i16* %b		%ld.b.0 = load i16, i16* %b
%ld.a.1 = load i16, i16* %addr.a.1		%ld.a.1 = load i16, i16* %addr.a.1
▲ Show 20 Lines • Show All 134 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll

	; RUN: opt -mtriple=thumbv7em -arm-parallel-dsp -dce -S %s -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv7em -arm-parallel-dsp -dce -S %s -o - \| FileCheck %s

	; CHECK-LABEL: full_unroll			; CHECK-LABEL: full_unroll
	; CHECK: [[IV:%[^ ]+]] = phi i32			; CHECK: [[IV:%[^ ]+]] = phi i32
	; CHECK: [[AI:%[^ ]+]] = getelementptr inbounds i32, i32* %a, i32 [[IV]]			; CHECK: [[AI:%[^ ]+]] = getelementptr inbounds i32, i32* %a, i32 [[IV]]
	; CHECK: [[BI:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 [[IV]]			; CHECK: [[BI:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 [[IV]]
	; CHECK: [[BIJ:%[^ ]+]] = load i16, i16* %arrayidx5, align 4			; CHECK: [[BIJ:%[^ ]+]] = load i16, i16* %arrayidx5, align 4
	; CHECK: [[CI:%[^ ]+]] = getelementptr inbounds i16, i16* %c, i32 [[IV]]			; CHECK: [[CI:%[^ ]+]] = getelementptr inbounds i16, i16* %c, i32 [[IV]]
	; CHECK: [[CIJ:%[^ ]+]] = load i16, i16* [[CI]], align 4			; CHECK: [[CIJ:%[^ ]+]] = load i16, i16* [[CI]], align 4
	; CHECK: [[BIJ_CAST:%[^ ]+]] = bitcast i16* [[BIJ]] to i32*			; CHECK: [[BIJ_CAST:%[^ ]+]] = bitcast i16* [[BIJ]] to i32*
	; CHECK: [[BIJ_LD:%[^ ]+]] = load i32, i32* [[BIJ_CAST]], align 2			; CHECK: [[BIJ_LD:%[^ ]+]] = load i32, i32* [[BIJ_CAST]], align 2
	; CHECK: [[CIJ_CAST:%[^ ]+]] = bitcast i16* [[CIJ]] to i32*			; CHECK: [[CIJ_CAST:%[^ ]+]] = bitcast i16* [[CIJ]] to i32*
	; CHECK: [[CIJ_LD:%[^ ]+]] = load i32, i32* [[CIJ_CAST]], align 2			; CHECK: [[CIJ_LD:%[^ ]+]] = load i32, i32* [[CIJ_CAST]], align 2
				; CHECK: [[SMLAD0:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[CIJ_LD]], i32 [[BIJ_LD]], i32 0)
	; CHECK: [[BIJ_2:%[^ ]+]] = getelementptr inbounds i16, i16* [[BIJ]], i32 2			; CHECK: [[BIJ_2:%[^ ]+]] = getelementptr inbounds i16, i16* [[BIJ]], i32 2
	; CHECK: [[BIJ_2_CAST:%[^ ]+]] = bitcast i16* [[BIJ_2]] to i32*			; CHECK: [[BIJ_2_CAST:%[^ ]+]] = bitcast i16* [[BIJ_2]] to i32*
	; CHECK: [[BIJ_2_LD:%[^ ]+]] = load i32, i32* [[BIJ_2_CAST]], align 2			; CHECK: [[BIJ_2_LD:%[^ ]+]] = load i32, i32* [[BIJ_2_CAST]], align 2
	; CHECK: [[CIJ_2:%[^ ]+]] = getelementptr inbounds i16, i16* [[CIJ]], i32 2			; CHECK: [[CIJ_2:%[^ ]+]] = getelementptr inbounds i16, i16* [[CIJ]], i32 2
	; CHECK: [[CIJ_2_CAST:%[^ ]+]] = bitcast i16* [[CIJ_2]] to i32*			; CHECK: [[CIJ_2_CAST:%[^ ]+]] = bitcast i16* [[CIJ_2]] to i32*
	; CHECK: [[CIJ_2_LD:%[^ ]+]] = load i32, i32* [[CIJ_2_CAST]], align 2			; CHECK: [[CIJ_2_LD:%[^ ]+]] = load i32, i32* [[CIJ_2_CAST]], align 2
	; CHECK: [[SMLAD0:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[CIJ_2_LD]], i32 [[BIJ_2_LD]], i32 0)			; CHECK: [[SMLAD1:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[CIJ_2_LD]], i32 [[BIJ_2_LD]], i32 [[SMLAD0]])
	; CHECK: [[SMLAD1:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[CIJ_LD]], i32 [[BIJ_LD]], i32 [[SMLAD0]])
	; CHECK: store i32 [[SMLAD1]], i32* %arrayidx, align 4			; CHECK: store i32 [[SMLAD1]], i32* %arrayidx, align 4

	define void @full_unroll(i32* noalias nocapture %a, i16 noalias nocapture readonly %b, i16 noalias nocapture readonly %c, i32 %N) {			define void @full_unroll(i32* noalias nocapture %a, i16 noalias nocapture readonly %b, i16 noalias nocapture readonly %c, i32 %N) {
	entry:			entry:
	%cmp29 = icmp eq i32 %N, 0			%cmp29 = icmp eq i32 %N, 0
	br i1 %cmp29, label %for.cond.cleanup, label %for.body			br i1 %cmp29, label %for.cond.cleanup, label %for.body

	for.cond.cleanup: ; preds = %for.body, %entry			for.cond.cleanup: ; preds = %for.body, %entry
	▲ Show 20 Lines • Show All 122 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll

Show All 14 Lines
; CHECK-LE-NEXT: mov.w r12, #0		; CHECK-LE-NEXT: mov.w r12, #0
; CHECK-LE-NEXT: movs r1, #0		; CHECK-LE-NEXT: movs r1, #0
; CHECK-LE-NEXT: .p2align 2		; CHECK-LE-NEXT: .p2align 2
; CHECK-LE-NEXT: .LBB0_2: @ %for.body		; CHECK-LE-NEXT: .LBB0_2: @ %for.body
; CHECK-LE-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-LE-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-LE-NEXT: ldr lr, [r3, #2]!		; CHECK-LE-NEXT: ldr lr, [r3, #2]!
; CHECK-LE-NEXT: ldr r4, [r2, #2]!		; CHECK-LE-NEXT: ldr r4, [r2, #2]!
; CHECK-LE-NEXT: subs r0, #1		; CHECK-LE-NEXT: subs r0, #1
; CHECK-LE-NEXT: sxtah r1, r1, lr
; CHECK-LE-NEXT: smlad r12, r4, lr, r12		; CHECK-LE-NEXT: smlad r12, r4, lr, r12
		; CHECK-LE-NEXT: sxtah r1, r1, lr
; CHECK-LE-NEXT: bne .LBB0_2		; CHECK-LE-NEXT: bne .LBB0_2
; CHECK-LE-NEXT: @ %bb.3: @ %for.cond.cleanup		; CHECK-LE-NEXT: @ %bb.3: @ %for.cond.cleanup
; CHECK-LE-NEXT: add.w r0, r12, r1		; CHECK-LE-NEXT: add.w r0, r12, r1
; CHECK-LE-NEXT: pop {r4, pc}		; CHECK-LE-NEXT: pop {r4, pc}
; CHECK-LE-NEXT: .LBB0_4:		; CHECK-LE-NEXT: .LBB0_4:
; CHECK-LE-NEXT: mov.w r12, #0		; CHECK-LE-NEXT: mov.w r12, #0
; CHECK-LE-NEXT: movs r1, #0		; CHECK-LE-NEXT: movs r1, #0
; CHECK-LE-NEXT: add.w r0, r12, r1		; CHECK-LE-NEXT: add.w r0, r12, r1
▲ Show 20 Lines • Show All 172 Lines • ▼ Show 20 Lines	for.body:
%count.next = mul i32 %conv4, %count		%count.next = mul i32 %conv4, %count
%exitcond = icmp ne i32 %add, %arg		%exitcond = icmp ne i32 %add, %arg
br i1 %exitcond, label %for.body, label %for.cond.cleanup		br i1 %exitcond, label %for.body, label %for.cond.cleanup
}		}

define i32 @mul_top_user(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {		define i32 @mul_top_user(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {
; CHECK-LE-LABEL: mul_top_user:		; CHECK-LE-LABEL: mul_top_user:
; CHECK-LE: @ %bb.0: @ %entry		; CHECK-LE: @ %bb.0: @ %entry
; CHECK-LE-NEXT: .save {r4, r5, r7, lr}		; CHECK-LE-NEXT: .save {r4, lr}
; CHECK-LE-NEXT: push {r4, r5, r7, lr}		; CHECK-LE-NEXT: push {r4, lr}
; CHECK-LE-NEXT: cmp r0, #1		; CHECK-LE-NEXT: cmp r0, #1
; CHECK-LE-NEXT: blt .LBB2_4		; CHECK-LE-NEXT: blt .LBB2_4
; CHECK-LE-NEXT: @ %bb.1: @ %for.body.preheader		; CHECK-LE-NEXT: @ %bb.1: @ %for.body.preheader
; CHECK-LE-NEXT: subs r2, #2		; CHECK-LE-NEXT: subs r2, #2
; CHECK-LE-NEXT: subs r3, #2		; CHECK-LE-NEXT: subs r3, #2
; CHECK-LE-NEXT: mov.w r12, #0		; CHECK-LE-NEXT: mov.w r12, #0
; CHECK-LE-NEXT: movs r1, #0		; CHECK-LE-NEXT: movs r1, #0
; CHECK-LE-NEXT: .p2align 2		; CHECK-LE-NEXT: .p2align 2
; CHECK-LE-NEXT: .LBB2_2: @ %for.body		; CHECK-LE-NEXT: .LBB2_2: @ %for.body
; CHECK-LE-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-LE-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-LE-NEXT: ldr r4, [r2, #2]!
; CHECK-LE-NEXT: ldr lr, [r3, #2]!		; CHECK-LE-NEXT: ldr lr, [r3, #2]!
; CHECK-LE-NEXT: asrs r5, r4, #16		; CHECK-LE-NEXT: ldr r4, [r2, #2]!
; CHECK-LE-NEXT: smlad r12, r4, lr, r12
; CHECK-LE-NEXT: subs r0, #1		; CHECK-LE-NEXT: subs r0, #1
; CHECK-LE-NEXT: mul r1, r5, r1		; CHECK-LE-NEXT: smlad r12, r4, lr, r12
		; CHECK-LE-NEXT: asr.w r4, r4, #16
		; CHECK-LE-NEXT: mul r1, r4, r1
; CHECK-LE-NEXT: bne .LBB2_2		; CHECK-LE-NEXT: bne .LBB2_2
; CHECK-LE-NEXT: @ %bb.3: @ %for.cond.cleanup		; CHECK-LE-NEXT: @ %bb.3: @ %for.cond.cleanup
; CHECK-LE-NEXT: add.w r0, r12, r1		; CHECK-LE-NEXT: add.w r0, r12, r1
; CHECK-LE-NEXT: pop {r4, r5, r7, pc}		; CHECK-LE-NEXT: pop {r4, pc}
; CHECK-LE-NEXT: .LBB2_4:		; CHECK-LE-NEXT: .LBB2_4:
; CHECK-LE-NEXT: mov.w r12, #0		; CHECK-LE-NEXT: mov.w r12, #0
; CHECK-LE-NEXT: movs r1, #0		; CHECK-LE-NEXT: movs r1, #0
; CHECK-LE-NEXT: add.w r0, r12, r1		; CHECK-LE-NEXT: add.w r0, r12, r1
; CHECK-LE-NEXT: pop {r4, r5, r7, pc}		; CHECK-LE-NEXT: pop {r4, pc}
;		;
; CHECK-BE-LABEL: mul_top_user:		; CHECK-BE-LABEL: mul_top_user:
; CHECK-BE: @ %bb.0: @ %entry		; CHECK-BE: @ %bb.0: @ %entry
; CHECK-BE-NEXT: .save {r4, r5, r6, lr}		; CHECK-BE-NEXT: .save {r4, r5, r6, lr}
; CHECK-BE-NEXT: push {r4, r5, r6, lr}		; CHECK-BE-NEXT: push {r4, r5, r6, lr}
; CHECK-BE-NEXT: cmp r0, #1		; CHECK-BE-NEXT: cmp r0, #1
; CHECK-BE-NEXT: blt .LBB2_4		; CHECK-BE-NEXT: blt .LBB2_4
; CHECK-BE-NEXT: @ %bb.1: @ %for.body.preheader		; CHECK-BE-NEXT: @ %bb.1: @ %for.body.preheader
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	for.body:
%count.next = mul i32 %conv7, %count		%count.next = mul i32 %conv7, %count
%exitcond = icmp ne i32 %add, %arg		%exitcond = icmp ne i32 %add, %arg
br i1 %exitcond, label %for.body, label %for.cond.cleanup		br i1 %exitcond, label %for.body, label %for.cond.cleanup
}		}

define i32 @and_user(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {		define i32 @and_user(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {
; CHECK-LE-LABEL: and_user:		; CHECK-LE-LABEL: and_user:
; CHECK-LE: @ %bb.0: @ %entry		; CHECK-LE: @ %bb.0: @ %entry
; CHECK-LE-NEXT: .save {r4, r5, r7, lr}		; CHECK-LE-NEXT: .save {r4, lr}
; CHECK-LE-NEXT: push {r4, r5, r7, lr}		; CHECK-LE-NEXT: push {r4, lr}
; CHECK-LE-NEXT: cmp r0, #1		; CHECK-LE-NEXT: cmp r0, #1
; CHECK-LE-NEXT: blt .LBB3_4		; CHECK-LE-NEXT: blt .LBB3_4
; CHECK-LE-NEXT: @ %bb.1: @ %for.body.preheader		; CHECK-LE-NEXT: @ %bb.1: @ %for.body.preheader
; CHECK-LE-NEXT: sub.w lr, r2, #2		; CHECK-LE-NEXT: sub.w lr, r2, #2
; CHECK-LE-NEXT: subs r3, #2		; CHECK-LE-NEXT: subs r3, #2
; CHECK-LE-NEXT: mov.w r12, #0		; CHECK-LE-NEXT: mov.w r12, #0
; CHECK-LE-NEXT: movs r1, #0		; CHECK-LE-NEXT: movs r1, #0
; CHECK-LE-NEXT: .p2align 2		; CHECK-LE-NEXT: .p2align 2
; CHECK-LE-NEXT: .LBB3_2: @ %for.body		; CHECK-LE-NEXT: .LBB3_2: @ %for.body
; CHECK-LE-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-LE-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-LE-NEXT: ldr r2, [r3, #2]!		; CHECK-LE-NEXT: ldr r2, [r3, #2]!
; CHECK-LE-NEXT: ldr r4, [lr, #2]!		; CHECK-LE-NEXT: ldr r4, [lr, #2]!
; CHECK-LE-NEXT: uxth r5, r2
; CHECK-LE-NEXT: smlad r12, r4, r2, r12
; CHECK-LE-NEXT: subs r0, #1		; CHECK-LE-NEXT: subs r0, #1
; CHECK-LE-NEXT: mul r1, r5, r1		; CHECK-LE-NEXT: smlad r12, r4, r2, r12
		; CHECK-LE-NEXT: uxth r2, r2
		; CHECK-LE-NEXT: mul r1, r2, r1
; CHECK-LE-NEXT: bne .LBB3_2		; CHECK-LE-NEXT: bne .LBB3_2
; CHECK-LE-NEXT: @ %bb.3: @ %for.cond.cleanup		; CHECK-LE-NEXT: @ %bb.3: @ %for.cond.cleanup
; CHECK-LE-NEXT: add.w r0, r12, r1		; CHECK-LE-NEXT: add.w r0, r12, r1
; CHECK-LE-NEXT: pop {r4, r5, r7, pc}		; CHECK-LE-NEXT: pop {r4, pc}
; CHECK-LE-NEXT: .LBB3_4:		; CHECK-LE-NEXT: .LBB3_4:
; CHECK-LE-NEXT: mov.w r12, #0		; CHECK-LE-NEXT: mov.w r12, #0
; CHECK-LE-NEXT: movs r1, #0		; CHECK-LE-NEXT: movs r1, #0
; CHECK-LE-NEXT: add.w r0, r12, r1		; CHECK-LE-NEXT: add.w r0, r12, r1
; CHECK-LE-NEXT: pop {r4, r5, r7, pc}		; CHECK-LE-NEXT: pop {r4, pc}
;		;
; CHECK-BE-LABEL: and_user:		; CHECK-BE-LABEL: and_user:
; CHECK-BE: @ %bb.0: @ %entry		; CHECK-BE: @ %bb.0: @ %entry
; CHECK-BE-NEXT: .save {r4, r5, r6, lr}		; CHECK-BE-NEXT: .save {r4, r5, r6, lr}
; CHECK-BE-NEXT: push {r4, r5, r6, lr}		; CHECK-BE-NEXT: push {r4, r5, r6, lr}
; CHECK-BE-NEXT: cmp r0, #1		; CHECK-BE-NEXT: cmp r0, #1
; CHECK-BE-NEXT: blt .LBB3_4		; CHECK-BE-NEXT: blt .LBB3_4
; CHECK-BE-NEXT: @ %bb.1: @ %for.body.preheader		; CHECK-BE-NEXT: @ %bb.1: @ %for.body.preheader
▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/overlapping.ll

	; RUN: opt -arm-parallel-dsp -mtriple=armv7-a -S %s -o - \| FileCheck %s			; RUN: opt -arm-parallel-dsp -mtriple=armv7-a -S %s -o - \| FileCheck %s

	; CHECK-LABEL: overlap_1			; CHECK-LABEL: overlap_1
	; CHECK: [[ADDR_A_1:%[^ ]+]] = getelementptr i16, i16* %a, i32 1			; CHECK: [[ADDR_A_1:%[^ ]+]] = getelementptr i16, i16* %a, i32 1
	; CHECK: [[ADDR_B_1:%[^ ]+]] = getelementptr i16, i16* %b, i32 1			; CHECK: [[ADDR_B_1:%[^ ]+]] = getelementptr i16, i16* %b, i32 1
	; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*			; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
	; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]			; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
	; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*			; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
	; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]			; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
				; CHECK: [[ACC:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 %acc)
	; CHECK: [[CAST_A_1:%[^ ]+]] = bitcast i16* [[ADDR_A_1]] to i32*			; CHECK: [[CAST_A_1:%[^ ]+]] = bitcast i16* [[ADDR_A_1]] to i32*
	; CHECK: [[LD_A_1:%[^ ]+]] = load i32, i32* [[CAST_A_1]]			; CHECK: [[LD_A_1:%[^ ]+]] = load i32, i32* [[CAST_A_1]]
	; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[ADDR_B_1]] to i32*			; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[ADDR_B_1]] to i32*
	; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]			; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]
	; CHECK: [[ACC:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A_1]], i32 [[LD_B_1]], i32 %acc)			; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A_1]], i32 [[LD_B_1]], i32 [[ACC]])
	; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 [[ACC]])
	; CHECK: ret i32 [[RES]]			; CHECK: ret i32 [[RES]]
	define i32 @overlap_1(i16* %a, i16* %b, i32 %acc) {			define i32 @overlap_1(i16* %a, i16* %b, i32 %acc) {
	entry:			entry:
	%addr.a.1 = getelementptr i16, i16* %a, i32 1			%addr.a.1 = getelementptr i16, i16* %a, i32 1
	%addr.b.1 = getelementptr i16, i16* %b, i32 1			%addr.b.1 = getelementptr i16, i16* %b, i32 1
	%ld.a.0 = load i16, i16* %a			%ld.a.0 = load i16, i16* %a
	%sext.a.0 = sext i16 %ld.a.0 to i32			%sext.a.0 = sext i16 %ld.a.0 to i32
	%ld.b.0 = load i16, i16* %b			%ld.b.0 = load i16, i16* %b
	Show All 22 Lines
	; this just increase register pressure unnecessarily?			; this just increase register pressure unnecessarily?
	; CHECK-LABEL: overlap_64_1			; CHECK-LABEL: overlap_64_1
	; CHECK: [[ADDR_A_1:%[^ ]+]] = getelementptr i16, i16* %a, i32 1			; CHECK: [[ADDR_A_1:%[^ ]+]] = getelementptr i16, i16* %a, i32 1
	; CHECK: [[ADDR_B_1:%[^ ]+]] = getelementptr i16, i16* %b, i32 1			; CHECK: [[ADDR_B_1:%[^ ]+]] = getelementptr i16, i16* %b, i32 1
	; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*			; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
	; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]			; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
	; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*			; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
	; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]			; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
				; CHECK: [[ACC:%[^ ]+]] = call i64 @llvm.arm.smlald(i32 [[LD_A]], i32 [[LD_B]], i64 %acc)
	; CHECK: [[CAST_A_1:%[^ ]+]] = bitcast i16* [[ADDR_A_1]] to i32*			; CHECK: [[CAST_A_1:%[^ ]+]] = bitcast i16* [[ADDR_A_1]] to i32*
	; CHECK: [[LD_A_1:%[^ ]+]] = load i32, i32* [[CAST_A_1]]			; CHECK: [[LD_A_1:%[^ ]+]] = load i32, i32* [[CAST_A_1]]
	; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[ADDR_B_1]] to i32*			; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[ADDR_B_1]] to i32*
	; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]			; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]
	; CHECK: [[ACC:%[^ ]+]] = call i64 @llvm.arm.smlald(i32 [[LD_A_1]], i32 [[LD_B_1]], i64 %acc)			; CHECK: [[RES:%[^ ]+]] = call i64 @llvm.arm.smlald(i32 [[LD_A_1]], i32 [[LD_B_1]], i64 [[ACC]])
	; CHECK: [[RES:%[^ ]+]] = call i64 @llvm.arm.smlald(i32 [[LD_A]], i32 [[LD_B]], i64 [[ACC]])
	; CHECK: ret i64 [[RES]]			; CHECK: ret i64 [[RES]]
	define i64 @overlap_64_1(i16* %a, i16* %b, i64 %acc) {			define i64 @overlap_64_1(i16* %a, i16* %b, i64 %acc) {
	entry:			entry:
	%addr.a.1 = getelementptr i16, i16* %a, i32 1			%addr.a.1 = getelementptr i16, i16* %a, i32 1
	%addr.b.1 = getelementptr i16, i16* %b, i32 1			%addr.b.1 = getelementptr i16, i16* %b, i32 1
	%ld.a.0 = load i16, i16* %a			%ld.a.0 = load i16, i16* %a
	%sext.a.0 = sext i16 %ld.a.0 to i32			%sext.a.0 = sext i16 %ld.a.0 to i32
	%ld.b.0 = load i16, i16* %b			%ld.b.0 = load i16, i16* %b
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	}			}

	; CHECK-LABEL: overlap_3			; CHECK-LABEL: overlap_3
	; CHECK: [[GEP_B:%[^ ]+]] = getelementptr i16, i16* %b, i32 1			; CHECK: [[GEP_B:%[^ ]+]] = getelementptr i16, i16* %b, i32 1
	; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*			; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
	; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]			; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
	; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*			; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
	; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]			; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
				; CHECK: [[SMLAD:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 %acc)
	; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[GEP_B]] to i32*			; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[GEP_B]] to i32*
	; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]			; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]
	; CHECK: [[GEP_A:%[^ ]+]] = getelementptr i16, i16* %a, i32 2			; CHECK: [[GEP_A:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
	; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP_A]] to i32*			; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP_A]] to i32*
	; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]			; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
	; CHECK: [[SMLAD:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A_2]], i32 [[LD_B_1]], i32 %acc)			; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A_2]], i32 [[LD_B_1]], i32 [[SMLAD]])
	; CHECK: call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 [[SMLAD]])			; CHECK: ret i32 [[RES]]
	define i32 @overlap_3(i16* %a, i16* %b, i32 %acc) {			define i32 @overlap_3(i16* %a, i16* %b, i32 %acc) {
	entry:			entry:
	%addr.a.1 = getelementptr i16, i16* %a, i32 1			%addr.a.1 = getelementptr i16, i16* %a, i32 1
	%addr.b.1 = getelementptr i16, i16* %b, i32 1			%addr.b.1 = getelementptr i16, i16* %b, i32 1
	%ld.a.0 = load i16, i16* %a			%ld.a.0 = load i16, i16* %a
	%sext.a.0 = sext i16 %ld.a.0 to i32			%sext.a.0 = sext i16 %ld.a.0 to i32
	%ld.b.0 = load i16, i16* %b			%ld.b.0 = load i16, i16* %b
	%ld.a.1 = load i16, i16* %addr.a.1			%ld.a.1 = load i16, i16* %addr.a.1
	Show All 22 Lines
	}			}

	; CHECK-LABEL: overlap_4			; CHECK-LABEL: overlap_4
	; CHECK: [[GEP_B:%[^ ]+]] = getelementptr i16, i16* %b, i32 1			; CHECK: [[GEP_B:%[^ ]+]] = getelementptr i16, i16* %b, i32 1
	; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*			; CHECK: [[CAST_A:%[^ ]+]] = bitcast i16* %a to i32*
	; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]			; CHECK: [[LD_A:%[^ ]+]] = load i32, i32* [[CAST_A]]
	; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*			; CHECK: [[CAST_B:%[^ ]+]] = bitcast i16* %b to i32*
	; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]			; CHECK: [[LD_B:%[^ ]+]] = load i32, i32* [[CAST_B]]
				; CHECK: [[SMLAD:%[^ ]+]] = call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 %acc)
	; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[GEP_B]] to i32*			; CHECK: [[CAST_B_1:%[^ ]+]] = bitcast i16* [[GEP_B]] to i32*
	; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]			; CHECK: [[LD_B_1:%[^ ]+]] = load i32, i32* [[CAST_B_1]]
	; CHECK: [[GEP_A:%[^ ]+]] = getelementptr i16, i16* %a, i32 2			; CHECK: [[GEP_A:%[^ ]+]] = getelementptr i16, i16* %a, i32 2
	; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP_A]] to i32*			; CHECK: [[CAST_A_2:%[^ ]+]] = bitcast i16* [[GEP_A]] to i32*
	; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]			; CHECK: [[LD_A_2:%[^ ]+]] = load i32, i32* [[CAST_A_2]]
	; CHECK: [[SMLAD:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[LD_A_2]], i32 [[LD_B_1]], i32 %acc)			; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[LD_A_2]], i32 [[LD_B_1]], i32 [[SMLAD]])
	; CHECK: call i32 @llvm.arm.smlad(i32 [[LD_A]], i32 [[LD_B]], i32 [[SMLAD]])			; CHECK: ret i32 [[RES]]
	define i32 @overlap_4(i16* %a, i16* %b, i32 %acc) {			define i32 @overlap_4(i16* %a, i16* %b, i32 %acc) {
	entry:			entry:
	%addr.a.1 = getelementptr i16, i16* %a, i32 1			%addr.a.1 = getelementptr i16, i16* %a, i32 1
	%addr.b.1 = getelementptr i16, i16* %b, i32 1			%addr.b.1 = getelementptr i16, i16* %b, i32 1
	%ld.a.0 = load i16, i16* %a			%ld.a.0 = load i16, i16* %a
	%sext.a.0 = sext i16 %ld.a.0 to i32			%sext.a.0 = sext i16 %ld.a.0 to i32
	%ld.b.0 = load i16, i16* %b			%ld.b.0 = load i16, i16* %b
	%ld.a.1 = load i16, i16* %addr.a.1			%ld.a.1 = load i16, i16* %addr.a.1
	Show All 23 Lines

test/CodeGen/ARM/ParallelDSP/pr43073.ll

	Show All 9 Lines
	; CHECK: [[MUL0:%[^ ]+]] = mul nsw i32 [[B_PLUS_1]], [[IN_MINUS_1]]			; CHECK: [[MUL0:%[^ ]+]] = mul nsw i32 [[B_PLUS_1]], [[IN_MINUS_1]]
	; CHECK: [[ADD0:%[^ ]+]] = add i32 [[MUL0]], %call			; CHECK: [[ADD0:%[^ ]+]] = add i32 [[MUL0]], %call
	; CHECK: [[ADDR_IN_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -3			; CHECK: [[ADDR_IN_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -3
	; CHECK: [[CAST_ADDR_IN_MINUS_3:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_3]] to i32*			; CHECK: [[CAST_ADDR_IN_MINUS_3:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_3]] to i32*
	; CHECK: [[IN_MINUS_3:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_3]], align 2			; CHECK: [[IN_MINUS_3:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_3]], align 2
	; CHECK: [[ADDR_B_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 2			; CHECK: [[ADDR_B_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 2
	; CHECK: [[CAST_ADDR_B_PLUS_2:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_2]] to i32*			; CHECK: [[CAST_ADDR_B_PLUS_2:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_2]] to i32*
	; CHECK: [[B_PLUS_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_2]], align 2			; CHECK: [[B_PLUS_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_2]], align 2
				; CHECK: [[ACC:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_3]], i32 [[B_PLUS_2]], i32 [[ADD0]])
	; CHECK: [[ADDR_IN_MINUS_5:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -5			; CHECK: [[ADDR_IN_MINUS_5:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -5
	; CHECK: [[CAST_ADDR_IN_MINUS_5:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_5]] to i32*			; CHECK: [[CAST_ADDR_IN_MINUS_5:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_5]] to i32*
	; CHECK: [[IN_MINUS_5:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_5]], align 2			; CHECK: [[IN_MINUS_5:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_5]], align 2
	; CHECK: [[ADDR_B_PLUS_4:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 4			; CHECK: [[ADDR_B_PLUS_4:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 4
	; CHECK: [[CAST_ADDR_B_PLUS_4:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_4]] to i32*			; CHECK: [[CAST_ADDR_B_PLUS_4:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_4]] to i32*
	; CHECK: [[B_PLUS_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_4]], align 2			; CHECK: [[B_PLUS_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_4]], align 2
	; CHECK: [[ACC:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_5]], i32 [[B_PLUS_4]], i32 [[ADD0]])			; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_5]], i32 [[B_PLUS_4]], i32 [[ACC]])
	; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_3]], i32 [[B_PLUS_2]], i32 [[ACC]])
	; CHECK: ret i32 [[RES]]			; CHECK: ret i32 [[RES]]
	define i32 @first_mul_invalid(i16* nocapture readonly %in, i16* nocapture readonly %b) {			define i32 @first_mul_invalid(i16* nocapture readonly %in, i16* nocapture readonly %b) {
	entry:			entry:
	%0 = load i16, i16* %in, align 2			%0 = load i16, i16* %in, align 2
	%conv = sext i16 %0 to i32			%conv = sext i16 %0 to i32
	%1 = load i16, i16* %b, align 2			%1 = load i16, i16* %b, align 2
	%conv2 = sext i16 %1 to i32			%conv2 = sext i16 %1 to i32
	%call = tail call i32 @bar(i32 %conv, i32 %conv2)			%call = tail call i32 @bar(i32 %conv, i32 %conv2)
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; CHECK: [[B_PLUS_1:%[^ ]+]] = sext i16 [[LD_B_PLUS_1]] to i32			; CHECK: [[B_PLUS_1:%[^ ]+]] = sext i16 [[LD_B_PLUS_1]] to i32
	; CHECK: [[MUL0:%[^ ]+]] = mul nsw i32 [[B_PLUS_1]], [[IN_MINUS_1]]			; CHECK: [[MUL0:%[^ ]+]] = mul nsw i32 [[B_PLUS_1]], [[IN_MINUS_1]]
	; CHECK: [[ADDR_IN_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -3			; CHECK: [[ADDR_IN_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -3
	; CHECK: [[CAST_ADDR_IN_MINUS_3:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_3]] to i32*			; CHECK: [[CAST_ADDR_IN_MINUS_3:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_3]] to i32*
	; CHECK: [[IN_MINUS_3:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_3]], align 2			; CHECK: [[IN_MINUS_3:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_3]], align 2
	; CHECK: [[ADDR_B_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 2			; CHECK: [[ADDR_B_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 2
	; CHECK: [[CAST_ADDR_B_PLUS_2:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_2]] to i32*			; CHECK: [[CAST_ADDR_B_PLUS_2:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_2]] to i32*
	; CHECK: [[B_PLUS_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_2]], align 2			; CHECK: [[B_PLUS_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_2]], align 2
				; CHECK: [[ACC:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_3]], i32 [[B_PLUS_2]], i32 [[MUL0]])
	; CHECK: [[ADDR_IN_MINUS_5:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -5			; CHECK: [[ADDR_IN_MINUS_5:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -5
	; CHECK: [[CAST_ADDR_IN_MINUS_5:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_5]] to i32*			; CHECK: [[CAST_ADDR_IN_MINUS_5:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_5]] to i32*
	; CHECK: [[IN_MINUS_5:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_5]], align 2			; CHECK: [[IN_MINUS_5:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_5]], align 2
	; CHECK: [[ADDR_B_PLUS_4:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 4			; CHECK: [[ADDR_B_PLUS_4:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 4
	; CHECK: [[CAST_ADDR_B_PLUS_4:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_4]] to i32*			; CHECK: [[CAST_ADDR_B_PLUS_4:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_4]] to i32*
	; CHECK: [[B_PLUS_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_4]], align 2			; CHECK: [[B_PLUS_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_4]], align 2
	; CHECK: [[ACC:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_5]], i32 [[B_PLUS_4]], i32 [[MUL0]])			; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_5]], i32 [[B_PLUS_4]], i32 [[ACC]])
	; CHECK: [[RES:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN_MINUS_3]], i32 [[B_PLUS_2]], i32 [[ACC]])
	; CHECK: ret i32 [[RES]]			; CHECK: ret i32 [[RES]]
	define i32 @with_no_acc_input(i16* nocapture readonly %in, i16* nocapture readonly %b) {			define i32 @with_no_acc_input(i16* nocapture readonly %in, i16* nocapture readonly %b) {
	entry:			entry:
	%arrayidx3 = getelementptr inbounds i16, i16* %in, i32 -1			%arrayidx3 = getelementptr inbounds i16, i16* %in, i32 -1
	%ld.2 = load i16, i16* %arrayidx3, align 2			%ld.2 = load i16, i16* %arrayidx3, align 2
	%conv4 = sext i16 %ld.2 to i32			%conv4 = sext i16 %ld.2 to i32
	%arrayidx5 = getelementptr inbounds i16, i16* %b, i32 1			%arrayidx5 = getelementptr inbounds i16, i16* %b, i32 1
	%ld.3 = load i16, i16* %arrayidx5, align 2			%ld.3 = load i16, i16* %arrayidx5, align 2
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; CHECK: [[SEXT1:%[^ ]+]] = sext i32 [[MUL0]] to i64			; CHECK: [[SEXT1:%[^ ]+]] = sext i32 [[MUL0]] to i64
	; CHECK: [[ADD0:%[^ ]+]] = add i64 %sext.0, [[SEXT1]]			; CHECK: [[ADD0:%[^ ]+]] = add i64 %sext.0, [[SEXT1]]
	; CHECK: [[ADDR_IN_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -3			; CHECK: [[ADDR_IN_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -3
	; CHECK: [[CAST_ADDR_IN_MINUS_3:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_3]] to i32*			; CHECK: [[CAST_ADDR_IN_MINUS_3:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_3]] to i32*
	; CHECK: [[IN_MINUS_3:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_3]], align 2			; CHECK: [[IN_MINUS_3:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_3]], align 2
	; CHECK: [[ADDR_B_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 2			; CHECK: [[ADDR_B_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 2
	; CHECK: [[CAST_ADDR_B_PLUS_2:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_2]] to i32*			; CHECK: [[CAST_ADDR_B_PLUS_2:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_2]] to i32*
	; CHECK: [[B_PLUS_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_2]], align 2			; CHECK: [[B_PLUS_2:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_2]], align 2
				; CHECK: [[ACC:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN_MINUS_3]], i32 [[B_PLUS_2]], i64 [[ADD0]])
	; CHECK: [[ADDR_IN_MINUS_5:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -5			; CHECK: [[ADDR_IN_MINUS_5:%[^ ]+]] = getelementptr inbounds i16, i16* %in, i32 -5
	; CHECK: [[CAST_ADDR_IN_MINUS_5:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_5]] to i32*			; CHECK: [[CAST_ADDR_IN_MINUS_5:%[^ ]+]] = bitcast i16* [[ADDR_IN_MINUS_5]] to i32*
	; CHECK: [[IN_MINUS_5:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_5]], align 2			; CHECK: [[IN_MINUS_5:%[^ ]+]] = load i32, i32* [[CAST_ADDR_IN_MINUS_5]], align 2
	; CHECK: [[ADDR_B_PLUS_4:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 4			; CHECK: [[ADDR_B_PLUS_4:%[^ ]+]] = getelementptr inbounds i16, i16* %b, i32 4
	; CHECK: [[CAST_ADDR_B_PLUS_4:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_4]] to i32*			; CHECK: [[CAST_ADDR_B_PLUS_4:%[^ ]+]] = bitcast i16* [[ADDR_B_PLUS_4]] to i32*
	; CHECK: [[B_PLUS_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_4]], align 2			; CHECK: [[B_PLUS_4:%[^ ]+]] = load i32, i32* [[CAST_ADDR_B_PLUS_4]], align 2
	; CHECK: [[ACC:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN_MINUS_5]], i32 [[B_PLUS_4]], i64 [[ADD0]])			; CHECK: [[RES:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN_MINUS_5]], i32 [[B_PLUS_4]], i64 [[ACC]])
	; CHECK: [[RES:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN_MINUS_3]], i32 [[B_PLUS_2]], i64 [[ACC]])
	; CHECK: ret i64 [[RES]]			; CHECK: ret i64 [[RES]]
	define i64 @with_64bit_acc(i16* nocapture readonly %in, i16* nocapture readonly %b) {			define i64 @with_64bit_acc(i16* nocapture readonly %in, i16* nocapture readonly %b) {
	entry:			entry:
	%0 = load i16, i16* %in, align 2			%0 = load i16, i16* %in, align 2
	%conv = sext i16 %0 to i32			%conv = sext i16 %0 to i32
	%1 = load i16, i16* %b, align 2			%1 = load i16, i16* %b, align 2
	%conv2 = sext i16 %1 to i32			%conv2 = sext i16 %1 to i32
	%call = tail call i32 @bar(i32 %conv, i32 %conv2)			%call = tail call i32 @bar(i32 %conv, i32 %conv2)
	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; CHECK: [[SEXT_MUL0:%[^ ]+]] = sext i32 [[MUL0]] to i64			; CHECK: [[SEXT_MUL0:%[^ ]+]] = sext i32 [[MUL0]] to i64
	; CHECK: [[ADD_1:%[^ ]+]] = add nsw i64 %sum.3758.unr, [[SEXT_MUL0]]			; CHECK: [[ADD_1:%[^ ]+]] = add nsw i64 %sum.3758.unr, [[SEXT_MUL0]]
	; CHECK: [[X_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %px.10756.unr, i32 2			; CHECK: [[X_PLUS_2:%[^ ]+]] = getelementptr inbounds i16, i16* %px.10756.unr, i32 2
	; CHECK: [[X_1:%[^ ]+]] = load i16, i16* [[ADDR_X_PLUS_1]], align 2			; CHECK: [[X_1:%[^ ]+]] = load i16, i16* [[ADDR_X_PLUS_1]], align 2
	; CHECK: [[SEXT_X_1:%[^ ]+]] = sext i16 [[X_1]] to i32			; CHECK: [[SEXT_X_1:%[^ ]+]] = sext i16 [[X_1]] to i32
	; CHECK: [[Y_1:%[^ ]+]] = load i16, i16* [[ADDR_Y_MINUS_1]], align 2			; CHECK: [[Y_1:%[^ ]+]] = load i16, i16* [[ADDR_Y_MINUS_1]], align 2
	; CHECK: [[SEXT_Y_1:%[^ ]+]] = sext i16 [[Y_1]] to i32			; CHECK: [[SEXT_Y_1:%[^ ]+]] = sext i16 [[Y_1]] to i32
	; CHECK: [[UNPAIRED:%[^ ]+]] = mul nsw i32 [[SEXT_Y_1]], [[SEXT_X_1]]			; CHECK: [[UNPAIRED:%[^ ]+]] = mul nsw i32 [[SEXT_Y_1]], [[SEXT_X_1]]
				; CHECK: [[SEXT:%[^ ]+]] = sext i32 [[UNPAIRED]] to i64
				; CHECK: [[ACC:%[^ ]+]] = add i64 [[SEXT]], [[ADD_1]]
	; CHECK: [[ADDR_X_PLUS_2:%[^ ]+]] = bitcast i16* [[X_PLUS_2]] to i32*			; CHECK: [[ADDR_X_PLUS_2:%[^ ]+]] = bitcast i16* [[X_PLUS_2]] to i32*
	; CHECK: [[X_2:%[^ ]+]] = load i32, i32* [[ADDR_X_PLUS_2]], align 2			; CHECK: [[X_2:%[^ ]+]] = load i32, i32* [[ADDR_X_PLUS_2]], align 2
	; CHECK: [[Y_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %py.8757.unr, i32 -3			; CHECK: [[Y_MINUS_3:%[^ ]+]] = getelementptr inbounds i16, i16* %py.8757.unr, i32 -3
	; CHECK: [[ADDR_Y_MINUS_3:%[^ ]+]] = bitcast i16* [[Y_MINUS_3]] to i32*			; CHECK: [[ADDR_Y_MINUS_3:%[^ ]+]] = bitcast i16* [[Y_MINUS_3]] to i32*
	; CHECK: [[Y_3:%[^ ]+]] = load i32, i32* [[ADDR_Y_MINUS_3]], align 2			; CHECK: [[Y_3:%[^ ]+]] = load i32, i32* [[ADDR_Y_MINUS_3]], align 2
	; CHECK: [[SEXT:%[^ ]+]] = sext i32 [[UNPAIRED]] to i64
	; CHECK: [[ACC:%[^ ]+]] = add i64 [[SEXT]], [[ADD_1]]
	; CHECK: [[RES:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[Y_3]], i32 [[X_2]], i64 [[ACC]])			; CHECK: [[RES:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[Y_3]], i32 [[X_2]], i64 [[ACC]])
	; CHECK: ret i64 [[RES]]			; CHECK: ret i64 [[RES]]
	define i64 @with_64bit_add_acc(i16* nocapture readonly %px.10756.unr, i16* nocapture readonly %py.8757.unr, i32 %acc) {			define i64 @with_64bit_add_acc(i16* nocapture readonly %px.10756.unr, i16* nocapture readonly %py.8757.unr, i32 %acc) {
	entry:			entry:
	%sum.3758.unr = sext i32 %acc to i64			%sum.3758.unr = sext i32 %acc to i64
	br label %bb.1			br label %bb.1

	bb.1:			bb.1:
	Show All 39 Lines

test/CodeGen/ARM/ParallelDSP/smlad11.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp -S -stats 2>&1 \| FileCheck %s			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp -S -stats 2>&1 \| FileCheck %s
	;			;
	; A more complicated chain: 4 mul operations, so we expect 2 smlad calls.			; A more complicated chain: 4 mul operations, so we expect 2 smlad calls.
	;			;
	; CHECK: %mac1{{\.}}054 = phi i32 [ [[V17:%[0-9]+]], %for.body ], [ 0, %for.body.preheader ]			; CHECK: %mac1{{\.}}054 = phi i32 [ [[V17:%[0-9]+]], %for.body ], [ 0, %for.body.preheader ]
	; CHECK: [[V10:%[0-9]+]] = bitcast i16* %arrayidx to i32*			; CHECK: [[V10:%[0-9]+]] = bitcast i16* %arrayidx to i32*
	; CHECK: [[V11:%[0-9]+]] = load i32, i32* [[V10]], align 2			; CHECK: [[V11:%[0-9]+]] = load i32, i32* [[V10]], align 2
	; CHECK: [[V15:%[0-9]+]] = bitcast i16* %arrayidx4 to i32*			; CHECK: [[V15:%[0-9]+]] = bitcast i16* %arrayidx4 to i32*
	; CHECK: [[V16:%[0-9]+]] = load i32, i32* [[V15]], align 2			; CHECK: [[V16:%[0-9]+]] = load i32, i32* [[V15]], align 2
	; CHECK: [[V8:%[0-9]+]] = bitcast i16* %arrayidx8 to i32*			; CHECK: [[V8:%[0-9]+]] = bitcast i16* %arrayidx8 to i32*
	; CHECK: [[V9:%[0-9]+]] = load i32, i32* [[V8]], align 2			; CHECK: [[V9:%[0-9]+]] = load i32, i32* [[V8]], align 2
				; CHECK: [[ACC:%[0-9]+]] = call i32 @llvm.arm.smlad(i32 [[V9]], i32 [[V11]], i32 %mac1{{\.}}054)
	; CHECK: [[V13:%[0-9]+]] = bitcast i16* %arrayidx17 to i32*			; CHECK: [[V13:%[0-9]+]] = bitcast i16* %arrayidx17 to i32*
	; CHECK: [[V14:%[0-9]+]] = load i32, i32* [[V13]], align 2			; CHECK: [[V14:%[0-9]+]] = load i32, i32* [[V13]], align 2
	; CHECK: [[V12:%[0-9]+]] = call i32 @llvm.arm.smlad(i32 [[V14]], i32 [[V16]], i32 %mac1{{\.}}054)			; CHECK: [[V12:%[0-9]+]] = call i32 @llvm.arm.smlad(i32 [[V14]], i32 [[V16]], i32 [[ACC]])
	; CHECK: [[V17:%[0-9]+]] = call i32 @llvm.arm.smlad(i32 [[V9]], i32 [[V11]], i32 [[V12]])
	;			;
	; And we don't want to see a 3rd smlad:			; And we don't want to see a 3rd smlad:
	; CHECK-NOT: call i32 @llvm.arm.smlad			; CHECK-NOT: call i32 @llvm.arm.smlad
	;			;
	; CHECK: 2 arm-parallel-dsp - Number of smlad instructions generated			; CHECK: 2 arm-parallel-dsp - Number of smlad instructions generated
	;			;
	define dso_local i32 @test(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {			define dso_local i32 @test(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {
	entry:			entry:
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/smladx-1.ll

	; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -arm-parallel-dsp %s -S -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -arm-parallel-dsp %s -S -o - \| FileCheck %s
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED
	; RUN: opt -mtriple=armeb-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=armeb-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED

	define i32 @smladx(i16* nocapture readonly %pIn1, i16* nocapture readonly %pIn2, i32 %j, i32 %limit) {			define i32 @smladx(i16* nocapture readonly %pIn1, i16* nocapture readonly %pIn2, i32 %j, i32 %limit) {

	; CHECK-LABEL: smladx			; CHECK-LABEL: smladx
	; CHECK: = phi i32 [ 0, %for.body.preheader.new ],			; CHECK: = phi i32 [ 0, %for.body.preheader.new ],
	; CHECK: [[ACC0:%[^ ]+]] = phi i32 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]			; CHECK: [[ACC0:%[^ ]+]] = phi i32 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]
	; CHECK: [[PIN21:%[^ ]+]] = bitcast i16* %pIn2.1 to i32*			; CHECK: [[PIN21:%[^ ]+]] = bitcast i16* %pIn2.1 to i32*
	; CHECK: [[IN21:%[^ ]+]] = load i32, i32* [[PIN21]], align 2			; CHECK: [[IN21:%[^ ]+]] = load i32, i32* [[PIN21]], align 2
	; CHECK: [[PIN10:%[^ ]+]] = bitcast i16* %pIn1.0 to i32*			; CHECK: [[PIN10:%[^ ]+]] = bitcast i16* %pIn1.0 to i32*
	; CHECK: [[IN10:%[^ ]+]] = load i32, i32* [[PIN10]], align 2			; CHECK: [[IN10:%[^ ]+]] = load i32, i32* [[PIN10]], align 2
				; CHECK: [[ACC1:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN21]], i32 [[IN10]], i32 [[ACC0]])

	; CHECK: [[PIN23:%[^ ]+]] = bitcast i16* %pIn2.3 to i32*			; CHECK: [[PIN23:%[^ ]+]] = bitcast i16* %pIn2.3 to i32*
	; CHECK: [[IN23:%[^ ]+]] = load i32, i32* [[PIN23]], align 2			; CHECK: [[IN23:%[^ ]+]] = load i32, i32* [[PIN23]], align 2
	; CHECK: [[PIN12:%[^ ]+]] = bitcast i16* %pIn1.2 to i32*			; CHECK: [[PIN12:%[^ ]+]] = bitcast i16* %pIn1.2 to i32*
	; CHECK: [[IN12:%[^ ]+]] = load i32, i32* [[PIN12]], align 2			; CHECK: [[IN12:%[^ ]+]] = load i32, i32* [[PIN12]], align 2
	; CHECK: [[ACC1:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN23]], i32 [[IN12]], i32 [[ACC0]])			; CHECK: [[ACC2]] = call i32 @llvm.arm.smladx(i32 [[IN23]], i32 [[IN12]], i32 [[ACC1]])
	; CHECK: [[ACC2]] = call i32 @llvm.arm.smladx(i32 [[IN21]], i32 [[IN10]], i32 [[ACC1]])
	; CHECK-NOT: call i32 @llvm.arm.smlad			; CHECK-NOT: call i32 @llvm.arm.smlad
	; CHECK-UNSUPPORTED-NOT: call i32 @llvm.arm.smlad			; CHECK-UNSUPPORTED-NOT: call i32 @llvm.arm.smlad

	entry:			entry:
	%cmp9 = icmp eq i32 %limit, 0			%cmp9 = icmp eq i32 %limit, 0
	br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader			br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

	for.body.preheader:			for.body.preheader:
	▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines
	; CHECK: [[ACC0:%[^ ]+]] = phi i32 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]			; CHECK: [[ACC0:%[^ ]+]] = phi i32 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]

	; CHECK: [[PIN2_CAST:%[^ ]+]] = bitcast i16* [[PIN2]] to i32*			; CHECK: [[PIN2_CAST:%[^ ]+]] = bitcast i16* [[PIN2]] to i32*
	; CHECK: [[IN2:%[^ ]+]] = load i32, i32* [[PIN2_CAST]], align 2			; CHECK: [[IN2:%[^ ]+]] = load i32, i32* [[PIN2_CAST]], align 2

	; CHECK: [[PIN1_2:%[^ ]+]] = getelementptr i16, i16* [[PIN1]], i32 -2			; CHECK: [[PIN1_2:%[^ ]+]] = getelementptr i16, i16* [[PIN1]], i32 -2
	; CHECK: [[PIN1_2_CAST:%[^ ]+]] = bitcast i16* [[PIN1_2]] to i32*			; CHECK: [[PIN1_2_CAST:%[^ ]+]] = bitcast i16* [[PIN1_2]] to i32*
	; CHECK: [[IN1_2:%[^ ]+]] = load i32, i32* [[PIN1_2_CAST]], align 2			; CHECK: [[IN1_2:%[^ ]+]] = load i32, i32* [[PIN1_2_CAST]], align 2
				; CHECK: [[ACC1:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN2]], i32 [[IN1_2]], i32 [[ACC0]])

	; CHECK: [[PIN2_2:%[^ ]+]] = getelementptr i16, i16* [[PIN2]], i32 -2			; CHECK: [[PIN2_2:%[^ ]+]] = getelementptr i16, i16* [[PIN2]], i32 -2
	; CHECK: [[PIN2_2_CAST:%[^ ]+]] = bitcast i16* [[PIN2_2]] to i32*			; CHECK: [[PIN2_2_CAST:%[^ ]+]] = bitcast i16* [[PIN2_2]] to i32*
	; CHECK: [[IN2_2:%[^ ]+]] = load i32, i32* [[PIN2_2_CAST]], align 2			; CHECK: [[IN2_2:%[^ ]+]] = load i32, i32* [[PIN2_2_CAST]], align 2

	; CHECK: [[PIN1_CAST:%[^ ]+]] = bitcast i16* [[PIN1]] to i32*			; CHECK: [[PIN1_CAST:%[^ ]+]] = bitcast i16* [[PIN1]] to i32*
	; CHECK: [[IN1:%[^ ]+]] = load i32, i32* [[PIN1_CAST]], align 2			; CHECK: [[IN1:%[^ ]+]] = load i32, i32* [[PIN1_CAST]], align 2

	; CHECK: [[ACC1:%[^ ]+]] = call i32 @llvm.arm.smladx(i32 [[IN2_2]], i32 [[IN1]], i32 [[ACC0]])			; CHECK: [[ACC2]] = call i32 @llvm.arm.smladx(i32 [[IN2_2]], i32 [[IN1]], i32 [[ACC1]])
	; CHECK: [[ACC2]] = call i32 @llvm.arm.smladx(i32 [[IN2]], i32 [[IN1_2]], i32 [[ACC1]])

	; CHECK: [[PIN1_NEXT]] = getelementptr i16, i16* [[PIN1]], i32 4			; CHECK: [[PIN1_NEXT]] = getelementptr i16, i16* [[PIN1]], i32 4
	; CHECK: [[PIN2_NEXT]] = getelementptr i16, i16* [[PIN2]], i32 -4			; CHECK: [[PIN2_NEXT]] = getelementptr i16, i16* [[PIN2]], i32 -4

	; CHECK-NOT: call i32 @llvm.arm.smlad			; CHECK-NOT: call i32 @llvm.arm.smlad
	; CHECK-UNSUPPORTED-NOT: call i32 @llvm.arm.smlad			; CHECK-UNSUPPORTED-NOT: call i32 @llvm.arm.smlad

	entry:			entry:
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/smlaldx-1.ll

	; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -arm-parallel-dsp %s -S -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -arm-parallel-dsp %s -S -o - \| FileCheck %s
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED

	define i64 @smlaldx(i16* nocapture readonly %pIn1, i16* nocapture readonly %pIn2, i32 %j, i32 %limit) {			define i64 @smlaldx(i16* nocapture readonly %pIn1, i16* nocapture readonly %pIn2, i32 %j, i32 %limit) {

	; CHECK-LABEL: smlaldx			; CHECK-LABEL: smlaldx
	; CHECK: = phi i32 [ 0, %for.body.preheader.new ],			; CHECK: = phi i32 [ 0, %for.body.preheader.new ],
	; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]			; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]
	; CHECK: [[PIN21:%[^ ]+]] = bitcast i16* %pIn2.1 to i32*			; CHECK: [[PIN21:%[^ ]+]] = bitcast i16* %pIn2.1 to i32*
	; CHECK: [[IN21:%[^ ]+]] = load i32, i32* [[PIN21]], align 2			; CHECK: [[IN21:%[^ ]+]] = load i32, i32* [[PIN21]], align 2
	; CHECK: [[PIN10:%[^ ]+]] = bitcast i16* %pIn1.0 to i32*			; CHECK: [[PIN10:%[^ ]+]] = bitcast i16* %pIn1.0 to i32*
	; CHECK: [[IN10:%[^ ]+]] = load i32, i32* [[PIN10]], align 2			; CHECK: [[IN10:%[^ ]+]] = load i32, i32* [[PIN10]], align 2
				; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN21]], i32 [[IN10]], i64 [[ACC0]])
	; CHECK: [[PIN23:%[^ ]+]] = bitcast i16* %pIn2.3 to i32*			; CHECK: [[PIN23:%[^ ]+]] = bitcast i16* %pIn2.3 to i32*
	; CHECK: [[IN23:%[^ ]+]] = load i32, i32* [[PIN23]], align 2			; CHECK: [[IN23:%[^ ]+]] = load i32, i32* [[PIN23]], align 2
	; CHECK: [[PIN12:%[^ ]+]] = bitcast i16* %pIn1.2 to i32*			; CHECK: [[PIN12:%[^ ]+]] = bitcast i16* %pIn1.2 to i32*
	; CHECK: [[IN12:%[^ ]+]] = load i32, i32* [[PIN12]], align 2			; CHECK: [[IN12:%[^ ]+]] = load i32, i32* [[PIN12]], align 2
	; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN23]], i32 [[IN12]], i64 [[ACC0]])			; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN23]], i32 [[IN12]], i64 [[ACC1]])
	; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN21]], i32 [[IN10]], i64 [[ACC1]])
	; CHECK-NOT: call i64 @llvm.arm.smlad			; CHECK-NOT: call i64 @llvm.arm.smlad
	; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad			; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad

	entry:			entry:
	%cmp9 = icmp eq i32 %limit, 0			%cmp9 = icmp eq i32 %limit, 0
	br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader			br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

	for.body.preheader:			for.body.preheader:
	▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]			; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]

	; CHECK: [[PIN2_CAST:%[^ ]+]] = bitcast i16* [[PIN2]] to i32*			; CHECK: [[PIN2_CAST:%[^ ]+]] = bitcast i16* [[PIN2]] to i32*
	; CHECK: [[IN2:%[^ ]+]] = load i32, i32* [[PIN2_CAST]], align 2			; CHECK: [[IN2:%[^ ]+]] = load i32, i32* [[PIN2_CAST]], align 2

	; CHECK: [[PIN1_2:%[^ ]+]] = getelementptr i16, i16* [[PIN1]], i32 -2			; CHECK: [[PIN1_2:%[^ ]+]] = getelementptr i16, i16* [[PIN1]], i32 -2
	; CHECK: [[PIN1_2_CAST:%[^ ]+]] = bitcast i16* [[PIN1_2]] to i32*			; CHECK: [[PIN1_2_CAST:%[^ ]+]] = bitcast i16* [[PIN1_2]] to i32*
	; CHECK: [[IN1_2:%[^ ]+]] = load i32, i32* [[PIN1_2_CAST]], align 2			; CHECK: [[IN1_2:%[^ ]+]] = load i32, i32* [[PIN1_2_CAST]], align 2
				; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN2]], i32 [[IN1_2]], i64 [[ACC0]])

	; CHECK: [[PIN2_2:%[^ ]+]] = getelementptr i16, i16* [[PIN2]], i32 -2			; CHECK: [[PIN2_2:%[^ ]+]] = getelementptr i16, i16* [[PIN2]], i32 -2
	; CHECK: [[PIN2_2_CAST:%[^ ]+]] = bitcast i16* [[PIN2_2]] to i32*			; CHECK: [[PIN2_2_CAST:%[^ ]+]] = bitcast i16* [[PIN2_2]] to i32*
	; CHECK: [[IN2_2:%[^ ]+]] = load i32, i32* [[PIN2_2_CAST]], align 2			; CHECK: [[IN2_2:%[^ ]+]] = load i32, i32* [[PIN2_2_CAST]], align 2

	; CHECK: [[PIN1_CAST:%[^ ]+]] = bitcast i16* [[PIN1]] to i32*			; CHECK: [[PIN1_CAST:%[^ ]+]] = bitcast i16* [[PIN1]] to i32*
	; CHECK: [[IN1:%[^ ]+]] = load i32, i32* [[PIN1_CAST]], align 2			; CHECK: [[IN1:%[^ ]+]] = load i32, i32* [[PIN1_CAST]], align 2
				; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN2_2]], i32 [[IN1]], i64 [[ACC1]])
	; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN2_2]], i32 [[IN1]], i64 [[ACC0]])
	; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN2]], i32 [[IN1_2]], i64 [[ACC1]])

	; CHECK: [[PIN1_NEXT]] = getelementptr i16, i16* [[PIN1]], i32 4			; CHECK: [[PIN1_NEXT]] = getelementptr i16, i16* [[PIN1]], i32 4
	; CHECK: [[PIN2_NEXT]] = getelementptr i16, i16* [[PIN2]], i32 -4			; CHECK: [[PIN2_NEXT]] = getelementptr i16, i16* [[PIN2]], i32 -4

	; CHECK-NOT: call i64 @llvm.arm.smlad			; CHECK-NOT: call i64 @llvm.arm.smlad
	; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad			; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad

	for.body:			for.body:
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/smlaldx-2.ll

	; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -arm-parallel-dsp %s -S -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -arm-parallel-dsp %s -S -o - \| FileCheck %s
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED
	; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED			; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s -arm-parallel-dsp -S \| FileCheck %s --check-prefix=CHECK-UNSUPPORTED

	define i64 @smlaldx(i16* nocapture readonly %pIn1, i16* nocapture readonly %pIn2, i32 %j, i32 %limit) {			define i64 @smlaldx(i16* nocapture readonly %pIn1, i16* nocapture readonly %pIn2, i32 %j, i32 %limit) {

	; CHECK-LABEL: smlaldx			; CHECK-LABEL: smlaldx
	; CHECK: = phi i32 [ 0, %for.body.preheader.new ],			; CHECK: = phi i32 [ 0, %for.body.preheader.new ],
	; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]			; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]
	; CHECK: [[PIN21:%[^ ]+]] = bitcast i16* %pIn2.1 to i32*			; CHECK: [[PIN21:%[^ ]+]] = bitcast i16* %pIn2.1 to i32*
	; CHECK: [[IN21:%[^ ]+]] = load i32, i32* [[PIN21]], align 2			; CHECK: [[IN21:%[^ ]+]] = load i32, i32* [[PIN21]], align 2
	; CHECK: [[PIN10:%[^ ]+]] = bitcast i16* %pIn1.0 to i32*			; CHECK: [[PIN10:%[^ ]+]] = bitcast i16* %pIn1.0 to i32*
	; CHECK: [[IN10:%[^ ]+]] = load i32, i32* [[PIN10]], align 2			; CHECK: [[IN10:%[^ ]+]] = load i32, i32* [[PIN10]], align 2
				; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN21]], i32 [[IN10]], i64 [[ACC0]])
	; CHECK: [[PIN23:%[^ ]+]] = bitcast i16* %pIn2.3 to i32*			; CHECK: [[PIN23:%[^ ]+]] = bitcast i16* %pIn2.3 to i32*
	; CHECK: [[IN23:%[^ ]+]] = load i32, i32* [[PIN23]], align 2			; CHECK: [[IN23:%[^ ]+]] = load i32, i32* [[PIN23]], align 2
	; CHECK: [[PIN12:%[^ ]+]] = bitcast i16* %pIn1.2 to i32*			; CHECK: [[PIN12:%[^ ]+]] = bitcast i16* %pIn1.2 to i32*
	; CHECK: [[IN12:%[^ ]+]] = load i32, i32* [[PIN12]], align 2			; CHECK: [[IN12:%[^ ]+]] = load i32, i32* [[PIN12]], align 2
	; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN23]], i32 [[IN12]], i64 [[ACC0]])			; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN23]], i32 [[IN12]], i64 [[ACC1]])
	; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN21]], i32 [[IN10]], i64 [[ACC1]])
	; CHECK-NOT: call i64 @llvm.arm.smlad			; CHECK-NOT: call i64 @llvm.arm.smlad
	; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad			; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad

	entry:			entry:
	%cmp9 = icmp eq i32 %limit, 0			%cmp9 = icmp eq i32 %limit, 0
	br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader			br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

	for.body.preheader:			for.body.preheader:
	▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]			; CHECK: [[ACC0:%[^ ]+]] = phi i64 [ 0, %for.body.preheader.new ], [ [[ACC2:%[^ ]+]], %for.body ]

	; CHECK: [[PIN2_CAST:%[^ ]+]] = bitcast i16* [[PIN2]] to i32*			; CHECK: [[PIN2_CAST:%[^ ]+]] = bitcast i16* [[PIN2]] to i32*
	; CHECK: [[IN2:%[^ ]+]] = load i32, i32* [[PIN2_CAST]], align 2			; CHECK: [[IN2:%[^ ]+]] = load i32, i32* [[PIN2_CAST]], align 2

	; CHECK: [[PIN1_2:%[^ ]+]] = getelementptr i16, i16* [[PIN1]], i32 -2			; CHECK: [[PIN1_2:%[^ ]+]] = getelementptr i16, i16* [[PIN1]], i32 -2
	; CHECK: [[PIN1_2_CAST:%[^ ]+]] = bitcast i16* [[PIN1_2]] to i32*			; CHECK: [[PIN1_2_CAST:%[^ ]+]] = bitcast i16* [[PIN1_2]] to i32*
	; CHECK: [[IN1_2:%[^ ]+]] = load i32, i32* [[PIN1_2_CAST]], align 2			; CHECK: [[IN1_2:%[^ ]+]] = load i32, i32* [[PIN1_2_CAST]], align 2
				; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN2]], i32 [[IN1_2]], i64 [[ACC0]])

	; CHECK: [[PIN2_2:%[^ ]+]] = getelementptr i16, i16* [[PIN2]], i32 -2			; CHECK: [[PIN2_2:%[^ ]+]] = getelementptr i16, i16* [[PIN2]], i32 -2
	; CHECK: [[PIN2_2_CAST:%[^ ]+]] = bitcast i16* [[PIN2_2]] to i32*			; CHECK: [[PIN2_2_CAST:%[^ ]+]] = bitcast i16* [[PIN2_2]] to i32*
	; CHECK: [[IN2_2:%[^ ]+]] = load i32, i32* [[PIN2_2_CAST]], align 2			; CHECK: [[IN2_2:%[^ ]+]] = load i32, i32* [[PIN2_2_CAST]], align 2

	; CHECK: [[PIN1_CAST:%[^ ]+]] = bitcast i16* [[PIN1]] to i32*			; CHECK: [[PIN1_CAST:%[^ ]+]] = bitcast i16* [[PIN1]] to i32*
	; CHECK: [[IN1:%[^ ]+]] = load i32, i32* [[PIN1_CAST]], align 2			; CHECK: [[IN1:%[^ ]+]] = load i32, i32* [[PIN1_CAST]], align 2
				; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN2_2]], i32 [[IN1]], i64 [[ACC1]])
	; CHECK: [[ACC1:%[^ ]+]] = call i64 @llvm.arm.smlaldx(i32 [[IN2_2]], i32 [[IN1]], i64 [[ACC0]])
	; CHECK: [[ACC2]] = call i64 @llvm.arm.smlaldx(i32 [[IN2]], i32 [[IN1_2]], i64 [[ACC1]])

	; CHECK: [[PIN1_NEXT]] = getelementptr i16, i16* [[PIN1]], i32 4			; CHECK: [[PIN1_NEXT]] = getelementptr i16, i16* [[PIN1]], i32 4
	; CHECK: [[PIN2_NEXT]] = getelementptr i16, i16* [[PIN2]], i32 -4			; CHECK: [[PIN2_NEXT]] = getelementptr i16, i16* [[PIN2]], i32 -4

	; CHECK-NOT: call i64 @llvm.arm.smlad			; CHECK-NOT: call i64 @llvm.arm.smlad
	; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad			; CHECK-UNSUPPORTED-NOT: call i64 @llvm.arm.smlad

	for.body:			for.body:
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

test/CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll

	Show All 39 Lines
	; CHECK-REG-PRESSURE: .LBB0_1:			; CHECK-REG-PRESSURE: .LBB0_1:
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp			; CHECK-REG-PRESSURE-NOT: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: ldr{{.*}}, [sp
	; CHECK-REG-PRESSURE: bne .LBB0_1			; CHECK-REG-PRESSURE: bne .LBB0_1

	for.body:			for.body:
	%A3 = phi i32 [ %add9.us.i.3361.i, %for.body ], [ 0, %entry ]			%A3 = phi i32 [ %add9.us.i.3361.i, %for.body ], [ 0, %entry ]
	%j.026.us.i.i = phi i32 [ %inc.us.i.3362.i, %for.body ], [ 0, %entry ]			%j.026.us.i.i = phi i32 [ %inc.us.i.3362.i, %for.body ], [ 0, %entry ]
	%A4 = phi i32 [ %add9.us.i.1.3.i, %for.body ], [ 0, %entry ]			%A4 = phi i32 [ %add9.us.i.1.3.i, %for.body ], [ 0, %entry ]
	%A5 = phi i32 [ %add9.us.i.2.3.i, %for.body ], [ 0, %entry ]			%A5 = phi i32 [ %add9.us.i.2.3.i, %for.body ], [ 0, %entry ]
	%A6 = phi i32 [ %add9.us.i.3.3.i, %for.body ], [ 0, %entry ]			%A6 = phi i32 [ %add9.us.i.3.3.i, %for.body ], [ 0, %entry ]
	▲ Show 20 Lines • Show All 173 Lines • Show Last 20 Lines

test/MC/AsmParser/preserve-comments-crlf.s

This is a binary file.

This is an archive of the discontinued LLVM Phabricator instance.

[ARM][ParallelDSP] Change smlad insertion orderClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 219671

lib/Target/ARM/ARMParallelDSP.cpp

test/CodeGen/ARM/ParallelDSP/complex_dot_prod.ll

test/CodeGen/ARM/ParallelDSP/exchange.ll

test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll

test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll

test/CodeGen/ARM/ParallelDSP/overlapping.ll

test/CodeGen/ARM/ParallelDSP/pr43073.ll

test/CodeGen/ARM/ParallelDSP/smlad11.ll

test/CodeGen/ARM/ParallelDSP/smladx-1.ll

test/CodeGen/ARM/ParallelDSP/smlaldx-1.ll

test/CodeGen/ARM/ParallelDSP/smlaldx-2.ll

test/CodeGen/ARM/ParallelDSP/unroll-n-jam-smlad.ll

test/MC/AsmParser/preserve-comments-crlf.s

[ARM][ParallelDSP] Change smlad insertion order
ClosedPublic