This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] Sink vector shift operand
ClosedPublic

Authored by samparker on Nov 29 2019, 12:55 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
simon_tatham

Commits

rG1274ac3dc235: [ARM][MVE] Sink vector shift operand
rGe0b966643fc2: [ARM][MVE] Sink vector shift operand

Summary

The shift amount operand can be provided in a general purpose register so sink it. Patterns have been added for lshr and ashr where we use a negated shift value for with a 'shift left' instruction.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

samparker created this revision.Nov 29 2019, 12:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2019, 12:55 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

LGTM

This revision is now accepted and ready to land.Nov 29 2019, 4:27 AM

dmgreen added inline comments.Nov 30 2019, 5:31 AM

llvm/lib/Target/ARM/ARMInstrMVE.td
4072	I think all of this should be happening in a DAG combine, to give the negate a chance to fold into something else. It's generally not a good idea to create multiple outputs in a tablegen pattern. Ideally thy should be one/multi inputs to one output, with the other optimisations happening in dag combine to allow extra folding to happen. I guess ideally this would happen much earlier, like in instcombine to allow all that folding goodness. Not sure how to make that happen sensibly though.

samparker marked an inline comment as done.Dec 2 2019, 5:02 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMInstrMVE.td
4072	I thought that these negations were generated by us in lowering/combining? I haven't looked much into the lowering code, but I really would hope that we could remove some of the custom nodes... So I would be very hesitant to introduce yet more custom nodes, to get around the custom nodes that they've already introduced! I also don't see why more c++ would be better than having all the related patterns here, in a concise readable format.

dmgreen added inline comments.Dec 3 2019, 12:19 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
14746	This bit looks like a good change on its own, if you want to split that out ( with the tests) and commit it separately.
llvm/lib/Target/ARM/ARMInstrMVE.td
4072	I'm not sure which custom nodes you mean. There was a SPLATVECTOR added recently, which could replace ARMvdup. That would be good to use if it works enough yet (when I looked it seemed like it might only work for scalables, not sure). Shifts are a bit more awkward though, with the negated conditions. If you can make them work then that sounds useful, but I have a memory of trying that initially but eventually just using the custom nodes, the same as neon. My understanding is that, at least in isel, tablegen is considered only as the last step. There is no folding that happens after it, on machine instructions. (Which would be useful in places, but sounds like a lot of work, having to reimplement it for all targets). So the graph going into lowering needs to already be optimised. Unlike with legalising you can't just make a mess and have it cleaned up after. If you generate a shift _and_ a rsb, that's what you end up with even if the rsb can be folded into something else. We just need to give it that chance to be folded and that can only happen in dag combine. So I think there should be a (vdup(neg)) -> (neg(vdup)) combine. It shouldn't be too much code, as far as I understand. It might even be less than all this tablegen.

Now flipping the vdup and sub to reuse existing patterns.

Nice one. Yeah, that was what I meant. LGTM

llvm/lib/Target/ARM/ARMISelLowering.cpp
11748	We have a IsZeroVector in here now, which may help simplify some of this.

samparker marked an inline comment as done.Dec 11 2019, 11:36 PM

Closed by commit rGe0b966643fc2: [ARM][MVE] Sink vector shift operand (authored by samparker). · Explain WhyDec 11 2019, 11:39 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMISelLowering.cpp

30 lines

ARMInstrMVE.td

1 line

test/

CodeGen/

Thumb2/

mve-shifts-scalar.ll

422 lines

mve-shifts.ll

30 lines

Diff 233515

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,739 Lines • ▼ Show 20 Lines	case ARMISD::CMP:
break;		break;
}		}
}		}

if (N->getOpcode() != ISD::ADD && N->getOpcode() != ISD::OR &&		if (N->getOpcode() != ISD::ADD && N->getOpcode() != ISD::OR &&
N->getOpcode() != ISD::XOR && N->getOpcode() != ISD::AND)		N->getOpcode() != ISD::XOR && N->getOpcode() != ISD::AND)
return SDValue();		return SDValue();

if (N->getOperand(0).getOpcode() != ISD::SHL)		if (N->getOperand(0).getOpcode() != ISD::SHL)
		dmgreenUnsubmitted Done Reply Inline Actions We have a IsZeroVector in here now, which may help simplify some of this. dmgreen: We have a IsZeroVector in here now, which may help simplify some of this.
return SDValue();		return SDValue();

SDValue SHL = N->getOperand(0);		SDValue SHL = N->getOperand(0);

auto *C1ShlC2 = dyn_cast<ConstantSDNode>(N->getOperand(1));		auto *C1ShlC2 = dyn_cast<ConstantSDNode>(N->getOperand(1));
auto *C2 = dyn_cast<ConstantSDNode>(SHL.getOperand(1));		auto *C2 = dyn_cast<ConstantSDNode>(SHL.getOperand(1));
if (!C1ShlC2 \|\| !C2)		if (!C1ShlC2 \|\| !C2)
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	static SDValue PerformADDCombine(SDNode *N,

// If that didn't work, try again with the operands commuted.		// If that didn't work, try again with the operands commuted.
return PerformADDCombineWithOperands(N, N1, N0, DCI, Subtarget);		return PerformADDCombineWithOperands(N, N1, N0, DCI, Subtarget);
}		}

/// PerformSUBCombine - Target-specific dag combine xforms for ISD::SUB.		/// PerformSUBCombine - Target-specific dag combine xforms for ISD::SUB.
///		///
static SDValue PerformSUBCombine(SDNode *N,		static SDValue PerformSUBCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI,
		const ARMSubtarget *Subtarget) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);

// fold (sub x, (select cc, 0, c)) -> (select cc, x, (sub, x, c))		// fold (sub x, (select cc, 0, c)) -> (select cc, x, (sub, x, c))
if (N1.getNode()->hasOneUse())		if (N1.getNode()->hasOneUse())
if (SDValue Result = combineSelectAndUse(N, N1, N0, DCI))		if (SDValue Result = combineSelectAndUse(N, N1, N0, DCI))
return Result;		return Result;

		if (!Subtarget->hasMVEIntegerOps() \|\| !N->getValueType(0).isVector())
		return SDValue();

		// Fold (sub (ARMvmovImm 0), (ARMvdup x)) -> (ARMvdup (sub 0, x))
		// so that we can readily pattern match more mve instructions which can use
		// a scalar operand.
		SDValue VDup = N->getOperand(1);
		if (VDup->getOpcode() != ARMISD::VDUP)
return SDValue();		return SDValue();

		SDValue VMov = N->getOperand(0);
		if (VMov->getOpcode() == ISD::BITCAST)
		VMov = VMov->getOperand(0);

		if (VMov->getOpcode() != ARMISD::VMOVIMM \|\| !isZeroVector(VMov))
		return SDValue();

		SDLoc dl(N);
		SDValue Negate = DCI.DAG.getNode(ISD::SUB, dl, MVT::i32, VMov->getOperand(0),
		VDup->getOperand(0));
		return DCI.DAG.getNode(ARMISD::VDUP, dl, N->getValueType(0), Negate);
}		}

/// PerformVMULCombine		/// PerformVMULCombine
/// Distribute (A + B) * C to (A * C) + (B * C) to take advantage of the		/// Distribute (A + B) * C to (A * C) + (B * C) to take advantage of the
/// special multiplier accumulator forwarding.		/// special multiplier accumulator forwarding.
/// vmul d3, d0, d2		/// vmul d3, d0, d2
/// vmla d3, d1, d2		/// vmla d3, d1, d2
/// is faster than		/// is faster than
▲ Show 20 Lines • Show All 2,668 Lines • ▼ Show 20 Lines
SDValue ARMTargetLowering::PerformDAGCombine(SDNode *N,		SDValue ARMTargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default: break;		default: break;
case ISD::ABS: return PerformABSCombine(N, DCI, Subtarget);		case ISD::ABS: return PerformABSCombine(N, DCI, Subtarget);
case ARMISD::ADDE: return PerformADDECombine(N, DCI, Subtarget);		case ARMISD::ADDE: return PerformADDECombine(N, DCI, Subtarget);
case ARMISD::UMLAL: return PerformUMLALCombine(N, DCI.DAG, Subtarget);		case ARMISD::UMLAL: return PerformUMLALCombine(N, DCI.DAG, Subtarget);
case ISD::ADD: return PerformADDCombine(N, DCI, Subtarget);		case ISD::ADD: return PerformADDCombine(N, DCI, Subtarget);
case ISD::SUB: return PerformSUBCombine(N, DCI);		case ISD::SUB: return PerformSUBCombine(N, DCI, Subtarget);
case ISD::MUL: return PerformMULCombine(N, DCI, Subtarget);		case ISD::MUL: return PerformMULCombine(N, DCI, Subtarget);
case ISD::OR: return PerformORCombine(N, DCI, Subtarget);		case ISD::OR: return PerformORCombine(N, DCI, Subtarget);
case ISD::XOR: return PerformXORCombine(N, DCI, Subtarget);		case ISD::XOR: return PerformXORCombine(N, DCI, Subtarget);
case ISD::AND: return PerformANDCombine(N, DCI, Subtarget);		case ISD::AND: return PerformANDCombine(N, DCI, Subtarget);
case ISD::BRCOND:		case ISD::BRCOND:
case ISD::BR_CC: return PerformHWLoopCombine(N, DCI, Subtarget);		case ISD::BR_CC: return PerformHWLoopCombine(N, DCI, Subtarget);
case ARMISD::ADDC:		case ARMISD::ADDC:
case ARMISD::SUBC: return PerformAddcSubcCombine(N, DCI, Subtarget);		case ARMISD::SUBC: return PerformAddcSubcCombine(N, DCI, Subtarget);
▲ Show 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	bool ARMTargetLowering::allowsMisalignedMemoryAccesses(EVT VT, unsigned,
if (Ty == MVT::v16i8 \|\| Ty == MVT::v8i16 \|\| Ty == MVT::v8f16 \|\|		if (Ty == MVT::v16i8 \|\| Ty == MVT::v8i16 \|\| Ty == MVT::v8f16 \|\|
Ty == MVT::v4i32 \|\| Ty == MVT::v4f32 \|\| Ty == MVT::v2i64 \|\|		Ty == MVT::v4i32 \|\| Ty == MVT::v4f32 \|\| Ty == MVT::v2i64 \|\|
Ty == MVT::v2f64) {		Ty == MVT::v2f64) {
if (Fast)		if (Fast)
*Fast = true;		*Fast = true;
return true;		return true;
}		}

return false;		return false;
		dmgreenUnsubmitted Not Done Reply Inline Actions This bit looks like a good change on its own, if you want to split that out ( with the tests) and commit it separately. dmgreen: This bit looks like a good change on its own, if you want to split that out ( with the tests)…
}		}

static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,		static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,
unsigned AlignCheck) {		unsigned AlignCheck) {
return ((SrcAlign == 0 \|\| SrcAlign % AlignCheck == 0) &&		return ((SrcAlign == 0 \|\| SrcAlign % AlignCheck == 0) &&
(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));		(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));
}		}

▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	if (!Subtarget->hasMVEIntegerOps())
return false;		return false;

auto IsSinker = [](Instruction *I, int Operand) {		auto IsSinker = [](Instruction *I, int Operand) {
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case Instruction::Add:		case Instruction::Add:
case Instruction::Mul:		case Instruction::Mul:
return true;		return true;
case Instruction::Sub:		case Instruction::Sub:
		case Instruction::Shl:
		case Instruction::LShr:
		case Instruction::AShr:
return Operand == 1;		return Operand == 1;
default:		default:
return false;		return false;
}		}
};		};

int Op = 0;		int Op = 0;
if (!isa<ShuffleVectorInst>(I->getOperand(Op)))		if (!isa<ShuffleVectorInst>(I->getOperand(Op)))
▲ Show 20 Lines • Show All 2,475 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrMVE.td

Show First 20 Lines • Show All 4,063 Lines • ▼ Show 20 Lines
}		}

defm MVE_VCADDi8 : MVE_VxCADD_m<"vcadd", MVE_v16i8, 0b1>;		defm MVE_VCADDi8 : MVE_VxCADD_m<"vcadd", MVE_v16i8, 0b1>;
defm MVE_VCADDi16 : MVE_VxCADD_m<"vcadd", MVE_v8i16, 0b1>;		defm MVE_VCADDi16 : MVE_VxCADD_m<"vcadd", MVE_v8i16, 0b1>;
defm MVE_VCADDi32 : MVE_VxCADD_m<"vcadd", MVE_v4i32, 0b1, "@earlyclobber $Qd">;		defm MVE_VCADDi32 : MVE_VxCADD_m<"vcadd", MVE_v4i32, 0b1, "@earlyclobber $Qd">;

defm MVE_VHCADDs8 : MVE_VxCADD_m<"vhcadd", MVE_v16s8, 0b0>;		defm MVE_VHCADDs8 : MVE_VxCADD_m<"vhcadd", MVE_v16s8, 0b0>;
defm MVE_VHCADDs16 : MVE_VxCADD_m<"vhcadd", MVE_v8s16, 0b0>;		defm MVE_VHCADDs16 : MVE_VxCADD_m<"vhcadd", MVE_v8s16, 0b0>;
defm MVE_VHCADDs32 : MVE_VxCADD_m<"vhcadd", MVE_v4s32, 0b0, "@earlyclobber $Qd">;		defm MVE_VHCADDs32 : MVE_VxCADD_m<"vhcadd", MVE_v4s32, 0b0, "@earlyclobber $Qd">;
		dmgreenUnsubmitted Not Done Reply Inline Actions I think all of this should be happening in a DAG combine, to give the negate a chance to fold into something else. It's generally not a good idea to create multiple outputs in a tablegen pattern. Ideally thy should be one/multi inputs to one output, with the other optimisations happening in dag combine to allow extra folding to happen. I guess ideally this would happen much earlier, like in instcombine to allow all that folding goodness. Not sure how to make that happen sensibly though. dmgreen: I think all of this should be happening in a DAG combine, to give the negate a chance to fold…
		samparkerAuthorUnsubmitted Done Reply Inline Actions I thought that these negations were generated by us in lowering/combining? I haven't looked much into the lowering code, but I really would hope that we could remove some of the custom nodes... So I would be very hesitant to introduce yet more custom nodes, to get around the custom nodes that they've already introduced! I also don't see why more c++ would be better than having all the related patterns here, in a concise readable format. samparker: I thought that these negations were generated by us in lowering/combining? I haven't looked…
		dmgreenUnsubmitted Not Done Reply Inline Actions I'm not sure which custom nodes you mean. There was a SPLATVECTOR added recently, which could replace ARMvdup. That would be good to use if it works enough yet (when I looked it seemed like it might only work for scalables, not sure). Shifts are a bit more awkward though, with the negated conditions. If you can make them work then that sounds useful, but I have a memory of trying that initially but eventually just using the custom nodes, the same as neon. My understanding is that, at least in isel, tablegen is considered only as the last step. There is no folding that happens after it, on machine instructions. (Which would be useful in places, but sounds like a lot of work, having to reimplement it for all targets). So the graph going into lowering needs to already be optimised. Unlike with legalising you can't just make a mess and have it cleaned up after. If you generate a shift _and_ a rsb, that's what you end up with even if the rsb can be folded into something else. We just need to give it that chance to be folded and that can only happen in dag combine. So I think there should be a (vdup(neg)) -> (neg(vdup)) combine. It shouldn't be too much code, as far as I understand. It might even be less than all this tablegen. dmgreen: I'm not sure which custom nodes you mean. There was a SPLATVECTOR added recently, which could…

class MVE_VADCSBC<string iname, bit I, bit subtract,		class MVE_VADCSBC<string iname, bit I, bit subtract,
dag carryin, list<dag> pattern=[]>		dag carryin, list<dag> pattern=[]>
: MVE_qDest_qSrc<iname, "i32", (outs MQPR:$Qd, cl_FPSCR_NZCV:$carryout),		: MVE_qDest_qSrc<iname, "i32", (outs MQPR:$Qd, cl_FPSCR_NZCV:$carryout),
!con((ins MQPR:$Qn, MQPR:$Qm), carryin),		!con((ins MQPR:$Qn, MQPR:$Qm), carryin),
"$Qd, $Qn, $Qm", vpred_r, "", pattern> {		"$Qd, $Qn, $Qm", vpred_r, "", pattern> {
bits<4> Qn;		bits<4> Qn;

▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	multiclass MVE_VxSHL_qr_types<string iname, bit bit_7, bit bit_17> {
def u32 : MVE_VxSHL_qr<iname, "u32", 0b1, 0b10, bit_7, bit_17>;		def u32 : MVE_VxSHL_qr<iname, "u32", 0b1, 0b10, bit_7, bit_17>;
}		}

defm MVE_VSHL_qr : MVE_VxSHL_qr_types<"vshl", 0b0, 0b0>;		defm MVE_VSHL_qr : MVE_VxSHL_qr_types<"vshl", 0b0, 0b0>;
defm MVE_VRSHL_qr : MVE_VxSHL_qr_types<"vrshl", 0b0, 0b1>;		defm MVE_VRSHL_qr : MVE_VxSHL_qr_types<"vrshl", 0b0, 0b1>;
defm MVE_VQSHL_qr : MVE_VxSHL_qr_types<"vqshl", 0b1, 0b0>;		defm MVE_VQSHL_qr : MVE_VxSHL_qr_types<"vqshl", 0b1, 0b0>;
defm MVE_VQRSHL_qr : MVE_VxSHL_qr_types<"vqrshl", 0b1, 0b1>;		defm MVE_VQRSHL_qr : MVE_VxSHL_qr_types<"vqrshl", 0b1, 0b1>;


let Predicates = [HasMVEInt] in {		let Predicates = [HasMVEInt] in {
def : Pat<(v4i32 (ARMvshlu (v4i32 MQPR:$Qm), (v4i32 (ARMvdup GPR:$Rm)))),		def : Pat<(v4i32 (ARMvshlu (v4i32 MQPR:$Qm), (v4i32 (ARMvdup GPR:$Rm)))),
(v4i32 (MVE_VSHL_qru32 (v4i32 MQPR:$Qm), GPR:$Rm))>;		(v4i32 (MVE_VSHL_qru32 (v4i32 MQPR:$Qm), GPR:$Rm))>;
def : Pat<(v8i16 (ARMvshlu (v8i16 MQPR:$Qm), (v8i16 (ARMvdup GPR:$Rm)))),		def : Pat<(v8i16 (ARMvshlu (v8i16 MQPR:$Qm), (v8i16 (ARMvdup GPR:$Rm)))),
(v8i16 (MVE_VSHL_qru16 (v8i16 MQPR:$Qm), GPR:$Rm))>;		(v8i16 (MVE_VSHL_qru16 (v8i16 MQPR:$Qm), GPR:$Rm))>;
def : Pat<(v16i8 (ARMvshlu (v16i8 MQPR:$Qm), (v16i8 (ARMvdup GPR:$Rm)))),		def : Pat<(v16i8 (ARMvshlu (v16i8 MQPR:$Qm), (v16i8 (ARMvdup GPR:$Rm)))),
(v16i8 (MVE_VSHL_qru8 (v16i8 MQPR:$Qm), GPR:$Rm))>;		(v16i8 (MVE_VSHL_qru8 (v16i8 MQPR:$Qm), GPR:$Rm))>;

▲ Show 20 Lines • Show All 1,806 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-shifts-scalar.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -O3 -mtriple=thumbv8.1m.main-arm-none-eabi -mattr=+mve %s -o - \| FileCheck %s

				define dso_local arm_aapcs_vfpcc void @sink_shl_i32(i32* nocapture readonly %in, i32* noalias nocapture %out, i32 %shift, i32 %N) {
				; CHECK-LABEL: sink_shl_i32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #16
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #16
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #16]!
				; CHECK-NEXT: vshl.u32 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #16]!
				; CHECK-NEXT: le lr, .LBB0_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %shift, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i32, i32* %in, i32 %index
				%cast.in = bitcast i32* %gep.in to <4 x i32>*
				%wide.load = load <4 x i32>, <4 x i32>* %cast.in, align 4
				%res = shl <4 x i32> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i32, i32* %out, i32 %index
				%cast.out = bitcast i32* %gep.out to <4 x i32>*
				store <4 x i32> %res, <4 x i32>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_shl_i16(i16* nocapture readonly %in, i16* noalias nocapture %out, i16 %shift, i32 %N) {
				; CHECK-LABEL: sink_shl_i16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #8
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #8
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #8]!
				; CHECK-NEXT: vshl.u16 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #8]!
				; CHECK-NEXT: le lr, .LBB1_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <8 x i16> undef, i16 %shift, i32 0
				%broadcast.splat11 = shufflevector <8 x i16> %broadcast.splatinsert10, <8 x i16> undef, <8 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i16, i16* %in, i32 %index
				%cast.in = bitcast i16* %gep.in to <8 x i16>*
				%wide.load = load <8 x i16>, <8 x i16>* %cast.in, align 4
				%res = shl <8 x i16> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i16, i16* %out, i32 %index
				%cast.out = bitcast i16* %gep.out to <8 x i16>*
				store <8 x i16> %res, <8 x i16>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_shl_i8(i8* nocapture readonly %in, i8* noalias nocapture %out, i8 %shift, i32 %N) {
				; CHECK-LABEL: sink_shl_i8:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #4
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #4
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #4]!
				; CHECK-NEXT: vshl.u8 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #4]!
				; CHECK-NEXT: le lr, .LBB2_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <16 x i8> undef, i8 %shift, i32 0
				%broadcast.splat11 = shufflevector <16 x i8> %broadcast.splatinsert10, <16 x i8> undef, <16 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i8, i8* %in, i32 %index
				%cast.in = bitcast i8* %gep.in to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %cast.in, align 4
				%res = shl <16 x i8> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i8, i8* %out, i32 %index
				%cast.out = bitcast i8* %gep.out to <16 x i8>*
				store <16 x i8> %res, <16 x i8>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_lshr_i32(i32* nocapture readonly %in, i32* noalias nocapture %out, i32 %shift, i32 %N) {
				; CHECK-LABEL: sink_lshr_i32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #16
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #16
				; CHECK-NEXT: subs r2, #0, r2
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #16]!
				; CHECK-NEXT: vshl.u32 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #16]!
				; CHECK-NEXT: le lr, .LBB3_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %shift, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i32, i32* %in, i32 %index
				%cast.in = bitcast i32* %gep.in to <4 x i32>*
				%wide.load = load <4 x i32>, <4 x i32>* %cast.in, align 4
				%res = lshr <4 x i32> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i32, i32* %out, i32 %index
				%cast.out = bitcast i32* %gep.out to <4 x i32>*
				store <4 x i32> %res, <4 x i32>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_lshr_i16(i16* nocapture readonly %in, i16* noalias nocapture %out, i16 %shift, i32 %N) {
				; CHECK-LABEL: sink_lshr_i16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #8
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #8
				; CHECK-NEXT: subs r2, #0, r2
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB4_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #8]!
				; CHECK-NEXT: vshl.u16 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #8]!
				; CHECK-NEXT: le lr, .LBB4_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <8 x i16> undef, i16 %shift, i32 0
				%broadcast.splat11 = shufflevector <8 x i16> %broadcast.splatinsert10, <8 x i16> undef, <8 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i16, i16* %in, i32 %index
				%cast.in = bitcast i16* %gep.in to <8 x i16>*
				%wide.load = load <8 x i16>, <8 x i16>* %cast.in, align 4
				%res = lshr <8 x i16> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i16, i16* %out, i32 %index
				%cast.out = bitcast i16* %gep.out to <8 x i16>*
				store <8 x i16> %res, <8 x i16>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_lshr_i8(i8* nocapture readonly %in, i8* noalias nocapture %out, i8 %shift, i32 %N) {
				; CHECK-LABEL: sink_lshr_i8:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #4
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #4
				; CHECK-NEXT: subs r2, #0, r2
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB5_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #4]!
				; CHECK-NEXT: vshl.u8 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #4]!
				; CHECK-NEXT: le lr, .LBB5_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <16 x i8> undef, i8 %shift, i32 0
				%broadcast.splat11 = shufflevector <16 x i8> %broadcast.splatinsert10, <16 x i8> undef, <16 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i8, i8* %in, i32 %index
				%cast.in = bitcast i8* %gep.in to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %cast.in, align 4
				%res = lshr <16 x i8> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i8, i8* %out, i32 %index
				%cast.out = bitcast i8* %gep.out to <16 x i8>*
				store <16 x i8> %res, <16 x i8>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_ashr_i32(i32* nocapture readonly %in, i32* noalias nocapture %out, i32 %shift, i32 %N) {
				; CHECK-LABEL: sink_ashr_i32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #16
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #16
				; CHECK-NEXT: subs r2, #0, r2
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB6_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #16]!
				; CHECK-NEXT: vshl.s32 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #16]!
				; CHECK-NEXT: le lr, .LBB6_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %shift, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i32, i32* %in, i32 %index
				%cast.in = bitcast i32* %gep.in to <4 x i32>*
				%wide.load = load <4 x i32>, <4 x i32>* %cast.in, align 4
				%res = ashr <4 x i32> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i32, i32* %out, i32 %index
				%cast.out = bitcast i32* %gep.out to <4 x i32>*
				store <4 x i32> %res, <4 x i32>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_ashr_i16(i16* nocapture readonly %in, i16* noalias nocapture %out, i16 %shift, i32 %N) {
				; CHECK-LABEL: sink_ashr_i16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #8
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #8
				; CHECK-NEXT: subs r2, #0, r2
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB7_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #8]!
				; CHECK-NEXT: vshl.s16 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #8]!
				; CHECK-NEXT: le lr, .LBB7_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <8 x i16> undef, i16 %shift, i32 0
				%broadcast.splat11 = shufflevector <8 x i16> %broadcast.splatinsert10, <8 x i16> undef, <8 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i16, i16* %in, i32 %index
				%cast.in = bitcast i16* %gep.in to <8 x i16>*
				%wide.load = load <8 x i16>, <8 x i16>* %cast.in, align 4
				%res = ashr <8 x i16> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i16, i16* %out, i32 %index
				%cast.out = bitcast i16* %gep.out to <8 x i16>*
				store <8 x i16> %res, <8 x i16>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @sink_ashr_i8(i8* nocapture readonly %in, i8* noalias nocapture %out, i8 %shift, i32 %N) {
				; CHECK-LABEL: sink_ashr_i8:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bic r3, r3, #3
				; CHECK-NEXT: subs r0, #4
				; CHECK-NEXT: sub.w r12, r3, #4
				; CHECK-NEXT: movs r3, #1
				; CHECK-NEXT: subs r1, #4
				; CHECK-NEXT: subs r2, #0, r2
				; CHECK-NEXT: add.w lr, r3, r12, lsr #2
				; CHECK-NEXT: dls lr, lr
				; CHECK-NEXT: .LBB8_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #4]!
				; CHECK-NEXT: vshl.s8 q0, r2
				; CHECK-NEXT: vstrb.8 q0, [r1, #4]!
				; CHECK-NEXT: le lr, .LBB8_1
				; CHECK-NEXT: @ %bb.2: @ %exit
				; CHECK-NEXT: pop {r7, pc}
				entry:
				br label %vector.ph

				vector.ph:
				%n.vec = and i32 %N, -4
				%broadcast.splatinsert10 = insertelement <16 x i8> undef, i8 %shift, i32 0
				%broadcast.splat11 = shufflevector <16 x i8> %broadcast.splatinsert10, <16 x i8> undef, <16 x i32> zeroinitializer
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%gep.in = getelementptr inbounds i8, i8* %in, i32 %index
				%cast.in = bitcast i8* %gep.in to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %cast.in, align 4
				%res = ashr <16 x i8> %wide.load, %broadcast.splat11
				%gep.out = getelementptr inbounds i8, i8* %out, i32 %index
				%cast.out = bitcast i8* %gep.out to <16 x i8>*
				store <16 x i8> %res, <16 x i8>* %cast.out, align 4
				%index.next = add i32 %index, 4
				%cmp = icmp eq i32 %index.next, %n.vec
				br i1 %cmp, label %exit, label %vector.body

				exit:
				ret void
				}

llvm/test/CodeGen/Thumb2/mve-shifts.ll

Show First 20 Lines • Show All 377 Lines • ▼ Show 20 Lines	entry:
%0 = shl <2 x i64> %src1, %s		%0 = shl <2 x i64> %src1, %s
ret <2 x i64> %0		ret <2 x i64> %0
}		}


define arm_aapcs_vfpcc <16 x i8> @shru_qr_int8_t(<16 x i8> %src1, i8 %src2) {		define arm_aapcs_vfpcc <16 x i8> @shru_qr_int8_t(<16 x i8> %src1, i8 %src2) {
; CHECK-LABEL: shru_qr_int8_t:		; CHECK-LABEL: shru_qr_int8_t:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vdup.8 q1, r0		; CHECK-NEXT: subs r0, #0, r0
; CHECK-NEXT: vneg.s8 q1, q1		; CHECK-NEXT: vshl.u8 q0, r0
; CHECK-NEXT: vshl.u8 q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%i = insertelement <16 x i8> undef, i8 %src2, i32 0		%i = insertelement <16 x i8> undef, i8 %src2, i32 0
%s = shufflevector <16 x i8> %i, <16 x i8> undef, <16 x i32> zeroinitializer		%s = shufflevector <16 x i8> %i, <16 x i8> undef, <16 x i32> zeroinitializer
%0 = lshr <16 x i8> %src1, %s		%0 = lshr <16 x i8> %src1, %s
ret <16 x i8> %0		ret <16 x i8> %0
}		}

define arm_aapcs_vfpcc <8 x i16> @shru_qr_int16_t(<8 x i16> %src1, i16 %src2) {		define arm_aapcs_vfpcc <8 x i16> @shru_qr_int16_t(<8 x i16> %src1, i16 %src2) {
; CHECK-LABEL: shru_qr_int16_t:		; CHECK-LABEL: shru_qr_int16_t:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vdup.16 q1, r0		; CHECK-NEXT: subs r0, #0, r0
; CHECK-NEXT: vneg.s16 q1, q1		; CHECK-NEXT: vshl.u16 q0, r0
; CHECK-NEXT: vshl.u16 q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%i = insertelement <8 x i16> undef, i16 %src2, i32 0		%i = insertelement <8 x i16> undef, i16 %src2, i32 0
%s = shufflevector <8 x i16> %i, <8 x i16> undef, <8 x i32> zeroinitializer		%s = shufflevector <8 x i16> %i, <8 x i16> undef, <8 x i32> zeroinitializer
%0 = lshr <8 x i16> %src1, %s		%0 = lshr <8 x i16> %src1, %s
ret <8 x i16> %0		ret <8 x i16> %0
}		}

define arm_aapcs_vfpcc <4 x i32> @shru_qr_int32_t(<4 x i32> %src1, i32 %src2) {		define arm_aapcs_vfpcc <4 x i32> @shru_qr_int32_t(<4 x i32> %src1, i32 %src2) {
; CHECK-LABEL: shru_qr_int32_t:		; CHECK-LABEL: shru_qr_int32_t:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vdup.32 q1, r0		; CHECK-NEXT: subs r0, #0, r0
; CHECK-NEXT: vneg.s32 q1, q1		; CHECK-NEXT: vshl.u32 q0, r0
; CHECK-NEXT: vshl.u32 q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%i = insertelement <4 x i32> undef, i32 %src2, i32 0		%i = insertelement <4 x i32> undef, i32 %src2, i32 0
%s = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer		%s = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer
%0 = lshr <4 x i32> %src1, %s		%0 = lshr <4 x i32> %src1, %s
ret <4 x i32> %0		ret <4 x i32> %0
}		}

Show All 19 Lines	entry:
%0 = lshr <2 x i64> %src1, %s		%0 = lshr <2 x i64> %src1, %s
ret <2 x i64> %0		ret <2 x i64> %0
}		}


define arm_aapcs_vfpcc <16 x i8> @shrs_qr_int8_t(<16 x i8> %src1, i8 %src2) {		define arm_aapcs_vfpcc <16 x i8> @shrs_qr_int8_t(<16 x i8> %src1, i8 %src2) {
; CHECK-LABEL: shrs_qr_int8_t:		; CHECK-LABEL: shrs_qr_int8_t:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vdup.8 q1, r0		; CHECK-NEXT: subs r0, #0, r0
; CHECK-NEXT: vneg.s8 q1, q1		; CHECK-NEXT: vshl.s8 q0, r0
; CHECK-NEXT: vshl.s8 q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%i = insertelement <16 x i8> undef, i8 %src2, i32 0		%i = insertelement <16 x i8> undef, i8 %src2, i32 0
%s = shufflevector <16 x i8> %i, <16 x i8> undef, <16 x i32> zeroinitializer		%s = shufflevector <16 x i8> %i, <16 x i8> undef, <16 x i32> zeroinitializer
%0 = ashr <16 x i8> %src1, %s		%0 = ashr <16 x i8> %src1, %s
ret <16 x i8> %0		ret <16 x i8> %0
}		}

define arm_aapcs_vfpcc <8 x i16> @shrs_qr_int16_t(<8 x i16> %src1, i16 %src2) {		define arm_aapcs_vfpcc <8 x i16> @shrs_qr_int16_t(<8 x i16> %src1, i16 %src2) {
; CHECK-LABEL: shrs_qr_int16_t:		; CHECK-LABEL: shrs_qr_int16_t:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vdup.16 q1, r0		; CHECK-NEXT: subs r0, #0, r0
; CHECK-NEXT: vneg.s16 q1, q1		; CHECK-NEXT: vshl.s16 q0, r0
; CHECK-NEXT: vshl.s16 q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%i = insertelement <8 x i16> undef, i16 %src2, i32 0		%i = insertelement <8 x i16> undef, i16 %src2, i32 0
%s = shufflevector <8 x i16> %i, <8 x i16> undef, <8 x i32> zeroinitializer		%s = shufflevector <8 x i16> %i, <8 x i16> undef, <8 x i32> zeroinitializer
%0 = ashr <8 x i16> %src1, %s		%0 = ashr <8 x i16> %src1, %s
ret <8 x i16> %0		ret <8 x i16> %0
}		}

define arm_aapcs_vfpcc <4 x i32> @shrs_qr_int32_t(<4 x i32> %src1, i32 %src2) {		define arm_aapcs_vfpcc <4 x i32> @shrs_qr_int32_t(<4 x i32> %src1, i32 %src2) {
; CHECK-LABEL: shrs_qr_int32_t:		; CHECK-LABEL: shrs_qr_int32_t:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vdup.32 q1, r0		; CHECK-NEXT: subs r0, #0, r0
; CHECK-NEXT: vneg.s32 q1, q1		; CHECK-NEXT: vshl.s32 q0, r0
; CHECK-NEXT: vshl.s32 q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%i = insertelement <4 x i32> undef, i32 %src2, i32 0		%i = insertelement <4 x i32> undef, i32 %src2, i32 0
%s = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer		%s = shufflevector <4 x i32> %i, <4 x i32> undef, <4 x i32> zeroinitializer
%0 = ashr <4 x i32> %src1, %s		%0 = ashr <4 x i32> %src1, %s
ret <4 x i32> %0		ret <4 x i32> %0
}		}

▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines