This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/ARM/
-
Target/
-
ARM/
1/5
ARMISelLowering.cpp
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
-
rotate.ll

Differential D56474

[ARM] [NEON] Add ROTR/ROTL lowering
Needs ReviewPublic

Authored by easyaspi314 on Jan 8 2019, 9:14 PM.

Download Raw Diff

Details

Reviewers

efriedma
craig.topper
t.p.northover
spatel
RKSimon

Summary

This patch adds support for converting ISD::ROTR and ISD::ROTL into either a vshl/vsri or a vrevN for ARM32 NEON.

This patch would also work for aarch64, and I will probably add it in later when I get a chance.

vshl/vsri vs vshl/vshr/vorr saves one instruction, and vrevN is a single cycle rotl for an N/2 rotation (eg a 32-bit rotation on a 64-bit lane).

Diff Detail

Event Timeline

easyaspi314 created this revision.Jan 8 2019, 9:14 PM

Herald added subscribers: llvm-commits, kristof.beyls, javed.absar. · View Herald TranscriptJan 8 2019, 9:14 PM

I'm not sure this is really the best approach... essentially, there are two relevant transforms here:

A rotate by a multiple of 8 can be transformed into a shuffle. I guess the only case that's really relevant on ARM is vrev, since there aren't any other single-instruction shifts that correspond to a rotate, so maybe it's okay to just special-case here.
(OR X, (SRL Y, N)) can be transformed to VSRI if X has enough known trailing zeros. You can special-case rotates (or slightly more generally, FSHL/FSHR), but it's not much harder to handle the general case.

I'm also a little concerned that the VSRI could actually be slower in certain cases... if you look at timings for a Cortex-A57, 128-bit VSRI takes two cycles throughput to execute, as opposed to one for a regular shift.

lib/Target/ARM/ARMISelLowering.cpp
8066	llvm_unreachable; there isn't any other possible vector type.

In D56474#1351603, @efriedma wrote:

I'm not sure this is really the best approach... essentially, there are two relevant transforms here:

A rotate by a multiple of 8 can be transformed into a shuffle. I guess the only case that's really relevant on ARM is vrev, since there aren't any other single-instruction shifts that correspond to a rotate, so maybe it's okay to just special-case here.

Yeah, also, you need to load the pattern. You need a cycle to load the literal, and a cycle to do the shuffle. Additionally, ARMv7-a only supports shuffling on 64-bit vectors. If we were to use it on a 128-bit vector, we would need two shuffles.

(OR X, (SRL Y, N)) can be transformed to VSRI if X has enough known trailing zeros. You can special-case rotates (or slightly more generally, FSHL/FSHR), but it's not much harder to handle the general case.

True. I should probably implement that.

I'm also a little concerned that the VSRI could actually be slower in certain cases... if you look at timings for a Cortex-A57, 128-bit VSRI takes two cycles throughput to execute, as opposed to one for a regular shift.

Well VSRI takes care of both VSHR and VORR, which take a single cycle each. It doesn't have any performance impact, but it saves an instruction.

lib/Target/ARM/ARMISelLowering.cpp

8040

SDValue Temporary = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount);
Value = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount);

Left ? ISD::SHL : ISD::SRL
Left ? ISD::SHL : ISD::SRL

Wow. I'm an idiot.

Huh. Does clang even emit FSHL/FSHR instructions?

u32x4 fshr32_13(u32x4 val, u64x2 val2)
{
    return (val << (32 - (13 % 32))) | (val2 >> (13 % 32));
}

That is literally the contents of the code comment describing it.

I can do this, though:

declare <4 x i32> @llvm.fshl.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c)
define <4 x i32> @fshr32_13(<4 x i32> %val1, <4 x i32> %val2) nounwind
{
  %r = call <4 x i32> @llvm.fshr.v4i32(<4 x i32> %val1, <4 x i32> %val2, <4 x i32> <i32 13, i32 13, i32 13, i32 13>)
  ret <4 x i32> %r
}

I do want to mention that VSLI is not beneficial if it is required for the value to be in the same register as before. This pattern will place the value in a different register.

The only ways I would think that would be an issue is if we are using every single NEON register (we have at least 16), or we are using it in a standalone function. However, since Clang has a (bad?) habit of moving every vector passed or returned in/out of normal registers, and that functions that directly take SIMD vectors should be inlined, it isn't a huge deal.

Does clang even emit FSHL/FSHR instructions

This is an area of active development; llvm.fshl got added recently. But it's not really a priority to form llvm.fshl for constant shifts; it's easy to analyze anyway.

Huh. Apparently, vshr/vsli is actually faster.

I used the benchmark tool on my my NEON-optimized xxHash variant to test this, as it is a real-life usage of the rotate pattern.

Compiled on my LG G3 (Quad core Snapdragon 801/Cortex-A15 underclocked to 1.8 GHz) with Clang 7.0.1 from the Termux repos, and benchmarked in the Termux app while tapping the screen to maintain a stable frequency.
clang -march=native -O2

The main loop basically looks like this:

    uint32x4_t v;
    const uint32x4_t prime1, prime2; // literals
    const uint8_t *p, limit; // unaligned data pointer
    ...
    do {
        /* note: vld1q_u8 is to work around alignment bug */
        const uint32x4_t inp = vreinterpretq_u32_u8(vld1q_u8(p));
        v = vmlaq_u32(v, prime2, inp);
#ifdef VSLI
        v = vsliq_n_u32(vshrq_n_u32(v, 19), v, 13);
#else 
        v = vorrq_u32(vshrq_n_u32(v, 19), vshlq_n_u32(v, 13));
#endif 
        v = vmulq_u32(v, prime1);
        p += 16;
    } while (p < limit);

The benchmark gets 4.1 GB/s with -DVSLI, but only 3.7 GB/s with vshl/vorr. Similarly, the variation using two vectors in parallel gets 5.7 GB/s with -DVSLI, but only 5.3 GB/s without.

Considering that all the other variables are the same, I presume that maybe writeback latency is to blame.

Fixed my incredibly stupid typo, added FSHL/FSHR support, and used llvm_unreachable instead of the ugly goto.

I didn't add additional tests for FSHL/FSHR yet.

Definitely either result/writeback cycles.

                             @ Cy / Re / Wr
vld1.8 { inpLo, inpHi }, [p] @  2 /  2 /  6
vmla.i32 v, inp, prime2      @  4 /  9 /  9
vshr.u32 v2, v, #19          @  1 /  3 /  6
vsli.32 v2, v, #13           @  2 /  4 /  7
vmul.i32 v, v2, prime1       @  4 /  9 /  9
                             @ 13 / 27 / 37

vld1.8 { inpLo, inpHi }, [p] @  2 /  2 /  6
vmla.i32 v, inp, prime2      @  4 /  9 /  9
vshr.u32 tmp, v, #19         @  1 /  3 /  6
vshl.i32 v, v, #13           @  1 /  3 /  6
vorr v, v, tmp               @  1 /  3 /  6
vmul.i32 v, v, prime1        @  4 /  9 /  9
                             @ 13 / 29 / 42

If we count result cycles, we get 29 cycles with vshr/vshr/vorr, and 27 cycles with vshr/vsli. 29/27 = 1.074. If we count writeback cycles, we get 1.135. That checks out with the 1.10x ratio I saw in the benchmark, as it lands right in that range.

RKSimon added a comment.Jan 10 2019, 1:22 AM

This comment was removed by RKSimon.

efriedma added inline comments.Jan 14 2019, 6:35 PM

lib/Target/ARM/ARMISelLowering.cpp
8048	You can just `return SDValue();` here, rather than explicitly expand it.
8055	The shift amount is modulo the size of the elements.

RKSimon added inline comments.Jan 15 2019, 1:20 AM

lib/Target/ARM/ARMISelLowering.cpp
851	If you're doing this please add suitable fshl/fshr test coverage for reference: llvm\test\CodeGen\X86\vector-fshl-128.ll llvm\test\CodeGen\X86\vector-fshr-128.ll llvm\test\CodeGen\X86\vector-fshl-rot-128.ll llvm\test\CodeGen\X86\vector-fshr-rot-128.ll

Huh. Sorry about the idle time.

I noticed that there is an even faster way of doing a rotate, which is replacing vshr/vsli with vshl/vsra.

It appears to save at least one cycle, and in the xxhash benchmark, I go from 5.7 GB/s to almost 6.2 GB/s.

Doing this also fixes this bug:

vsraq_n_u32(vshlq_n_u32(val, 13), val, 19);

should not generate

vshl.i32 tmp, val, #13 
vshr.u32 val, val, #19 
vorr val, val, tmp

Nice. :)

@easyaspi314 What happened with this?

RKSimon resigned from this revision.Jan 25 2020, 1:59 AM

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMISelLowering.cpp

88 lines

test/

CodeGen/

ARM/

rotate.ll

103 lines

Diff 180987

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 839 Lines • ▼ Show 20 Lines	ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,
}		}

if (!Subtarget->isThumb1Only() && Subtarget->hasV6T2Ops())		if (!Subtarget->isThumb1Only() && Subtarget->hasV6T2Ops())
setOperationAction(ISD::BITREVERSE, MVT::i32, Legal);		setOperationAction(ISD::BITREVERSE, MVT::i32, Legal);

// ARM does not have ROTL.		// ARM does not have ROTL.
setOperationAction(ISD::ROTL, MVT::i32, Expand);		setOperationAction(ISD::ROTL, MVT::i32, Expand);
for (MVT VT : MVT::vector_valuetypes()) {		for (MVT VT : MVT::vector_valuetypes()) {
setOperationAction(ISD::ROTL, VT, Expand);		setOperationAction(ISD::ROTL, VT, Custom);
setOperationAction(ISD::ROTR, VT, Expand);		setOperationAction(ISD::FSHL, VT, Custom);
		setOperationAction(ISD::ROTR, VT, Custom);
		setOperationAction(ISD::FSHR, VT, Custom);
		RKSimonUnsubmitted Not Done Reply Inline Actions If you're doing this please add suitable fshl/fshr test coverage for reference: llvm\test\CodeGen\X86\vector-fshl-128.ll llvm\test\CodeGen\X86\vector-fshr-128.ll llvm\test\CodeGen\X86\vector-fshl-rot-128.ll llvm\test\CodeGen\X86\vector-fshr-rot-128.ll RKSimon: If you're doing this please add suitable fshl/fshr test coverage for reference…
}		}
setOperationAction(ISD::CTTZ, MVT::i32, Custom);		setOperationAction(ISD::CTTZ, MVT::i32, Custom);
setOperationAction(ISD::CTPOP, MVT::i32, Expand);		setOperationAction(ISD::CTPOP, MVT::i32, Expand);
if (!Subtarget->hasV5TOps() \|\| Subtarget->isThumb1Only()) {		if (!Subtarget->hasV5TOps() \|\| Subtarget->isThumb1Only()) {
setOperationAction(ISD::CTLZ, MVT::i32, Expand);		setOperationAction(ISD::CTLZ, MVT::i32, Expand);
setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, LibCall);		setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, LibCall);
}		}

▲ Show 20 Lines • Show All 7,155 Lines • ▼ Show 20 Lines	CLI.setDebugLoc(dl)
.setCallee(CallingConv::ARM_AAPCS_VFP, LCRTy, Callee, std::move(Args))		.setCallee(CallingConv::ARM_AAPCS_VFP, LCRTy, Callee, std::move(Args))
.setTailCall(IsTC);		.setTailCall(IsTC);
std::pair<SDValue, SDValue> CI = TLI.LowerCallTo(CLI);		std::pair<SDValue, SDValue> CI = TLI.LowerCallTo(CLI);

// Return the chain (the DAG root) if it is a tail call		// Return the chain (the DAG root) if it is a tail call
return !CI.second.getNode() ? DAG.getRoot() : CI.first;		return !CI.second.getNode() ? DAG.getRoot() : CI.first;
}		}

		/// Handles ROTR, ROTL, FSHL, FSHR
		static SDValue LowerRotate(SDValue Op, SelectionDAG &DAG) {
		EVT VT = Op.getValueType();

		assert(VT.isVector() && VT.isInteger() &&
		"unexpected type for custom-lowering ISD::ROTR/ROTL/FSHL/FSHR");
		unsigned Opcode = Op.getOpcode();
		bool Left = Opcode == ISD::ROTL \|\| Opcode == ISD::FSHL;
		bool IsFunnel = Opcode == ISD::FSHL \|\| Opcode == ISD::FSHR;
		SDValue Value1, Value2, Amount;
		if (IsFunnel) {
		Value1 = Op.getOperand(0);
		Value2 = Op.getOperand(1);
		Amount = Op.getOperand(2);
		} else {
		Value2 = Value1 = Op.getOperand(0);
		Amount = Op.getOperand(1);
		}
		easyaspi314AuthorUnsubmitted Done Reply Inline Actions SDValue Temporary = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount); Value = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount); Left ? ISD::SHL : ISD::SRL Left ? ISD::SHL : ISD::SRL Wow. I'm an idiot. easyaspi314: ``` SDValue Temporary = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount); Value =…
		EVT ShiftType = Amount.getValueType();
		SDLoc DL(Op);

		unsigned Bits = VT.getScalarSizeInBits();
		KnownBits ShiftKnown = DAG.computeKnownBits(Amount);

		if (!ShiftKnown.isConstant()) {
		// Normal shift-shift-or, we don't have a constant.
		efriedmaUnsubmitted Not Done Reply Inline Actions You can just `return SDValue();` here, rather than explicitly expand it. efriedma: You can just `return SDValue();` here, rather than explicitly expand it.
		Value1 = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value1, Amount);
		Amount = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(Bits, DL, ShiftType), Amount);
		Value2 = DAG.getNode(Left ? ISD::SRL : ISD::SHL, DL, VT, Value2, Amount);
		return DAG.getNode(ISD::OR, DL, VT, Value1, Value2);
		}

		const uint64_t Shift = ShiftKnown.getConstant().getLimitedValue(129);
		efriedmaUnsubmitted Not Done Reply Inline Actions The shift amount is modulo the size of the elements. efriedma: The shift amount is modulo the size of the elements.
		if (Shift == 0) {
		// nothing to do.
		return Op;
		} else if (!IsFunnel && Bits >= 16 && Shift * 2 == Bits) {
		// Rotating N/2 bits is very easy with the VREV instructions.
		EVT NewVT;
		unsigned Opcode;
		switch (Bits) {
		case 16:
		NewVT = VT.is128BitVector() ? MVT::v16i8 : MVT::v8i8;
		Opcode = ARMISD::VREV16;
		efriedmaUnsubmitted Not Done Reply Inline Actions llvm_unreachable; there isn't any other possible vector type. efriedma: llvm_unreachable; there isn't any other possible vector type.
		break;
		case 32:
		NewVT = VT.is128BitVector() ? MVT::v8i16 : MVT::v4i16;
		Opcode = ARMISD::VREV32;
		break;
		case 64:
		NewVT = VT.is128BitVector() ? MVT::v4i32 : MVT::v2i32;
		Opcode = ARMISD::VREV64;
		break;
		default:
		llvm_unreachable("Unexpected vector type encountered!");
		}
		// Cast to the smaller vector
		SDValue Reversed = DAG.getBitcast(NewVT, Value1);
		// vrevN.N/2
		Reversed = DAG.getNode(Opcode, DL, NewVT, Reversed);
		// cast back
		return DAG.getBitcast(VT, Reversed);
		} else if (Shift >= Bits) {
		// illegal
		return SDValue();
		}

		// vshl.iN q1, q0, #X
		// vsri.N q1, q0, #(32 - x)
		SDValue Forward = DAG.getConstant(Shift, DL, MVT::i32);
		SDValue Reverse = DAG.getConstant(Bits - Shift, DL, MVT::i32);

		Value1 = DAG.getNode(Left ? ARMISD::VSHL : ARMISD::VSHRu, DL, VT, Value1, Forward);
		SDValue Rotated = DAG.getNode(Left ? ARMISD::VSRI : ARMISD::VSLI, DL, VT, Value1, Value2, Reverse);
		return Rotated;
		}

SDValue ARMTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {		SDValue ARMTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
LLVM_DEBUG(dbgs() << "Lowering node: "; Op.dump());		LLVM_DEBUG(dbgs() << "Lowering node: "; Op.dump());
switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
default: llvm_unreachable("Don't know how to custom lower this!");		default: llvm_unreachable("Don't know how to custom lower this!");
case ISD::WRITE_REGISTER: return LowerWRITE_REGISTER(Op, DAG);		case ISD::WRITE_REGISTER: return LowerWRITE_REGISTER(Op, DAG);
case ISD::ConstantPool: return LowerConstantPool(Op, DAG);		case ISD::ConstantPool: return LowerConstantPool(Op, DAG);
case ISD::BlockAddress: return LowerBlockAddress(Op, DAG);		case ISD::BlockAddress: return LowerBlockAddress(Op, DAG);
case ISD::GlobalAddress: return LowerGlobalAddress(Op, DAG);		case ISD::GlobalAddress: return LowerGlobalAddress(Op, DAG);
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	SDValue ARMTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::DYNAMIC_STACKALLOC:		case ISD::DYNAMIC_STACKALLOC:
if (Subtarget->isTargetWindows())		if (Subtarget->isTargetWindows())
return LowerDYNAMIC_STACKALLOC(Op, DAG);		return LowerDYNAMIC_STACKALLOC(Op, DAG);
llvm_unreachable("Don't know how to custom lower this!");		llvm_unreachable("Don't know how to custom lower this!");
case ISD::FP_ROUND: return LowerFP_ROUND(Op, DAG);		case ISD::FP_ROUND: return LowerFP_ROUND(Op, DAG);
case ISD::FP_EXTEND: return LowerFP_EXTEND(Op, DAG);		case ISD::FP_EXTEND: return LowerFP_EXTEND(Op, DAG);
case ISD::FPOWI: return LowerFPOWI(Op, *Subtarget, DAG);		case ISD::FPOWI: return LowerFPOWI(Op, *Subtarget, DAG);
case ARMISD::WIN__DBZCHK: return SDValue();		case ARMISD::WIN__DBZCHK: return SDValue();
		case ISD::ROTR:
		case ISD::ROTL:
		case ISD::FSHL:
		case ISD::FSHR:
		return LowerRotate(Op, DAG);
}		}
}		}

static void ReplaceLongIntrinsic(SDNode *N, SmallVectorImpl<SDValue> &Results,		static void ReplaceLongIntrinsic(SDNode *N, SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
unsigned IntNo = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();
unsigned Opc = 0;		unsigned Opc = 0;
if (IntNo == Intrinsic::arm_smlald)		if (IntNo == Intrinsic::arm_smlald)
▲ Show 20 Lines • Show All 7,078 Lines • Show Last 20 Lines

test/CodeGen/ARM/rotate.ll

	; RUN: llc < %s -mtriple=thumbv8--linux-gnueabihf \| FileCheck %s			; RUN: llc < %s -mtriple=thumbv8--linux-gnueabihf \| FileCheck %s

	;; This used to cause a backend crash about not being able to			; We want vshl/vsri.
	;; select ROTL. Make sure if generates the basic VSHL/VSHR.			define <2 x i64> @vrolq_n_u64_8(<2 x i64>* %in) {
	define <2 x i64> @testcase(<2 x i64>* %in) {			; CHECK-LABEL: vrolq_n_u64_8
	; CHECK-LABEL: testcase			; CHECK-NOT: vrev64.32
	; CHECK: vshl.i64			; CHECK: vshl.i64
	; CHECK: vshr.u64			; CHECK: vsri.64
				; CHECK-NOT: vorr
	%1 = load <2 x i64>, <2 x i64>* %in			%1 = load <2 x i64>, <2 x i64>* %in
	%2 = lshr <2 x i64> %1, <i64 8, i64 8>			%2 = lshr <2 x i64> %1, <i64 8, i64 8>
	%3 = shl <2 x i64> %1, <i64 56, i64 56>			%3 = shl <2 x i64> %1, <i64 56, i64 56>
	%4 = or <2 x i64> %2, %3			%4 = or <2 x i64> %2, %3
	ret <2 x i64> %4			ret <2 x i64> %4
	}			}

				define <4 x i32> @vrolq_n_u32_13(<4 x i32>* %in) {
				; CHECK-LABEL: vrolq_n_u32_13
				; CHECK-NOT: vrev32.16
				; CHECK: vshl.i32
				; CHECK: vsri.32
				; CHECK-NOT: vshr.u32
				; CHECK-NOT: vsli.32
				; CHECK-NOT: vorr
				%1 = load <4 x i32>, <4 x i32>* %in
				%2 = lshr <4 x i32> %1, <i32 13, i32 13, i32 13, i32 13>
				%3 = shl <4 x i32> %1, <i32 19, i32 19, i32 19, i32 19>
				%4 = or <4 x i32> %2, %3
				ret <4 x i32> %4
				}
				define <8 x i16> @vrolq_n_u16_3(<8 x i16>* %in) {
				; CHECK-LABEL: vrolq_n_u16_3
				; CHECK-NOT: vrev16.8
				; CHECK: vshl.i16
				; CHECK: vsri.16
				; CHECK-NOT: vorr
				%1 = load <8 x i16>, <8 x i16>* %in
				%2 = lshr <8 x i16> %1, <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>
				%3 = shl <8 x i16> %1, <i16 13, i16 13, i16 13, i16 13, i16 13, i16 13, i16 13, i16 13>
				%4 = or <8 x i16> %2, %3
				ret <8 x i16> %4
				}

				; We want vrev64 when we are rotating by <width>/2 bits.
				define <2 x i64> @vrolq_n_u64_32(<2 x i64>* %in) {
				; CHECK-LABEL: vrolq_n_u64_32
				; CHECK: vrev64.32
				; CHECK-NOT: vorr
				; CHECK-NOT: vsli.64
				; CHECK-NOT: vsri.64
				; CHECK-NOT: vshl.i64
				; CHECK-NOT: vshr.u64
				%1 = load <2 x i64>, <2 x i64>* %in
				%2 = lshr <2 x i64> %1, <i64 32, i64 32>
				%3 = shl <2 x i64> %1, <i64 32, i64 32>
				%4 = or <2 x i64> %2, %3
				ret <2 x i64> %4
				}
				define <4 x i32> @vrolq_n_u32_16(<4 x i32>* %in) {
				; CHECK-LABEL: vrolq_n_u32_16
				; CHECK: vrev32.16
				; CHECK-NOT: vorr
				; CHECK-NOT: vsli.32
				; CHECK-NOT: vsri.32
				; CHECK-NOT: vshl.i32
				; CHECK-NOT: vshr.u32
				%1 = load <4 x i32>, <4 x i32>* %in
				%2 = lshr <4 x i32> %1, <i32 16, i32 16, i32 16, i32 16>
				%3 = shl <4 x i32> %1, <i32 16, i32 16, i32 16, i32 16>
				%4 = or <4 x i32> %2, %3
				ret <4 x i32> %4
				}
				define <8 x i16> @vrolq_n_u16_8(<8 x i16>* %in) {
				; CHECK-LABEL: vrolq_n_u16_8
				; CHECK: vrev16.8
				; CHECK-NOT: vorr
				; CHECK-NOT: vsli.16
				; CHECK-NOT: vsri.16
				; CHECK-NOT: vshl.i16
				; CHECK-NOT: vshr.u16
				%1 = load <8 x i16>, <8 x i16>* %in
				%2 = lshr <8 x i16> %1, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%3 = shl <8 x i16> %1, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%4 = or <8 x i16> %2, %3
				ret <8 x i16> %4
				}

				; No such thing as VREV8.
				define <16 x i8> @vrolq_n_u8_4(<16 x i8>* %in) {
				; CHECK-LABEL: vrolq_n_u8_4
				; CHECK-NOT: vrev
				; CHECK-NOT: vorr
				; CHECK: vshl.i8
				; CHECK: vsri.8
				; CHECK-NOT: vsri.8
				; CHECK-NOT: vshl.i8
				; CHECK-NOT: vshr.u8
				%1 = load <16 x i8>, <16 x i8>* %in
				%2 = lshr <16 x i8> %1, <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>
				%3 = shl <16 x i8> %1, <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>
				%4 = or <16 x i8> %2, %3
				ret <16 x i8> %4
				}

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] [NEON] Add ROTR/ROTL loweringNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 180987

lib/Target/ARM/ARMISelLowering.cpp

test/CodeGen/ARM/rotate.ll

[ARM] [NEON] Add ROTR/ROTL lowering
Needs ReviewPublic