This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/ARM/
-
Target/
-
ARM/
1/3
ARMISelLowering.cpp
2/2
ARMTargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/ARM/
-
CostModel/
-
ARM/
1/2
mult.ll
-
CodeGen/ARM/
-
ARM/
1/6
vmul.ll

Differential D56118

[ARM]: Add optimized NEON uint64x2_t multiply routine.
Needs ReviewPublic

Authored by easyaspi314 on Dec 27 2018, 8:48 PM.

Download Raw Diff

Details

Reviewers

eli.friedman
craig.topper
t.p.northover
spatel
RKSimon

Summary

Patch to fix bug 39967

There are a few further optimizations that can be made, but this is many times better than scalarizing.

These take up to 11 cycles, however, the smallest possible case is 7 with two pre-interleaved values, or 4 if we are using a single vmull.

This automatically tries to select the proper routine.

twomul is used for multiplies that could not be simplified. It takes 7 instructions and 11 cycles. It looks like so:

vmovn.i64 topLo, top
vmovn.i64 botLo, bot
vrev64.32 bot, bot

vmul.i32 bot, bot, top
vpaddl.i32 bot, bot
vshl.i64 top, bot, #32
vmlal.u32 top, botLo, topLo

ssemul is simpler, clearer, and more modular. It has the same timing as twomul, but it has one more instruction.

However, if something is simplified, such as preinterleaving a load/constant or removing multiplies on bits that are known to be zero, ssemul ends up being faster and shorter.

vmovn.i64 topLo, top
vshrn.i64 topHi, top, #32
vmovn.i64 botHi, bot
vshrn.i64 botLo, bot, #32 

vmull.u32 ret, topLo, botHi
vmlal.u32 ret, topHi, botLo
vshl.i64  ret, ret, #32 
vmlal.u32 ret, topLo, botLo

Some missing optimizations:

Masking a uint64x2_t/v2i64 by 0xFFFFFFFF shouldn't actually generate a vand, it should remove one multiply and one vshrn.
Constant interleaving is put into two vldr instructions. This might not be the most efficient, as I want a vld1.64 I don't know how adr comes into play, but I think it could be run in parallel with shifting on the other multiple.
Load interleaving should be implemented. If a multiple is used only once and it is loaded from a pointer, by replacing vld1.64 with vld2.32 and using the two uint32x2_t/v2i32 values generated, we can also avoid vshrn/vmovn.
Cost model is pretty weird with v2i64 in NEON. We should probably fix the "add 4 to the cost" hack, as NEON v2i64 vectors are not as expensive as they are made out to be with that huge penalty.

Diff Detail

Repository: rL LLVM

Event Timeline

easyaspi314 created this revision.Dec 27 2018, 8:48 PM

Herald added subscribers: llvm-commits, kristof.beyls, javed.absar. · View Herald TranscriptDec 27 2018, 8:48 PM

easyaspi314 added inline comments.Dec 27 2018, 9:01 PM

lib/Target/ARM/ARMISelLowering.cpp
7485	Can someone help me here? I can't figure this one out.

easyaspi314 retitled this revision from [ARM]: WIP: Add optimized uint64x2_t multiply routine. to [ARM]: WIP: Add optimized NEON uint64x2_t multiply routine..Dec 27 2018, 9:09 PM

easyaspi314 edited the summary of this revision. (Show Details)

Actually, further optimizations can be done later. At the very least, this is usable. I still left the notes for someone who wants to implement them later, but I think that at least, for now, it is far better than it was.

I don't know who I should put for a reviewer, but I think it is ready for review.

*facepalms* Wow, I am an idiot.

Note to self: Check your PWD before running svn diff.

easyaspi314 added a reviewer: eli.friedman.Dec 28 2018, 7:29 PM

Please upload patches with full context (see https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface).

lib/Target/ARM/ARMISelLowering.cpp
7465	Alternatively, we could try to handle simplifications by adding a DAGCombine for VSHRu and VMULLu nodes with constant operands, instead of trying to avoid emitting them. The end result is roughly the same (it's probably unlikely to construct a trivial VSHRu or VMULLu any other way), but it might be easier to understand. But I guess it would get tricky if we tried to emit the form using "vrev64". Probably okay as-is. Needs more test coverage for various cases where the upper/lower half an operand is zero.
lib/Target/ARM/ARMTargetTransformInfo.cpp
528	Did you really mean UMULO (llvm.umul.with.overflow)? I mean, I guess it's theoretically possible to implement, but I think we'll just scalarize it at the moment. (The cost model changes need test coverage in test/Analysis/CostModel/ARM/ .)
test/CodeGen/ARM/vmul.ll
69	DAGCombine should be able to catch the redundant AND... but it looks like DAGCombiner::visitTRUNCATE doesn't try to handle demanded bits for vectors. (I guess it didn't get updated when other operations got support for vector operands?)

craig.topper added subscribers: spatel, RKSimon, craig.topper.Dec 29 2018, 1:33 PM

craig.topper added inline comments.

test/CodeGen/ARM/vmul.ll
69	PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it?

I am going to add constant interleaving, more tests, and call it done. I think that twomul and interleaved loads are much smaller and lower priority optimizations, and something I am not comfortable enough to do myself.

lib/Target/ARM/ARMISelLowering.cpp
7465	This is the exact same method used in X86ISelLowering.cpp/LowerMUL, except for the early return to work around missing optimizations later on. If we use the `twomul`/`vrev64` version, we would only use it if none of them are zero and none could be pre-interleaved. Once things get interleaved and/or we eliminate one of the multiplies, we lose the benefit.
lib/Target/ARM/ARMTargetTransformInfo.cpp
528	Oh yeah, I did that wrong. Should be normal multiply.
test/CodeGen/ARM/vmul.ll
69	That would explain why it was choking on this when X86 does not.

Fixed cost model and added cost model tests. I also made a larger diff.

I can't figure out the constant interleaving, as I can't seem to figure out how to detect a literal vector. Everything I tried returned false. I don't want to spend too much time on it. If any of you want to finish it, go ahead. It is still better than what it was.

easyaspi314 marked an inline comment as done.Dec 31 2018, 4:59 PM

I figured things out and I am on track to adding twomul and constant interleaving.

This is what I have now. Constant swapping and twomul is now implemented, and I will update the patch once I finish the documentation and tests, as well as double-check the return values on my phone.

typedef unsigned long long v2i64 __attribute__((vector_size(16)));

v2i64 mult(v2i64 v1, v2i64 v2) {
  return v1 * v2;
}

v2i64 mult_lo(v2i64 v1, v2i64 v2) {
  return (v1 & 0xFFFFFFFF) * (v2 & 0xFFFFFFFF);
}

v2i64 mult_lo_lohi(v2i64 v1, v2i64 v2) {
  return (v1 & 0xFFFFFFFF) * v2;
}

v2i64 mult_constant(v2i64 v1) {
  return 1234567889904ULL * v1;
}

v2i64 mult_lo_constant(v2i64 v1) {
  return v1 * 1234ULL;
}

mult:
	vmov	d17, r2, r3
	mov	r12, sp
	vmov	d16, r0, r1
	vld1.64	{d18, d19}, [r12]
	vrev64.32	q10, q8
	vmovn.i64	d16, q8
	vmovn.i64	d17, q9
	vmul.i32	q10, q10, q9
	vpaddl.u32	q10, q10
	vshl.i64	q9, q10, #32
	vmlal.u32	q9, d17, d16
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr

mult_lo:
	vmov	d19, r2, r3
	vmov.i64	q8, #0xffffffff
	vmov	d18, r0, r1
	mov	r0, sp
	vand	q9, q9, q8
	vld1.64	{d20, d21}, [r0]
	vand	q8, q10, q8
	vmovn.i64	d18, q9
	vmovn.i64	d16, q8
	vmull.u32	q8, d16, d18
	vmov	r0, r1, d16
	vmov	r2, r3, d17
	bx	lr

mult_lo_lohi:
	vmov	d19, r2, r3
	vmov.i64	q8, #0xffffffff
	vmov	d18, r0, r1
	mov	r0, sp
	vld1.64	{d20, d21}, [r0]
	vand	q8, q9, q8
	vshrn.i64	d18, q10, #32
	vmovn.i64	d16, q8
	vmovn.i64	d17, q10
	vmull.u32	q9, d16, d18
	vshl.i64	q9, q9, #32
	vmlal.u32	q9, d16, d17
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr

mult_constant:
	adr	r12, .LCPI3_0
	vmov	d17, r2, r3
	vld1.64	{d18, d19}, [r12:128]
	vmov	d16, r0, r1
	vmul.i32	q9, q8, q9
	vldr	d20, .LCPI3_1
	vmovn.i64	d16, q8
	vpaddl.u32	q9, q9
	vshl.i64	q9, q9, #32
	vmlal.u32	q9, d16, d20
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr
.LCPI3_0:
	.long	1912275952
	.long	1912275952
	.long	287
	.long	287
.LCPI3_1:
	.long	1912275952
	.long	1912275952

mult_lo_constant:
        vmov    d17, r2, r3
        vldr    d18, .LCPI4_0
        vmov    d16, r0, r1
        vshrn.i64       d19, q8, #32
        vmovn.i64       d16, q8
        vmull.u32       q10, d19, d18
        vshl.i64        q10, q10, #32
        vmlal.u32       q10, d16, d18
        vmov    r0, r1, d20
        vmov    r2, r3, d21
        bx      lr
.LCPI4_0:
        .long   1234
        .long   1234

If someone can optimize that redundant load in mult_constant, I would appreciate it.

This is the code I want:

mult_constant:
	adr	r12, .LCPI3_0
	vmov	d17, r2, r3
	vld1.64	{d22, d23}, [r12:128]
	vmov	d16, r0, r1
	vmul.i32	q9, q8, q11
	vmovn.i64	d16, q8
	vpaddl.u32	q9, q9
	vshl.i64	q9, q9, #32
	vmlal.u32	q9, d16, d22
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr
.LCPI3_0:
	.long	1912275952
	.long	1912275952
	.long	287
	.long	287

Oops, that's not right. mult_constant isn't matching.

I think I am going to fall back to ssemul for all constants like I originally planned.

vmul.i32 Qd,Qn,Qm actually takes 4 cycles, which means twomul has the same timing as ssemul, 11 cycles.
@efriedma that explains why twomul wasn't visibly faster in my tests.

However, twomul saves an instruction, so I am keeping it.

twomul:
	vrev64.32	q10, q8        @ 1 cycle,  total 1
	vmovn.i64	d16, q8        @ 1 cycle,  total 2
	vmovn.i64	d17, q9        @ 1 cycle,  total 3
	vmul.i32	q10, q10, q9   @ 4 cycles, total 7
	vpaddl.u32	q10, q10       @ 1 cycle,  total 8
	vshl.i64	q9, q10, #32   @ 1 cycle,  total 9
	vmlal.u32	q9, d17, d16   @ 2 cycles, total 11

ssemul:
	vshrn.i64       d20, q8, #32   @ 1 cycle,  total 1
	vmovn.i64       d16, q8        @ 1 cycle,  total 2
	vmovn.i64       d21, q9        @ 1 cycle,  total 3
	vmull.u32       q11, d21, d20  @ 2 cycles, total 5
	vshrn.i64       d17, q9, #32   @ 1 cycle,  total 6
	vmlal.u32       q11, d17, d16  @ 2 cycles, total 8
	vshl.i64        q9, q11, #32   @ 1 cycle,  total 9
	vmlal.u32       q9, d21, d16   @ 2 cycles, total 11

Ok, done. I fixed twomul and constant interleaving, and added a couple more tests.

I also lowered the cost for multiplication to be in relation to normal mul i64, which makes it so LLVM will properly autovectorize it.

easyaspi314 edited the summary of this revision. (Show Details)Jan 3 2019, 8:27 PM

easyaspi314 added reviewers: RKSimon, craig.topper, t.p.northover, spatel.

RKSimon added inline comments.Jan 6 2019, 12:51 PM

test/Analysis/CostModel/ARM/mult.ll
2	You might find the utils\update_analyze_test_checks.py script useful to make this more maintainable - see X86\arith.ll for examples.
test/CodeGen/ARM/vmul.ll
41	Please add these new tests to trunk with current codegen now then rebase this patch so it shows the changes to codegen.
69	PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it? I've been putting it off as there's a load of yak shaving to be done for it - but I will look again.

v2i64 mul5(v2i64 val)
{
    return val * 5;
}

mul5:
    vmov    d17, r2, r3
    vmov    d16, r0, r1
    vmov.i32    d18, #0x5
    vshrn.i64   d19, q8, #32
    vmovn.i64   d16, q8
    vmull.u32   q10, d19, d18
    vshl.i64    q10, q10, #32
    vmlal.u32   q10, d16, d18
    vmov    r0, r1, d20
    vmov    r2, r3, d21
    bx  lr

Ummmmmm…that should DEFINITELY be a shift+add. That would be much cheaper.

The cost model is definitely messed up…nope, something else. I set the multiply cost to 80000, and it still chose it over shifts!

mul5:
    movdqa  xmm1, xmmword ptr [rip + LCPI0_0]
    movdqa  xmm2, xmm0
    pmuludq xmm2, xmm1
    psrlq   xmm0, 32
    pmuludq xmm0, xmm1
    psllq   xmm0, 32
    paddq   xmm0, xmm2
    ret

What?! We should be shifting + adding here too! Are we just not doing shift + adds for vectors?

What the heck?

test/Analysis/CostModel/ARM/mult.ll
2	Ok, will take a look at that.
test/CodeGen/ARM/vmul.ll
41	Okay, I will try to do that. I just run svn update, right? I don't know why I chose SVN.

RKSimon mentioned this in rL350513: Regenerate test..Jan 7 2019, 4:24 AM

RKSimon mentioned this in rL350514: Regenerate test..

RKSimon mentioned this in D56387: [DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE.Jan 7 2019, 4:38 AM

Updated tests with the generation tools, changed multiply cost to 8.

Each time I was about to submit it I got distracted and had to update and build again. ¯\_(ツ)_/¯

@easyaspi314 What happened with this?

Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2019, 9:59 PM

RKSimon resigned from this revision.Jan 25 2020, 1:58 AM

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMISelLowering.cpp

210 lines

ARMTargetTransformInfo.cpp

13 lines

test/

Analysis/

CostModel/

ARM/

mult.ll

147 lines

CodeGen/

ARM/

vmul.ll

151 lines

Diff 180781

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,378 Lines • ▼ Show 20 Lines	if (Opcode == ISD::ADD \|\| Opcode == ISD::SUB) {
SDNode *N0 = N->getOperand(0).getNode();		SDNode *N0 = N->getOperand(0).getNode();
SDNode *N1 = N->getOperand(1).getNode();		SDNode *N1 = N->getOperand(1).getNode();
return N0->hasOneUse() && N1->hasOneUse() &&		return N0->hasOneUse() && N1->hasOneUse() &&
isZeroExtended(N0, DAG) && isZeroExtended(N1, DAG);		isZeroExtended(N0, DAG) && isZeroExtended(N1, DAG);
}		}
return false;		return false;
}		}

		/// Generates a v2i64 multiply routine for NEON.
		/// Uses one of two versions of the 64-bit multiply algorithm.
		/// Multiplications take a maximum of 11 cycles, possibly fewer depending on
		/// whether an operand is constant or interleaved beforehand.
		/// See the implementation for details.
		static SDValue LowerNEONv2i64MUL(SDValue &Op, SelectionDAG &DAG,
		const EVT &VT) {
		// The simplest way to efficiently do a v2i64 multiply on NEON is to use the
		// exact same method as used for SSE, to be referred to as "ssemul":
		// vshrn.i64 topHi, top, #32 @ v2i32 topHi = top >> 32;
		// vmovn.i64 topLo, top @ v2i32 topLo = top & 0xFFFFFFFF;
		// vshrn.i64 botHi, bot, #32 @ v2i32 botHi = bot >> 32;
		// vmovn.i64 botLo, bot @ v2i32 botLo = bot & 0xFFFFFFFF;
		// vmull.u32 ret64, topHi, botLo @ v2i64 ret64 = (v2i64)topHi * botLo;
		// vmlal.u64 ret64, topLo, botHi @ ret64 += (v2i64)topLo * botHi;
		// vshl.i64 ret64, ret64, #32 @ ret64 <<= 32;
		// vmlal.u32 ret64, topLo, botLo @ ret64 += (v2i64)topLo * botLo;
		// However, a second method, "twomul", is four bytes shorter with the same
		// timing (11 cycles).
		// vmovn.i64 topLo, top @ v2i64 topLo = top & 0xFFFFFFFF;
		// vmovn.i64 botLo, bot @ v2i64 botLo = bot & 0xFFFFFFFF;
		// @ v4i32 bot32 = (v4i32)bot;
		// vrev64.32 botRe, bot @ v4i32 botRe = (v4i32) {
		// @ bot32[1], bot32[0],
		// @ bot32[3], bot32[2]
		// @ };
		// vmul.i32 botRe, botRe, top @ botRe *= (v4i32)top;
		// vpaddl.u32 botRe, botRe @ top = (v2i64) {
		// @ (u64)botRe[0] + botRe[1],
		// @ (u64)botRe[2] + botRe[3]
		// @ }
		// vshl.i64 top, botRe, #32 @ top <<= 32;
		// vmlal.u32 top, topLo, botLo @ top += (v2i64)topLo * botLo;
		// However, ssemul can be simplified a lot, in which it becomes the better
		// option. This can be in one or more of the following ways:
		// - The high or low bits of an operand are determined to be zero.
		// - One or more operands can be interleaved into high and low bits
		// beforehand i.e. changing a vld1q.64 to vld2.32 or preswapping a
		// constant. This makes it so we don't have to waste two cycles on
		// vshrn/vmovn. While pointer loading has yet to be implemented,
		// constant swapping is mostly complete.
		// In the case of the interleaved version, it would look like this:
		// vld2.32 {topLo,topHi} [topPtr]
		// vld2.32 {botLo,botHi} [botPtr]
		// vmull.u32 ret64, topHi, botLo
		// vmlal.u64 ret64, topLo, botHi
		// vshl.i64 ret64, ret64, #32
		// vmlal.u32 ret64, topLo, botLo
		// This optimization only works if the pointer hasn't been loaded and is
		// only used for the multiply, as swapping back removes the benefit.
		SDLoc DL(Op);
		SDValue Top = Op.getOperand(0);
		SDValue Bot = Op.getOperand(1);

		APInt LowerBitsMask = APInt::getLowBitsSet(64, 32);
		APInt UpperBitsMask = APInt::getHighBitsSet(64, 32);

		bool TopLoIsZero, BotLoIsZero, TopHiIsZero, BotHiIsZero;
		bool TopIsInterleaved = false, BotIsInterleaved = false;
		std::pair<SDValue, SDValue> TopPair, BotPair;
		// If we have a constant operand, and we only use that operand once,
		// we can interleave the bits at compile time to avoid vshrn and vmovn.
		// The goal is to have a 0, 2, 1, 3 shuffle of the multiple reinterpreted
		// as a v4i32.
		// TODO: Make this use one load instead of two.
		// Currently, this saves one cycle, but it can save two if we can
		// combine the loads together.
		// TODO: Detect if we use a constant multiple times, but only for
		//multiplication.
		if (ISD::isBuildVectorOfConstantSDNodes(Top.getOperand(0).getNode())
		&& Top.getOperand(0).hasOneUse()) {
		// Bitcast to v4i32.
		SDValue TopRaw = DAG.getBitcast(MVT::v4i32, Top.getOperand(0));
		SmallVector<uint32_t, 4> RawValues;
		// Gather up all of the values.
		for (int i = 0; i < 4; i++) {
		RawValues.push_back(uint32_t(TopRaw.getConstantOperandVal(i)));
		}
		// Might as well do the zero calculations while we have easy integers.
		efriedmaUnsubmitted Not Done Reply Inline Actions Alternatively, we could try to handle simplifications by adding a DAGCombine for VSHRu and VMULLu nodes with constant operands, instead of trying to avoid emitting them. The end result is roughly the same (it's probably unlikely to construct a trivial VSHRu or VMULLu any other way), but it might be easier to understand. But I guess it would get tricky if we tried to emit the form using "vrev64". Probably okay as-is. Needs more test coverage for various cases where the upper/lower half an operand is zero. efriedma: Alternatively, we could try to handle simplifications by adding a DAGCombine for VSHRu and…
		easyaspi314AuthorUnsubmitted Done Reply Inline Actions This is the exact same method used in X86ISelLowering.cpp/LowerMUL, except for the early return to work around missing optimizations later on. If we use the `twomul`/`vrev64` version, we would only use it if none of them are zero and none could be pre-interleaved. Once things get interleaved and/or we eliminate one of the multiplies, we lose the benefit. easyaspi314: This is the exact same method used in X86ISelLowering.cpp/LowerMUL, except for the early return…
		TopLoIsZero = (RawValues[0] == 0 && RawValues[2] == 0);
		TopHiIsZero = (RawValues[1] == 0 && RawValues[3] == 0);

		SmallVector<SDValue, 2> Reordered;
		// 0, 2, 1, 3 puts the low bits first and the high bits last.
		Reordered.push_back(DAG.getConstant(RawValues[0], DL, MVT::i32));
		Reordered.push_back(DAG.getConstant(RawValues[2], DL, MVT::i32));
		Reordered.push_back(DAG.getConstant(RawValues[1], DL, MVT::i32));
		Reordered.push_back(DAG.getConstant(RawValues[3], DL, MVT::i32));
		// Create the vector
		SDValue TopInterleaved = DAG.getBuildVector(MVT::v4i32, DL, Reordered);
		// Split it into two halves, topLo and topHi.
		TopPair = DAG.SplitVector(TopInterleaved, DL, MVT::v2i32, MVT::v2i32);
		// Mark that we don't need to separate these later.
		TopIsInterleaved = true;
		} else {
		KnownBits TopKnown = DAG.computeKnownBits(Top);
		TopLoIsZero = LowerBitsMask.isSubsetOf(TopKnown.Zero);
		TopHiIsZero = UpperBitsMask.isSubsetOf(TopKnown.Zero);
		}
		easyaspi314AuthorUnsubmitted Not Done Reply Inline Actions Can someone help me here? I can't figure this one out. easyaspi314: Can someone help me here? I can't figure this one out.

		if (ISD::isBuildVectorOfConstantSDNodes(Bot.getOperand(0).getNode())
		&& Bot.getOperand(0).hasOneUse()) {
		SDValue BotRaw = DAG.getBitcast(MVT::v4i32, Bot.getOperand(0));
		SmallVector<uint32_t, 4> RawValues;
		for (int i = 0; i < 4; i++) {
		RawValues.push_back(uint32_t(BotRaw.getConstantOperandVal(i)));
		}
		BotLoIsZero = (RawValues[0] == 0 && RawValues[2] == 0);
		BotHiIsZero = (RawValues[1] == 0 && RawValues[3] == 0);
		SmallVector<SDValue, 2> Reordered;
		// 0, 2, 1, 3 puts the low bits first and the high bits last.
		Reordered.push_back(DAG.getConstant(RawValues[0], DL, MVT::i32));
		Reordered.push_back(DAG.getConstant(RawValues[2], DL, MVT::i32));
		Reordered.push_back(DAG.getConstant(RawValues[1], DL, MVT::i32));
		Reordered.push_back(DAG.getConstant(RawValues[3], DL, MVT::i32));
		SDValue BotInterleaved = DAG.getBuildVector(MVT::v4i32, DL, Reordered);
		BotPair = DAG.SplitVector(BotInterleaved, DL, MVT::v2i32, MVT::v2i32);
		BotIsInterleaved = true;
		} else {
		KnownBits BotKnown = DAG.computeKnownBits(Bot);
		BotLoIsZero = LowerBitsMask.isSubsetOf(BotKnown.Zero);
		BotHiIsZero = UpperBitsMask.isSubsetOf(BotKnown.Zero);
		}

		SDValue Zero = DAG.getConstant(0, DL, VT);
		SDValue C32 = DAG.getConstant(32, DL, MVT::i32);

		// Either way, we need the low bits and TopLo * BotLo.
		// vmovn.i64
		SDValue TopLo =
		TopIsInterleaved ? TopPair.first
		: DAG.getNode(ISD::TRUNCATE, DL, MVT::v2i32, Top);
		// vmovn.i64
		SDValue BotLo =
		BotIsInterleaved ? BotPair.first
		: DAG.getNode(ISD::TRUNCATE, DL, MVT::v2i32, Bot);


		SDValue TopLoBotLo = Zero;
		if (!TopLoIsZero && !BotLoIsZero)
		// vmull.u32
		TopLoBotLo = DAG.getNode(ARMISD::VMULLu, DL, VT, TopLo, BotLo);

		// Don't go any further if we are only multiplying low bits.
		// This is needed until further optimizations can be made.
		if (TopHiIsZero && BotHiIsZero)
		return TopLoBotLo;

		// The following block is the twomul routine. We only use it when we
		// can't tell if any of the words are zero, and we are not interleaving.
		if (!TopIsInterleaved && !BotIsInterleaved
		&& !TopLoIsZero && !BotLoIsZero && !TopHiIsZero && !BotHiIsZero) {
		Bot = DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, Bot);
		// vrev64.32
		SDValue BotRev = DAG.getNode(ARMISD::VREV64, DL, MVT::v4i32, Bot);
		Top = DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, Top);
		// vmul.i32
		SDValue CrossMul = DAG.getNode(ISD::MUL, DL, MVT::v4i32, BotRev, Top);
		// vpaddl.u32
		SDValue Merged = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::v2i64,
		DAG.getConstant(Intrinsic::arm_neon_vpaddlu, DL, MVT::i32),
		CrossMul);
		// vshl.i64
		SDValue Shifted = DAG.getNode(ARMISD::VSHL, DL, VT, Merged, C32);
		return DAG.getNode(ISD::ADD, DL, VT, Shifted, TopLoBotLo);
		}

		// This is the ssemul routine.
		SDValue TopLoBotHi = Zero;
		if (!TopLoIsZero && !BotHiIsZero) {
		// vshrn.i64
		SDValue BotHi =
		BotIsInterleaved ? BotPair.second
		: DAG.getNode(ISD::TRUNCATE, DL, MVT::v2i32,
		DAG.getNode(ARMISD::VSHRu, DL, VT, Bot, C32));
		// vmlal.u32 (merges with add below)
		TopLoBotHi = DAG.getNode(ARMISD::VMULLu, DL, VT, TopLo, BotHi);
		}
		SDValue TopHiBotLo = Zero;
		if (!TopHiIsZero && !BotLoIsZero) {
		// vshrn.i64
		SDValue TopHi =
		TopIsInterleaved ? TopPair.second
		: DAG.getNode(ISD::TRUNCATE, DL, MVT::v2i32,
		DAG.getNode(ARMISD::VSHRu, DL, VT, Top, C32));
		// vmlal.u32 (merges with add below)
		TopHiBotLo = DAG.getNode(ARMISD::VMULLu, DL, VT, TopHi, BotLo);
		}

		// (optimized to vmlal.u32)
		SDValue High = DAG.getNode(ISD::ADD, DL, VT, TopLoBotHi, TopHiBotLo);
		// vshl.i64
		SDValue Shifted = DAG.getNode(ARMISD::VSHL, DL, VT, High, C32);
		// (optimized to vmlal.u32)
		return DAG.getNode(ISD::ADD, DL, VT, TopLoBotLo, Shifted);
		}

static SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) {		static SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) {
// Multiplications are only custom-lowered for 128-bit vectors so that		// Multiplications are only custom-lowered for 128-bit vectors so that
// VMULL can be detected. Otherwise v2i64 multiplications are not legal.		// VMULL can be detected.
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
assert(VT.is128BitVector() && VT.isInteger() &&		assert(VT.is128BitVector() && VT.isInteger() &&
"unexpected type for custom-lowering ISD::MUL");		"unexpected type for custom-lowering ISD::MUL");
SDNode *N0 = Op.getOperand(0).getNode();		SDNode *N0 = Op.getOperand(0).getNode();
SDNode *N1 = Op.getOperand(1).getNode();		SDNode *N1 = Op.getOperand(1).getNode();
unsigned NewOpc = 0;		unsigned NewOpc = 0;
bool isMLA = false;		bool isMLA = false;
bool isN0SExt = isSignExtended(N0, DAG);		bool isN0SExt = isSignExtended(N0, DAG);
Show All 17 Lines	else if (isN1SExt \|\| isN1ZExt) {
} else if (isN0ZExt && isAddSubZExt(N1, DAG)) {		} else if (isN0ZExt && isAddSubZExt(N1, DAG)) {
std::swap(N0, N1);		std::swap(N0, N1);
NewOpc = ARMISD::VMULLu;		NewOpc = ARMISD::VMULLu;
isMLA = true;		isMLA = true;
}		}
}		}

if (!NewOpc) {		if (!NewOpc) {
if (VT == MVT::v2i64)		if (VT == MVT::v2i64) {
// Fall through to expand this. It is not legal.		return LowerNEONv2i64MUL(Op, DAG, VT);
return SDValue();		}
else
// Other vector multiplications are legal.		// Other vector multiplications are legal.
return Op;		return Op;
}		}
}		}

// Legalize to a VMULL instruction.		// Legalize to a VMULL instruction.
SDLoc DL(Op);		SDLoc DL(Op);
SDValue Op0;		SDValue Op0;
SDValue Op1 = SkipExtensionForVMULL(N1, DAG);		SDValue Op1 = SkipExtensionForVMULL(N1, DAG);
if (!isMLA) {		if (!isMLA) {
▲ Show 20 Lines • Show All 7,750 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 481 Lines • ▼ Show 20 Lines	int ARMTTIImpl::getArithmeticInstrCost(
int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);		int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);

const unsigned FunctionCallDivCost = 20;		const unsigned FunctionCallDivCost = 20;
const unsigned ReciprocalDivCost = 10;		const unsigned ReciprocalDivCost = 10;
static const CostTblEntry CostTbl[] = {		static const CostTblEntry CostTbl[] = {
// Division.		// Division.
// These costs are somewhat random. Choose a cost of 20 to indicate that		// These costs are somewhat random. Choose a cost of 20 to indicate that
// vectorizing devision (added function call) is going to be very expensive.		// vectorizing division (added function call) is going to be very expensive.
// Double registers types.		// Double registers types.
{ ISD::SDIV, MVT::v1i64, 1 * FunctionCallDivCost},		{ ISD::SDIV, MVT::v1i64, 1 * FunctionCallDivCost},
{ ISD::UDIV, MVT::v1i64, 1 * FunctionCallDivCost},		{ ISD::UDIV, MVT::v1i64, 1 * FunctionCallDivCost},
{ ISD::SREM, MVT::v1i64, 1 * FunctionCallDivCost},		{ ISD::SREM, MVT::v1i64, 1 * FunctionCallDivCost},
{ ISD::UREM, MVT::v1i64, 1 * FunctionCallDivCost},		{ ISD::UREM, MVT::v1i64, 1 * FunctionCallDivCost},
{ ISD::SDIV, MVT::v2i32, 2 * FunctionCallDivCost},		{ ISD::SDIV, MVT::v2i32, 2 * FunctionCallDivCost},
{ ISD::UDIV, MVT::v2i32, 2 * FunctionCallDivCost},		{ ISD::UDIV, MVT::v2i32, 2 * FunctionCallDivCost},
{ ISD::SREM, MVT::v2i32, 2 * FunctionCallDivCost},		{ ISD::SREM, MVT::v2i32, 2 * FunctionCallDivCost},
Show All 18 Lines	static const CostTblEntry CostTbl[] = {
{ ISD::SDIV, MVT::v8i16, 8 * FunctionCallDivCost},		{ ISD::SDIV, MVT::v8i16, 8 * FunctionCallDivCost},
{ ISD::UDIV, MVT::v8i16, 8 * FunctionCallDivCost},		{ ISD::UDIV, MVT::v8i16, 8 * FunctionCallDivCost},
{ ISD::SREM, MVT::v8i16, 8 * FunctionCallDivCost},		{ ISD::SREM, MVT::v8i16, 8 * FunctionCallDivCost},
{ ISD::UREM, MVT::v8i16, 8 * FunctionCallDivCost},		{ ISD::UREM, MVT::v8i16, 8 * FunctionCallDivCost},
{ ISD::SDIV, MVT::v16i8, 16 * FunctionCallDivCost},		{ ISD::SDIV, MVT::v16i8, 16 * FunctionCallDivCost},
{ ISD::UDIV, MVT::v16i8, 16 * FunctionCallDivCost},		{ ISD::UDIV, MVT::v16i8, 16 * FunctionCallDivCost},
{ ISD::SREM, MVT::v16i8, 16 * FunctionCallDivCost},		{ ISD::SREM, MVT::v16i8, 16 * FunctionCallDivCost},
{ ISD::UREM, MVT::v16i8, 16 * FunctionCallDivCost},		{ ISD::UREM, MVT::v16i8, 16 * FunctionCallDivCost},
// Multiplication.		// Multiplication. Not sure what exact value to use here, but the main thing
		// is that it is cheaper to multiply if we are already in a vector.
		{ ISD::MUL, MVT::v2i64, 8 },
};		};
		efriedmaUnsubmitted Done Reply Inline Actions Did you really mean UMULO (llvm.umul.with.overflow)? I mean, I guess it's theoretically possible to implement, but I think we'll just scalarize it at the moment. (The cost model changes need test coverage in test/Analysis/CostModel/ARM/ .) efriedma: Did you really mean UMULO (llvm.umul.with.overflow)? I mean, I guess it's theoretically…
		easyaspi314AuthorUnsubmitted Done Reply Inline Actions Oh yeah, I did that wrong. Should be normal multiply. easyaspi314: Oh yeah, I did that wrong. Should be normal multiply.

if (ST->hasNEON())		if (ST->hasNEON()) {
		// Multiply by constant is a little cheaper.
		if (ISDOpcode == ISD::MUL && LT.second == MVT::v2i64
		&& Op2Info == TargetTransformInfo::OK_UniformConstantValue)
		return LT.first * 7;
if (const auto *Entry = CostTableLookup(CostTbl, ISDOpcode, LT.second))		if (const auto *Entry = CostTableLookup(CostTbl, ISDOpcode, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
		}

int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,		int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
Opd1PropInfo, Opd2PropInfo);		Opd1PropInfo, Opd2PropInfo);

// This is somewhat of a hack. The problem that we are facing is that SROA		// This is somewhat of a hack. The problem that we are facing is that SROA
// creates a sequence of shift, and, or instructions to construct values.		// creates a sequence of shift, and, or instructions to construct values.
// These sequences are recognized by the ISel and have zero-cost. Not so for		// These sequences are recognized by the ISel and have zero-cost. Not so for
// the vectorized code. Because we have support for v2i64 but not i64 those		// the vectorized code. Because we have support for v2i64 but not i64 those
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

test/Analysis/CostModel/ARM/mult.ll

				; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
				; RUN: opt < %s -cost-model -analyze -mtriple=thumbv7-apple-ios6.0.0 -mcpu=cortex-a9 \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions You might find the utils\update_analyze_test_checks.py script useful to make this more maintainable - see X86\arith.ll for examples. RKSimon: You might find the utils\update_analyze_test_checks.py script useful to make this more…
				easyaspi314AuthorUnsubmitted Done Reply Inline Actions Ok, will take a look at that. easyaspi314: Ok, will take a look at that.

				define <2 x i8> @mul_v2_i8(<2 x i8> %a, <2 x i8> %b) {
				; CHECK-LABEL: 'mul_v2_i8'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <2 x i8> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i8> %1
				;

				%1 = mul <2 x i8> %a, %b
				ret <2 x i8> %1
				}
				define <2 x i16> @mul_v2_i16(<2 x i16> %a, <2 x i16> %b) {
				; CHECK-LABEL: 'mul_v2_i16'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <2 x i16> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i16> %1
				;

				%1 = mul <2 x i16> %a, %b
				ret <2 x i16> %1
				}
				define <2 x i32> @mul_v2_i32(<2 x i32> %a, <2 x i32> %b) {
				; CHECK-LABEL: 'mul_v2_i32'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <2 x i32> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
				;

				%1 = mul <2 x i32> %a, %b
				ret <2 x i32> %1
				}
				define <2 x i64> @mul_v2_i64(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-LABEL: 'mul_v2_i64'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %1 = mul <2 x i64> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %1
				;

				%1 = mul <2 x i64> %a, %b
				ret <2 x i64> %1
				}
				define <4 x i8> @mul_v4_i8(<4 x i8> %a, <4 x i8> %b) {
				; CHECK-LABEL: 'mul_v4_i8'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <4 x i8> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i8> %1
				;

				%1 = mul <4 x i8> %a, %b
				ret <4 x i8> %1
				}
				define <4 x i16> @mul_v4_i16(<4 x i16> %a, <4 x i16> %b) {
				; CHECK-LABEL: 'mul_v4_i16'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <4 x i16> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %1
				;

				%1 = mul <4 x i16> %a, %b
				ret <4 x i16> %1
				}
				define <4 x i32> @mul_v4_i32(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-LABEL: 'mul_v4_i32'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %1 = mul <4 x i32> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %1
				;

				%1 = mul <4 x i32> %a, %b
				ret <4 x i32> %1
				}
				define <4 x i64> @mul_v4_i64(<4 x i64> %a, <4 x i64> %b) {
				; CHECK-LABEL: 'mul_v4_i64'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %1 = mul <4 x i64> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %1
				;

				%1 = mul <4 x i64> %a, %b
				ret <4 x i64> %1
				}
				define <8 x i8> @mul_v8_i8(<8 x i8> %a, <8 x i8> %b) {
				; CHECK-LABEL: 'mul_v8_i8'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <8 x i8> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %1
				;

				%1 = mul <8 x i8> %a, %b
				ret <8 x i8> %1
				}
				define <8 x i16> @mul_v8_i16(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-LABEL: 'mul_v8_i16'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %1 = mul <8 x i16> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i16> %1
				;

				%1 = mul <8 x i16> %a, %b
				ret <8 x i16> %1
				}
				define <8 x i32> @mul_v8_i32(<8 x i32> %a, <8 x i32> %b) {
				; CHECK-LABEL: 'mul_v8_i32'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %1 = mul <8 x i32> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %1
				;

				%1 = mul <8 x i32> %a, %b
				ret <8 x i32> %1
				}
				define <8 x i64> @mul_v8_i64(<8 x i64> %a, <8 x i64> %b) {
				; CHECK-LABEL: 'mul_v8_i64'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %1 = mul <8 x i64> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %1
				;

				%1 = mul <8 x i64> %a, %b
				ret <8 x i64> %1
				}
				define <16 x i8> @mul_v16_i8(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-LABEL: 'mul_v16_i8'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul <16 x i8> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i8> %1
				;

				%1 = mul <16 x i8> %a, %b
				ret <16 x i8> %1
				}
				define <16 x i16> @mul_v16_i16(<16 x i16> %a, <16 x i16> %b) {
				; CHECK-LABEL: 'mul_v16_i16'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %1 = mul <16 x i16> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i16> %1
				;

				%1 = mul <16 x i16> %a, %b
				ret <16 x i16> %1
				}
				define <16 x i32> @mul_v16_i32(<16 x i32> %a, <16 x i32> %b) {
				; CHECK-LABEL: 'mul_v16_i32'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %1 = mul <16 x i32> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %1
				;

				%1 = mul <16 x i32> %a, %b
				ret <16 x i32> %1
				}
				define <16 x i64> @mul_v16_i64(<16 x i64> %a, <16 x i64> %b) {
				; CHECK-LABEL: 'mul_v16_i64'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %1 = mul <16 x i64> %a, %b
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i64> %1
				;

				%1 = mul <16 x i64> %a, %b
				ret <16 x i64> %1
				}

test/CodeGen/ARM/vmul.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a8 %s -o - \| FileCheck %s			; RUN: llc -mtriple=arm-eabi -mcpu=cortex-a8 %s -o - \| FileCheck %s

	define <8 x i8> @vmuli8(<8 x i8>* %A, <8 x i8>* %B) nounwind {			define <8 x i8> @vmuli8(<8 x i8>* %A, <8 x i8>* %B) nounwind {
	;CHECK-LABEL: vmuli8:			;CHECK-LABEL: vmuli8:
	;CHECK: vmul.i8			;CHECK: vmul.i8
	%tmp1 = load <8 x i8>, <8 x i8>* %A			%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = load <8 x i8>, <8 x i8>* %B			%tmp2 = load <8 x i8>, <8 x i8>* %B
	%tmp3 = mul <8 x i8> %tmp1, %tmp2			%tmp3 = mul <8 x i8> %tmp1, %tmp2
	Show All 22 Lines
	;CHECK-LABEL: vmulf32:			;CHECK-LABEL: vmulf32:
	;CHECK: vmul.f32			;CHECK: vmul.f32
	%tmp1 = load <2 x float>, <2 x float>* %A			%tmp1 = load <2 x float>, <2 x float>* %A
	%tmp2 = load <2 x float>, <2 x float>* %B			%tmp2 = load <2 x float>, <2 x float>* %B
	%tmp3 = fmul <2 x float> %tmp1, %tmp2			%tmp3 = fmul <2 x float> %tmp1, %tmp2
	ret <2 x float> %tmp3			ret <2 x float> %tmp3
	}			}


				define <2 x i64> @vmuli64(<2 x i64>* %A, <2 x i64>* %B) nounwind {
				RKSimonUnsubmitted Not Done Reply Inline Actions Please add these new tests to trunk with current codegen now then rebase this patch so it shows the changes to codegen. RKSimon: Please add these new tests to trunk with current codegen now then rebase this patch so it shows…
				easyaspi314AuthorUnsubmitted Not Done Reply Inline Actions Okay, I will try to do that. I just run svn update, right? I don't know why I chose SVN. easyaspi314: Okay, I will try to do that. I just run svn update, right? I don't know why I chose SVN.
				; CHECK-LABEL: vmuli64:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
				; CHECK-NEXT: vrev64.32 q10, q8
				; CHECK-NEXT: vmovn.i64 d16, q8
				; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
				; CHECK-NEXT: vmul.i32 q10, q10, q9
				; CHECK-NEXT: vmovn.i64 d17, q9
				; CHECK-NEXT: vpaddl.u32 q10, q10
				; CHECK-NEXT: vshl.i64 q9, q10, #32
				; CHECK-NEXT: vmlal.u32 q9, d17, d16
				; CHECK-NEXT: vmov r0, r1, d18
				; CHECK-NEXT: vmov r2, r3, d19
				; CHECK-NEXT: bx lr
				%tmp1 = load <2 x i64>, <2 x i64>* %A
				%tmp2 = load <2 x i64>, <2 x i64>* %B
				%tmp3 = mul <2 x i64> %tmp1, %tmp2
				ret <2 x i64> %tmp3
				}


				define <2 x i64> @vmuli64_lo_lo(<2 x i64>* %A, <2 x i64>* %B) nounwind {
				; The equivalent of pmuludq for ARM. Codegen needs improvement.
				; TODO: The mask is not required.
				; The important thing for now is a single multiply.
				; CHECK-LABEL: vmuli64_lo_lo:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: vmov.i64 q8, #0xffffffff
				efriedmaUnsubmitted Not Done Reply Inline Actions DAGCombine should be able to catch the redundant AND... but it looks like DAGCombiner::visitTRUNCATE doesn't try to handle demanded bits for vectors. (I guess it didn't get updated when other operations got support for vector operands?) efriedma: DAGCombine should be able to catch the redundant AND... but it looks like DAGCombiner…
				craig.topperUnsubmitted Not Done Reply Inline Actions PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it? craig.topper: PR39689 mentions this is disabled for vectors. Maybe @rksimon or @spatel are working on it?
				RKSimonUnsubmitted Not Done Reply Inline Actions PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it? I've been putting it off as there's a load of yak shaving to be done for it - but I will look again. RKSimon: > PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it?
				easyaspi314AuthorUnsubmitted Done Reply Inline Actions That would explain why it was choking on this when X86 does not. easyaspi314: That would explain why it was choking on this when X86 does not.
				; CHECK-NEXT: vld1.64 {d18, d19}, [r1]
				; CHECK-NEXT: vand q9, q9, q8
				; CHECK-NEXT: vld1.64 {d20, d21}, [r0]
				; CHECK-NEXT: vand q8, q10, q8
				; CHECK-NEXT: vmovn.i64 d18, q9
				; CHECK-NEXT: vmovn.i64 d16, q8
				; CHECK-NEXT: vmull.u32 q8, d16, d18
				; CHECK-NEXT: vmov r0, r1, d16
				; CHECK-NEXT: vmov r2, r3, d17
				; CHECK-NEXT: bx lr
				;CHECK; vand q8, q10, q8
				%tmp1 = load <2 x i64>, <2 x i64>* %A
				%tmp2 = load <2 x i64>, <2 x i64>* %B
				%tmp1mask = and <2 x i64> %tmp1, <i64 4294967295, i64 4294967295>
				%tmp2mask = and <2 x i64> %tmp2, <i64 4294967295, i64 4294967295>
				%ret = mul nuw <2 x i64> %tmp1mask, %tmp2mask
				ret <2 x i64> %ret
				}


				define <2 x i64> @vmuli64_hi_all(<2 x i64>* %A, <2 x i64>* %B) nounwind {
				; FIXME: The mask is not required.
				; The important thing for now is a single multiply.
				; CHECK-LABEL: vmuli64_hi_all:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: vmov.i64 q8, #0xffffffff00000000
				; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
				; CHECK-NEXT: vand q8, q9, q8
				; CHECK-NEXT: vld1.64 {d20, d21}, [r1]
				; CHECK-NEXT: vmovn.i64 d18, q10
				; CHECK-NEXT: vshrn.i64 d16, q8, #32
				; CHECK-NEXT: vmull.u32 q8, d16, d18
				; CHECK-NEXT: vshl.i64 q8, q8, #32
				; CHECK-NEXT: vmov r0, r1, d16
				; CHECK-NEXT: vmov r2, r3, d17
				; CHECK-NEXT: bx lr
				%tmp1 = load <2 x i64>, <2 x i64>* %A
				%tmp2 = load <2 x i64>, <2 x i64>* %B
				%tmp1mask = and <2 x i64> %tmp1, <i64 -4294967296, i64 -4294967296>
				%ret = mul nuw <2 x i64> %tmp1mask, %tmp2
				ret <2 x i64> %ret
				}

				define <2 x i64> @vmuli64_lo_all(<2 x i64>* %A, <2 x i64>* %B) nounwind {
				; FIXME: the mask is not required.
				; The important thing for now is two multiplies.
				; CHECK-LABEL: vmuli64_lo_all:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: vmov.i64 q9, #0xffffffff
				; CHECK-NEXT: vld1.64 {d20, d21}, [r0]
				; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
				; CHECK-NEXT: vand q9, q10, q9
				; CHECK-NEXT: vshrn.i64 d20, q8, #32
				; CHECK-NEXT: vmovn.i64 d18, q9
				; CHECK-NEXT: vmovn.i64 d16, q8
				; CHECK-NEXT: vmull.u32 q10, d18, d20
				; CHECK-NEXT: vshl.i64 q10, q10, #32
				; CHECK-NEXT: vmlal.u32 q10, d18, d16
				; CHECK-NEXT: vmov r0, r1, d20
				; CHECK-NEXT: vmov r2, r3, d21
				; CHECK-NEXT: bx lr
				%tmp1 = load <2 x i64>, <2 x i64>* %A
				%tmp2 = load <2 x i64>, <2 x i64>* %B
				%tmp1mask = and <2 x i64> %tmp1, <i64 4294967295, i64 4294967295>
				%ret = mul nuw <2 x i64> %tmp1mask, %tmp2
				ret <2 x i64> %ret
				}

				define <2 x i64> @vmuli64_constant(<2 x i64> %A) nounwind {
				; CHECK-LABEL: vmuli64_constant:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: vmov d17, r2, r3
				; CHECK-NEXT: vldr d18, .LCPI8_0
				; CHECK-NEXT: vmov d16, r0, r1
				; CHECK-NEXT: vmovn.i64 d19, q8
				; CHECK-NEXT: vmull.u32 q10, d19, d18
				; CHECK-NEXT: vldr d18, .LCPI8_1
				; CHECK-NEXT: vshrn.i64 d16, q8, #32
				; CHECK-NEXT: vmlal.u32 q10, d16, d18
				; CHECK-NEXT: vshl.i64 q8, q10, #32
				; CHECK-NEXT: vmlal.u32 q8, d19, d18
				; CHECK-NEXT: vmov r0, r1, d16
				; CHECK-NEXT: vmov r2, r3, d17
				; CHECK-NEXT: bx lr
				; CHECK-NEXT: .p2align 3
				; CHECK-NEXT: @ %bb.1:
				; CHECK-NEXT: .LCPI8_0:
				; CHECK-NEXT: .long 287 @ 0x11f
				; CHECK-NEXT: .long 287 @ 0x11f
				; CHECK-NEXT: .LCPI8_1:
				; CHECK-NEXT: .long 1912275952 @ 0x71fb03f0
				; CHECK-NEXT: .long 1912275952 @ 0x71fb03f0
				%ret = mul <2 x i64> %A, <i64 1234567889904, i64 1234567889904>
				ret <2 x i64> %ret
				}

				define <2 x i64> @vmuli64_lo_constant(<2 x i64> %A) nounwind {
				; CHECK-LABEL: vmuli64_lo_constant:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: vmov d17, r2, r3
				; CHECK-NEXT: vldr d18, .LCPI9_0
				; CHECK-NEXT: vmov d16, r0, r1
				; CHECK-NEXT: vshrn.i64 d19, q8, #32
				; CHECK-NEXT: vmovn.i64 d16, q8
				; CHECK-NEXT: vmull.u32 q10, d19, d18
				; CHECK-NEXT: vshl.i64 q10, q10, #32
				; CHECK-NEXT: vmlal.u32 q10, d16, d18
				; CHECK-NEXT: vmov r0, r1, d20
				; CHECK-NEXT: vmov r2, r3, d21
				; CHECK-NEXT: bx lr
				; CHECK-NEXT: .p2align 3
				; CHECK-NEXT: @ %bb.1:
				; CHECK-NEXT: .LCPI9_0:
				; CHECK-NEXT: .long 1234 @ 0x4d2
				; CHECK-NEXT: .long 1234 @ 0x4d2
				%ret = mul <2 x i64> %A, <i64 1234, i64 1234>
				ret <2 x i64> %ret
				}


	define <8 x i8> @vmulp8(<8 x i8>* %A, <8 x i8>* %B) nounwind {			define <8 x i8> @vmulp8(<8 x i8>* %A, <8 x i8>* %B) nounwind {
	;CHECK-LABEL: vmulp8:			;CHECK-LABEL: vmulp8:
	;CHECK: vmul.p8			;CHECK: vmul.p8
	%tmp1 = load <8 x i8>, <8 x i8>* %A			%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = load <8 x i8>, <8 x i8>* %B			%tmp2 = load <8 x i8>, <8 x i8>* %B
	%tmp3 = call <8 x i8> @llvm.arm.neon.vmulp.v8i8(<8 x i8> %tmp1, <8 x i8> %tmp2)			%tmp3 = call <8 x i8> @llvm.arm.neon.vmulp.v8i8(<8 x i8> %tmp1, <8 x i8> %tmp2)
	ret <8 x i8> %tmp3			ret <8 x i8> %tmp3
	}			}
	▲ Show 20 Lines • Show All 621 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM]: Add optimized NEON uint64x2_t multiply routine.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 180781

lib/Target/ARM/ARMISelLowering.cpp

lib/Target/ARM/ARMTargetTransformInfo.cpp

test/Analysis/CostModel/ARM/mult.ll

test/CodeGen/ARM/vmul.ll

[ARM]: Add optimized NEON uint64x2_t multiply routine.
Needs ReviewPublic