This is an archive of the discontinued LLVM Phabricator instance.

[ARM]: Add optimized NEON uint64x2_t multiply routine.
Needs ReviewPublic

Authored by easyaspi314 on Dec 27 2018, 8:48 PM.

Details

Summary

Patch to fix bug 39967

There are a few further optimizations that can be made, but this is many times better than scalarizing.

These take up to 11 cycles, however, the smallest possible case is 7 with two pre-interleaved values, or 4 if we are using a single vmull.

This automatically tries to select the proper routine.

twomul is used for multiplies that could not be simplified. It takes 7 instructions and 11 cycles. It looks like so:

vmovn.i64 topLo, top
vmovn.i64 botLo, bot
vrev64.32 bot, bot

vmul.i32 bot, bot, top
vpaddl.i32 bot, bot
vshl.i64 top, bot, #32
vmlal.u32 top, botLo, topLo

ssemul is simpler, clearer, and more modular. It has the same timing as twomul, but it has one more instruction.

However, if something is simplified, such as preinterleaving a load/constant or removing multiplies on bits that are known to be zero, ssemul ends up being faster and shorter.

vmovn.i64 topLo, top
vshrn.i64 topHi, top, #32
vmovn.i64 botHi, bot
vshrn.i64 botLo, bot, #32 

vmull.u32 ret, topLo, botHi
vmlal.u32 ret, topHi, botLo
vshl.i64  ret, ret, #32 
vmlal.u32 ret, topLo, botLo

Some missing optimizations:

  1. Masking a uint64x2_t/v2i64 by 0xFFFFFFFF shouldn't actually generate a vand, it should remove one multiply and one vshrn.
  2. Constant interleaving is put into two vldr instructions. This might not be the most efficient, as I want a vld1.64 I don't know how adr comes into play, but I think it could be run in parallel with shifting on the other multiple.
  3. Load interleaving should be implemented. If a multiple is used only once and it is loaded from a pointer, by replacing vld1.64 with vld2.32 and using the two uint32x2_t/v2i32 values generated, we can also avoid vshrn/vmovn.
  4. Cost model is pretty weird with v2i64 in NEON. We should probably fix the "add 4 to the cost" hack, as NEON v2i64 vectors are not as expensive as they are made out to be with that huge penalty.

Diff Detail

Repository
rL LLVM

Event Timeline

easyaspi314 created this revision.Dec 27 2018, 8:48 PM
easyaspi314 added inline comments.Dec 27 2018, 9:01 PM
lib/Target/ARM/ARMISelLowering.cpp
7485

Can someone help me here? I can't figure this one out.

easyaspi314 retitled this revision from [ARM]: WIP: Add optimized uint64x2_t multiply routine. to [ARM]: WIP: Add optimized NEON uint64x2_t multiply routine..Dec 27 2018, 9:09 PM
easyaspi314 edited the summary of this revision. (Show Details)
easyaspi314 edited the summary of this revision. (Show Details)
easyaspi314 retitled this revision from [ARM]: WIP: Add optimized NEON uint64x2_t multiply routine. to [ARM]: Add optimized NEON uint64x2_t multiply routine..

Actually, further optimizations can be done later. At the very least, this is usable. I still left the notes for someone who wants to implement them later, but I think that at least, for now, it is far better than it was.

I don't know who I should put for a reviewer, but I think it is ready for review.

*facepalms* Wow, I am an idiot.

Note to self: Check your PWD before running svn diff.

lib/Target/ARM/ARMISelLowering.cpp
7465

Alternatively, we could try to handle simplifications by adding a DAGCombine for VSHRu and VMULLu nodes with constant operands, instead of trying to avoid emitting them. The end result is roughly the same (it's probably unlikely to construct a trivial VSHRu or VMULLu any other way), but it might be easier to understand. But I guess it would get tricky if we tried to emit the form using "vrev64". Probably okay as-is.

Needs more test coverage for various cases where the upper/lower half an operand is zero.

lib/Target/ARM/ARMTargetTransformInfo.cpp
528

Did you really mean UMULO (llvm.umul.with.overflow)? I mean, I guess it's theoretically possible to implement, but I think we'll just scalarize it at the moment.

(The cost model changes need test coverage in test/Analysis/CostModel/ARM/ .)

test/CodeGen/ARM/vmul.ll
69

DAGCombine should be able to catch the redundant AND... but it looks like DAGCombiner::visitTRUNCATE doesn't try to handle demanded bits for vectors. (I guess it didn't get updated when other operations got support for vector operands?)

craig.topper added inline comments.
test/CodeGen/ARM/vmul.ll
69

PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it?

easyaspi314 planned changes to this revision.Dec 29 2018, 5:55 PM
easyaspi314 marked 3 inline comments as done.

I am going to add constant interleaving, more tests, and call it done. I think that twomul and interleaved loads are much smaller and lower priority optimizations, and something I am not comfortable enough to do myself.

lib/Target/ARM/ARMISelLowering.cpp
7465

This is the exact same method used in X86ISelLowering.cpp/LowerMUL, except for the early return to work around missing optimizations later on.

If we use the twomul/vrev64 version, we would only use it if none of them are zero and none could be pre-interleaved. Once things get interleaved and/or we eliminate one of the multiplies, we lose the benefit.

lib/Target/ARM/ARMTargetTransformInfo.cpp
528

Oh yeah, I did that wrong. Should be normal multiply.

test/CodeGen/ARM/vmul.ll
69

That would explain why it was choking on this when X86 does not.

easyaspi314 updated this revision to Diff 179784.EditedDec 31 2018, 4:58 PM
easyaspi314 edited the summary of this revision. (Show Details)

Fixed cost model and added cost model tests. I also made a larger diff.

I can't figure out the constant interleaving, as I can't seem to figure out how to detect a literal vector. Everything I tried returned false. I don't want to spend too much time on it. If any of you want to finish it, go ahead. It is still better than what it was.

easyaspi314 marked an inline comment as done.Dec 31 2018, 4:59 PM
easyaspi314 planned changes to this revision.Jan 2 2019, 4:57 PM

I figured things out and I am on track to adding twomul and constant interleaving.

This is what I have now. Constant swapping and twomul is now implemented, and I will update the patch once I finish the documentation and tests, as well as double-check the return values on my phone.

typedef unsigned long long v2i64 __attribute__((vector_size(16)));

v2i64 mult(v2i64 v1, v2i64 v2) {
  return v1 * v2;
}

v2i64 mult_lo(v2i64 v1, v2i64 v2) {
  return (v1 & 0xFFFFFFFF) * (v2 & 0xFFFFFFFF);
}

v2i64 mult_lo_lohi(v2i64 v1, v2i64 v2) {
  return (v1 & 0xFFFFFFFF) * v2;
}

v2i64 mult_constant(v2i64 v1) {
  return 1234567889904ULL * v1;
}

v2i64 mult_lo_constant(v2i64 v1) {
  return v1 * 1234ULL;
}
mult:
	vmov	d17, r2, r3
	mov	r12, sp
	vmov	d16, r0, r1
	vld1.64	{d18, d19}, [r12]
	vrev64.32	q10, q8
	vmovn.i64	d16, q8
	vmovn.i64	d17, q9
	vmul.i32	q10, q10, q9
	vpaddl.u32	q10, q10
	vshl.i64	q9, q10, #32
	vmlal.u32	q9, d17, d16
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr

mult_lo:
	vmov	d19, r2, r3
	vmov.i64	q8, #0xffffffff
	vmov	d18, r0, r1
	mov	r0, sp
	vand	q9, q9, q8
	vld1.64	{d20, d21}, [r0]
	vand	q8, q10, q8
	vmovn.i64	d18, q9
	vmovn.i64	d16, q8
	vmull.u32	q8, d16, d18
	vmov	r0, r1, d16
	vmov	r2, r3, d17
	bx	lr

mult_lo_lohi:
	vmov	d19, r2, r3
	vmov.i64	q8, #0xffffffff
	vmov	d18, r0, r1
	mov	r0, sp
	vld1.64	{d20, d21}, [r0]
	vand	q8, q9, q8
	vshrn.i64	d18, q10, #32
	vmovn.i64	d16, q8
	vmovn.i64	d17, q10
	vmull.u32	q9, d16, d18
	vshl.i64	q9, q9, #32
	vmlal.u32	q9, d16, d17
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr

mult_constant:
	adr	r12, .LCPI3_0
	vmov	d17, r2, r3
	vld1.64	{d18, d19}, [r12:128]
	vmov	d16, r0, r1
	vmul.i32	q9, q8, q9
	vldr	d20, .LCPI3_1
	vmovn.i64	d16, q8
	vpaddl.u32	q9, q9
	vshl.i64	q9, q9, #32
	vmlal.u32	q9, d16, d20
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr
.LCPI3_0:
	.long	1912275952
	.long	1912275952
	.long	287
	.long	287
.LCPI3_1:
	.long	1912275952
	.long	1912275952

mult_lo_constant:
        vmov    d17, r2, r3
        vldr    d18, .LCPI4_0
        vmov    d16, r0, r1
        vshrn.i64       d19, q8, #32
        vmovn.i64       d16, q8
        vmull.u32       q10, d19, d18
        vshl.i64        q10, q10, #32
        vmlal.u32       q10, d16, d18
        vmov    r0, r1, d20
        vmov    r2, r3, d21
        bx      lr
.LCPI4_0:
        .long   1234
        .long   1234

If someone can optimize that redundant load in mult_constant, I would appreciate it.

This is the code I want:

mult_constant:
	adr	r12, .LCPI3_0
	vmov	d17, r2, r3
	vld1.64	{d22, d23}, [r12:128]
	vmov	d16, r0, r1
	vmul.i32	q9, q8, q11
	vmovn.i64	d16, q8
	vpaddl.u32	q9, q9
	vshl.i64	q9, q9, #32
	vmlal.u32	q9, d16, d22
	vmov	r0, r1, d18
	vmov	r2, r3, d19
	bx	lr
.LCPI3_0:
	.long	1912275952
	.long	1912275952
	.long	287
	.long	287
easyaspi314 added a comment.EditedJan 2 2019, 7:28 PM

Oops, that's not right. mult_constant isn't matching.

I think I am going to fall back to ssemul for all constants like I originally planned.

easyaspi314 added a comment.EditedJan 3 2019, 9:28 AM

vmul.i32 Qd,Qn,Qm actually takes 4 cycles, which means twomul has the same timing as ssemul, 11 cycles.
@efriedma that explains why twomul wasn't visibly faster in my tests.

However, twomul saves an instruction, so I am keeping it.

twomul:
	vrev64.32	q10, q8        @ 1 cycle,  total 1
	vmovn.i64	d16, q8        @ 1 cycle,  total 2
	vmovn.i64	d17, q9        @ 1 cycle,  total 3
	vmul.i32	q10, q10, q9   @ 4 cycles, total 7
	vpaddl.u32	q10, q10       @ 1 cycle,  total 8
	vshl.i64	q9, q10, #32   @ 1 cycle,  total 9
	vmlal.u32	q9, d17, d16   @ 2 cycles, total 11
ssemul:
	vshrn.i64       d20, q8, #32   @ 1 cycle,  total 1
	vmovn.i64       d16, q8        @ 1 cycle,  total 2
	vmovn.i64       d21, q9        @ 1 cycle,  total 3
	vmull.u32       q11, d21, d20  @ 2 cycles, total 5
	vshrn.i64       d17, q9, #32   @ 1 cycle,  total 6
	vmlal.u32       q11, d17, d16  @ 2 cycles, total 8
	vshl.i64        q9, q11, #32   @ 1 cycle,  total 9
	vmlal.u32       q9, d21, d16   @ 2 cycles, total 11
easyaspi314 updated this revision to Diff 180192.EditedJan 3 2019, 7:37 PM

Ok, done. I fixed twomul and constant interleaving, and added a couple more tests.

I also lowered the cost for multiplication to be in relation to normal mul i64, which makes it so LLVM will properly autovectorize it.

easyaspi314 edited the summary of this revision. (Show Details)Jan 3 2019, 8:27 PM
RKSimon added inline comments.Jan 6 2019, 12:51 PM
test/Analysis/CostModel/ARM/mult.ll
2

You might find the utils\update_analyze_test_checks.py script useful to make this more maintainable - see X86\arith.ll for examples.

test/CodeGen/ARM/vmul.ll
41

Please add these new tests to trunk with current codegen now then rebase this patch so it shows the changes to codegen.

69

PR39689 mentions this is disabled for vectors. Maybe @RKSimon or @spatel are working on it?

I've been putting it off as there's a load of yak shaving to be done for it - but I will look again.

easyaspi314 marked 2 inline comments as done.Jan 6 2019, 7:03 PM
v2i64 mul5(v2i64 val)
{
    return val * 5;
}
mul5:
    vmov    d17, r2, r3
    vmov    d16, r0, r1
    vmov.i32    d18, #0x5
    vshrn.i64   d19, q8, #32
    vmovn.i64   d16, q8
    vmull.u32   q10, d19, d18
    vshl.i64    q10, q10, #32
    vmlal.u32   q10, d16, d18
    vmov    r0, r1, d20
    vmov    r2, r3, d21
    bx  lr

Ummmmmm…that should DEFINITELY be a shift+add. That would be much cheaper.

The cost model is definitely messed up…nope, something else. I set the multiply cost to 80000, and it still chose it over shifts!

mul5:
    movdqa  xmm1, xmmword ptr [rip + LCPI0_0]
    movdqa  xmm2, xmm0
    pmuludq xmm2, xmm1
    psrlq   xmm0, 32
    pmuludq xmm0, xmm1
    psllq   xmm0, 32
    paddq   xmm0, xmm2
    ret

What?! We should be shifting + adding here too! Are we just not doing shift + adds for vectors?

What the heck?

test/Analysis/CostModel/ARM/mult.ll
2

Ok, will take a look at that.

test/CodeGen/ARM/vmul.ll
41

Okay, I will try to do that.

I just run svn update, right? I don't know why I chose SVN.

Updated tests with the generation tools, changed multiply cost to 8.

Each time I was about to submit it I got distracted and had to update and build again. ¯\_(ツ)_/¯

@easyaspi314 What happened with this?

Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2019, 9:59 PM
RKSimon resigned from this revision.Jan 25 2020, 1:58 AM