Page MenuHomePhabricator

[ARM] [NEON] Add ROTR/ROTL lowering
Needs ReviewPublic

Authored by easyaspi314 on Jan 8 2019, 9:14 PM.

Details

Summary

This patch adds support for converting ISD::ROTR and ISD::ROTL into either a vshl/vsri or a vrevN for ARM32 NEON.

This patch would also work for aarch64, and I will probably add it in later when I get a chance.

vshl/vsri vs vshl/vshr/vorr saves one instruction, and vrevN is a single cycle rotl for an N/2 rotation (eg a 32-bit rotation on a 64-bit lane).

Diff Detail

Event Timeline

easyaspi314 created this revision.Jan 8 2019, 9:14 PM

I'm not sure this is really the best approach... essentially, there are two relevant transforms here:

  1. A rotate by a multiple of 8 can be transformed into a shuffle. I guess the only case that's really relevant on ARM is vrev, since there aren't any other single-instruction shifts that correspond to a rotate, so maybe it's okay to just special-case here.
  2. (OR X, (SRL Y, N)) can be transformed to VSRI if X has enough known trailing zeros. You can special-case rotates (or slightly more generally, FSHL/FSHR), but it's not much harder to handle the general case.

I'm also a little concerned that the VSRI could actually be slower in certain cases... if you look at timings for a Cortex-A57, 128-bit VSRI takes two cycles throughput to execute, as opposed to one for a regular shift.

lib/Target/ARM/ARMISelLowering.cpp
8066

llvm_unreachable; there isn't any other possible vector type.

easyaspi314 planned changes to this revision.EditedJan 9 2019, 2:29 PM
easyaspi314 marked an inline comment as done.

I'm not sure this is really the best approach... essentially, there are two relevant transforms here:

  1. A rotate by a multiple of 8 can be transformed into a shuffle. I guess the only case that's really relevant on ARM is vrev, since there aren't any other single-instruction shifts that correspond to a rotate, so maybe it's okay to just special-case here.

Yeah, also, you need to load the pattern. You need a cycle to load the literal, and a cycle to do the shuffle. Additionally, ARMv7-a only supports shuffling on 64-bit vectors. If we were to use it on a 128-bit vector, we would need two shuffles.

  1. (OR X, (SRL Y, N)) can be transformed to VSRI if X has enough known trailing zeros. You can special-case rotates (or slightly more generally, FSHL/FSHR), but it's not much harder to handle the general case.

True. I should probably implement that.

I'm also a little concerned that the VSRI could actually be slower in certain cases... if you look at timings for a Cortex-A57, 128-bit VSRI takes two cycles throughput to execute, as opposed to one for a regular shift.

Well VSRI takes care of both VSHR and VORR, which take a single cycle each. It doesn't have any performance impact, but it saves an instruction.

lib/Target/ARM/ARMISelLowering.cpp
8040
SDValue Temporary = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount);
Value = DAG.getNode(Left ? ISD::SHL : ISD::SRL, DL, VT, Value, Amount);
Left ? ISD::SHL : ISD::SRL
Left ? ISD::SHL : ISD::SRL

Wow. I'm an idiot.

Huh. Does clang even emit FSHL/FSHR instructions?

u32x4 fshr32_13(u32x4 val, u64x2 val2)
{
    return (val << (32 - (13 % 32))) | (val2 >> (13 % 32));
}

That is literally the contents of the code comment describing it.

I can do this, though:

declare <4 x i32> @llvm.fshl.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c)
define <4 x i32> @fshr32_13(<4 x i32> %val1, <4 x i32> %val2) nounwind
{
  %r = call <4 x i32> @llvm.fshr.v4i32(<4 x i32> %val1, <4 x i32> %val2, <4 x i32> <i32 13, i32 13, i32 13, i32 13>)
  ret <4 x i32> %r
}

I do want to mention that VSLI is not beneficial if it is required for the value to be in the same register as before. This pattern will place the value in a different register.

The only ways I would think that would be an issue is if we are using every single NEON register (we have at least 16), or we are using it in a standalone function. However, since Clang has a (bad?) habit of moving every vector passed or returned in/out of normal registers, and that functions that directly take SIMD vectors should be inlined, it isn't a huge deal.

Does clang even emit FSHL/FSHR instructions

This is an area of active development; llvm.fshl got added recently. But it's not really a priority to form llvm.fshl for constant shifts; it's easy to analyze anyway.

easyaspi314 added a comment.EditedJan 9 2019, 6:48 PM

Huh. Apparently, vshr/vsli is actually faster.

I used the benchmark tool on my my NEON-optimized xxHash variant to test this, as it is a real-life usage of the rotate pattern.

Compiled on my LG G3 (Quad core Snapdragon 801/Cortex-A15 underclocked to 1.8 GHz) with Clang 7.0.1 from the Termux repos, and benchmarked in the Termux app while tapping the screen to maintain a stable frequency.
clang -march=native -O2

The main loop basically looks like this:

    uint32x4_t v;
    const uint32x4_t prime1, prime2; // literals
    const uint8_t *p, limit; // unaligned data pointer
    ...
    do {
        /* note: vld1q_u8 is to work around alignment bug */
        const uint32x4_t inp = vreinterpretq_u32_u8(vld1q_u8(p));
        v = vmlaq_u32(v, prime2, inp);
#ifdef VSLI
        v = vsliq_n_u32(vshrq_n_u32(v, 19), v, 13);
#else 
        v = vorrq_u32(vshrq_n_u32(v, 19), vshlq_n_u32(v, 13));
#endif 
        v = vmulq_u32(v, prime1);
        p += 16;
    } while (p < limit);

The benchmark gets 4.1 GB/s with -DVSLI, but only 3.7 GB/s with vshl/vorr. Similarly, the variation using two vectors in parallel gets 5.7 GB/s with -DVSLI, but only 5.3 GB/s without.

Considering that all the other variables are the same, I presume that maybe writeback latency is to blame.

Fixed my incredibly stupid typo, added FSHL/FSHR support, and used llvm_unreachable instead of the ugly goto.

I didn't add additional tests for FSHL/FSHR yet.

Definitely either result/writeback cycles.

                             @ Cy / Re / Wr
vld1.8 { inpLo, inpHi }, [p] @  2 /  2 /  6
vmla.i32 v, inp, prime2      @  4 /  9 /  9
vshr.u32 v2, v, #19          @  1 /  3 /  6
vsli.32 v2, v, #13           @  2 /  4 /  7
vmul.i32 v, v2, prime1       @  4 /  9 /  9
                             @ 13 / 27 / 37
vld1.8 { inpLo, inpHi }, [p] @  2 /  2 /  6
vmla.i32 v, inp, prime2      @  4 /  9 /  9
vshr.u32 tmp, v, #19         @  1 /  3 /  6
vshl.i32 v, v, #13           @  1 /  3 /  6
vorr v, v, tmp               @  1 /  3 /  6
vmul.i32 v, v, prime1        @  4 /  9 /  9
                             @ 13 / 29 / 42

If we count result cycles, we get 29 cycles with vshr/vshr/vorr, and 27 cycles with vshr/vsli. 29/27 = 1.074. If we count writeback cycles, we get 1.135. That checks out with the 1.10x ratio I saw in the benchmark, as it lands right in that range.

This comment was removed by RKSimon.
efriedma added inline comments.Jan 14 2019, 6:35 PM
lib/Target/ARM/ARMISelLowering.cpp
8048

You can just return SDValue(); here, rather than explicitly expand it.

8055

The shift amount is modulo the size of the elements.

RKSimon added inline comments.Jan 15 2019, 1:20 AM
lib/Target/ARM/ARMISelLowering.cpp
851

If you're doing this please add suitable fshl/fshr test coverage

for reference:
llvm\test\CodeGen\X86\vector-fshl-128.ll
llvm\test\CodeGen\X86\vector-fshr-128.ll
llvm\test\CodeGen\X86\vector-fshl-rot-128.ll
llvm\test\CodeGen\X86\vector-fshr-rot-128.ll

Huh. Sorry about the idle time.

I noticed that there is an even faster way of doing a rotate, which is replacing vshr/vsli with vshl/vsra.

It appears to save at least one cycle, and in the xxhash benchmark, I go from 5.7 GB/s to almost 6.2 GB/s.

Doing this also fixes this bug:

vsraq_n_u32(vshlq_n_u32(val, 13), val, 19);

should not generate

vshl.i32 tmp, val, #13 
vshr.u32 val, val, #19 
vorr val, val, tmp

@easyaspi314 What happened with this?

RKSimon resigned from this revision.Jan 25 2020, 1:59 AM