This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve shuffles of i1 vectors (WIP)
Needs ReviewPublic

Authored by rjj on Jun 13 2023, 9:22 AM.

Details

Summary

If the operands of a shuffle are concatenations of i1 vectors and undef,
this patch tries to rewrite the operands without the concatenation by doing a
bitcast instead. This avoids BUILD_VECTORs during legalisation, which
ultimately leads to better codegen.

This is WIP because currently two tests are failing that need looking into, but
in the meantime I wanted to get your opinion @dmgreen on the approach.

Diff Detail

Event Timeline

rjj created this revision.Jun 13 2023, 9:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:22 AM
rjj requested review of this revision.Jun 13 2023, 9:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:22 AM

Just for completeness, this is related to: https://reviews.llvm.org/D142359

It (partially) fixes a regression when we reduce the vector insertion costs:

this "distance" one from cmsisdsp: https://godbolt.org/z/1xz7GP3fM.

Matt added a subscriber: Matt.Jun 13 2023, 12:57 PM

Thanks for uploading this. It does look very similar to the case I was looking at recently, but goes further than the optimization I was trying. (That was about altering anyext(buildvector(..)), not doing it earlier which looks like it allows the shuffles to lower more efficiently). I will try to take a look, to get my head around how this works.

rjj added a comment.Jun 14 2023, 7:28 AM

Thanks for uploading this. It does look very similar to the case I was looking at recently, but goes further than the optimization I was trying. (That was about altering anyext(buildvector(..)), not doing it earlier which looks like it allows the shuffles to lower more efficiently). I will try to take a look, to get my head around how this works.

Thanks for taking a look! To give a bit of context, this was brought up by the following example (from cmsisdsp iirc):

define <4 x i1> @t1(<2 x i1> %a, <2 x i1> %b)  {
  %r = shufflevector <2 x i1> %a, <2 x i1> %b, <4 x i32> <i32 0, i32 3, i32 0, i32 3>
  ret <4 x i1> %r
}

This was getting lowered to:

t1:                                     // @t1
        fmov    d2, d0
        mov     w8, v1.s[1]
        mov     v0.16b, v2.16b
        mov     v0.h[1], w8
        mov     v0.h[2], v2.h[0]
        mov     v0.h[3], w8
        ret

which is essentially an element-by-element insert.

However, by taking advantage of the symmetry of the mask, <(0, 3), (0, 3)>, we could do better if we used a bigger element type to replicate the first half of the mask into the second, e.g.:

t1:
        mov     v0.h[1], v1.h[2]
        mov     v0.s[1], v0.s[0]

As I was working on this, I saw the compiler was actually able to do something similar if we used i32s (for example) instead of i1s, but was getting stuck with the latter because its initial SelectionDAG consisted of vector_shuffle (concat (tx, undef), concat (ty, undef)), which would get type-legalised into BUILD_VECTORs and eventually lead to a sequence of inserts.

Since during legalisation the CopyFromReg nodes were getting bitcast into smaller types (v4i16 in this case), I thought I could cut the middleman and cast them earlier in the pipeline (making the necessary adjustments) and thus avoid the undef concatenations. The rest of the optimisations fell through from this.

In the end, for the example above we got the following assembly:

t1:
        mov     v0.h[2], v1.h[2]
        uzp1    v0.4h, v0.4h, v0.4h
        ret

which, despite being a bit less obvious, is functionally equivalent to the assembly with two moves and should have the same performance on the V2.

You can check this example on Alive2 here https://alive2.llvm.org/ce/z/297iBB where I've essentially written the IR that corresponds to the initial SelectionDAGs before and after the patch.

Hello. Sorry for the delay. This seems to work OK for AArch64, but I'm not sure how much other architectures would like it. It might be worth possibly adding it to AArch64ISelLowering instead, as it is easier there to make assumptions about which types would be better.

I was trying it with the motivating example, and whilst it did help by reducing the instruction count by a lot, it wasn't enough for it to become profitable compared to scalar. There seems to just be too much shuffling going on. I am attempting to see if I can stop the vectorization in that case instead. We've seen this come up a few times though, so it would be good to get an improvement like this in if we can.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
24742

This creates a setcc with the result type set to the input type? What happens for targets which want to use a native vxi1 datatype, as they have native predicate registers?

There is a getSetCCResultType method in which the target specifies what the setcc result type should become. There is also a TargetLowering::BooleanContent that specifies whether the bits of the predicate from setcc should be 0/1 or 0/-1 or the top bits are undefined.

There seems to just be too much shuffling going on. I am attempting to see if I can stop the vectorization in that case instead.

Ok, so parking this, and instead focus on reigning in vectorisation? I guess it makes sense and is another way of doing it.
I can't remember why we went for fixing the codegen. Perhaps that was just a fun exercise, and avoids looking at cost-modelling again if we manage to get the vector code on par with the scalar code (that would be good enough for this case).

It certainly looks like the code is better, and that this would be an improvement. But the scores didn't move as much as expected when I tried it. (The example has some pretty long SLP vectorization chain that reaches over multiple loops).

This would be useful if you did want to continue it, but I don't think it should be a blocker for anything.

Ok, got it, cheers.