Page MenuHomePhabricator

[WIP][X86] Introduce 'blend with broadcast' shuffle lowering strategy (PR50971)

Authored by lebedev.ri on Aug 17 2021, 4:01 PM.



This needs more tests, but something along these lines is needed to catch the last motivational case from

        .globl  _Z6float1Dv4_dS_                # -- Begin function _Z6float1Dv4_dS_
        .p2align        4, 0x90
        .type   _Z6float1Dv4_dS_,@function
_Z6float1Dv4_dS_:                       # @_Z6float1Dv4_dS_
# %bb.0:
        vbroadcastsd    %xmm1, %ymm1
        vblendps        $192, %ymm1, %ymm0, %ymm0       # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
        .size   _Z6float1Dv4_dS_, .Lfunc_end0-_Z6float1Dv4_dS_
                                        # -- End function
        .ident  "clang version 14.0.0 ( 41e3ac398c3ae9dfba5a57d80c420c122c1ec700)"
        .section        ".note.GNU-stack","",@progbits

Diff Detail

Event Timeline

lebedev.ri created this revision.Aug 17 2021, 4:01 PM
lebedev.ri requested review of this revision.Aug 17 2021, 4:01 PM

Fixed the obvious miscompile w/ input picking

Ok, this is mostly it.
Interestingly, the originally-reported pattern appears to be the only problematic one
out of an exhaustive test coverage in copy-low-subvec-elt-to-high-subvec-elt.ll.

Now, am i wildly off-base here, should i be simply using/extending some existing lowering?

Fix typo in comments, NFC.

Change the implementation approach, mostly piggy-back on lowerShuffleAsDecomposedShuffleMerge() - this seems cleaner?

Split preparatory change into D108382, NFC.

Rebased, NFC.

Rebased ontop of D108411.

Rebased, NFC.

lebedev.ri updated this revision to Diff 370082.Sep 1 2021, 3:10 PM

Rebased, no code changes.

lebedev.ri edited reviewers, added: spatel; removed: adpatel6.Sep 1 2021, 3:10 PM

Rebased, ping.
*Something* like this is still needed to address the last motivational pattern
(float1()) from

RKSimon added a comment.EditedSep 19 2021, 8:38 AM
This comment has been deleted.

Any thoughts on the approach? Is this conceptually wrong?

pengfei added inline comments.Sep 28 2021, 7:30 PM

Why is these been affected?


Is this a regression?


Not sure if there is the right direction. Does the vmovlhps + vpermpd have better performance for cases AVX512VL-SLOW and AVX512VL-FAST-PERLANE?
Besides, this does show relationship with broadcast.

RKSimon added inline comments.Sep 29 2021, 2:50 AM

lowerShuffleAsDecomposedShuffleMerge does some of this already - can we not extend that?

lebedev.ri added inline comments.Sep 29 2021, 6:49 AM

I'm not following. That is exactly what i'm doing here, extending lowerShuffleAsDecomposedShuffleMerge().
The point of lowerShuffleAsBlendOfIdentities() is to be able to run the relevant portion
of lowerShuffleAsDecomposedShuffleMerge() at some earlier point
(before the full ​lowerShuffleAsDecomposedShuffleMerge() invocation),
catching the target pattern only.

RKSimon added inline comments.Sep 29 2021, 9:29 AM

I'm sorry, that was very badly worded - I was looking at the MVT::v4i64 case which is calling lowerShuffleAsBlendOfIdentities directly before lowerShuffleAsDecomposedShuffleMerge - not sure why I put the comment up here instead.

lebedev.ri marked 2 inline comments as done.

Rebased, dropped redundant code.



RKSimon added inline comments.Sep 29 2021, 10:22 AM

We're doing something very similar here as well - but I have a suspicion that the lowerShuffleAsLanePermuteAndSHUFP is placed very specifically .....

Matt added a subscriber: Matt.Oct 6 2021, 9:13 AM
lebedev.ri abandoned this revision.Jan 17 2022, 2:37 PM