This is an archive of the discontinued LLVM Phabricator instance.

[X86] Lower insertions into upper half of an 256-bit vector as broadcast+blend (PR50971)
ClosedPublic

Authored by lebedev.ri on Jul 3 2021, 1:07 AM.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 3 2021, 1:07 AM
lebedev.ri requested review of this revision.Jul 3 2021, 1:07 AM

I think the premise is sound, but creating variable shuffle/blend masks isn't great - its also uncovering a number of other poor codegen issues that need addressing.

llvm/lib/Target/X86/X86ISelLowering.cpp
19196

Yes - its very tricky to see the effect of a domain crossing penalty on targets capable of broadcasts, so casts are fine

llvm/test/CodeGen/X86/avx512-insert-extract.ll
13

Is it really worth loading a variable shuffle mask?

llvm/test/CodeGen/X86/insertelement-shuffle.ll
44

Any idea whats going on here?

llvm/test/CodeGen/X86/masked_load.ll
5649

This definitely looks like a regression

I think the premise is sound, but creating variable shuffle/blend masks isn't great - its also uncovering a number of other poor codegen issues that need addressing.

My main question is, presumably we only want to do this iff that is the only insertion into that 128-bit-wide subreg?

lebedev.ri marked 3 inline comments as done.

Addressing review notes:

  1. Allow i32/i64 for AVX (just pretend they are f32/f64)
  2. Only allow YMM vectors, disallow ZMM vectors
  3. Disallow i8 even if we can handle it - we have to load mask
lebedev.ri updated this revision to Diff 356355.Jul 3 2021, 1:08 PM
lebedev.ri marked an inline comment as done.

Add broadcast(extract_vector_elt(x, 0)) -> broadcast(x) fold, to address one more regression.

llvm/test/CodeGen/X86/insertelement-shuffle.ll
44
Optimized legalized selection DAG: %bb.0 'insert_subvector_512:'
SelectionDAG has 24 nodes:
  t0: ch = EntryToken
      t6: v4i64,ch = CopyFromReg t0, Register:v4i64 %2
                t2: i32,ch = CopyFromReg t0, Register:i32 %0
              t41: v4i32 = scalar_to_vector t2
              t4: i32,ch = CopyFromReg t0, Register:i32 %1
            t43: v4i32 = insert_vector_elt t41, t4, Constant:i64<1>
          t35: v2i64 = bitcast t43
        t36: i64 = extract_vector_elt t35, Constant:i64<0>
      t47: v4i64 = X86ISD::VBROADCAST t36
    t45: v4i64 = X86ISD::BLENDI t6, t47, TargetConstant:i8<4>
  t26: ch,glue = CopyToReg t0, Register:v4i64 $ymm0, t45
    t8: v4i64,ch = CopyFromReg t0, Register:v4i64 %3
  t28: ch,glue = CopyToReg t26, Register:v4i64 $ymm1, t8, t26:1
  t29: ch = X86ISD::RET_FLAG t28, TargetConstant:i32<0>, Register:v4i64 $ymm0, Register:v4i64 $ymm1, t28:1

We were missing broadcast(extract_vector_elt(x, 0)) -> broadcast(x) fold.

lebedev.ri retitled this revision from [X86] Lower insertions into non-0'th 128-bit subvector as broadcast+blend (PR50971) to [X86] Lower insertions into upper half of an 256-bit vector as broadcast+blend (PR50971).

Actually, AVX1 has no from-register broadcasts, only 32/64-bit from-memory broadcasts.
Not sure why i thought otherwise.

RKSimon added inline comments.Jul 5 2021, 9:01 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19193

Move the assert(isPowerOf2_32(NumEltsIn128)) as well? And add an assert message to match style guide.

19199

Use MayFoldLoad?

37998

We might need a legal type check on Src.getOperand(0) before introducing a target opcode?

lebedev.ri updated this revision to Diff 356515.Jul 5 2021, 9:28 AM
lebedev.ri marked 3 inline comments as done.

Addressing nits.

craig.topper added inline comments.Jul 6 2021, 10:24 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19203

X86 doesn't use ISD::SPLAT_VECTOR. I think this should be DAG.getSplatBuildVector.

lebedev.ri marked an inline comment as done.

Addressing review nit.

craig.topper added inline comments.Jul 6 2021, 11:32 AM
llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

Is this really better? I assume this what we get for an AVX2 target too. Not just KNL?

lebedev.ri added inline comments.Jul 6 2021, 12:32 PM
llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

Multi-insert case does seem questionable, yes. We could improve this via:

define <16 x i16> @src(<16 x i16> %x, i16 %y, i16* %ptr) {
  %val = load i16, i16* %ptr
  %r1 = insertelement <16 x i16> %x, i16 %val, i32 1
  %r2 = insertelement <16 x i16> %r1, i16 %y, i32 9
  ret <16 x i16> %r2
}
define <16 x i16> @tgt(<16 x i16> %x, i16 %y, i16* %ptr) {
  %val = load i16, i16* %ptr
  %r1 = insertelement <16 x i16> undef, i16 %val, i32 1
  %r2 = insertelement <16 x i16> %r1, i16 %y, i32 9
  %r3 = select <16 x i1> <i1 1, i1 0, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 0, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, <16 x i16> %x, <16 x i16> %r2
  ret <16 x i16> %r3
}

then we get

        .text
        .file   "test.ll"
        .globl  src                             # -- Begin function src
        .p2align        4, 0x90
        .type   src,@function
src:                                    # @src
        .cfi_startproc
# %bb.0:
        vpbroadcastw    (%rsi), %xmm1
        vpblendw        $2, %xmm1, %xmm0, %xmm1         # xmm1 = xmm0[0],xmm1[1],xmm0[2,3,4,5,6,7]
        vmovd   %edi, %xmm2
        vpbroadcastw    %xmm2, %ymm2
        vpblendw        $2, %ymm2, %ymm0, %ymm0         # ymm0 = ymm0[0],ymm2[1],ymm0[2,3,4,5,6,7,8],ymm2[9],ymm0[10,11,12,13,14,15]
        vpblendd        $240, %ymm0, %ymm1, %ymm0       # ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
        retq
.Lfunc_end0:
        .size   src, .Lfunc_end0-src
        .cfi_endproc
                                        # -- End function
        .globl  tgt                             # -- Begin function tgt
        .p2align        4, 0x90
        .type   tgt,@function
tgt:                                    # @tgt
        .cfi_startproc
# %bb.0:
        vpbroadcastw    (%rsi), %xmm1
        vmovd   %edi, %xmm2
        vpslld  $16, %xmm2, %xmm2
        vinserti128     $1, %xmm2, %ymm1, %ymm1
        vpblendw        $2, %ymm1, %ymm0, %ymm0         # ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7,8],ymm1[9],ymm0[10,11,12,13,14,15]
        retq
.Lfunc_end1:
        .size   tgt, .Lfunc_end1-tgt
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
lebedev.ri added inline comments.Jul 6 2021, 3:09 PM
llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

... something like D105514, but clearly that is also not as straight-forward.
Thoughts?

craig.topper added inline comments.Jul 7 2021, 12:23 PM
llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

I was more questioning the trading of 3 instructions for the scalar to vector copy, broadcast and 2 blends. But it turns out vpinsrw is slower than I realized on Haswell.

706

vpermi2w is 3 uops, 2 of which are 3 cycles that are serialized. I think the two blends we got on avx2 would be better. That's probably a separate issue in shuffle lowering/combining.

820

Again, I'd expect 2 blends to be better.

lebedev.ri marked an inline comment as done.Jul 7 2021, 12:32 PM
lebedev.ri added inline comments.
llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

Ah, so we agree that this is good for upper subvector in general.
Should we perhaps be doing this for lower subvector too?

706

Right. This is a separate problem, in combineX86ShufflesRecursively() i would guess.

lebedev.ri marked an inline comment as done.Jul 13 2021, 3:28 AM

ping

RKSimon added inline comments.Jul 13 2021, 5:29 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
38000

do we get any changes in current tests if we pull this out as a preliminary patch?

llvm/test/CodeGen/X86/avx512-insert-extract.ll
706

The 'AllowBWIVPERMV3' logic in combineX86ShuffleChain is probably slightly off.

lebedev.ri marked an inline comment as done.Jul 13 2021, 5:30 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
38000

Sadly, i tried that originally, and we do not, otherwise i would have had this as a separate patch :/

RKSimon added inline comments.Jul 19 2021, 7:46 AM
llvm/test/CodeGen/X86/masked_gather.ll
1420

Just noticed this on D106280 - I don't suppose you know why we fail to merge these identical broadcasts?

lebedev.ri marked an inline comment as done.Jul 19 2021, 7:53 AM
lebedev.ri added inline comments.
llvm/test/CodeGen/X86/masked_gather.ll
1420

I'm not sure i follow. this inserts c+28(%rip) into the 4'th 32-bit element of ymm0.
How/what would expect it to look like?

RKSimon added inline comments.Jul 19 2021, 9:28 AM
llvm/test/CodeGen/X86/masked_gather.ll
1420

Aren't all the "broadcastss c+28(%rip), XXXX" cases broadcasting the same memory location? The IR looks like the gep is splatting the element 3 of the pointer array to every gather address.

lebedev.ri added inline comments.Jul 19 2021, 10:06 AM
llvm/test/CodeGen/X86/masked_gather.ll
1420

Right. Well, i'm not sure where we'd do that. And what do you mean by merge?

They are scalarized by Scalarize Masked Memory Intrinsics (scalarize-masked-mem-intrin) pass,
which is a codegen pass, I'm not sure how we could do that in DAGCombine,
since we only have a single bb at a time, and we don't have any heavy-lifting passes this late.

Is this waiting on some changes from my side?

Do we have any test coverage of repeated insertions of the same scalar into different elements? Either same subvector or different subvectors.

llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

Did you have any luck testing broadcasts into lower subvector?

RKSimon added inline comments.Jul 25 2021, 6:23 AM
llvm/test/CodeGen/X86/avx512-insert-extract.ll
706

rG15b883f45771 should address this

lebedev.ri marked 3 inline comments as done.Jul 25 2021, 6:59 AM
lebedev.ri marked an inline comment as done.Jul 25 2021, 7:21 AM
lebedev.ri added inline comments.
llvm/test/CodeGen/X86/avx512-insert-extract.ll
690–707

I briefly looked at the test changes without high-subvector limitation,
and the test changes aren't really obviously better,
so i'm not really planning on touching that here.

lebedev.ri marked an inline comment as done.

Do we have any test coverage of repeated insertions of the same scalar into different elements? Either same subvector or different subvectors.

We do now, added to avx-insertelt.ll.

RKSimon added inline comments.Jul 25 2021, 8:28 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19199

Maybe generalize the IdxVal >= NumEltsIn128 limit to insert with broadcast if the scalar is already used in a (a) another insertelement/psinrw/pinsrb, (b) scalar_to_vector or (c) broadcast.

Generalize profitability check somewhat.

llvm/lib/Target/X86/X86ISelLowering.cpp
19199

How about now?

Fixup check prefixes in vselect.ll

Also accept insertions into non-low part of YMM.
I don't know if the code is better, but if it is worse,
then we also need to filter out YMM ops in the multi-use check.

This comment was removed by lebedev.ri.

(sorry, wrong patch)

RKSimon added inline comments.Jul 27 2021, 2:23 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19201

It might be easier to read if you pull this out into a helper function/lamdba

llvm/test/CodeGen/X86/vselect.ll
617 ↗(On Diff #361525)

whats going on here?

lebedev.ri updated this revision to Diff 362063.EditedJul 27 2021, 9:26 AM

@RKSimon thank you for taking a look!
Thinking about it more, i am not okay with following https://reviews.llvm.org/D105390#inline-1016171
Maybe generalize the IdxVal >= NumEltsIn128 limit to insert with broadcast if the scalar is already used in a (a) another insertelement/psinrw/pinsrb, (b) scalar_to_vector or (c) broadcast. suggestion.
While that clearly results in improvements mostly, not all additional changes are wins per-se.
This is counter-productive. The relaxation is clearly separable from the existing diff.
I see no reason why it must be done at once.

I'm interested in looking into that later, but at the same time
i'm perfectly okay with not proceeding with this patch at all.
Thanks.

lebedev.ri updated this revision to Diff 364102.Aug 4 2021, 7:31 AM

Rebased, NFC.

RKSimon accepted this revision.Aug 17 2021, 8:06 AM

Sorry for the delay - I'm happy for this to go in as a first step, are you intending to continue investigating multiple insertions?

llvm/lib/Target/X86/X86ISelLowering.cpp
19202

EltSizeInBits >= 32 ?

This revision is now accepted and ready to land.Aug 17 2021, 8:06 AM

Sorry for the delay - I'm happy for this to go in as a first step,

Thank you for the review!

are you intending to continue investigating multiple insertions?

I would like to look further into this, but right now i'm not sure what would be the best way to deal with those cases.

This revision was landed with ongoing or failed builds.Aug 17 2021, 8:45 AM
This revision was automatically updated to reflect the committed changes.

@lebedev.ri Are you looking at the https://bugs.llvm.org/show_bug.cgi?id=51615 regression due to this patch?

@lebedev.ri Are you looking at the https://bugs.llvm.org/show_bug.cgi?id=51615 regression due to this patch?

Yes, looking into this now.