Broadcast is not worse than extract+insert of subvector.
https://godbolt.org/z/aPq98G6Yh
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I think the premise is sound, but creating variable shuffle/blend masks isn't great - its also uncovering a number of other poor codegen issues that need addressing.
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 19196 | Yes - its very tricky to see the effect of a domain crossing penalty on targets capable of broadcasts, so casts are fine | |
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
| 13 | Is it really worth loading a variable shuffle mask? | |
| llvm/test/CodeGen/X86/insertelement-shuffle.ll | ||
| 44 | Any idea whats going on here? | |
| llvm/test/CodeGen/X86/masked_load.ll | ||
| 5649 | This definitely looks like a regression | |
My main question is, presumably we only want to do this iff that is the only insertion into that 128-bit-wide subreg?
Addressing review notes:
- Allow i32/i64 for AVX (just pretend they are f32/f64)
- Only allow YMM vectors, disallow ZMM vectors
- Disallow i8 even if we can handle it - we have to load mask
Add broadcast(extract_vector_elt(x, 0)) -> broadcast(x) fold, to address one more regression.
| llvm/test/CodeGen/X86/insertelement-shuffle.ll | ||
|---|---|---|
| 44 | Optimized legalized selection DAG: %bb.0 'insert_subvector_512:'
SelectionDAG has 24 nodes:
  t0: ch = EntryToken
      t6: v4i64,ch = CopyFromReg t0, Register:v4i64 %2
                t2: i32,ch = CopyFromReg t0, Register:i32 %0
              t41: v4i32 = scalar_to_vector t2
              t4: i32,ch = CopyFromReg t0, Register:i32 %1
            t43: v4i32 = insert_vector_elt t41, t4, Constant:i64<1>
          t35: v2i64 = bitcast t43
        t36: i64 = extract_vector_elt t35, Constant:i64<0>
      t47: v4i64 = X86ISD::VBROADCAST t36
    t45: v4i64 = X86ISD::BLENDI t6, t47, TargetConstant:i8<4>
  t26: ch,glue = CopyToReg t0, Register:v4i64 $ymm0, t45
    t8: v4i64,ch = CopyFromReg t0, Register:v4i64 %3
  t28: ch,glue = CopyToReg t26, Register:v4i64 $ymm1, t8, t26:1
  t29: ch = X86ISD::RET_FLAG t28, TargetConstant:i32<0>, Register:v4i64 $ymm0, Register:v4i64 $ymm1, t28:1We were missing broadcast(extract_vector_elt(x, 0)) -> broadcast(x) fold. | |
Actually, AVX1 has no from-register broadcasts, only 32/64-bit from-memory broadcasts.
Not sure why i thought otherwise.
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 19203 | X86 doesn't use ISD::SPLAT_VECTOR. I think this should be DAG.getSplatBuildVector. | |
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
|---|---|---|
| 690–707 | Is this really better? I assume this what we get for an AVX2 target too. Not just KNL? | |
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
|---|---|---|
| 690–707 | Multi-insert case does seem questionable, yes. We could improve this via: define <16 x i16> @src(<16 x i16> %x, i16 %y, i16* %ptr) {
  %val = load i16, i16* %ptr
  %r1 = insertelement <16 x i16> %x, i16 %val, i32 1
  %r2 = insertelement <16 x i16> %r1, i16 %y, i32 9
  ret <16 x i16> %r2
}
define <16 x i16> @tgt(<16 x i16> %x, i16 %y, i16* %ptr) {
  %val = load i16, i16* %ptr
  %r1 = insertelement <16 x i16> undef, i16 %val, i32 1
  %r2 = insertelement <16 x i16> %r1, i16 %y, i32 9
  %r3 = select <16 x i1> <i1 1, i1 0, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 0, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, <16 x i16> %x, <16 x i16> %r2
  ret <16 x i16> %r3
}then we get         .text
        .file   "test.ll"
        .globl  src                             # -- Begin function src
        .p2align        4, 0x90
        .type   src,@function
src:                                    # @src
        .cfi_startproc
# %bb.0:
        vpbroadcastw    (%rsi), %xmm1
        vpblendw        $2, %xmm1, %xmm0, %xmm1         # xmm1 = xmm0[0],xmm1[1],xmm0[2,3,4,5,6,7]
        vmovd   %edi, %xmm2
        vpbroadcastw    %xmm2, %ymm2
        vpblendw        $2, %ymm2, %ymm0, %ymm0         # ymm0 = ymm0[0],ymm2[1],ymm0[2,3,4,5,6,7,8],ymm2[9],ymm0[10,11,12,13,14,15]
        vpblendd        $240, %ymm0, %ymm1, %ymm0       # ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
        retq
.Lfunc_end0:
        .size   src, .Lfunc_end0-src
        .cfi_endproc
                                        # -- End function
        .globl  tgt                             # -- Begin function tgt
        .p2align        4, 0x90
        .type   tgt,@function
tgt:                                    # @tgt
        .cfi_startproc
# %bb.0:
        vpbroadcastw    (%rsi), %xmm1
        vmovd   %edi, %xmm2
        vpslld  $16, %xmm2, %xmm2
        vinserti128     $1, %xmm2, %ymm1, %ymm1
        vpblendw        $2, %ymm1, %ymm0, %ymm0         # ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7,8],ymm1[9],ymm0[10,11,12,13,14,15]
        retq
.Lfunc_end1:
        .size   tgt, .Lfunc_end1-tgt
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits | |
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
|---|---|---|
| 690–707 | I was more questioning the trading of 3 instructions for the scalar to vector copy, broadcast and 2 blends. But it turns out vpinsrw is slower than I realized on Haswell. | |
| 706 | vpermi2w is 3 uops, 2 of which are 3 cycles that are serialized. I think the two blends we got on avx2 would be better. That's probably a separate issue in shuffle lowering/combining. | |
| 820 | Again, I'd expect 2 blends to be better. | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 38000 | Sadly, i tried that originally, and we do not, otherwise i would have had this as a separate patch :/ | |
| llvm/test/CodeGen/X86/masked_gather.ll | ||
|---|---|---|
| 1420 | I'm not sure i follow. this inserts c+28(%rip) into the 4'th 32-bit element of ymm0. | |
| llvm/test/CodeGen/X86/masked_gather.ll | ||
|---|---|---|
| 1420 | Aren't all the "broadcastss c+28(%rip), XXXX" cases broadcasting the same memory location? The IR looks like the gep is splatting the element 3 of the pointer array to every gather address. | |
| llvm/test/CodeGen/X86/masked_gather.ll | ||
|---|---|---|
| 1420 | Right. Well, i'm not sure where we'd do that. And what do you mean by merge? They are scalarized by Scalarize Masked Memory Intrinsics (scalarize-masked-mem-intrin) pass, | |
Do we have any test coverage of repeated insertions of the same scalar into different elements? Either same subvector or different subvectors.
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
|---|---|---|
| 690–707 | Did you have any luck testing broadcasts into lower subvector? | |
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
|---|---|---|
| 706 | rG15b883f45771 should address this | |
| llvm/test/CodeGen/X86/avx512-insert-extract.ll | ||
|---|---|---|
| 690–707 | I briefly looked at the test changes without high-subvector limitation, | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 19199 | Maybe generalize the IdxVal >= NumEltsIn128 limit to insert with broadcast if the scalar is already used in a (a) another insertelement/psinrw/pinsrb, (b) scalar_to_vector or (c) broadcast. | |
Generalize profitability check somewhat.
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 19199 | How about now? | |
Also accept insertions into non-low part of YMM.
I don't know if the code is better, but if it is worse,
then we also need to filter out YMM ops in the multi-use check.
@RKSimon thank you for taking a look!
Thinking about it more, i am not okay with following https://reviews.llvm.org/D105390#inline-1016171
Maybe generalize the IdxVal >= NumEltsIn128 limit to insert with broadcast if the scalar is already used in a (a) another insertelement/psinrw/pinsrb, (b) scalar_to_vector or (c) broadcast. suggestion.
While that clearly results in improvements mostly, not all additional changes are wins per-se.
This is counter-productive. The relaxation is clearly separable from the existing diff.
I see no reason why it must be done at once.
I'm interested in looking into that later, but at the same time
i'm perfectly okay with not proceeding with this patch at all.
Thanks.
Sorry for the delay - I'm happy for this to go in as a first step, are you intending to continue investigating multiple insertions?
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 19202 | EltSizeInBits >= 32 ? | |
Thank you for the review!
are you intending to continue investigating multiple insertions?
I would like to look further into this, but right now i'm not sure what would be the best way to deal with those cases.
@lebedev.ri Are you looking at the https://bugs.llvm.org/show_bug.cgi?id=51615 regression due to this patch?
Move the assert(isPowerOf2_32(NumEltsIn128)) as well? And add an assert message to match style guide.