This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
transpose-inseltpoison.ll
-
transpose.ll
-
vectorize-free-extracts-inserts.ll
-
X86/
5/9
alternate-int-inseltpoison.ll
-
alternate-int.ll
1/2
extractelement.ll

Differential D99980

[SLP]Improve cost model for the vectorized extractelements.
ClosedPublic

Authored by ABataev on Apr 6 2021, 11:09 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
dtemirbulatov
anton-afanasyev
vdmitrie

Commits

rGe99b98cb1bca: [SLP]Improve cost model for the vectorized extractelements.

Summary

No need to call areAllUsersVectorized as later the cost is calculated only if the instruction has one use and gets vectorized.
Need to calculate the cost of the dead extractelement more precisely, taking the vector type of the vector operand, not the resulting vector type.

Part of D57059.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Apr 6 2021, 11:09 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptApr 6 2021, 11:09 AM

ABataev requested review of this revision.Apr 6 2021, 11:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2021, 11:09 AM

Harbormaster completed remote builds in B97365: Diff 335602.Apr 6 2021, 1:08 PM

Improve cost of other extractelement instructions.

RKSimon added inline comments.Apr 7 2021, 5:37 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123	Why do we have both a v4i32 and v8i32 shl in here?

ABataev added inline comments.Apr 7 2021, 5:43 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

123

That's because of 2 main factors here: max size of the vector register and final insertinstruction instructions. These insertinstruction leads to the emission of <8x i32> vectors while other instructions are limited by the 128 bit size of the vector register.
SLP vectorizer generates this code:

define <8 x i32> @ashr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) #0 {
  %a0 = extractelement <8 x i32> %a, i32 0
  %a1 = extractelement <8 x i32> %a, i32 1
  %a2 = extractelement <8 x i32> %a, i32 2
  %a3 = extractelement <8 x i32> %a, i32 3
  %a4 = extractelement <8 x i32> %a, i32 4
  %a5 = extractelement <8 x i32> %a, i32 5
  %a6 = extractelement <8 x i32> %a, i32 6
  %a7 = extractelement <8 x i32> %a, i32 7
  %b0 = extractelement <8 x i32> %b, i32 0
  %b1 = extractelement <8 x i32> %b, i32 1
  %b2 = extractelement <8 x i32> %b, i32 2
  %b3 = extractelement <8 x i32> %b, i32 3
  %b4 = extractelement <8 x i32> %b, i32 4
  %b5 = extractelement <8 x i32> %b, i32 5
  %b6 = extractelement <8 x i32> %b, i32 6
  %b7 = extractelement <8 x i32> %b, i32 7
  %ab0 = ashr i32 %a0, %b0
  %ab1 = ashr i32 %a1, %b1
  %1 = insertelement <4 x i32> poison, i32 %a2, i32 0
  %2 = insertelement <4 x i32> %1, i32 %a3, i32 1
  %3 = insertelement <4 x i32> %2, i32 %a4, i32 2
  %4 = insertelement <4 x i32> %3, i32 %a5, i32 3
  %5 = insertelement <4 x i32> poison, i32 %b2, i32 0
  %6 = insertelement <4 x i32> %5, i32 %b3, i32 1
  %7 = insertelement <4 x i32> %6, i32 %b4, i32 2
  %8 = insertelement <4 x i32> %7, i32 %b5, i32 3
  %9 = ashr <4 x i32> %4, %8
  %10 = shl <4 x i32> %4, %8
  %11 = shufflevector <4 x i32> %9, <4 x i32> %10, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %12 = insertelement <2 x i32> poison, i32 %a6, i32 0
  %13 = insertelement <2 x i32> %12, i32 %a7, i32 1
  %14 = insertelement <2 x i32> poison, i32 %b6, i32 0
  %15 = insertelement <2 x i32> %14, i32 %b7, i32 1
  %16 = shl <2 x i32> %13, %15
  %r0 = insertelement <8 x i32> undef, i32 %ab0, i32 0
  %r1 = insertelement <8 x i32> %r0, i32 %ab1, i32 1
  %17 = extractelement <4 x i32> %11, i32 0
  %r2 = insertelement <8 x i32> %r1, i32 %17, i32 2
  %18 = extractelement <4 x i32> %11, i32 1
  %r3 = insertelement <8 x i32> %r2, i32 %18, i32 3
  %19 = extractelement <4 x i32> %11, i32 2
  %r4 = insertelement <8 x i32> %r3, i32 %19, i32 4
  %20 = extractelement <4 x i32> %11, i32 3
  %r5 = insertelement <8 x i32> %r4, i32 %20, i32 5
  %21 = extractelement <2 x i32> %16, i32 0
  %r6 = insertelement <8 x i32> %r5, i32 %21, i32 6
  %22 = extractelement <2 x i32> %16, i32 1
  %r7 = insertelement <8 x i32> %r6, i32 %22, i32 7
  ret <8 x i32> %r7
}

InstCombiner optimizes this to <8 x i32> vector operations.

Harbormaster completed remote builds in B97493: Diff 335783.Apr 7 2021, 6:05 AM

RKSimon added inline comments.Apr 7 2021, 6:05 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

123

I'm still seeing 128/256 shifts: https://c.godbolt.org/z/4q364s1fT

define <8 x i32> @ashr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) {
  %1 = ashr <8 x i32> %a, %b
  %2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
  %3 = shufflevector <8 x i32> %b, <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
  %4 = ashr <4 x i32> %2, %3
  %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %6 = shl <4 x i32> %2, %3
  %7 = shufflevector <4 x i32> %6, <4 x i32> undef, <8 x i32> <i32 undef, i32 undef, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %8 = shl <8 x i32> %a, %b
  %r3 = shufflevector <8 x i32> %1, <8 x i32> %5, <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 undef, i32 undef, i32 undef, i32 undef>
  %r5 = shufflevector <8 x i32> %r3, <8 x i32> %7, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 10, i32 11, i32 undef, i32 undef>
  %r7 = shufflevector <8 x i32> %r5, <8 x i32> %8, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 14, i32 15>
  ret <8 x i32> %r7
}

ABataev added inline comments.Apr 7 2021, 6:15 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123	The original function operates on <8 x i32> params/return value. SLP vectorizer is limited by 128 bit vector register size and generates <2 x i32> and <4 x i32> vector instructions. instcombiner combines some of the vector instructions and produces <4 x i32> and <8 x i32> instructions. But only becaus ethe function paramters and return value are <8 x i32>. The test uses both SLP and instcombiner, instcombiner produces <8 x i32> instructions.

Rebase

Harbormaster completed remote builds in B98755: Diff 337550.Apr 14 2021, 3:00 PM

RKSimon requested changes to this revision.Apr 15 2021, 3:15 AM

RKSimon added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123	I'm still concerned about this codegen - TMP1 and TMP2 have subvector extractions that aren't even on a subvector boundary, and we're performing ashr on vector elements (4 + 5) that aren't required - but scalarizing 2 elements (0 + 1) that are.

This revision now requires changes to proceed.Apr 15 2021, 3:15 AM

ABataev added inline comments.Apr 15 2021, 6:18 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

123

Looks like it reveals the problem in InstCombiner. For some reason, it extends shl <2 x i32> %13, %15 generated by SLP to shl <8 x i32> [[A]], [[B]] though it should not.
As to shuffles

%9 = ashr <4 x i32> %4, %8
%10 = shl <4 x i32> %4, %8

It is caused by the altopcode vectorization algorithm. Looks like we detect vector bundle:

%ab2 = ashr i32 %a2, %b2
%ab3 = ashr i32 %a3, %b3
%ab4 = shl  i32 %a4, %b4
%ab5 = shl  i32 %a5, %b5

and we generate something like this for it:

%1 = insertelement <4 x i32> poison, i32 %a2, i32 0
%2 = insertelement <4 x i32> %1, i32 %a3, i32 1
%3 = insertelement <4 x i32> %2, i32 %a4, i32 2
%4 = insertelement <4 x i32> %3, i32 %a5, i32 3
%5 = insertelement <4 x i32> poison, i32 %b2, i32 0
%6 = insertelement <4 x i32> %5, i32 %b3, i32 1
%7 = insertelement <4 x i32> %6, i32 %b4, i32 2
%8 = insertelement <4 x i32> %7, i32 %b5, i32 3
%9 = ashr <4 x i32> %4, %8
%10 = shl <4 x i32> %4, %8
%11 = shufflevector <4 x i32> %9, <4 x i32> %10, <4 x i32> <i32 0, i32 1, i32 6, i32 7>

%1 = insertelement <4 x i32> poison, i32 %a2, i32 0
%2 = insertelement <4 x i32> %1, i32 %a3, i32 1
%3 = insertelement <4 x i32> %2, i32 %a4, i32 2
%4 = insertelement <4 x i32> %3, i32 %a5, i32 3
%5 = insertelement <4 x i32> poison, i32 %b2, i32 0
%6 = insertelement <4 x i32> %5, i32 %b3, i32 1
%7 = insertelement <4 x i32> %6, i32 %b4, i32 2
%8 = insertelement <4 x i32> %7, i32 %b5, i32 3

are just subvector extracts. What I missed in the patch is adding a cost for subvector extracts/inserts, will add it.

Rebase + improvements

Cleanup

ABataev added inline comments.Apr 15 2021, 2:10 PM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123	Investigated test case more closely. Everything is correct. Before we just vectorized `4 shl` instructions, which were extended to `<8x` by Instcombiner. Currently, we're vectorizing `2 ashr + 2 shl` and `2 shl` (again extended to `<8x` by Instcombiner). The test will be improved further by the patch for vectorization of `InsertElement` instructions, it will end up with `ashr 8x + shl 8x + shuffle` just like for other targets.

Harbormaster completed remote builds in B99017: Diff 337893.Apr 15 2021, 2:23 PM

Harbormaster completed remote builds in B99019: Diff 337895.Apr 15 2021, 2:32 PM

RKSimon added inline comments.Apr 16 2021, 2:58 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123	Would you be OK with waiting until D98714 has landed?

ABataev added inline comments.Apr 16 2021, 5:31 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123	It blocks some of the patches for non-power-2 vectorization in SLP, would be good to land it ASAP. Plus, it still improves the situation comparing to existing codegen.

Rebase

Harbormaster completed remote builds in B99809: Diff 338999.Apr 20 2021, 2:58 PM

OK - LGTM - cheers

llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll
53	Missed fadd reduction opportunity

This revision is now accepted and ready to land.Apr 22 2021, 3:55 AM

ABataev added inline comments.Apr 22 2021, 7:21 AM

llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll
53	We even do not try to detect reductions if we have less than 4 elements, here we're going to have shuffle + vector fadd + extractelement.

This revision was landed with ongoing or failed builds.Apr 22 2021, 7:41 AM

Closed by commit rGe99b98cb1bca: [SLP]Improve cost model for the vectorized extractelements. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGe99b98cb1bca: [SLP]Improve cost model for the vectorized extractelements..

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

In D99980#2735012, @jgorbe wrote:

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Thanks for the report, I'll investigate it

In D99980#2735012, @jgorbe wrote:

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Looks like a bug in X86ISelLowering, will try to make a patch to fix this

In D99980#2735012, @jgorbe wrote:

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Must be fixed in 13a51e017c09ce449ba2ec0024baf356d6dfcbad

In D99980#2738628, @ABataev wrote:

Must be fixed in 13a51e017c09ce449ba2ec0024baf356d6dfcbad

Awesome! Thanks a lot :)

RKSimon mentioned this in D124769: [SLP] AdjustExtractsCost - remove redundant subvector extraction code.May 2 2022, 6:54 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

96 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose-inseltpoison.ll

23 lines

transpose.ll

23 lines

vectorize-free-extracts-inserts.ll

61 lines

X86/

alternate-int-inseltpoison.ll

28 lines

alternate-int.ll

20 lines

extractelement.ll

12 lines

Diff 339629

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,537 Lines • ▼ Show 20 Lines	InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
unsigned ReuseShuffleNumbers = E->ReuseShuffleIndices.size();		unsigned ReuseShuffleNumbers = E->ReuseShuffleIndices.size();
bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();		bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();
InstructionCost ReuseShuffleCost = 0;		InstructionCost ReuseShuffleCost = 0;
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost =		ReuseShuffleCost =
TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy,		TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy,
E->ReuseShuffleIndices);		E->ReuseShuffleIndices);
}		}
		auto &&AdjustExtractsCost = [this, CostKind, VL, VecTy](InstructionCost &Cost,
		bool IsGather) {
		DenseMap<Value *, int> ExtractVectorsTys;
		for (auto *V : VL) {
		// If all users of instruction are going to be vectorized and this
		// instruction itself is not going to be vectorized, consider this
		// instruction as dead and remove its cost from the final cost of the
		// vectorized tree.
		if (IsGather && (!areAllUsersVectorized(cast<Instruction>(V)) \|\|
		ScalarToTreeEntry.count(V)))
		continue;
		auto *EE = cast<ExtractElementInst>(V);
		unsigned Idx = *getExtractIndex(EE);
		if (TTI->getNumberOfParts(VecTy) !=
		TTI->getNumberOfParts(EE->getVectorOperandType())) {
		auto It =
		ExtractVectorsTys.try_emplace(EE->getVectorOperand(), Idx).first;
		It->getSecond() = std::min<int>(It->second, Idx);
		}
		// Take credit for instruction that will become dead.
		if (EE->hasOneUse()) {
		Instruction *Ext = EE->user_back();
		if ((isa<SExtInst>(Ext) \|\| isa<ZExtInst>(Ext)) &&
		all_of(Ext->users(),
		[](User *U) { return isa<GetElementPtrInst>(U); })) {
		// Use getExtractWithExtendCost() to calculate the cost of
		// extractelement/ext pair.
		Cost -=
		TTI->getExtractWithExtendCost(Ext->getOpcode(), Ext->getType(),
		EE->getVectorOperandType(), Idx);
		// Add back the cost of s\|zext which is subtracted separately.
		Cost += TTI->getCastInstrCost(
		Ext->getOpcode(), Ext->getType(), EE->getType(),
		TTI::getCastContextHint(Ext), CostKind, Ext);
		continue;
		}
		}
		Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement,
		EE->getVectorOperandType(), Idx);
		}
		// Add a cost for subvector extracts/inserts if required.
		for (const auto &Data : ExtractVectorsTys) {
		auto *EEVTy = cast<FixedVectorType>(Data.first->getType());
		unsigned NumElts = VecTy->getNumElements();
		if (TTI->getNumberOfParts(EEVTy) > TTI->getNumberOfParts(VecTy))
		Cost +=
		TTI->getShuffleCost(TargetTransformInfo::SK_ExtractSubvector, EEVTy,
		None, (Data.second / NumElts) * NumElts, VecTy);
		else
		Cost += TTI->getShuffleCost(TargetTransformInfo::SK_InsertSubvector,
		VecTy, None, 0, EEVTy);
		}
		};
if (E->State == TreeEntry::NeedToGather) {		if (E->State == TreeEntry::NeedToGather) {
if (allConstant(VL))		if (allConstant(VL))
return 0;		return 0;
if (isSplat(VL)) {		if (isSplat(VL)) {
return ReuseShuffleCost +		return ReuseShuffleCost +
TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, None,		TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, None,
0);		0);
}		}
if (E->getOpcode() == Instruction::ExtractElement &&		if (E->getOpcode() == Instruction::ExtractElement &&
allSameType(VL) && allSameBlock(VL)) {		allSameType(VL) && allSameBlock(VL)) {
SmallVector<int> Mask;		SmallVector<int> Mask;
Optional<TargetTransformInfo::ShuffleKind> ShuffleKind =		Optional<TargetTransformInfo::ShuffleKind> ShuffleKind =
isShuffle(VL, Mask);		isShuffle(VL, Mask);
if (ShuffleKind.hasValue()) {		if (ShuffleKind.hasValue()) {
InstructionCost Cost =		InstructionCost Cost =
computeExtractCost(VL, VecTy, ShuffleKind, Mask, TTI);		computeExtractCost(VL, VecTy, ShuffleKind, Mask, TTI);
for (auto *V : VL) {		AdjustExtractsCost(Cost, /IsGather=/true);
// If all users of instruction are going to be vectorized and this
// instruction itself is not going to be vectorized, consider this
// instruction as dead and remove its cost from the final cost of the
// vectorized tree.
if (areAllUsersVectorized(cast<Instruction>(V)) &&
!ScalarToTreeEntry.count(V)) {
auto *IO = cast<ConstantInt>(
cast<ExtractElementInst>(V)->getIndexOperand());
Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
IO->getZExtValue());
}
}
return ReuseShuffleCost + Cost;		return ReuseShuffleCost + Cost;
}		}
}		}
InstructionCost GatherCost = 0;		InstructionCost GatherCost = 0;
SmallVector<int> Mask;		SmallVector<int> Mask;
SmallVector<const TreeEntry *> Entries;		SmallVector<const TreeEntry *> Entries;
Optional<TargetTransformInfo::ShuffleKind> Shuffle =		Optional<TargetTransformInfo::ShuffleKind> Shuffle =
isGatherShuffledEntry(E, Mask, Entries);		isGatherShuffledEntry(E, Mask, Entries);
Show All 29 Lines	switch (ShuffleOrOp) {
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
// The common cost of removal ExtractElement/ExtractValue instructions +		// The common cost of removal ExtractElement/ExtractValue instructions +
// the cost of shuffles, if required to resuffle the original vector.		// the cost of shuffles, if required to resuffle the original vector.
InstructionCost CommonCost = 0;		InstructionCost CommonCost = 0;
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
unsigned Idx = 0;		unsigned Idx = 0;
for (unsigned I : E->ReuseShuffleIndices) {		for (unsigned I : E->ReuseShuffleIndices) {
if (ShuffleOrOp == Instruction::ExtractElement) {		if (ShuffleOrOp == Instruction::ExtractElement) {
auto *IO = cast<ConstantInt>(		auto *EE = cast<ExtractElementInst>(VL[I]);
cast<ExtractElementInst>(VL[I])->getIndexOperand());
Idx = IO->getZExtValue();
ReuseShuffleCost -= TTI->getVectorInstrCost(		ReuseShuffleCost -= TTI->getVectorInstrCost(
Instruction::ExtractElement, VecTy, Idx);		Instruction::ExtractElement, EE->getVectorOperandType(),
		*getExtractIndex(EE));
} else {		} else {
ReuseShuffleCost -= TTI->getVectorInstrCost(		ReuseShuffleCost -= TTI->getVectorInstrCost(
Instruction::ExtractElement, VecTy, Idx);		Instruction::ExtractElement, VecTy, Idx);
++Idx;		++Idx;
}		}
}		}
Idx = ReuseShuffleNumbers;		Idx = ReuseShuffleNumbers;
for (Value *V : VL) {		for (Value *V : VL) {
if (ShuffleOrOp == Instruction::ExtractElement) {		if (ShuffleOrOp == Instruction::ExtractElement) {
auto *IO = cast<ConstantInt>(		auto *EE = cast<ExtractElementInst>(V);
cast<ExtractElementInst>(V)->getIndexOperand());		ReuseShuffleCost += TTI->getVectorInstrCost(
Idx = IO->getZExtValue();		Instruction::ExtractElement, EE->getVectorOperandType(),
		*getExtractIndex(EE));
} else {		} else {
--Idx;		--Idx;
		ReuseShuffleCost += TTI->getVectorInstrCost(
		Instruction::ExtractElement, VecTy, Idx);
}		}
ReuseShuffleCost +=
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, Idx);
}		}
CommonCost = ReuseShuffleCost;		CommonCost = ReuseShuffleCost;
} else if (!E->ReorderIndices.empty()) {		} else if (!E->ReorderIndices.empty()) {
SmallVector<int> NewMask;		SmallVector<int> NewMask;
inversePermutation(E->ReorderIndices, NewMask);		inversePermutation(E->ReorderIndices, NewMask);
CommonCost = TTI->getShuffleCost(		CommonCost = TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy, NewMask);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy, NewMask);
}		}
		if (ShuffleOrOp == Instruction::ExtractValue) {
for (unsigned I = 0, E = VL.size(); I < E; ++I) {		for (unsigned I = 0, E = VL.size(); I < E; ++I) {
Instruction *EI = cast<Instruction>(VL[I]);		auto *EI = cast<Instruction>(VL[I]);
// If all users are going to be vectorized, instruction can be
// considered as dead.
// The same, if have only one user, it will be vectorized for sure.
if (areAllUsersVectorized(EI)) {
// Take credit for instruction that will become dead.		// Take credit for instruction that will become dead.
if (EI->hasOneUse()) {		if (EI->hasOneUse()) {
Instruction *Ext = EI->user_back();		Instruction *Ext = EI->user_back();
if ((isa<SExtInst>(Ext) \|\| isa<ZExtInst>(Ext)) &&		if ((isa<SExtInst>(Ext) \|\| isa<ZExtInst>(Ext)) &&
all_of(Ext->users(),		all_of(Ext->users(),
[](User *U) { return isa<GetElementPtrInst>(U); })) {		[](User *U) { return isa<GetElementPtrInst>(U); })) {
// Use getExtractWithExtendCost() to calculate the cost of		// Use getExtractWithExtendCost() to calculate the cost of
// extractelement/ext pair.		// extractelement/ext pair.
CommonCost -= TTI->getExtractWithExtendCost(		CommonCost -= TTI->getExtractWithExtendCost(
Ext->getOpcode(), Ext->getType(), VecTy, I);		Ext->getOpcode(), Ext->getType(), VecTy, I);
// Add back the cost of s\|zext which is subtracted separately.		// Add back the cost of s\|zext which is subtracted separately.
CommonCost += TTI->getCastInstrCost(		CommonCost += TTI->getCastInstrCost(
Ext->getOpcode(), Ext->getType(), EI->getType(),		Ext->getOpcode(), Ext->getType(), EI->getType(),
TTI::getCastContextHint(Ext), CostKind, Ext);		TTI::getCastContextHint(Ext), CostKind, Ext);
continue;		continue;
}		}
}		}
CommonCost -=		CommonCost -=
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);
}		}
		} else {
		AdjustExtractsCost(CommonCost, /IsGather=/false);
}		}
return CommonCost;		return CommonCost;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
▲ Show 20 Lines • Show All 4,364 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.0, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.0, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.1, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.1, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_1(		; CHECK-LABEL: @build_vec_v4i32_reuse_1(
; CHECK-NEXT: [[V0_0:%.]] = extractelement <2 x i32> [[V0:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <2 x i32> [[V1:%.]], i32 1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i32 1		; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[V1]], i32 0
; CHECK-NEXT: [[V1_0:%.]] = extractelement <2 x i32> [[V1:%.]], i32 0		; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[V0:%.]], i32 1
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i32 1		; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i32> [[V0]], i32 0
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[TMP4]], [[TMP2]]
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[TMP3]], [[TMP1]]
; CHECK-NEXT: [[TMP0_2:%.*]] = xor i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP0_3:%.*]] = xor i32 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]		; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]		; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP1_2:%.*]] = sub i32 [[TMP0_2]], [[TMP0_3]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP1_3:%.*]] = sub i32 [[TMP0_3]], [[TMP0_2]]		; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[TMP5]], [[TMP6]]
		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP2_0:%.*]] = insertelement <4 x i32> poison, i32 [[TMP1_0]], i32 0		; CHECK-NEXT: [[TMP2_0:%.*]] = insertelement <4 x i32> poison, i32 [[TMP1_0]], i32 0
; CHECK-NEXT: [[TMP2_1:%.*]] = insertelement <4 x i32> [[TMP2_0]], i32 [[TMP1_1]], i32 1		; CHECK-NEXT: [[TMP2_1:%.*]] = insertelement <4 x i32> [[TMP2_0]], i32 [[TMP1_1]], i32 1
; CHECK-NEXT: [[TMP2_2:%.*]] = insertelement <4 x i32> [[TMP2_1]], i32 [[TMP1_2]], i32 2		; CHECK-NEXT: [[TMP2_3:%.*]] = shufflevector <4 x i32> [[TMP2_1]], <4 x i32> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 4, i32 5>
; CHECK-NEXT: [[TMP2_3:%.*]] = insertelement <4 x i32> [[TMP2_2]], i32 [[TMP1_3]], i32 3
; CHECK-NEXT: ret <4 x i32> [[TMP2_3]]		; CHECK-NEXT: ret <4 x i32> [[TMP2_3]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.0, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.0, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.1, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.1, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_1(		; CHECK-LABEL: @build_vec_v4i32_reuse_1(
; CHECK-NEXT: [[V0_0:%.]] = extractelement <2 x i32> [[V0:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <2 x i32> [[V1:%.]], i32 1
; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i32 1		; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[V1]], i32 0
; CHECK-NEXT: [[V1_0:%.]] = extractelement <2 x i32> [[V1:%.]], i32 0		; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[V0:%.]], i32 1
; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i32 1		; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i32> [[V0]], i32 0
; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[TMP4]], [[TMP2]]
; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]]		; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[TMP3]], [[TMP1]]
; CHECK-NEXT: [[TMP0_2:%.*]] = xor i32 [[V0_0]], [[V1_0]]		; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP0_3:%.*]] = xor i32 [[V0_1]], [[V1_1]]
; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]		; CHECK-NEXT: [[TMP1_0:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]		; CHECK-NEXT: [[TMP1_1:%.*]] = sub i32 [[TMP0_0]], [[TMP0_1]]
; CHECK-NEXT: [[TMP1_2:%.*]] = sub i32 [[TMP0_2]], [[TMP0_3]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP1_3:%.*]] = sub i32 [[TMP0_3]], [[TMP0_2]]		; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[TMP5]], [[TMP6]]
		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP2_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1_0]], i32 0		; CHECK-NEXT: [[TMP2_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1_0]], i32 0
; CHECK-NEXT: [[TMP2_1:%.*]] = insertelement <4 x i32> [[TMP2_0]], i32 [[TMP1_1]], i32 1		; CHECK-NEXT: [[TMP2_1:%.*]] = insertelement <4 x i32> [[TMP2_0]], i32 [[TMP1_1]], i32 1
; CHECK-NEXT: [[TMP2_2:%.*]] = insertelement <4 x i32> [[TMP2_1]], i32 [[TMP1_2]], i32 2		; CHECK-NEXT: [[TMP2_3:%.*]] = shufflevector <4 x i32> [[TMP2_1]], <4 x i32> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 4, i32 5>
; CHECK-NEXT: [[TMP2_3:%.*]] = insertelement <4 x i32> [[TMP2_2]], i32 [[TMP1_3]], i32 3
; CHECK-NEXT: ret <4 x i32> [[TMP2_3]]		; CHECK-NEXT: ret <4 x i32> [[TMP2_3]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S %s \| FileCheck %s

	target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
	target triple = "arm64-apple-darwin"			target triple = "arm64-apple-darwin"

	declare void @use(double)			declare void @use(double)

	; The extracts %v1.lane.0 and %v1.lane.1 should be considered free during SLP,			; The extracts %v1.lane.0 and %v1.lane.1 should be considered free during SLP,
	; because they will be directly in a vector register on AArch64.			; because they will be directly in a vector register on AArch64.
	define void @noop_extracts_first_2_lanes(<2 x double>* %ptr.1, <4 x double>* %ptr.2) {			define void @noop_extracts_first_2_lanes(<2 x double>* %ptr.1, <4 x double>* %ptr.2) {
	; CHECK-LABEL: @noop_extracts_first_2_lanes(			; CHECK-LABEL: @noop_extracts_first_2_lanes(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[V_1:%.]] = load <2 x double>, <2 x double> [[PTR_1:%.*]], align 8			; CHECK-NEXT: [[V_1:%.]] = load <2 x double>, <2 x double> [[PTR_1:%.*]], align 8
				; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <2 x double> [[V_1]], i32 0
				; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <2 x double> [[V_1]], i32 1
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[V2_LANE_3:%.*]] = extractelement <4 x double> [[V_2]], i32 3			; CHECK-NEXT: [[V2_LANE_3:%.*]] = extractelement <4 x double> [[V_2]], i32 3
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_2]], i32 0			; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V2_LANE_3]], i32 1			; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_3]]
	; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> [[V_1]], [[TMP1]]			; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <2 x double> undef, double [[A_LANE_0]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 0			; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <2 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <2 x double> undef, double [[TMP3]], i32 0			; CHECK-NEXT: call void @use(double [[V1_LANE_0]])
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP2]], i32 1			; CHECK-NEXT: call void @use(double [[V1_LANE_1]])
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <2 x double> [[A_INS_0]], double [[TMP4]], i32 1
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[V_1]], i32 0
	; CHECK-NEXT: call void @use(double [[TMP5]])
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[V_1]], i32 1
	; CHECK-NEXT: call void @use(double [[TMP6]])
	; CHECK-NEXT: store <2 x double> [[A_INS_1]], <2 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <2 x double> [[A_INS_1]], <2 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%v.1 = load <2 x double>, <2 x double>* %ptr.1, align 8			%v.1 = load <2 x double>, <2 x double>* %ptr.1, align 8
	%v1.lane.0 = extractelement <2 x double> %v.1, i32 0			%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
	%v1.lane.1 = extractelement <2 x double> %v.1, i32 1			%v1.lane.1 = extractelement <2 x double> %v.1, i32 1

	▲ Show 20 Lines • Show All 188 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <9 x double> [[V_1]], i32 0			; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <9 x double> [[V_1]], i32 0
	; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <9 x double> [[V_1]], i32 1			; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <9 x double> [[V_1]], i32 1
	; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <9 x double> [[V_1]], i32 2			; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <9 x double> [[V_1]], i32 2
	; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <9 x double> [[V_1]], i32 3			; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <9 x double> [[V_1]], i32 3
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0			; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0
	; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1			; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <4 x double> poison, double [[V1_LANE_2]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V1_LANE_2]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> [[TMP0]], double [[V1_LANE_3]], i32 1			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V1_LANE_3]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x double> [[TMP1]], double [[V1_LANE_0]], i32 2			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_2]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP2]], double [[V1_LANE_1]], i32 3			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[V2_LANE_2]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_2]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[V2_LANE_0]], i32 1			; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> poison, <4 x i32> <i32 0, i32 0, i32 0, i32 1>			; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_0]]
	; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x double> [[TMP3]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x double> [[TMP6]], i32 0			; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[TMP5]], i32 0
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[TMP7]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP4]], i32 1
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x double> [[TMP6]], i32 1			; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[TMP6]], i32 1
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[TMP8]], i32 1			; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x double> [[TMP6]], i32 2			; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3
	; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[TMP9]], i32 2
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x double> [[TMP6]], i32 3
	; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[TMP10]], i32 3
	; CHECK-NEXT: call void @use(double [[V1_LANE_0]])			; CHECK-NEXT: call void @use(double [[V1_LANE_0]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_1]])			; CHECK-NEXT: call void @use(double [[V1_LANE_1]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_2]])			; CHECK-NEXT: call void @use(double [[V1_LANE_2]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_3]])			; CHECK-NEXT: call void @use(double [[V1_LANE_3]])
	; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	Show All 32 Lines
	; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <9 x double> [[V_1]], i32 1			; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <9 x double> [[V_1]], i32 1
	; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <9 x double> [[V_1]], i32 2			; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <9 x double> [[V_1]], i32 2
	; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <9 x double> [[V_1]], i32 3			; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <9 x double> [[V_1]], i32 3
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0			; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0
	; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1			; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]			; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V1_LANE_2]], i32 0			; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_1]]
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V1_LANE_1]], i32 1			; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_2]]
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_1]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[V2_LANE_2]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]]			; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]]
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0			; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP4]], i32 0			; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[TMP5]], i32 1			; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP4]], i32 1
	; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[TMP6]], i32 2
	; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3			; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3
	; CHECK-NEXT: call void @use(double [[V1_LANE_0]])			; CHECK-NEXT: call void @use(double [[V1_LANE_0]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_1]])			; CHECK-NEXT: call void @use(double [[V1_LANE_1]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_2]])			; CHECK-NEXT: call void @use(double [[V1_LANE_2]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_3]])			; CHECK-NEXT: call void @use(double [[V1_LANE_3]])
	; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	▲ Show 20 Lines • Show All 441 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

	Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
	; SSE-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]			; SSE-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]
	; SSE-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]			; SSE-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]
	; SSE-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; SSE-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
	; SSE-NEXT: ret <8 x i32> [[R7]]			; SSE-NEXT: ret <8 x i32> [[R7]]
	;			;
	; AVX1-LABEL: @ashr_shl_v8i32(			; AVX1-LABEL: @ashr_shl_v8i32(
	; AVX1-NEXT: [[A0:%.]] = extractelement <8 x i32> [[A:%.]], i32 0			; AVX1-NEXT: [[A0:%.]] = extractelement <8 x i32> [[A:%.]], i32 0
	; AVX1-NEXT: [[A1:%.*]] = extractelement <8 x i32> [[A]], i32 1			; AVX1-NEXT: [[A1:%.*]] = extractelement <8 x i32> [[A]], i32 1
	; AVX1-NEXT: [[A2:%.*]] = extractelement <8 x i32> [[A]], i32 2
	; AVX1-NEXT: [[A3:%.*]] = extractelement <8 x i32> [[A]], i32 3
	; AVX1-NEXT: [[B0:%.]] = extractelement <8 x i32> [[B:%.]], i32 0			; AVX1-NEXT: [[B0:%.]] = extractelement <8 x i32> [[B:%.]], i32 0
	; AVX1-NEXT: [[B1:%.*]] = extractelement <8 x i32> [[B]], i32 1			; AVX1-NEXT: [[B1:%.*]] = extractelement <8 x i32> [[B]], i32 1
	; AVX1-NEXT: [[B2:%.*]] = extractelement <8 x i32> [[B]], i32 2
	; AVX1-NEXT: [[B3:%.*]] = extractelement <8 x i32> [[B]], i32 3
	; AVX1-NEXT: [[AB0:%.*]] = ashr i32 [[A0]], [[B0]]			; AVX1-NEXT: [[AB0:%.*]] = ashr i32 [[A0]], [[B0]]
	; AVX1-NEXT: [[AB1:%.*]] = ashr i32 [[A1]], [[B1]]			; AVX1-NEXT: [[AB1:%.*]] = ashr i32 [[A1]], [[B1]]
	; AVX1-NEXT: [[AB2:%.*]] = ashr i32 [[A2]], [[B2]]			; AVX1-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
	; AVX1-NEXT: [[AB3:%.*]] = ashr i32 [[A3]], [[B3]]			; AVX1-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
	; AVX1-NEXT: [[TMP1:%.*]] = shl <8 x i32> [[A]], [[B]]			; AVX1-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]
				; AVX1-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX1-NEXT: [[TMP5:%.*]] = shl <4 x i32> [[TMP1]], [[TMP2]]
				; AVX1-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <8 x i32> <i32 undef, i32 undef, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX1-NEXT: [[TMP7:%.*]] = shl <8 x i32> [[A]], [[B]]
				RKSimonUnsubmitted Not Done Reply Inline Actions Why do we have both a v4i32 and v8i32 shl in here? RKSimon: Why do we have both a v4i32 and v8i32 shl in here?
				ABataevAuthorUnsubmitted Done Reply Inline Actions That's because of 2 main factors here: max size of the vector register and final insertinstruction instructions. These insertinstruction leads to the emission of <8x i32> vectors while other instructions are limited by the 128 bit size of the vector register. SLP vectorizer generates this code: define <8 x i32> @ashr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) #0 { %a0 = extractelement <8 x i32> %a, i32 0 %a1 = extractelement <8 x i32> %a, i32 1 %a2 = extractelement <8 x i32> %a, i32 2 %a3 = extractelement <8 x i32> %a, i32 3 %a4 = extractelement <8 x i32> %a, i32 4 %a5 = extractelement <8 x i32> %a, i32 5 %a6 = extractelement <8 x i32> %a, i32 6 %a7 = extractelement <8 x i32> %a, i32 7 %b0 = extractelement <8 x i32> %b, i32 0 %b1 = extractelement <8 x i32> %b, i32 1 %b2 = extractelement <8 x i32> %b, i32 2 %b3 = extractelement <8 x i32> %b, i32 3 %b4 = extractelement <8 x i32> %b, i32 4 %b5 = extractelement <8 x i32> %b, i32 5 %b6 = extractelement <8 x i32> %b, i32 6 %b7 = extractelement <8 x i32> %b, i32 7 %ab0 = ashr i32 %a0, %b0 %ab1 = ashr i32 %a1, %b1 %1 = insertelement <4 x i32> poison, i32 %a2, i32 0 %2 = insertelement <4 x i32> %1, i32 %a3, i32 1 %3 = insertelement <4 x i32> %2, i32 %a4, i32 2 %4 = insertelement <4 x i32> %3, i32 %a5, i32 3 %5 = insertelement <4 x i32> poison, i32 %b2, i32 0 %6 = insertelement <4 x i32> %5, i32 %b3, i32 1 %7 = insertelement <4 x i32> %6, i32 %b4, i32 2 %8 = insertelement <4 x i32> %7, i32 %b5, i32 3 %9 = ashr <4 x i32> %4, %8 %10 = shl <4 x i32> %4, %8 %11 = shufflevector <4 x i32> %9, <4 x i32> %10, <4 x i32> <i32 0, i32 1, i32 6, i32 7> %12 = insertelement <2 x i32> poison, i32 %a6, i32 0 %13 = insertelement <2 x i32> %12, i32 %a7, i32 1 %14 = insertelement <2 x i32> poison, i32 %b6, i32 0 %15 = insertelement <2 x i32> %14, i32 %b7, i32 1 %16 = shl <2 x i32> %13, %15 %r0 = insertelement <8 x i32> undef, i32 %ab0, i32 0 %r1 = insertelement <8 x i32> %r0, i32 %ab1, i32 1 %17 = extractelement <4 x i32> %11, i32 0 %r2 = insertelement <8 x i32> %r1, i32 %17, i32 2 %18 = extractelement <4 x i32> %11, i32 1 %r3 = insertelement <8 x i32> %r2, i32 %18, i32 3 %19 = extractelement <4 x i32> %11, i32 2 %r4 = insertelement <8 x i32> %r3, i32 %19, i32 4 %20 = extractelement <4 x i32> %11, i32 3 %r5 = insertelement <8 x i32> %r4, i32 %20, i32 5 %21 = extractelement <2 x i32> %16, i32 0 %r6 = insertelement <8 x i32> %r5, i32 %21, i32 6 %22 = extractelement <2 x i32> %16, i32 1 %r7 = insertelement <8 x i32> %r6, i32 %22, i32 7 ret <8 x i32> %r7 } InstCombiner optimizes this to <8 x i32> vector operations. ABataev: That's because of 2 main factors here: max size of the vector register and final…
				RKSimonUnsubmitted Not Done Reply Inline Actions I'm still seeing 128/256 shifts: https://c.godbolt.org/z/4q364s1fT define <8 x i32> @ashr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { %1 = ashr <8 x i32> %a, %b %2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5> %3 = shufflevector <8 x i32> %b, <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5> %4 = ashr <4 x i32> %2, %3 %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> %6 = shl <4 x i32> %2, %3 %7 = shufflevector <4 x i32> %6, <4 x i32> undef, <8 x i32> <i32 undef, i32 undef, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %8 = shl <8 x i32> %a, %b %r3 = shufflevector <8 x i32> %1, <8 x i32> %5, <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 undef, i32 undef, i32 undef, i32 undef> %r5 = shufflevector <8 x i32> %r3, <8 x i32> %7, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 10, i32 11, i32 undef, i32 undef> %r7 = shufflevector <8 x i32> %r5, <8 x i32> %8, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 14, i32 15> ret <8 x i32> %r7 } RKSimon: I'm still seeing 128/256 shifts: https://c.godbolt.org/z/4q364s1fT ``` define <8 x i32>…
				ABataevAuthorUnsubmitted Done Reply Inline Actions The original function operates on <8 x i32> params/return value. SLP vectorizer is limited by 128 bit vector register size and generates <2 x i32> and <4 x i32> vector instructions. instcombiner combines some of the vector instructions and produces <4 x i32> and <8 x i32> instructions. But only becaus ethe function paramters and return value are <8 x i32>. The test uses both SLP and instcombiner, instcombiner produces <8 x i32> instructions. ABataev: The original function operates on <8 x i32> params/return value. SLP vectorizer is limited by…
				RKSimonUnsubmitted Not Done Reply Inline Actions I'm still concerned about this codegen - TMP1 and TMP2 have subvector extractions that aren't even on a subvector boundary, and we're performing ashr on vector elements (4 + 5) that aren't required - but scalarizing 2 elements (0 + 1) that are. RKSimon: I'm still concerned about this codegen - TMP1 and TMP2 have subvector extractions that aren't…
				ABataevAuthorUnsubmitted Done Reply Inline Actions Looks like it reveals the problem in InstCombiner. For some reason, it extends `shl <2 x i32> %13, %15` generated by SLP to `shl <8 x i32> [[A]], [[B]]` though it should not. As to shuffles %9 = ashr <4 x i32> %4, %8 %10 = shl <4 x i32> %4, %8 It is caused by the altopcode vectorization algorithm. Looks like we detect vector bundle: %ab2 = ashr i32 %a2, %b2 %ab3 = ashr i32 %a3, %b3 %ab4 = shl i32 %a4, %b4 %ab5 = shl i32 %a5, %b5 and we generate something like this for it: %1 = insertelement <4 x i32> poison, i32 %a2, i32 0 %2 = insertelement <4 x i32> %1, i32 %a3, i32 1 %3 = insertelement <4 x i32> %2, i32 %a4, i32 2 %4 = insertelement <4 x i32> %3, i32 %a5, i32 3 %5 = insertelement <4 x i32> poison, i32 %b2, i32 0 %6 = insertelement <4 x i32> %5, i32 %b3, i32 1 %7 = insertelement <4 x i32> %6, i32 %b4, i32 2 %8 = insertelement <4 x i32> %7, i32 %b5, i32 3 %9 = ashr <4 x i32> %4, %8 %10 = shl <4 x i32> %4, %8 %11 = shufflevector <4 x i32> %9, <4 x i32> %10, <4 x i32> <i32 0, i32 1, i32 6, i32 7> . %1 = insertelement <4 x i32> poison, i32 %a2, i32 0 %2 = insertelement <4 x i32> %1, i32 %a3, i32 1 %3 = insertelement <4 x i32> %2, i32 %a4, i32 2 %4 = insertelement <4 x i32> %3, i32 %a5, i32 3 %5 = insertelement <4 x i32> poison, i32 %b2, i32 0 %6 = insertelement <4 x i32> %5, i32 %b3, i32 1 %7 = insertelement <4 x i32> %6, i32 %b4, i32 2 %8 = insertelement <4 x i32> %7, i32 %b5, i32 3 are just subvector extracts. What I missed in the patch is adding a cost for subvector extracts/inserts, will add it. ABataev: Looks like it reveals the problem in InstCombiner. For some reason, it extends `shl <2 x i32>…
				ABataevAuthorUnsubmitted Done Reply Inline Actions Investigated test case more closely. Everything is correct. Before we just vectorized `4 shl` instructions, which were extended to `<8x` by Instcombiner. Currently, we're vectorizing `2 ashr + 2 shl` and `2 shl` (again extended to `<8x` by Instcombiner). The test will be improved further by the patch for vectorization of `InsertElement` instructions, it will end up with `ashr 8x + shl 8x + shuffle` just like for other targets. ABataev: Investigated test case more closely. Everything is correct. Before we just vectorized `4 shl`…
				RKSimonUnsubmitted Not Done Reply Inline Actions Would you be OK with waiting until D98714 has landed? RKSimon: Would you be OK with waiting until D98714 has landed?
				ABataevAuthorUnsubmitted Done Reply Inline Actions It blocks some of the patches for non-power-2 vectorization in SLP, would be good to land it ASAP. Plus, it still improves the situation comparing to existing codegen. ABataev: It blocks some of the patches for non-power-2 vectorization in SLP, would be good to land it…
	; AVX1-NEXT: [[R0:%.*]] = insertelement <8 x i32> poison, i32 [[AB0]], i32 0			; AVX1-NEXT: [[R0:%.*]] = insertelement <8 x i32> poison, i32 [[AB0]], i32 0
	; AVX1-NEXT: [[R1:%.*]] = insertelement <8 x i32> [[R0]], i32 [[AB1]], i32 1			; AVX1-NEXT: [[R1:%.*]] = insertelement <8 x i32> [[R0]], i32 [[AB1]], i32 1
	; AVX1-NEXT: [[R2:%.*]] = insertelement <8 x i32> [[R1]], i32 [[AB2]], i32 2			; AVX1-NEXT: [[R3:%.*]] = shufflevector <8 x i32> [[R1]], <8 x i32> [[TMP4]], <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 undef, i32 undef, i32 undef, i32 undef>
	; AVX1-NEXT: [[R3:%.*]] = insertelement <8 x i32> [[R2]], i32 [[AB3]], i32 3			; AVX1-NEXT: [[R5:%.*]] = shufflevector <8 x i32> [[R3]], <8 x i32> [[TMP6]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 10, i32 11, i32 undef, i32 undef>
	; AVX1-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[R3]], <8 x i32> [[TMP1]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; AVX1-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[R5]], <8 x i32> [[TMP7]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 14, i32 15>
	; AVX1-NEXT: ret <8 x i32> [[R7]]			; AVX1-NEXT: ret <8 x i32> [[R7]]
	;			;
	; AVX2-LABEL: @ashr_shl_v8i32(			; AVX2-LABEL: @ashr_shl_v8i32(
	; AVX2-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]			; AVX2-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]
	; AVX2-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]			; AVX2-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]
	; AVX2-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; AVX2-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
	; AVX2-NEXT: ret <8 x i32> [[R7]]			; AVX2-NEXT: ret <8 x i32> [[R7]]
	;			;
	▲ Show 20 Lines • Show All 283 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A]], i32 6			; CHECK-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A]], i32 6
	; CHECK-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7			; CHECK-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7
	; CHECK-NEXT: [[AB1:%.*]] = sdiv i32 [[A1]], 4			; CHECK-NEXT: [[AB1:%.*]] = sdiv i32 [[A1]], 4
	; CHECK-NEXT: [[AB2:%.*]] = sdiv i32 [[A2]], 8			; CHECK-NEXT: [[AB2:%.*]] = sdiv i32 [[A2]], 8
	; CHECK-NEXT: [[AB3:%.*]] = sdiv i32 [[A3]], 16			; CHECK-NEXT: [[AB3:%.*]] = sdiv i32 [[A3]], 16
	; CHECK-NEXT: [[AB5:%.*]] = sdiv i32 [[A5]], 4			; CHECK-NEXT: [[AB5:%.*]] = sdiv i32 [[A5]], 4
	; CHECK-NEXT: [[AB6:%.*]] = sdiv i32 [[A6]], 8			; CHECK-NEXT: [[AB6:%.*]] = sdiv i32 [[A6]], 8
	; CHECK-NEXT: [[AB7:%.*]] = sdiv i32 [[A7]], 16			; CHECK-NEXT: [[AB7:%.*]] = sdiv i32 [[A7]], 16
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <8 x i32> poison, i32 [[AB1]], i32 1			; CHECK-NEXT: [[R1:%.*]] = insertelement <8 x i32> poison, i32 [[AB1]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x i32> [[TMP1]], i32 [[AB2]], i32 2			; CHECK-NEXT: [[R2:%.*]] = insertelement <8 x i32> [[R1]], i32 [[AB2]], i32 2
	; CHECK-NEXT: [[R4:%.*]] = insertelement <8 x i32> [[TMP2]], i32 [[AB3]], i32 3			; CHECK-NEXT: [[R3:%.*]] = insertelement <8 x i32> [[R2]], i32 [[AB3]], i32 3
	; CHECK-NEXT: [[R5:%.*]] = insertelement <8 x i32> [[R4]], i32 [[AB5]], i32 5			; CHECK-NEXT: [[R5:%.*]] = insertelement <8 x i32> [[R3]], i32 [[AB5]], i32 5
	; CHECK-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R5]], i32 [[AB6]], i32 6			; CHECK-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R5]], i32 [[AB6]], i32 6
	; CHECK-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7			; CHECK-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7
	; CHECK-NEXT: ret <8 x i32> [[R7]]			; CHECK-NEXT: ret <8 x i32> [[R7]]
	;			;
	%a0 = extractelement <8 x i32> %a, i32 0			%a0 = extractelement <8 x i32> %a, i32 0
	%a1 = extractelement <8 x i32> %a, i32 1			%a1 = extractelement <8 x i32> %a, i32 1
	%a2 = extractelement <8 x i32> %a, i32 2			%a2 = extractelement <8 x i32> %a, i32 2
	%a3 = extractelement <8 x i32> %a, i32 3			%a3 = extractelement <8 x i32> %a, i32 3
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll

	Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
	; SSE-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]			; SSE-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]
	; SSE-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]			; SSE-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]
	; SSE-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; SSE-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
	; SSE-NEXT: ret <8 x i32> [[R7]]			; SSE-NEXT: ret <8 x i32> [[R7]]
	;			;
	; AVX1-LABEL: @ashr_shl_v8i32(			; AVX1-LABEL: @ashr_shl_v8i32(
	; AVX1-NEXT: [[A0:%.]] = extractelement <8 x i32> [[A:%.]], i32 0			; AVX1-NEXT: [[A0:%.]] = extractelement <8 x i32> [[A:%.]], i32 0
	; AVX1-NEXT: [[A1:%.*]] = extractelement <8 x i32> [[A]], i32 1			; AVX1-NEXT: [[A1:%.*]] = extractelement <8 x i32> [[A]], i32 1
	; AVX1-NEXT: [[A2:%.*]] = extractelement <8 x i32> [[A]], i32 2
	; AVX1-NEXT: [[A3:%.*]] = extractelement <8 x i32> [[A]], i32 3
	; AVX1-NEXT: [[B0:%.]] = extractelement <8 x i32> [[B:%.]], i32 0			; AVX1-NEXT: [[B0:%.]] = extractelement <8 x i32> [[B:%.]], i32 0
	; AVX1-NEXT: [[B1:%.*]] = extractelement <8 x i32> [[B]], i32 1			; AVX1-NEXT: [[B1:%.*]] = extractelement <8 x i32> [[B]], i32 1
	; AVX1-NEXT: [[B2:%.*]] = extractelement <8 x i32> [[B]], i32 2
	; AVX1-NEXT: [[B3:%.*]] = extractelement <8 x i32> [[B]], i32 3
	; AVX1-NEXT: [[AB0:%.*]] = ashr i32 [[A0]], [[B0]]			; AVX1-NEXT: [[AB0:%.*]] = ashr i32 [[A0]], [[B0]]
	; AVX1-NEXT: [[AB1:%.*]] = ashr i32 [[A1]], [[B1]]			; AVX1-NEXT: [[AB1:%.*]] = ashr i32 [[A1]], [[B1]]
	; AVX1-NEXT: [[AB2:%.*]] = ashr i32 [[A2]], [[B2]]			; AVX1-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
	; AVX1-NEXT: [[AB3:%.*]] = ashr i32 [[A3]], [[B3]]			; AVX1-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
	; AVX1-NEXT: [[TMP1:%.*]] = shl <8 x i32> [[A]], [[B]]			; AVX1-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]
				; AVX1-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX1-NEXT: [[TMP5:%.*]] = shl <4 x i32> [[TMP1]], [[TMP2]]
				; AVX1-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <8 x i32> <i32 undef, i32 undef, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX1-NEXT: [[TMP7:%.*]] = shl <8 x i32> [[A]], [[B]]
	; AVX1-NEXT: [[R0:%.*]] = insertelement <8 x i32> undef, i32 [[AB0]], i32 0			; AVX1-NEXT: [[R0:%.*]] = insertelement <8 x i32> undef, i32 [[AB0]], i32 0
	; AVX1-NEXT: [[R1:%.*]] = insertelement <8 x i32> [[R0]], i32 [[AB1]], i32 1			; AVX1-NEXT: [[R1:%.*]] = insertelement <8 x i32> [[R0]], i32 [[AB1]], i32 1
	; AVX1-NEXT: [[R2:%.*]] = insertelement <8 x i32> [[R1]], i32 [[AB2]], i32 2			; AVX1-NEXT: [[R3:%.*]] = shufflevector <8 x i32> [[R1]], <8 x i32> [[TMP4]], <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 undef, i32 undef, i32 undef, i32 undef>
	; AVX1-NEXT: [[R3:%.*]] = insertelement <8 x i32> [[R2]], i32 [[AB3]], i32 3			; AVX1-NEXT: [[R5:%.*]] = shufflevector <8 x i32> [[R3]], <8 x i32> [[TMP6]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 10, i32 11, i32 undef, i32 undef>
	; AVX1-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[R3]], <8 x i32> [[TMP1]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; AVX1-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[R5]], <8 x i32> [[TMP7]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 14, i32 15>
	; AVX1-NEXT: ret <8 x i32> [[R7]]			; AVX1-NEXT: ret <8 x i32> [[R7]]
	;			;
	; AVX2-LABEL: @ashr_shl_v8i32(			; AVX2-LABEL: @ashr_shl_v8i32(
	; AVX2-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]			; AVX2-NEXT: [[TMP1:%.]] = ashr <8 x i32> [[A:%.]], [[B:%.*]]
	; AVX2-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]			; AVX2-NEXT: [[TMP2:%.*]] = shl <8 x i32> [[A]], [[B]]
	; AVX2-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>			; AVX2-NEXT: [[R7:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
	; AVX2-NEXT: ret <8 x i32> [[R7]]			; AVX2-NEXT: ret <8 x i32> [[R7]]
	;			;
	▲ Show 20 Lines • Show All 361 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll

	Show All 40 Lines
	; CHECK-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1			; CHECK-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1
	; CHECK-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[X0]]			; CHECK-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[X0]]
	; CHECK-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[X1]]			; CHECK-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[X1]]
	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]]			; CHECK-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]]
	; CHECK-NEXT: store float [[ADD]], float* @a, align 4			; CHECK-NEXT: store float [[ADD]], float* @a, align 4
	; CHECK-NEXT: ret float [[X0]]			; CHECK-NEXT: ret float [[X0]]
	;			;
	; THRESH1-LABEL: @f_used_out_of_tree(			; THRESH1-LABEL: @f_used_out_of_tree(
	; THRESH1-NEXT: [[X0:%.]] = extractelement <2 x float> [[X:%.]], i32 0			; THRESH1-NEXT: [[TMP1:%.]] = extractelement <2 x float> [[X:%.]], i32 0
	; THRESH1-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1			; THRESH1-NEXT: [[TMP2:%.*]] = fmul <2 x float> [[X]], [[X]]
	; THRESH1-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[X0]]			; THRESH1-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0
	; THRESH1-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[X1]]			; THRESH1-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1
	; THRESH1-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]]			; THRESH1-NEXT: [[ADD:%.*]] = fadd float [[TMP3]], [[TMP4]]
				RKSimonUnsubmitted Not Done Reply Inline Actions Missed fadd reduction opportunity RKSimon: Missed fadd reduction opportunity
				ABataevAuthorUnsubmitted Done Reply Inline Actions We even do not try to detect reductions if we have less than 4 elements, here we're going to have shuffle + vector fadd + extractelement. ABataev: We even do not try to detect reductions if we have less than 4 elements, here we're going to…
	; THRESH1-NEXT: store float [[ADD]], float* @a, align 4			; THRESH1-NEXT: store float [[ADD]], float* @a, align 4
	; THRESH1-NEXT: ret float [[X0]]			; THRESH1-NEXT: ret float [[TMP1]]
	;			;
	; THRESH2-LABEL: @f_used_out_of_tree(			; THRESH2-LABEL: @f_used_out_of_tree(
	; THRESH2-NEXT: [[TMP1:%.]] = extractelement <2 x float> [[X:%.]], i32 0			; THRESH2-NEXT: [[TMP1:%.]] = extractelement <2 x float> [[X:%.]], i32 0
	; THRESH2-NEXT: [[TMP2:%.*]] = fmul <2 x float> [[X]], [[X]]			; THRESH2-NEXT: [[TMP2:%.*]] = fmul <2 x float> [[X]], [[X]]
	; THRESH2-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0			; THRESH2-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0
	; THRESH2-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1			; THRESH2-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1
	; THRESH2-NEXT: [[ADD:%.*]] = fadd float [[TMP3]], [[TMP4]]			; THRESH2-NEXT: [[ADD:%.*]] = fadd float [[TMP3]], [[TMP4]]
	; THRESH2-NEXT: store float [[ADD]], float* @a, align 4			; THRESH2-NEXT: store float [[ADD]], float* @a, align 4
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve cost model for the vectorized extractelements.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 339629

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll

llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll

[SLP]Improve cost model for the vectorized extractelements.
ClosedPublic