This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve cost model for the vectorized extractelements.
ClosedPublic

Authored by ABataev on Apr 6 2021, 11:09 AM.

Details

Summary
  1. No need to call areAllUsersVectorized as later the cost is calculated only if the instruction has one use and gets vectorized.
  2. Need to calculate the cost of the dead extractelement more precisely, taking the vector type of the vector operand, not the resulting vector type.

Part of D57059.

Diff Detail

Event Timeline

ABataev created this revision.Apr 6 2021, 11:09 AM
ABataev requested review of this revision.Apr 6 2021, 11:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2021, 11:09 AM
ABataev updated this revision to Diff 335783.Apr 7 2021, 5:31 AM

Improve cost of other extractelement instructions.

RKSimon added inline comments.Apr 7 2021, 5:37 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

Why do we have both a v4i32 and v8i32 shl in here?

ABataev added inline comments.Apr 7 2021, 5:43 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

That's because of 2 main factors here: max size of the vector register and final insertinstruction instructions. These insertinstruction leads to the emission of <8x i32> vectors while other instructions are limited by the 128 bit size of the vector register.
SLP vectorizer generates this code:

define <8 x i32> @ashr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) #0 {
  %a0 = extractelement <8 x i32> %a, i32 0
  %a1 = extractelement <8 x i32> %a, i32 1
  %a2 = extractelement <8 x i32> %a, i32 2
  %a3 = extractelement <8 x i32> %a, i32 3
  %a4 = extractelement <8 x i32> %a, i32 4
  %a5 = extractelement <8 x i32> %a, i32 5
  %a6 = extractelement <8 x i32> %a, i32 6
  %a7 = extractelement <8 x i32> %a, i32 7
  %b0 = extractelement <8 x i32> %b, i32 0
  %b1 = extractelement <8 x i32> %b, i32 1
  %b2 = extractelement <8 x i32> %b, i32 2
  %b3 = extractelement <8 x i32> %b, i32 3
  %b4 = extractelement <8 x i32> %b, i32 4
  %b5 = extractelement <8 x i32> %b, i32 5
  %b6 = extractelement <8 x i32> %b, i32 6
  %b7 = extractelement <8 x i32> %b, i32 7
  %ab0 = ashr i32 %a0, %b0
  %ab1 = ashr i32 %a1, %b1
  %1 = insertelement <4 x i32> poison, i32 %a2, i32 0
  %2 = insertelement <4 x i32> %1, i32 %a3, i32 1
  %3 = insertelement <4 x i32> %2, i32 %a4, i32 2
  %4 = insertelement <4 x i32> %3, i32 %a5, i32 3
  %5 = insertelement <4 x i32> poison, i32 %b2, i32 0
  %6 = insertelement <4 x i32> %5, i32 %b3, i32 1
  %7 = insertelement <4 x i32> %6, i32 %b4, i32 2
  %8 = insertelement <4 x i32> %7, i32 %b5, i32 3
  %9 = ashr <4 x i32> %4, %8
  %10 = shl <4 x i32> %4, %8
  %11 = shufflevector <4 x i32> %9, <4 x i32> %10, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %12 = insertelement <2 x i32> poison, i32 %a6, i32 0
  %13 = insertelement <2 x i32> %12, i32 %a7, i32 1
  %14 = insertelement <2 x i32> poison, i32 %b6, i32 0
  %15 = insertelement <2 x i32> %14, i32 %b7, i32 1
  %16 = shl <2 x i32> %13, %15
  %r0 = insertelement <8 x i32> undef, i32 %ab0, i32 0
  %r1 = insertelement <8 x i32> %r0, i32 %ab1, i32 1
  %17 = extractelement <4 x i32> %11, i32 0
  %r2 = insertelement <8 x i32> %r1, i32 %17, i32 2
  %18 = extractelement <4 x i32> %11, i32 1
  %r3 = insertelement <8 x i32> %r2, i32 %18, i32 3
  %19 = extractelement <4 x i32> %11, i32 2
  %r4 = insertelement <8 x i32> %r3, i32 %19, i32 4
  %20 = extractelement <4 x i32> %11, i32 3
  %r5 = insertelement <8 x i32> %r4, i32 %20, i32 5
  %21 = extractelement <2 x i32> %16, i32 0
  %r6 = insertelement <8 x i32> %r5, i32 %21, i32 6
  %22 = extractelement <2 x i32> %16, i32 1
  %r7 = insertelement <8 x i32> %r6, i32 %22, i32 7
  ret <8 x i32> %r7
}

InstCombiner optimizes this to <8 x i32> vector operations.

RKSimon added inline comments.Apr 7 2021, 6:05 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

I'm still seeing 128/256 shifts: https://c.godbolt.org/z/4q364s1fT

define <8 x i32> @ashr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) {
  %1 = ashr <8 x i32> %a, %b
  %2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
  %3 = shufflevector <8 x i32> %b, <8 x i32> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
  %4 = ashr <4 x i32> %2, %3
  %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %6 = shl <4 x i32> %2, %3
  %7 = shufflevector <4 x i32> %6, <4 x i32> undef, <8 x i32> <i32 undef, i32 undef, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %8 = shl <8 x i32> %a, %b
  %r3 = shufflevector <8 x i32> %1, <8 x i32> %5, <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 undef, i32 undef, i32 undef, i32 undef>
  %r5 = shufflevector <8 x i32> %r3, <8 x i32> %7, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 10, i32 11, i32 undef, i32 undef>
  %r7 = shufflevector <8 x i32> %r5, <8 x i32> %8, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 14, i32 15>
  ret <8 x i32> %r7
}
ABataev added inline comments.Apr 7 2021, 6:15 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

The original function operates on <8 x i32> params/return value. SLP vectorizer is limited by 128 bit vector register size and generates <2 x i32> and <4 x i32> vector instructions. instcombiner combines some of the vector instructions and produces <4 x i32> and <8 x i32> instructions. But only becaus ethe function paramters and return value are <8 x i32>.

The test uses both SLP and instcombiner, instcombiner produces <8 x i32> instructions.

RKSimon requested changes to this revision.Apr 15 2021, 3:15 AM
RKSimon added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

I'm still concerned about this codegen - TMP1 and TMP2 have subvector extractions that aren't even on a subvector boundary, and we're performing ashr on vector elements (4 + 5) that aren't required - but scalarizing 2 elements (0 + 1) that are.

This revision now requires changes to proceed.Apr 15 2021, 3:15 AM
ABataev added inline comments.Apr 15 2021, 6:18 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

Looks like it reveals the problem in InstCombiner. For some reason, it extends shl <2 x i32> %13, %15 generated by SLP to shl <8 x i32> [[A]], [[B]] though it should not.
As to shuffles

%9 = ashr <4 x i32> %4, %8
%10 = shl <4 x i32> %4, %8

It is caused by the altopcode vectorization algorithm. Looks like we detect vector bundle:

%ab2 = ashr i32 %a2, %b2
%ab3 = ashr i32 %a3, %b3
%ab4 = shl  i32 %a4, %b4
%ab5 = shl  i32 %a5, %b5

and we generate something like this for it:

%1 = insertelement <4 x i32> poison, i32 %a2, i32 0
%2 = insertelement <4 x i32> %1, i32 %a3, i32 1
%3 = insertelement <4 x i32> %2, i32 %a4, i32 2
%4 = insertelement <4 x i32> %3, i32 %a5, i32 3
%5 = insertelement <4 x i32> poison, i32 %b2, i32 0
%6 = insertelement <4 x i32> %5, i32 %b3, i32 1
%7 = insertelement <4 x i32> %6, i32 %b4, i32 2
%8 = insertelement <4 x i32> %7, i32 %b5, i32 3
%9 = ashr <4 x i32> %4, %8
%10 = shl <4 x i32> %4, %8
%11 = shufflevector <4 x i32> %9, <4 x i32> %10, <4 x i32> <i32 0, i32 1, i32 6, i32 7>

.

%1 = insertelement <4 x i32> poison, i32 %a2, i32 0
%2 = insertelement <4 x i32> %1, i32 %a3, i32 1
%3 = insertelement <4 x i32> %2, i32 %a4, i32 2
%4 = insertelement <4 x i32> %3, i32 %a5, i32 3
%5 = insertelement <4 x i32> poison, i32 %b2, i32 0
%6 = insertelement <4 x i32> %5, i32 %b3, i32 1
%7 = insertelement <4 x i32> %6, i32 %b4, i32 2
%8 = insertelement <4 x i32> %7, i32 %b5, i32 3

are just subvector extracts. What I missed in the patch is adding a cost for subvector extracts/inserts, will add it.

ABataev updated this revision to Diff 337893.Apr 15 2021, 2:01 PM

Rebase + improvements

ABataev added inline comments.Apr 15 2021, 2:10 PM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

Investigated test case more closely. Everything is correct.
Before we just vectorized 4 shl instructions, which were extended to <8x by Instcombiner. Currently, we're vectorizing 2 ashr + 2 shl and 2 shl (again extended to <8x by Instcombiner). The test will be improved further by the patch for vectorization of InsertElement instructions, it will end up with ashr 8x + shl 8x + shuffle just like for other targets.

RKSimon added inline comments.Apr 16 2021, 2:58 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

Would you be OK with waiting until D98714 has landed?

ABataev added inline comments.Apr 16 2021, 5:31 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
123

It blocks some of the patches for non-power-2 vectorization in SLP, would be good to land it ASAP. Plus, it still improves the situation comparing to existing codegen.

RKSimon accepted this revision.Apr 22 2021, 3:55 AM

OK - LGTM - cheers

llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll
53

Missed fadd reduction opportunity

This revision is now accepted and ready to land.Apr 22 2021, 3:55 AM
ABataev added inline comments.Apr 22 2021, 7:21 AM
llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll
53

We even do not try to detect reductions if we have less than 4 elements, here we're going to have shuffle + vector fadd + extractelement.

This revision was landed with ongoing or failed builds.Apr 22 2021, 7:41 AM
This revision was automatically updated to reflect the committed changes.
jgorbe added a subscriber: jgorbe.May 3 2021, 4:33 PM

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Thanks for the report, I'll investigate it

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Looks like a bug in X86ISelLowering, will try to make a patch to fix this

Hi, we're seeing some build failures after this patch. If you build the following reduced test case with clang -cc1 -emit-obj -O3 -vectorize-slp reduced.cc

# 1 "" 3
typedef int a __attribute__((__mode__(__DI__)));
struct b {
  __attribute__((__vector_size__(4 * sizeof(long)))) long c;
  a operator[](int) const;
};
a b::operator[](int d) const {
  if (d)
    return c[d];
  return;
}
b e;
a au[8];
void __attribute__((target("avx"))) f() {
  for (int g = 0; g < 8; ++g) {
    long bc = e[g];
    au[g] = e[bc & 7];
  }
}

it results in this error:

fatal error: error in backend: Cannot select: 0x578dff9323a8: v2i64 = truncate 0x578dff926208
  0x578dff926208: v4i32 = and 0x578dff926750, 0x578dff926a90
    0x578dff926750: v4i32 = X86ISD::UNPCKL 0x578dff926958, 0x578dff926b60
      0x578dff926958: v4i32 = X86ISD::PSHUFD 0x578dff9322d8, TargetConstant:i8<-18>
        0x578dff9322d8: v4i32,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)`, align 32)> 0x578dffc25c68, 0x578dff932b60, undef:i64
          0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
            0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
          0x578dff9324e0: i64 = undef
        0x578dff9268f0: i8 = TargetConstant<-18>
      0x578dff926b60: v4i32 = bitcast 0x578dff932410
        0x578dff932410: v2i64,ch = load<(dereferenceable load 16 from `<4 x i64>* getelementptr inbounds (%struct.b, %struct.b* @e, i64 0, i32 0)` + 16, basealign 32)> 0x578dffc25c68, 0x578dff932068, undef:i64
          0x578dff932068: i64 = add 0x578dff932b60, Constant:i64<16>
            0x578dff932b60: i64 = X86ISD::WrapperRIP TargetGlobalAddress:i64<%struct.b* @e> 0
              0x578dff932888: i64 = TargetGlobalAddress<%struct.b* @e> 0
            0x578dff926548: i64 = Constant<16>
          0x578dff9324e0: i64 = undef
    0x578dff926a90: v4i32,ch = load<(load 16 from constant-pool)> 0x578dffc25c68, 0x578dff932820, undef:i64
      0x578dff932820: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
        0x578dff932618: i64 = TargetConstantPool<<4 x i32> <i32 7, i32 7, i32 undef, i32 undef>> 0
      0x578dff9324e0: i64 = undef
In function: _Z1fv

A debug build of clang hits this assertion instead:

clang: /usr/local/google/home/jgorbe/code/llvm/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:4842: llvm::SDValue llvm::SelectionDAG::getNode(unsigned int, const llvm::SDLoc &, llvm::EVT, llvm::SDValue, const llvm::SDNodeFlags): Assertion `(!VT.isVector() || VT.getVectorElementCount() == Operand.getValueType().getVectorElementCount()) && "Vector element count mismatch!"' failed.

Must be fixed in 13a51e017c09ce449ba2ec0024baf356d6dfcbad

Awesome! Thanks a lot :)