This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly.
ClosedPublic

Authored by ABataev on Aug 12 2021, 8:26 AM.

Details

Summary

SLP vectorizer emits extracts for externally used vectorized scalars and
estimates the cost for each such extract. But in many cases these
scalars are input for insertelement instructions, forming buildvector,
and instead of extractelement/insertelement pair we can emit/cost
estimate shuffle(s) cost and generate series of shuffles, which can be
further optimized.

Tested using test-suite (+SPEC2017), the tests passed, SLP was able to
generate/vectorize more instructions in many cases and it allowed to reduce
number of re-vectorization attempts (where we could try to vectorize
buildector insertelements again and again).

Diff Detail

Event Timeline

ABataev created this revision.Aug 12 2021, 8:26 AM
ABataev requested review of this revision.Aug 12 2021, 8:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2021, 8:26 AM
ABataev updated this revision to Diff 368855.Aug 26 2021, 5:24 AM

Rebased. Checked that the test SLPVectorizer/X86/remark_extract_broadcast.ll (mentioned in D108703) is updated as requested.

lebedev.ri added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/remark_extract_broadcast.ll
19–20

Thanks! This is clearly an improvement, but these two shuffles are still clearly redundant,
because in either case, you end up with 0'th element of LD in some elements of output.
In this case you could simply drop the first shuffle, and do the second one directly.

ABataev added inline comments.Aug 26 2021, 5:34 AM
llvm/test/Transforms/SLPVectorizer/X86/remark_extract_broadcast.ll
19–20

I think, in codegen the first shuffle will be simply dropped (this is an identity shuffle). But I'll check what can be improved here.

lebedev.ri added inline comments.Aug 26 2021, 5:37 AM
llvm/test/Transforms/SLPVectorizer/X86/remark_extract_broadcast.ll
19–20

Ignoring more complicated cases, perhaps the key point here
is that the TMP0 is an identity (=>single-source), non-width-changing shuffle,
so it can be naturally dropped. ShuffleVectorInst::isIdentityMask() might be relevant.

ABataev added inline comments.Aug 26 2021, 5:40 AM
llvm/test/Transforms/SLPVectorizer/X86/remark_extract_broadcast.ll
19–20

Agree. Will check what can be done here to improve it.

ABataev updated this revision to Diff 368885.Aug 26 2021, 8:00 AM

Address comments

Please can you rebase?

Please can you rebase?

Sure, will do, just need to finish my work with other patches.

vporpo added a subscriber: vporpo.Nov 11 2021, 7:58 PM
RKSimon added inline comments.Dec 1 2021, 8:44 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4688

The (almost NFC) change to areTwoInsertFromSameBuildVector looks it can be pulled out to simplify this patch.

RKSimon added inline comments.Dec 14 2021, 7:24 AM
llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
222

These look like NFC changes by the update script that can probably be pre-comitted to reduce the patch?

It looks OK - but its a LOT of dense code which makes it very difficult to grok - better comments and possibly a simplification pass might be a good idea

It looks OK - but its a LOT of dense code which makes it very difficult to grok - better comments and possibly a simplification pass might be a good idea

Will try to split it.

rebase? not sure how big this is now

Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 9:36 AM

rebase? not sure how big this is now

Working on it.

RKSimon added inline comments.May 13 2022, 7:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6025

I find this control flow very confusing - is the 'cast<InsertElementInst>(Base)' guaranteed to match IEBase? we break after the if() above so we can't get here from there.

ABataev added inline comments.May 13 2022, 7:34 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6025

We just iterate through insertelements, which are not part of the vectorized buildvector.
For example:

%0 = insertelement %..., %a, 0
%1 = insertelement %0, %b, 1
%2 = insertelement %1, %c, 2

If %c is vectorized, we start looking through a buildvectror, trying to find the vectorized base.
Start from %2. getTreeEntry(%2) returns nullptr. Go to %1.
getTreeEntry(%1) returns nullptr too (it is not a part of vectorized buildvector). Go to %0.
getTreeEntry(%0) is vectorized and returns E. Iterate through all vectorized insertelements, build a mask.
Put %2 to the list of insertelements, which must be transformed to shuffles.

Later, we do the analysis of all inserts between %1-%2 (including boundaries), If they must be replaced with shuffles - replace them with shuffles, other insertelements remain as is, just change their base properly to the shuffles.

RKSimon accepted this revision.May 14 2022, 2:40 AM

LGTM

This revision is now accepted and ready to land.May 14 2022, 2:40 AM
This revision was landed with ongoing or failed builds.May 20 2022, 6:00 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.May 21 2022, 12:57 PM

It looks like this patch is causing SLPVectorizer to crash with the following IR. This blocks building SPEC on X86, so I'll go ahead and revert this for now to unblock testing.

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx"

define i64 @foo(ptr %arg, i32 %arg1) unnamed_addr #0 {
bb:
  %tmp = sub i32 undef, undef
  %tmp2 = sub nsw i32 undef, %tmp
  %tmp3 = add i32 undef, %tmp2
  %tmp4 = xor i32 %tmp3, undef
  %tmp5 = add i32 undef, %tmp4
  %tmp6 = sub i32 undef, undef
  %tmp7 = load i32, ptr undef, align 4
  %tmp8 = sub i32 %tmp7, undef
  %tmp9 = sub nsw i32 0, undef
  %tmp10 = add nsw i32 %tmp8, %tmp6
  %tmp11 = sub nsw i32 %tmp6, %tmp8
  %tmp12 = add i32 undef, %tmp10
  %tmp13 = xor i32 %tmp12, undef
  %tmp14 = add i32 undef, %tmp9
  %tmp15 = xor i32 %tmp14, undef
  %tmp16 = add i32 undef, %tmp11
  %tmp17 = xor i32 %tmp16, undef
  %tmp18 = add i32 %tmp13, %tmp5
  %tmp19 = add i32 %tmp18, undef
  %tmp20 = add i32 %tmp19, %tmp15
  %tmp21 = add i32 %tmp20, %tmp17
  %tmp22 = sub i32 undef, undef
  %tmp23 = add i32 undef, undef
  %tmp24 = sub i32 undef, undef
  %tmp25 = add nsw i32 %tmp23, undef
  %tmp26 = add nsw i32 %tmp24, %tmp22
  %tmp27 = sub nsw i32 %tmp22, %tmp24
  %tmp28 = add i32 undef, %tmp25
  %tmp29 = xor i32 %tmp28, undef
  %tmp30 = add i32 undef, %tmp26
  %tmp31 = xor i32 %tmp30, undef
  %tmp32 = add i32 undef, %tmp27
  %tmp33 = xor i32 %tmp32, undef
  %tmp34 = add i32 %tmp31, %tmp21
  %tmp35 = add i32 %tmp34, %tmp29
  %tmp36 = add i32 %tmp35, undef
  %tmp37 = add i32 %tmp36, %tmp33
  %tmp38 = sub nsw i32 undef, undef
  %tmp39 = add i32 undef, %tmp38
  %tmp40 = xor i32 %tmp39, undef
  %tmp41 = add i32 undef, %tmp37
  %tmp42 = add i32 %tmp41, 0
  %tmp43 = add i32 %tmp42, %tmp40
  %tmp44 = add i32 %tmp43, undef
  %tmp45 = add i32 undef, %tmp44
  %tmp46 = add i32 %tmp45, undef
  %tmp47 = add i32 %tmp46, undef
  %tmp48 = add i32 %tmp47, 0
  %tmp49 = add i32 undef, %tmp48
  %tmp50 = add i32 %tmp49, undef
  %tmp51 = add i32 %tmp50, undef
  %tmp52 = add i32 %tmp51, 0
  %tmp53 = add i32 undef, %tmp52
  %tmp54 = add i32 %tmp53, undef
  %tmp55 = add i32 %tmp54, undef
  %tmp56 = add i32 %tmp55, 0
  %tmp57 = add i32 0, %tmp56
  %tmp58 = add i32 %tmp57, 0
  %tmp59 = add i32 %tmp58, 0
  %tmp60 = add i32 %tmp59, 0
  %tmp61 = lshr i32 %tmp60, 16
  %tmp62 = add nuw nsw i32 undef, %tmp61
  %tmp63 = sub nsw i32 %tmp62, undef
  %tmp64 = zext i32 %tmp63 to i64
  %tmp65 = shl nuw i64 %tmp64, 32
  %tmp66 = add i64 %tmp65, undef
  ret i64 %tmp66
}

attributes #0 = { "target-features"="+64bit,+adx,+aes,+avx,+avx2" }
fhahn added a comment.May 24 2022, 1:25 AM

Unfortunately the latest version is still causing crashes when build SPEC2017 on X86. Reproducer below:

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx"

%struct.hoge = type { [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [4 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x void (i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i8*, i32, i32*)*], [7 x i32 (i8*, i32, i8*, i32)*], i32 (i8*, i32, i8*, i32, i32*)*, [4 x i64 (i8*, i32)*], [4 x i64 (i8*, i32)*], void (i8*, i32, i8*, i32, [4 x i32]*)*, float ([4 x i32]*, [4 x i32]*, i32)*, [7 x void (i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i8*, i32, i32*)*], [7 x i32 (i32*, i16*, i32, i16*, i16*, i32, i32)*], void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)* }

define i64 @quux.51(i8* %arg, i32 %arg1) unnamed_addr #0 {
bb:
  %tmp = add i32 undef, undef
  %tmp2 = sub i32 undef, undef
  %tmp3 = add i32 undef, undef
  %tmp4 = sub i32 undef, undef
  %tmp5 = add nsw i32 %tmp3, %tmp
  %tmp6 = sub nsw i32 %tmp, %tmp3
  %tmp7 = add nsw i32 %tmp4, %tmp2
  %tmp8 = sub nsw i32 %tmp2, %tmp4
  %tmp9 = add i32 undef, %tmp5
  %tmp10 = xor i32 %tmp9, undef
  %tmp11 = add i32 undef, %tmp7
  %tmp12 = xor i32 %tmp11, undef
  %tmp13 = add i32 undef, %tmp6
  %tmp14 = xor i32 %tmp13, undef
  %tmp15 = add i32 undef, %tmp8
  %tmp16 = xor i32 %tmp15, undef
  %tmp17 = add i32 %tmp12, %tmp10
  %tmp18 = add i32 %tmp17, %tmp14
  %tmp19 = add i32 %tmp18, %tmp16
  %tmp20 = add i32 undef, undef
  %tmp21 = sub i32 undef, undef
  %tmp22 = add i32 undef, undef
  %tmp23 = sub i32 undef, undef
  %tmp24 = add nsw i32 %tmp22, %tmp20
  %tmp25 = sub nsw i32 %tmp20, %tmp22
  %tmp26 = add nsw i32 %tmp23, %tmp21
  %tmp27 = sub nsw i32 %tmp21, %tmp23
  %tmp28 = add i32 undef, %tmp24
  %tmp29 = xor i32 %tmp28, undef
  %tmp30 = add i32 undef, %tmp26
  %tmp31 = xor i32 %tmp30, undef
  %tmp32 = add i32 0, %tmp25
  %tmp33 = xor i32 %tmp32, 0
  %tmp34 = add i32 undef, %tmp27
  %tmp35 = xor i32 %tmp34, undef
  %tmp36 = add i32 %tmp31, %tmp19
  %tmp37 = add i32 %tmp36, %tmp29
  %tmp38 = add i32 %tmp37, %tmp33
  %tmp39 = add i32 %tmp38, %tmp35
  %tmp40 = add i32 undef, undef
  %tmp41 = sub i32 undef, undef
  %tmp42 = add i32 undef, undef
  %tmp43 = sub i32 undef, undef
  %tmp44 = add nsw i32 %tmp42, %tmp40
  %tmp45 = sub nsw i32 %tmp40, %tmp42
  %tmp46 = add nsw i32 %tmp43, %tmp41
  %tmp47 = sub nsw i32 %tmp41, %tmp43
  %tmp48 = add i32 undef, %tmp44
  %tmp49 = xor i32 %tmp48, undef
  %tmp50 = add i32 undef, %tmp46
  %tmp51 = xor i32 %tmp50, undef
  %tmp52 = add i32 undef, %tmp45
  %tmp53 = xor i32 %tmp52, undef
  %tmp54 = add i32 undef, %tmp47
  %tmp55 = xor i32 %tmp54, undef
  %tmp56 = add i32 %tmp51, %tmp39
  %tmp57 = add i32 %tmp56, %tmp49
  %tmp58 = add i32 %tmp57, %tmp53
  %tmp59 = add i32 %tmp58, %tmp55
  %tmp60 = load i32, i32* undef, align 4
  %tmp61 = add i32 undef, %tmp60
  %tmp62 = sub i32 %tmp60, undef
  %tmp63 = add i32 undef, undef
  %tmp64 = sub i32 undef, undef
  %tmp65 = add nsw i32 %tmp63, %tmp61
  %tmp66 = sub nsw i32 %tmp61, %tmp63
  %tmp67 = add nsw i32 %tmp64, %tmp62
  %tmp68 = sub nsw i32 %tmp62, %tmp64
  %tmp69 = add i32 undef, %tmp65
  %tmp70 = xor i32 %tmp69, undef
  %tmp71 = add i32 undef, %tmp67
  %tmp72 = xor i32 %tmp71, undef
  %tmp73 = add i32 undef, %tmp66
  %tmp74 = xor i32 %tmp73, undef
  %tmp75 = add i32 undef, %tmp68
  %tmp76 = xor i32 %tmp75, undef
  %tmp77 = add i32 %tmp72, %tmp59
  %tmp78 = add i32 %tmp77, %tmp70
  %tmp79 = add i32 %tmp78, %tmp74
  %tmp80 = add i32 %tmp79, %tmp76
  %tmp81 = add i32 undef, undef
  %tmp82 = sub i32 undef, undef
  %tmp83 = add i32 undef, undef
  %tmp84 = sub i32 undef, undef
  %tmp85 = add nsw i32 %tmp83, %tmp81
  %tmp86 = sub nsw i32 %tmp81, %tmp83
  %tmp87 = add nsw i32 %tmp84, %tmp82
  %tmp88 = sub nsw i32 %tmp82, %tmp84
  %tmp89 = add i32 undef, %tmp85
  %tmp90 = xor i32 %tmp89, undef
  %tmp91 = add i32 undef, %tmp87
  %tmp92 = xor i32 %tmp91, undef
  %tmp93 = add i32 undef, %tmp86
  %tmp94 = xor i32 %tmp93, undef
  %tmp95 = add i32 undef, %tmp88
  %tmp96 = xor i32 %tmp95, undef
  %tmp97 = add i32 %tmp92, %tmp80
  %tmp98 = add i32 %tmp97, %tmp90
  %tmp99 = add i32 %tmp98, %tmp94
  %tmp100 = add i32 %tmp99, %tmp96
  %tmp101 = add i32 undef, undef
  %tmp102 = sub i32 undef, undef
  %tmp103 = add i32 undef, undef
  %tmp104 = sub i32 undef, undef
  %tmp105 = add nsw i32 %tmp103, %tmp101
  %tmp106 = sub nsw i32 %tmp101, %tmp103
  %tmp107 = add nsw i32 %tmp104, %tmp102
  %tmp108 = sub nsw i32 %tmp102, %tmp104
  %tmp109 = add i32 undef, %tmp105
  %tmp110 = xor i32 %tmp109, undef
  %tmp111 = add i32 undef, %tmp107
  %tmp112 = xor i32 %tmp111, undef
  %tmp113 = add i32 undef, %tmp106
  %tmp114 = xor i32 %tmp113, undef
  %tmp115 = add i32 undef, %tmp108
  %tmp116 = xor i32 %tmp115, undef
  %tmp117 = add i32 %tmp112, %tmp100
  %tmp118 = add i32 %tmp117, %tmp110
  %tmp119 = add i32 %tmp118, %tmp114
  %tmp120 = add i32 %tmp119, %tmp116
  %tmp121 = add i32 undef, undef
  %tmp122 = sub i32 undef, undef
  %tmp123 = add i32 undef, undef
  %tmp124 = sub i32 undef, undef
  %tmp125 = add nsw i32 %tmp123, %tmp121
  %tmp126 = sub nsw i32 %tmp121, %tmp123
  %tmp127 = add nsw i32 %tmp124, %tmp122
  %tmp128 = sub nsw i32 %tmp122, %tmp124
  %tmp129 = add i32 undef, %tmp125
  %tmp130 = xor i32 %tmp129, undef
  %tmp131 = add i32 undef, %tmp127
  %tmp132 = xor i32 %tmp131, undef
  %tmp133 = add i32 undef, %tmp126
  %tmp134 = xor i32 %tmp133, undef
  %tmp135 = add i32 undef, %tmp128
  %tmp136 = xor i32 %tmp135, undef
  %tmp137 = add i32 %tmp132, %tmp120
  %tmp138 = add i32 %tmp137, %tmp130
  %tmp139 = add i32 %tmp138, %tmp134
  %tmp140 = add i32 %tmp139, %tmp136
  %tmp141 = add i32 undef, undef
  %tmp142 = sub i32 undef, undef
  %tmp143 = add i32 undef, undef
  %tmp144 = sub i32 undef, undef
  %tmp145 = add nsw i32 %tmp143, %tmp141
  %tmp146 = sub nsw i32 %tmp141, %tmp143
  %tmp147 = add nsw i32 %tmp144, %tmp142
  %tmp148 = sub nsw i32 %tmp142, %tmp144
  %tmp149 = add i32 undef, %tmp145
  %tmp150 = xor i32 %tmp149, undef
  %tmp151 = add i32 undef, %tmp147
  %tmp152 = xor i32 %tmp151, undef
  %tmp153 = add i32 undef, %tmp146
  %tmp154 = xor i32 %tmp153, undef
  %tmp155 = add i32 undef, %tmp148
  %tmp156 = xor i32 %tmp155, undef
  %tmp157 = add i32 %tmp152, %tmp140
  %tmp158 = add i32 %tmp157, %tmp150
  %tmp159 = add i32 %tmp158, %tmp154
  %tmp160 = add i32 %tmp159, %tmp156
  %tmp161 = and i32 %tmp160, 65535
  %tmp162 = add nuw nsw i32 %tmp161, undef
  %tmp163 = sub nsw i32 %tmp162, undef
  %tmp164 = zext i32 %tmp163 to i64
  %tmp165 = shl nuw i64 %tmp164, 32
  %tmp166 = add i64 %tmp165, undef
  ret i64 %tmp166
}

attributes #0 = { "target-features"="+64bit,+adx,+aes,+avx,+avx2" }

Unfortunately the latest version is still causing crashes when build SPEC2017 on X86. Reproducer below:

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx"

%struct.hoge = type { [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [4 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x i32 (i8*, i32, i8*, i32)*], [7 x void (i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i8*, i32, i32*)*], [7 x i32 (i8*, i32, i8*, i32)*], i32 (i8*, i32, i8*, i32, i32*)*, [4 x i64 (i8*, i32)*], [4 x i64 (i8*, i32)*], void (i8*, i32, i8*, i32, [4 x i32]*)*, float ([4 x i32]*, [4 x i32]*, i32)*, [7 x void (i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i32, i32*)*], [7 x void (i8*, i8*, i8*, i8*, i8*, i32, i32*)*], [7 x i32 (i32*, i16*, i32, i16*, i16*, i32, i32)*], void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)*, void (i8*, i8*, i32*)* }

define i64 @quux.51(i8* %arg, i32 %arg1) unnamed_addr #0 {
bb:
  %tmp = add i32 undef, undef
  %tmp2 = sub i32 undef, undef
  %tmp3 = add i32 undef, undef
  %tmp4 = sub i32 undef, undef
  %tmp5 = add nsw i32 %tmp3, %tmp
  %tmp6 = sub nsw i32 %tmp, %tmp3
  %tmp7 = add nsw i32 %tmp4, %tmp2
  %tmp8 = sub nsw i32 %tmp2, %tmp4
  %tmp9 = add i32 undef, %tmp5
  %tmp10 = xor i32 %tmp9, undef
  %tmp11 = add i32 undef, %tmp7
  %tmp12 = xor i32 %tmp11, undef
  %tmp13 = add i32 undef, %tmp6
  %tmp14 = xor i32 %tmp13, undef
  %tmp15 = add i32 undef, %tmp8
  %tmp16 = xor i32 %tmp15, undef
  %tmp17 = add i32 %tmp12, %tmp10
  %tmp18 = add i32 %tmp17, %tmp14
  %tmp19 = add i32 %tmp18, %tmp16
  %tmp20 = add i32 undef, undef
  %tmp21 = sub i32 undef, undef
  %tmp22 = add i32 undef, undef
  %tmp23 = sub i32 undef, undef
  %tmp24 = add nsw i32 %tmp22, %tmp20
  %tmp25 = sub nsw i32 %tmp20, %tmp22
  %tmp26 = add nsw i32 %tmp23, %tmp21
  %tmp27 = sub nsw i32 %tmp21, %tmp23
  %tmp28 = add i32 undef, %tmp24
  %tmp29 = xor i32 %tmp28, undef
  %tmp30 = add i32 undef, %tmp26
  %tmp31 = xor i32 %tmp30, undef
  %tmp32 = add i32 0, %tmp25
  %tmp33 = xor i32 %tmp32, 0
  %tmp34 = add i32 undef, %tmp27
  %tmp35 = xor i32 %tmp34, undef
  %tmp36 = add i32 %tmp31, %tmp19
  %tmp37 = add i32 %tmp36, %tmp29
  %tmp38 = add i32 %tmp37, %tmp33
  %tmp39 = add i32 %tmp38, %tmp35
  %tmp40 = add i32 undef, undef
  %tmp41 = sub i32 undef, undef
  %tmp42 = add i32 undef, undef
  %tmp43 = sub i32 undef, undef
  %tmp44 = add nsw i32 %tmp42, %tmp40
  %tmp45 = sub nsw i32 %tmp40, %tmp42
  %tmp46 = add nsw i32 %tmp43, %tmp41
  %tmp47 = sub nsw i32 %tmp41, %tmp43
  %tmp48 = add i32 undef, %tmp44
  %tmp49 = xor i32 %tmp48, undef
  %tmp50 = add i32 undef, %tmp46
  %tmp51 = xor i32 %tmp50, undef
  %tmp52 = add i32 undef, %tmp45
  %tmp53 = xor i32 %tmp52, undef
  %tmp54 = add i32 undef, %tmp47
  %tmp55 = xor i32 %tmp54, undef
  %tmp56 = add i32 %tmp51, %tmp39
  %tmp57 = add i32 %tmp56, %tmp49
  %tmp58 = add i32 %tmp57, %tmp53
  %tmp59 = add i32 %tmp58, %tmp55
  %tmp60 = load i32, i32* undef, align 4
  %tmp61 = add i32 undef, %tmp60
  %tmp62 = sub i32 %tmp60, undef
  %tmp63 = add i32 undef, undef
  %tmp64 = sub i32 undef, undef
  %tmp65 = add nsw i32 %tmp63, %tmp61
  %tmp66 = sub nsw i32 %tmp61, %tmp63
  %tmp67 = add nsw i32 %tmp64, %tmp62
  %tmp68 = sub nsw i32 %tmp62, %tmp64
  %tmp69 = add i32 undef, %tmp65
  %tmp70 = xor i32 %tmp69, undef
  %tmp71 = add i32 undef, %tmp67
  %tmp72 = xor i32 %tmp71, undef
  %tmp73 = add i32 undef, %tmp66
  %tmp74 = xor i32 %tmp73, undef
  %tmp75 = add i32 undef, %tmp68
  %tmp76 = xor i32 %tmp75, undef
  %tmp77 = add i32 %tmp72, %tmp59
  %tmp78 = add i32 %tmp77, %tmp70
  %tmp79 = add i32 %tmp78, %tmp74
  %tmp80 = add i32 %tmp79, %tmp76
  %tmp81 = add i32 undef, undef
  %tmp82 = sub i32 undef, undef
  %tmp83 = add i32 undef, undef
  %tmp84 = sub i32 undef, undef
  %tmp85 = add nsw i32 %tmp83, %tmp81
  %tmp86 = sub nsw i32 %tmp81, %tmp83
  %tmp87 = add nsw i32 %tmp84, %tmp82
  %tmp88 = sub nsw i32 %tmp82, %tmp84
  %tmp89 = add i32 undef, %tmp85
  %tmp90 = xor i32 %tmp89, undef
  %tmp91 = add i32 undef, %tmp87
  %tmp92 = xor i32 %tmp91, undef
  %tmp93 = add i32 undef, %tmp86
  %tmp94 = xor i32 %tmp93, undef
  %tmp95 = add i32 undef, %tmp88
  %tmp96 = xor i32 %tmp95, undef
  %tmp97 = add i32 %tmp92, %tmp80
  %tmp98 = add i32 %tmp97, %tmp90
  %tmp99 = add i32 %tmp98, %tmp94
  %tmp100 = add i32 %tmp99, %tmp96
  %tmp101 = add i32 undef, undef
  %tmp102 = sub i32 undef, undef
  %tmp103 = add i32 undef, undef
  %tmp104 = sub i32 undef, undef
  %tmp105 = add nsw i32 %tmp103, %tmp101
  %tmp106 = sub nsw i32 %tmp101, %tmp103
  %tmp107 = add nsw i32 %tmp104, %tmp102
  %tmp108 = sub nsw i32 %tmp102, %tmp104
  %tmp109 = add i32 undef, %tmp105
  %tmp110 = xor i32 %tmp109, undef
  %tmp111 = add i32 undef, %tmp107
  %tmp112 = xor i32 %tmp111, undef
  %tmp113 = add i32 undef, %tmp106
  %tmp114 = xor i32 %tmp113, undef
  %tmp115 = add i32 undef, %tmp108
  %tmp116 = xor i32 %tmp115, undef
  %tmp117 = add i32 %tmp112, %tmp100
  %tmp118 = add i32 %tmp117, %tmp110
  %tmp119 = add i32 %tmp118, %tmp114
  %tmp120 = add i32 %tmp119, %tmp116
  %tmp121 = add i32 undef, undef
  %tmp122 = sub i32 undef, undef
  %tmp123 = add i32 undef, undef
  %tmp124 = sub i32 undef, undef
  %tmp125 = add nsw i32 %tmp123, %tmp121
  %tmp126 = sub nsw i32 %tmp121, %tmp123
  %tmp127 = add nsw i32 %tmp124, %tmp122
  %tmp128 = sub nsw i32 %tmp122, %tmp124
  %tmp129 = add i32 undef, %tmp125
  %tmp130 = xor i32 %tmp129, undef
  %tmp131 = add i32 undef, %tmp127
  %tmp132 = xor i32 %tmp131, undef
  %tmp133 = add i32 undef, %tmp126
  %tmp134 = xor i32 %tmp133, undef
  %tmp135 = add i32 undef, %tmp128
  %tmp136 = xor i32 %tmp135, undef
  %tmp137 = add i32 %tmp132, %tmp120
  %tmp138 = add i32 %tmp137, %tmp130
  %tmp139 = add i32 %tmp138, %tmp134
  %tmp140 = add i32 %tmp139, %tmp136
  %tmp141 = add i32 undef, undef
  %tmp142 = sub i32 undef, undef
  %tmp143 = add i32 undef, undef
  %tmp144 = sub i32 undef, undef
  %tmp145 = add nsw i32 %tmp143, %tmp141
  %tmp146 = sub nsw i32 %tmp141, %tmp143
  %tmp147 = add nsw i32 %tmp144, %tmp142
  %tmp148 = sub nsw i32 %tmp142, %tmp144
  %tmp149 = add i32 undef, %tmp145
  %tmp150 = xor i32 %tmp149, undef
  %tmp151 = add i32 undef, %tmp147
  %tmp152 = xor i32 %tmp151, undef
  %tmp153 = add i32 undef, %tmp146
  %tmp154 = xor i32 %tmp153, undef
  %tmp155 = add i32 undef, %tmp148
  %tmp156 = xor i32 %tmp155, undef
  %tmp157 = add i32 %tmp152, %tmp140
  %tmp158 = add i32 %tmp157, %tmp150
  %tmp159 = add i32 %tmp158, %tmp154
  %tmp160 = add i32 %tmp159, %tmp156
  %tmp161 = and i32 %tmp160, 65535
  %tmp162 = add nuw nsw i32 %tmp161, undef
  %tmp163 = sub nsw i32 %tmp162, undef
  %tmp164 = zext i32 %tmp163 to i64
  %tmp165 = shl nuw i64 %tmp164, 32
  %tmp166 = add i64 %tmp165, undef
  ret i64 %tmp166
}

attributes #0 = { "target-features"="+64bit,+adx,+aes,+avx,+avx2" }

Ho Florian, tried to reproduce, was unable to do it:

opt -slp-vectorizer -S ./repro1.ll
; ModuleID = './repro1.ll'
source_filename = "./repro1.ll"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx"

define i64 @quux.51(i8* %arg, i32 %arg1) unnamed_addr #0 {
bb:
  %tmp60 = load i32, i32* undef, align 4
  %0 = insertelement <32 x i32> poison, i32 %tmp60, i32 0
  %shuffle = shufflevector <32 x i32> %0, <32 x i32> poison, <32 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %1 = add <32 x i32> %shuffle, poison
  %2 = sub <32 x i32> %shuffle, poison
  %3 = shufflevector <32 x i32> %1, <32 x i32> %2, <32 x i32> <i32 0, i32 33, i32 2, i32 35, i32 36, i32 5, i32 6, i32 39, i32 40, i32 9, i32 10, i32 43, i32 44, i32 13, i32 14, i32 47, i32 48, i32 17, i32 18, i32 51, i32 52, i32 21, i32 22, i32 55, i32 56, i32 25, i32 26, i32 59, i32 60, i32 29, i32 30, i32 63>
  %4 = shufflevector <32 x i32> %3, <32 x i32> poison, <32 x i32> <i32 2, i32 3, i32 0, i32 1, i32 7, i32 6, i32 5, i32 4, i32 11, i32 10, i32 9, i32 8, i32 15, i32 14, i32 13, i32 12, i32 19, i32 18, i32 17, i32 16, i32 23, i32 22, i32 21, i32 20, i32 27, i32 26, i32 25, i32 24, i32 31, i32 30, i32 29, i32 28>
  %5 = add nsw <32 x i32> %3, %4
  %6 = sub nsw <32 x i32> %3, %4
  %7 = shufflevector <32 x i32> %5, <32 x i32> %6, <32 x i32> <i32 0, i32 1, i32 34, i32 35, i32 4, i32 5, i32 38, i32 39, i32 8, i32 9, i32 42, i32 43, i32 12, i32 13, i32 46, i32 47, i32 16, i32 17, i32 50, i32 51, i32 20, i32 21, i32 54, i32 55, i32 24, i32 25, i32 58, i32 59, i32 28, i32 29, i32 62, i32 63>
  %8 = add <32 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>, %7
  %9 = xor <32 x i32> %8, <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %10 = call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %9)
  %tmp161 = and i32 %10, 65535
  %tmp162 = add nuw nsw i32 %tmp161, undef
  %tmp163 = sub nsw i32 %tmp162, undef
  %tmp164 = zext i32 %tmp163 to i64
  %tmp165 = shl nuw i64 %tmp164, 32
  %tmp166 = add i64 %tmp165, undef
  ret i64 %tmp166
}

; Function Attrs: nocallback nofree nosync nounwind readnone willreturn
declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>) #1

attributes #0 = { "target-features"="+64bit,+adx,+aes,+avx,+avx2" }
attributes #1 = { nocallback nofree nosync nounwind readnone willreturn }

Could you check one more time, please?

fhahn added a comment.May 24 2022, 4:09 AM

Could you check one more time, please?

Yeah I just checked and this crashes for me with a release + assert build (commit is 96323c9f4c10bef5cb5d527970cabc73eab8aa21)

The assertion is: Assertion failed: (II && "Must be an insertelement instruction."), function vectorizeTree, file SLPVectorizer.cpp, line 8543.

Could you check one more time, please?

Yeah I just checked and this crashes for me with a release + assert build (commit is 96323c9f4c10bef5cb5d527970cabc73eab8aa21)

The assertion is: Assertion failed: (II && "Must be an insertelement instruction."), function vectorizeTree, file SLPVectorizer.cpp, line 8543.

Checked on the debug build, will check with rel+assert

Could you check one more time, please?

Yeah I just checked and this crashes for me with a release + assert build (commit is 96323c9f4c10bef5cb5d527970cabc73eab8aa21)

The assertion is: Assertion failed: (II && "Must be an insertelement instruction."), function vectorizeTree, file SLPVectorizer.cpp, line 8543.

Still unable to reproduce but I'll try to investigate it.

fhahn added a comment.May 24 2022, 5:09 AM

Could you check one more time, please?

Yeah I just checked and this crashes for me with a release + assert build (commit is 96323c9f4c10bef5cb5d527970cabc73eab8aa21)

The assertion is: Assertion failed: (II && "Must be an insertelement instruction."), function vectorizeTree, file SLPVectorizer.cpp, line 8543.

Still unable to reproduce but I'll try to investigate it.

I'm building on macOS which defaults to using libc++. It's possible that this may be the reason why you are not seeing the crash. I left an inline comment for a sort call. Replacing this with stable_sort fixes the crash.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6205

Is it possible that the relative order of elements that compare as equal matters in the code below? With stable_sort, I am not seeing the crash.

ABataev added inline comments.May 24 2022, 5:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6205

Let me check, yeah, most probably caused by the libc++ diff.
I used sort here as I hoped there should not be difference between sort and stable sort results.

ABataev added inline comments.May 24 2022, 6:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6205

Could you check again after f9c806ae5c53c990a935c46ba351cdcfb1271c58?

fhahn added inline comments.May 27 2022, 5:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6205

It doesn't crash any longer, thanks!

ABataev added inline comments.May 27 2022, 5:27 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6205

Great!

scui added a subscriber: scui.EditedJun 16 2022, 6:36 AM

Our SPEC build on PowerPC failed due to this patch. Following PR (gd.ll) is extracted from gcc_r build:

target datalayout = "E-m:a-p:32:32-i64:64-n32"
target triple = "powerpc-ibm-aix7.2.0.0"

%union.tree_node = type { %struct.tree_optimization_option }
%struct.tree_optimization_option = type { %struct.tree_common, %struct.cl_optimization }
%struct.tree_common = type { %struct.tree_base, %union.tree_node*, %union.tree_node* }
%struct.tree_base = type { i64 }
%struct.cl_optimization = type { i32 }
%struct.c_declarator = type { i32, %struct.c_declarator*, i32, %union.anon.1 }
%union.anon.1 = type { %struct.anon.443 }
%struct.anon.443 = type { %union.tree_node*, i32, %union.tree_node*, i8 }
%struct.c_declspecs = type { %union.tree_node*, %union.tree_node*, %union.tree_node*, %union.tree_node*, i32, i32, i8, i32, i16, i8 }

@flag_isoc99 = internal unnamed_addr global i1 false, align 4
@pedantic = internal global i32 0, align 4

; Function Attrs: nounwind
define fastcc %union.tree_node* @grokdeclarator(%struct.c_declarator* noundef readonly %declarator, %struct.c_declspecs* nocapture noundef %declspecs) unnamed_addr #0 {
entry:
  %type = getelementptr inbounds %struct.c_declspecs, %struct.c_declspecs* %declspecs, i32 0, i32 0
  %thread_p = getelementptr inbounds %struct.c_declspecs, %struct.c_declspecs* %declspecs, i32 0, i32 8
  %p0 = bitcast %struct.c_declarator* %declarator to i64*
  %t0 = load i64, i64* %p0, align 8
  %cmp00 = icmp eq i64 %t0, 0
  br i1 %cmp00, label %if.end10, label %cleanup

if.end10:                                         ; preds = %entry
  %t1 = load %union.tree_node*, %union.tree_node** %type, align 4
  %t2 = getelementptr %union.tree_node, %union.tree_node* %t1, i32 0, i32 0, i32 0, i32 0, i32 0
  %bf.load1 = load i64, i64* %t2, align 8
  %bf.lshr.mask5.i = and i64 %bf.load1, -281474976710656
  %cmp10 = icmp eq i64 %bf.lshr.mask5.i, 4222124650659840
  %extract.t814 = trunc i64 %bf.load1 to i8
  %extract.t817 = trunc i64 %bf.load1 to i32
  %extract819 = lshr i64 %bf.load1, 43
  %extract.t820 = trunc i64 %extract819 to i32
  %extract823 = lshr i64 %bf.load1, 44
  %extract.t824 = trunc i64 %extract823 to i32
  br i1 %cmp10, label %if.then20, label %if.else20

if.then20:                                        ; preds = %if.end10
  %type1.i33 = getelementptr inbounds %union.tree_node, %union.tree_node* %t1, i32 0, i32 0, i32 0, i32 2
  %t3 = load %union.tree_node*, %union.tree_node** %type1.i33, align 4
  %t4 = getelementptr %union.tree_node, %union.tree_node* %t3, i32 0, i32 0, i32 0, i32 0, i32 0
  %bf.load2 = load i64, i64* %t4, align 8
  %extract.t = trunc i64 %bf.load2 to i8
  %extract.t816 = trunc i64 %bf.load2 to i32
  %extract = lshr i64 %bf.load2, 43
  %extract.t818 = trunc i64 %extract to i32
  %extract821 = lshr i64 %bf.load2, 44
  %extract.t822 = trunc i64 %extract821 to i32
  br label %if.else20

if.else20:                                        ; preds = %if.then20, %if.end10
  %bf.load.off0 = phi i8 [ %extract.t, %if.then20 ], [ %extract.t814, %if.end10 ]
  %bf.load.off0815 = phi i32 [ %extract.t816, %if.then20 ], [ %extract.t817, %if.end10 ]
  %bf.load.off43 = phi i32 [ %extract.t818, %if.then20 ], [ %extract.t820, %if.end10 ]
  %bf.load.off44 = phi i32 [ %extract.t822, %if.then20 ], [ %extract.t824, %if.end10 ]
  %type.addr.0.lcssa.i = phi %union.tree_node* [ %t3, %if.then20 ], [ %t1, %if.end10 ]
  %p5 = getelementptr inbounds %union.tree_node, %union.tree_node* %type.addr.0.lcssa.i, i32 0, i32 0, i32 1, i32 0
  %p9 = getelementptr inbounds %struct.c_declspecs, %struct.c_declspecs* %declspecs, i32 0, i32 9
  %bf.load154 = load i16, i16* %thread_p, align 4
  %bf.lshr155 = lshr i16 %bf.load154, 7
  %bf.clear156 = and i16 %bf.lshr155, 1
  %bf.cast157 = zext i16 %bf.clear156 to i32
  %bf.cast162 = and i32 %bf.load.off43, 1
  %add = add nuw nsw i32 %bf.cast162, %bf.cast157
  %bf.load168 = load i32, i32* %p5, align 4
  %bf.lshr169 = lshr i32 %bf.load168, 18
  %t6 = insertelement <2 x i16> poison, i16 %bf.load154, i64 0
  %t7 = shufflevector <2 x i16> %t6, <2 x i16> poison, <2 x i32> zeroinitializer
  %t8 = lshr <2 x i16> %t7, <i16 5, i16 6>
  %t9 = and <2 x i16> %t8, <i16 1, i16 1>
  %t10 = zext <2 x i16> %t9 to <2 x i32>
  %t11 = insertelement <2 x i32> poison, i32 %bf.lshr169, i64 0
  %t12 = insertelement <2 x i32> %t11, i32 %bf.load.off44, i64 1
  %t13 = and <2 x i32> %t12, <i32 1, i32 1>
  %t14 = add nuw nsw <2 x i32> %t13, %t10
  %t15 = load i8, i8* %p9, align 2
  %conv188 = zext i8 %t15 to i32
  %cmp20 = icmp eq i8 %t15, 0
  %conv192 = and i32 %bf.load.off0815, 255
  %cond196 = select i1 %cmp20, i32 %bf.load.off0815, i32 %conv188
  %t16 = load i32, i32* @pedantic, align 4
  %cmp30 = icmp eq i32 %t16, 0
  %.b28 = load i1, i1* @flag_isoc99, align 4
  %t17 = insertelement <2 x i1> poison, i1 %cmp20, i64 0
  %t18 = insertelement <2 x i1> %t17, i1 %cmp30, i64 1
  %t19 = zext <2 x i1> %t18 to <2 x i64>
  %or.cond1969 = select i1 %cmp30, i1 true, i1 %.b28
  br i1 %or.cond1969, label %cleanup, label %if.else30

if.else30:                                        ; preds = %if.else20
  %cmp40 = icmp ugt i32 %add, 1
  br i1 %cmp40, label %if.then40, label %if.end40

if.then40:                                        ; preds = %if.else30
  br label %if.end40

if.end40:                                         ; preds = %if.then40, %if.else30
  %t20 = extractelement <2 x i32> %t14, i64 0
  %cmp50 = icmp ugt i32 %t20, 1
  br i1 %cmp50, label %if.then50, label %if.end50

if.then50:                                        ; preds = %if.end40
  br label %if.end50

if.end50:                                         ; preds = %if.then50, %if.end40
  br label %cleanup

cleanup:                                          ; preds = %if.end50, %if.else20, %entry
  ret %union.tree_node* null
}

attributes #0 = { nounwind "approx-func-fp-math"="true" "frame-pointer"="none" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="pwr10" "target-features"="+altivec,+bpermd,+crbits,+crypto,+direct-move,+extdiv,+isa-v206-instructions,+isa-v207-instructions,+isa-v30-instructions,+isa-v31-instructions,+mma,+paired-vector-memops,+pcrelative-memops,+power10-vector,+power8-vector,+power9-vector,+prefix-instrs,+vsx,-htm,-privileged,-quadword-atomics,-rop-protect,-spe" }

Here is the dumping with the latest SLPVectorizer.cpp (up to June 16). To reproduce,

opt  -slp-vectorizer gd.ll

opt: llvm/main/llvm-project/llvm/lib/IR/Instructions.cpp:2012: llvm::ShuffleVectorInst::ShuffleVectorInst(llvm::Value *, llvm::Value *, ArrayRef<int>, const llvm::Twine &, llvm::Instruction *): Assertion `isValidOperands(V1, V2, Mask) && "Invalid shuffle vector instruction operands!"' failed.

Stack dump:
0. Program arguments: llvm/main/build/bin/opt -slp-vectorizer gd.ll
#0 0x0000000012ea16d4 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (llvm/main/build/bin/opt+0x12ea16d4)
#1 0x0000000012ea1af4 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
#2 0x0000000012e9e818 llvm::sys::RunSignalHandlers() (llvm/main/build/bin/opt+0x12e9e818)
#3 0x0000000012ea1dbc SignalHandler(int) Signals.cpp:0:0
#4 0x00007d17768b04c8 (linux-vdso64.so.1+0x4c8)
#5 0x00007d1776130468 libc_signal_restore_set /build/glibc-tRXAGY/glibc-2.31/signal/../sysdeps/unix/sysv/linux/internal-signals.h:86:3
#6 0x00007d1776130468 raise /build/glibc-tRXAGY/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:48:3
#7 0x00007d1776107cd0 abort /build/glibc-tRXAGY/glibc-2.31/stdlib/abort.c:79:7
#8 0x00007d177611f5dc
assert_fail_base /build/glibc-tRXAGY/glibc-2.31/assert/assert.c:92:3
#9 0x00007d177611f680 __assert_fail /build/glibc-tRXAGY/glibc-2.31/assert/assert.c:101:3
#10 0x00000000124870cc llvm::ShuffleVectorInst::ShuffleVectorInst(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>, llvm::Twine const&, llvm::Instruction*) (llvm/main/build/bin/opt+0x124870cc)
#11 0x000000001064b62c llvm::IRBuilderBase::CreateShuffleVector(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>, llvm::Twine const&) (llvm/main/build/bin/opt+0x1064b62c)
#12 0x000000001318a698 llvm::slpvectorizer::BoUpSLP::vectorizeTree(llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int>>, std::vector<std::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>>, std::allocator<std::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>>>>>&)::$_69::operator()(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>) const SLPVectorizer.cpp:0:0
#13 0x000000001314098c llvm::slpvectorizer::BoUpSLP::vectorizeTree(llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int>>, std::vector<std::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>>, std::allocator<std::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>>>>>&) (llvm/main/build/bin/opt+0x1314098c)
#14 0x0000000013150de0 llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) (llvm/main/build/bin/opt+0x13150de0)
......

Can you please take a look? Thanks!

There is another issue which I tracked down to this patch but it is kind of hidden. In order to reveal the issue please apply attached patch ( that is basically enabling expensive checks and added verifyFunction right after vectorized code generated.

Crash looks like this:
Instruction does not dominate all uses!

%41 = insertelement <4 x i32> %40, i32 %32, i32 1
%39 = insertelement <4 x i32> %41, i32 poison, i32 2

opt: /path/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:8404: llvm::Value* llvm::slpvectorizer::BoUpSLP::vectorizeTree(): Assertion `!verifyFunction(*F, &dbgs()) && "Broken after vec"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: bin/opt -slp-vectorizer -mcpu=skylake -disable-output reduced.ll

There is another issue which I tracked down to this patch but it is kind of hidden. In order to reveal the issue please apply attached patch ( that is basically enabling expensive checks and added verifyFunction right after vectorized code generated.

Crash looks like this:
Instruction does not dominate all uses!

%41 = insertelement <4 x i32> %40, i32 %32, i32 1
%39 = insertelement <4 x i32> %41, i32 poison, i32 2

opt: /path/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:8404: llvm::Value* llvm::slpvectorizer::BoUpSLP::vectorizeTree(): Assertion `!verifyFunction(*F, &dbgs()) && "Broken after vec"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: bin/opt -slp-vectorizer -mcpu=skylake -disable-output reduced.ll

Hi Valery, thanks for the report, will prepare the fix later today or tomorrow

There is another issue which I tracked down to this patch but it is kind of hidden. In order to reveal the issue please apply attached patch ( that is basically enabling expensive checks and added verifyFunction right after vectorized code generated.

Crash looks like this:
Instruction does not dominate all uses!

%41 = insertelement <4 x i32> %40, i32 %32, i32 1
%39 = insertelement <4 x i32> %41, i32 poison, i32 2

opt: /path/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:8404: llvm::Value* llvm::slpvectorizer::BoUpSLP::vectorizeTree(): Assertion `!verifyFunction(*F, &dbgs()) && "Broken after vec"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: bin/opt -slp-vectorizer -mcpu=skylake -disable-output reduced.ll

Investigated. This is not quite a bug, but some junk is left that requires cleanup. I'll add the code to do this extra cleanup to avoid any problems, plus, I believe it may improve compile time in some cases.