This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3/13
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
insert-element-build-vector.ll

Differential D101555

[SLP]Improve handling of compensate external uses cost.
ClosedPublic

Authored by ABataev on Apr 29 2021, 11:09 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon

Commits

rG8dab25954b0a: [SLP]Improve handling of compensate external uses cost.

Summary

External insertelement users can be represented as a result of shuffle
of the vectorized element and noconsecutive insertlements too. Added
support for handling non-consecutive insertelements.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Apr 29 2021, 11:09 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptApr 29 2021, 11:09 AM

ABataev requested review of this revision.Apr 29 2021, 11:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2021, 11:09 AM

Harbormaster completed remote builds in B101688: Diff 341583.Apr 29 2021, 12:19 PM

Matt added a subscriber: Matt.Apr 29 2021, 12:48 PM

RKSimon added inline comments.May 4 2021, 2:42 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
636	Explain the purpose of InsertUses in the doxygen

ABataev added inline comments.May 4 2021, 2:48 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
636	Will add, thanks!

Address comments.

Harbormaster completed remote builds in B102726: Diff 343018.May 5 2021, 6:48 AM

RKSimon added inline comments.May 10 2021, 4:57 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4408	IsIdentity &= ?
llvm/test/Transforms/SLPVectorizer/X86/hsub.ll
174 ↗	(On Diff #343018)	These regressions looks like we need to do more in the shuffle costs to recognise when the shuffles don't cross subvector boundaries? Either for illegal types like this or across 128-bit subvector boundaries on AVX.

ABataev added inline comments.May 10 2021, 5:00 AM

llvm/test/Transforms/SLPVectorizer/X86/hsub.ll
174 ↗	(On Diff #343018)	Yes, need to subtract scalarization overhead for insertelement instruction, trying to handle it correctly in vectorization of InsertElement instructions patch. I'm going to abandon this patch when vectorization of InsertElements is landed. Keeping it just in case.

ABataev mentioned this in D98714: [SLP] Add insertelement instructions to vectorizable tree.May 10 2021, 5:52 AM

anton-afanasyev added a subscriber: anton-afanasyev.May 10 2021, 6:05 AM

Rework after handling of insertelements

ABataev edited the summary of this revision. (Show Details)May 14 2021, 1:34 PM

Harbormaster completed remote builds in B104582: Diff 345545.May 14 2021, 2:25 PM

Rebase + improved build vector detection.

RKSimon added inline comments.May 18 2021, 1:29 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2911	would this be better as a count_if ?
2916	Any good way to merge the SourceVectors set with the VectorOperands list?

ABataev added inline comments.May 18 2021, 1:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2911	We do not need simple count here, we're filling list of operands and counting source vectors at the same time. Rather doubt count_if will help here
2916	Thought about it too, will try to improve it somehow

Harbormaster completed remote builds in B105079: Diff 346224.May 18 2021, 2:11 PM

Rebase + address comments

Harbormaster completed remote builds in B105210: Diff 346422.May 19 2021, 6:37 AM

A few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2913	SmallVector seems unnecessary - why not just ValueList VectorOperands[NumOps] ? Even NumOps seems a bit too much.
3821	V is used only once getInsertIndex(VL[I], 0)
3833	assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ?
4387	auto
4417	Can this be replaced with a if (none_of(FirstUsers)) pattern? You might be able to merge AreFromSingleVector into the lambda as well, although that might get too unwieldy?
4987	assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ?

Address Comments.

RKSimon added inline comments.May 20 2021, 8:27 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗	(On Diff #346735)	Still not performing fptoui on the entire <4 x i32>?

Harbormaster completed remote builds in B105425: Diff 346735.May 20 2021, 8:33 AM

RKSimon mentioned this in rGa26288e8030a: [X86][Atom] Fix vector fadd/fcmp/fmul resource/throughputs.May 20 2021, 10:57 AM

LGTM

llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗	(On Diff #346735)	This is purely a cost-model issue - fptoui for 2f32 is 8 but 4f32 is 18 (looks like the model assumes they scalarize which they don't) - these are really wrong, but shouldn't stop this patch.

This revision is now accepted and ready to land.May 21 2021, 1:18 AM

RKSimon added inline comments.May 21 2021, 4:18 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗	(On Diff #346735)	@ABataev If you rebase this should be fixed after rGeb6429d0fb94fd467e03d229177ae6ff3a44e3cc + rG3ae7f7ae0a33961be48948205981aea91920d3aa

ABataev added inline comments.May 21 2021, 4:45 AM

llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗	(On Diff #346735)	Ok, thanks. I'll check it. Sorry for the delay with the answers, busy with other regressions.

Closed by commit rG8dab25954b0a: [SLP]Improve handling of compensate external uses cost. (authored by ABataev). · Explain WhyMay 21 2021, 7:46 AM

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG8dab25954b0a: [SLP]Improve handling of compensate external uses cost..

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

In D101555#2780145, @rupprecht wrote:

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.

In D101555#2780152, @ABataev wrote:

In D101555#2780145, @rupprecht wrote:

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.

I was unsuccessful in getting it to repro directly from the open source repo. However I reduced this which shows the issue:

$ cat repro.cc
#include <xmmintrin.h>

#include <cstdint>
#include <cstdio>
#include <cstring>

// https://github.com/google/tink/blob/a72c9d542cd1dd8b58b2620ab52585cf5544f212/cc/subtle/aes_eax_aesni.cc#L79
inline __m128i Add(__m128i x, uint64_t y) {
  // Convert to a vector of two uint64_t.
  uint64_t vec[2];
  _mm_storeu_si128(reinterpret_cast<__m128i *>(vec), x);
  // Perform the addition on the vector.
  vec[0] += y;
  if (y > vec[0]) {
    vec[1]++;
  }
  // Convert back to xmm.
  return _mm_loadu_si128(reinterpret_cast<__m128i *>(vec));
}

void print128(__m128i var) {
  uint64_t parts[2];
  memcpy(parts, &var, sizeof(parts));
  printf("%lu %lu\n", parts[0], parts[1]);
}

template <class T>
void DoNotOptimize(const T &var) {
  asm volatile("" : "+m"(const_cast<T &>(var)));
}

int main() {
  __m128i x = _mm_setzero_si128();
  DoNotOptimize(x);
  __m128i y = Add(x, 1);
  print128(x);
  print128(y);
}
$ clang++ repro.cc -o /tmp/miscompile -O2 -fno-slp-vectorize && /tmp/miscompile
0 0
1 0
$ clang++ repro.cc -o /tmp/miscompile -O2 && /tmp/miscompile
0 0
1 1

Prior to this patch, there is no difference when enabling or disabling -fslp-vectorize. The issue seems to be how this optimizes Add:

vec[0] += y;
if (y > vec[0]) {  // This effectively evaluates to true
  vec[1]++;
}

dyung added a subscriber: dyung.May 26 2021, 2:21 AM

This comment was removed by dyung.

Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.

Consider the following code (ymm-1undef-add_ps_002.cpp):

#include <x86intrin.h>

__attribute__((noinline))
__m256 add_ps_002(__m256 a, __m256 b) {
  __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
                       b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
}

Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:

# %bb.0:                                # %entry                                                                                         
        vhaddps %xmm0, %xmm0, %xmm2                                                                                                      
        vextractf128    $1, %ymm0, %xmm0                                                                                                 
        vhaddps %xmm0, %xmm1, %xmm3                                                                                                      
        vinsertf128     $1, %xmm3, %ymm0, %ymm3                                                                                          
        vhaddps %ymm0, %ymm1, %ymm0                                                                                                      
        vblendps        $3, %ymm2, %ymm3, %ymm2         # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
        vshufpd $2, %ymm0, %ymm2, %ymm0         # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
        retq

With the following characteristics according to llvm-mca:

Iterations:        100
Instructions:      800
Total Cycles:      902
Total uOps:        1200

Dispatch Width:    2
uOps Per Cycle:    1.33
IPC:               0.89
Block RThroughput: 6.0

But after this change, the compiler is now producing the following assembly for the same code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm0, %xmm2
        vmovlhps        %xmm2, %xmm0, %xmm3             # xmm3 = xmm0[0],xmm2[0]                                                         
        vshufps $17, %xmm2, %xmm0, %xmm0        # xmm0 = xmm0[1,0],xmm2[1,0]                                                             
        vshufps $232, %xmm2, %xmm3, %xmm3       # xmm3 = xmm3[0,2],xmm2[2,3]                                                             
        vshufps $248, %xmm2, %xmm0, %xmm0       # xmm0 = xmm0[0,2],xmm2[3,3]                                                             
        vextractf128    $1, %ymm1, %xmm2                            
        vinsertps       $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]                                                            
        vinsertps       $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]                                                           
        vhaddps %xmm2, %xmm1, %xmm1                                                                                                      
        vhaddps %xmm2, %xmm2, %xmm2                                                                                                      
        vaddps  %xmm0, %xmm3, %xmm0                                 
        vpermilps       $148, %xmm0, %xmm3      # xmm3 = xmm0[0,1,1,2]                                                                   
        vinsertps       $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero                                                        
        vinsertps       $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]                                                           
        vinsertf128     $1, %xmm0, %ymm3, %ymm0                                                                                          
        retq

Which has the following characteristics according to llvm-mca:

Iterations:        100                                                                                                                   
Instructions:      1600                                                                                                                  
Total Cycles:      1007                                                                                                                  
Total uOps:        1700         

Dispatch Width:    2
uOps Per Cycle:    1.69
IPC:               1.59
Block RThroughput: 8.5

With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.

Could you take a look?

Thanks for the reports, will investigate them all and fix ASAP.

In D101555#2781034, @rupprecht wrote:
In D101555#2780152, @ABataev wrote:

In D101555#2780145, @rupprecht wrote:

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.

I was unsuccessful in getting it to repro directly from the open source repo. However I reduced this which shows the issue:
$ cat repro.cc
#include <xmmintrin.h>

#include <cstdint>
#include <cstdio>
#include <cstring>

// https://github.com/google/tink/blob/a72c9d542cd1dd8b58b2620ab52585cf5544f212/cc/subtle/aes_eax_aesni.cc#L79
inline __m128i Add(__m128i x, uint64_t y) {
  // Convert to a vector of two uint64_t.
  uint64_t vec[2];
  _mm_storeu_si128(reinterpret_cast<__m128i *>(vec), x);
  // Perform the addition on the vector.
  vec[0] += y;
  if (y > vec[0]) {
    vec[1]++;
  }
  // Convert back to xmm.
  return _mm_loadu_si128(reinterpret_cast<__m128i *>(vec));
}

void print128(__m128i var) {
  uint64_t parts[2];
  memcpy(parts, &var, sizeof(parts));
  printf("%lu %lu\n", parts[0], parts[1]);
}

template <class T>
void DoNotOptimize(const T &var) {
  asm volatile("" : "+m"(const_cast<T &>(var)));
}

int main() {
  __m128i x = _mm_setzero_si128();
  DoNotOptimize(x);
  __m128i y = Add(x, 1);
  print128(x);
  print128(y);
}
$ clang++ repro.cc -o /tmp/miscompile -O2 -fno-slp-vectorize && /tmp/miscompile
0 0
1 0
$ clang++ repro.cc -o /tmp/miscompile -O2 && /tmp/miscompile
0 0
1 1
Prior to this patch, there is no difference when enabling or disabling -fslp-vectorize. The issue seems to be how this optimizes Add:
vec[0] += y;
if (y > vec[0]) {  // This effectively evaluates to true
  vec[1]++;
}

Here is a fix D103164

In D101555#2781369, @dyung wrote:

Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.

Consider the following code (ymm-1undef-add_ps_002.cpp):

#include <x86intrin.h>

__attribute__((noinline))
__m256 add_ps_002(__m256 a, __m256 b) {
  __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
                       b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
}

Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:

# %bb.0:                                # %entry                                                                                         
        vhaddps %xmm0, %xmm0, %xmm2                                                                                                      
        vextractf128    $1, %ymm0, %xmm0                                                                                                 
        vhaddps %xmm0, %xmm1, %xmm3                                                                                                      
        vinsertf128     $1, %xmm3, %ymm0, %ymm3                                                                                          
        vhaddps %ymm0, %ymm1, %ymm0                                                                                                      
        vblendps        $3, %ymm2, %ymm3, %ymm2         # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
        vshufpd $2, %ymm0, %ymm2, %ymm0         # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
        retq

With the following characteristics according to llvm-mca:

Iterations:        100
Instructions:      800
Total Cycles:      902
Total uOps:        1200

Dispatch Width:    2
uOps Per Cycle:    1.33
IPC:               0.89
Block RThroughput: 6.0

But after this change, the compiler is now producing the following assembly for the same code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm0, %xmm2
        vmovlhps        %xmm2, %xmm0, %xmm3             # xmm3 = xmm0[0],xmm2[0]                                                         
        vshufps $17, %xmm2, %xmm0, %xmm0        # xmm0 = xmm0[1,0],xmm2[1,0]                                                             
        vshufps $232, %xmm2, %xmm3, %xmm3       # xmm3 = xmm3[0,2],xmm2[2,3]                                                             
        vshufps $248, %xmm2, %xmm0, %xmm0       # xmm0 = xmm0[0,2],xmm2[3,3]                                                             
        vextractf128    $1, %ymm1, %xmm2                            
        vinsertps       $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]                                                            
        vinsertps       $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]                                                           
        vhaddps %xmm2, %xmm1, %xmm1                                                                                                      
        vhaddps %xmm2, %xmm2, %xmm2                                                                                                      
        vaddps  %xmm0, %xmm3, %xmm0                                 
        vpermilps       $148, %xmm0, %xmm3      # xmm3 = xmm0[0,1,1,2]                                                                   
        vinsertps       $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero                                                        
        vinsertps       $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]                                                           
        vinsertf128     $1, %xmm0, %ymm3, %ymm0                                                                                          
        retq

Which has the following characteristics according to llvm-mca:

Iterations:        100                                                                                                                   
Instructions:      1600                                                                                                                  
Total Cycles:      1007                                                                                                                  
Total uOps:        1700         

Dispatch Width:    2
uOps Per Cycle:    1.69
IPC:               1.59
Block RThroughput: 8.5

Could you take a look?

Looks like codegen or some other later passes previously recognized the pattern while SLP vectorizer did not. Actually, without SLP vectorizer I'm getting just this:

vperm2f128      $49, %ymm1, %ymm0, %ymm2 # ymm2 = ymm0[2,3],ymm1[2,3]
vinsertf128     $1, %xmm1, %ymm0, %ymm0
vhaddps %ymm2, %ymm0, %ymm0
retq

I assume SLP will be able to produce something similar (or even better) after we start supporting vectorization of non-power-2 vectors. Here we have a pattern that matches it exactly:

return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);

-1 causes the optimizer to optimize out a[2] + a[3] operation and SLP does not recognize vectorization of 7 addition operations. This is the price we have to pay till the landing of non-power-2 vectorization. Will try to speed up.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

240 lines

test/

Transforms/

SLPVectorizer/

X86/

insert-element-build-vector.ll

30 lines

Diff 347028

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 538 Lines • ▼ Show 20 Lines	static void inversePermutation(ArrayRef<unsigned> Indices,
const unsigned E = Indices.size();		const unsigned E = Indices.size();
Mask.resize(E, E + 1);		Mask.resize(E, E + 1);
for (unsigned I = 0; I < E; ++I)		for (unsigned I = 0; I < E; ++I)
Mask[Indices[I]] = I;		Mask[Indices[I]] = I;
}		}

/// \returns inserting index of InsertElement or InsertValue instruction,		/// \returns inserting index of InsertElement or InsertValue instruction,
/// using Offset as base offset for index.		/// using Offset as base offset for index.
static Optional<unsigned> getInsertIndex(Value *InsertInst, unsigned Offset) {		static Optional<int> getInsertIndex(Value *InsertInst, unsigned Offset) {
unsigned Index = Offset;		int Index = Offset;
if (auto *IE = dyn_cast<InsertElementInst>(InsertInst)) {		if (auto *IE = dyn_cast<InsertElementInst>(InsertInst)) {
if (auto *CI = dyn_cast<ConstantInt>(IE->getOperand(2))) {		if (auto *CI = dyn_cast<ConstantInt>(IE->getOperand(2))) {
auto *VT = cast<FixedVectorType>(IE->getType());		auto *VT = cast<FixedVectorType>(IE->getType());
		if (CI->getValue().uge(VT->getNumElements()))
		return UndefMaskElem;
Index *= VT->getNumElements();		Index *= VT->getNumElements();
Index += CI->getZExtValue();		Index += CI->getZExtValue();
return Index;		return Index;
}		}
		if (isa<UndefValue>(IE->getOperand(2)))
		return UndefMaskElem;
return None;		return None;
}		}

auto *IV = cast<InsertValueInst>(InsertInst);		auto *IV = cast<InsertValueInst>(InsertInst);
Type *CurrentType = IV->getType();		Type *CurrentType = IV->getType();
for (unsigned I : IV->indices()) {		for (unsigned I : IV->indices()) {
if (auto *ST = dyn_cast<StructType>(CurrentType)) {		if (auto *ST = dyn_cast<StructType>(CurrentType)) {
Index *= ST->getNumElements();		Index *= ST->getNumElements();
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	public:
/// generated extractvalue instructions.		/// generated extractvalue instructions.
Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);		Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
InstructionCost getSpillCost() const;		InstructionCost getSpillCost() const;

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
		RKSimonUnsubmitted Not Done Reply Inline Actions Explain the purpose of InsertUses in the doxygen RKSimon: Explain the purpose of InsertUses in the doxygen
		ABataevAuthorUnsubmitted Done Reply Inline Actions Will add, thanks! ABataev: Will add, thanks!
InstructionCost getTreeCost();		InstructionCost getTreeCost();

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
▲ Show 20 Lines • Show All 2,238 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
return;		return;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
assert(ReuseShuffleIndicies.empty() && "All inserts should be unique");		assert(ReuseShuffleIndicies.empty() && "All inserts should be unique");

int Offset = *getInsertIndex(VL[0], 0);		// Check that we have a buildvector and not a shuffle of 2 or more
ValueList Operands(VL.size());		// different vectors.
Operands[0] = cast<Instruction>(VL[0])->getOperand(1);		ValueSet SourceVectors;
bool IsConsecutive = true;		for (Value *V : VL)
for (unsigned I = 1, E = VL.size(); I < E; ++I) {		SourceVectors.insert(cast<Instruction>(V)->getOperand(0));
IsConsecutive &= (I == *getInsertIndex(VL[I], 0) - Offset);
Operands[I] = cast<Instruction>(VL[I])->getOperand(1);		if (count_if(VL, [&SourceVectors](Value *V) {
		return !SourceVectors.contains(V);
		}) >= 2) {
		// Found 2nd source vector - cancel.
		LLVM_DEBUG(dbgs() << "SLP: Gather of insertelement vectors with "
		"different source vectors.\n");
		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
		ReuseShuffleIndicies);
		BS.cancelScheduling(VL, VL0);
		return;
}		}

// FIXME: support vectorization of non-consecutive inserts
// using shuffle with its proper cost estimation
if (IsConsecutive) {
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx);		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx);
LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");		LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");

		RKSimonUnsubmitted Not Done Reply Inline Actions would this be better as a count_if ? RKSimon: would this be better as a count_if ?
		ABataevAuthorUnsubmitted Done Reply Inline Actions We do not need simple count here, we're filling list of operands and counting source vectors at the same time. Rather doubt count_if will help here ABataev: We do not need simple count here, we're filling list of operands and counting source vectors at…
TE->setOperand(0, Operands);		constexpr int NumOps = 2;
		ValueList VectorOperands[NumOps];
		RKSimonUnsubmitted Not Done Reply Inline Actions SmallVector seems unnecessary - why not just ValueList VectorOperands[NumOps] ? Even NumOps seems a bit too much. RKSimon: SmallVector seems unnecessary - why not just ValueList VectorOperands[NumOps] ? Even NumOps…
ValueList VectorOperands;		for (int I = 0; I < NumOps; ++I) {
for (Value *V : VL)		for (Value *V : VL)
VectorOperands.push_back(cast<Instruction>(V)->getOperand(0));		VectorOperands[I].push_back(cast<Instruction>(V)->getOperand(I));
		RKSimonUnsubmitted Not Done Reply Inline Actions Any good way to merge the SourceVectors set with the VectorOperands list? RKSimon: Any good way to merge the SourceVectors set with the VectorOperands list?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Thought about it too, will try to improve it somehow ABataev: Thought about it too, will try to improve it somehow

TE->setOperand(1, VectorOperands);

buildTree_rec(Operands, Depth + 1, {TE, 0});		TE->setOperand(I, VectorOperands[I]);
return;
}		}
		buildTree_rec(VectorOperands[NumOps - 1], Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: skipping non-consecutive inserts.\n");
BS.cancelScheduling(VL, VL0);
buildTree_rec(Operands, Depth, UserTreeIdx);
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load. For example, we don't want to vectorize loads that are smaller		// load. For example, we don't want to vectorize loads that are smaller
// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
// treats loading/storing it as an i8 struct. If we vectorize loads/stores		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
// from such a struct, we read/write packed bits disagreeing with the		// from such a struct, we read/write packed bits disagreeing with the
▲ Show 20 Lines • Show All 876 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {
AdjustExtractsCost(CommonCost, /IsGather=/false);		AdjustExtractsCost(CommonCost, /IsGather=/false);
}		}
return CommonCost;		return CommonCost;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());		auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());

unsigned const NumElts = SrcVecTy->getNumElements();		unsigned const NumElts = SrcVecTy->getNumElements();
		unsigned const NumScalars = VL.size();
APInt DemandedElts = APInt::getNullValue(NumElts);		APInt DemandedElts = APInt::getNullValue(NumElts);
for (auto *V : VL)		// TODO: Add support for Instruction::InsertValue.
DemandedElts.setBit(*getInsertIndex(V, 0));		unsigned Offset = UINT_MAX;
		bool IsIdentity = true;
		SmallVector<int> ShuffleMask(NumElts, UndefMaskElem);
		for (unsigned I = 0; I < NumScalars; ++I) {
		Optional<int> InsertIdx = getInsertIndex(VL[I], 0);
		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
		RKSimonUnsubmitted Not Done Reply Inline Actions V is used only once getInsertIndex(VL[I], 0) RKSimon: V is used only once ``` getInsertIndex(VL[I], 0) ```
		continue;
		unsigned Idx = *InsertIdx;
		DemandedElts.setBit(Idx);
		if (Idx < Offset) {
		Offset = Idx;
		IsIdentity &= I == 0;
		} else {
		assert(Idx >= Offset && "Failed to find vector index offset");
		IsIdentity &= Idx - Offset == I;
		}
		ShuffleMask[Idx] = I;
		}
		RKSimonUnsubmitted Not Done Reply Inline Actions assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ? RKSimon: assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset <…
		assert(Offset < NumElts && "Failed to find vector index offset");

InstructionCost Cost = 0;		InstructionCost Cost = 0;
Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,		Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
/Insert/ true, /Extract/ false);		/Insert/ true, /Extract/ false);

unsigned const NumScalars = VL.size();		if (IsIdentity && NumElts != NumScalars && Offset % NumScalars != 0)
unsigned const Offset = *getInsertIndex(VL[0], 0);
if (NumElts != NumScalars && Offset % NumScalars != 0)
Cost += TTI->getShuffleCost(		Cost += TTI->getShuffleCost(
TargetTransformInfo::SK_InsertSubvector, SrcVecTy, /Mask/ None,		TargetTransformInfo::SK_InsertSubvector, SrcVecTy, /Mask/ None,
Offset,		Offset,
FixedVectorType::get(SrcVecTy->getElementType(), NumScalars));		FixedVectorType::get(SrcVecTy->getElementType(), NumScalars));
		else if (!IsIdentity)
		Cost += TTI->getShuffleCost(TTI::SK_PermuteSingleSrc, SrcVecTy,
		ShuffleMask);

return Cost;		return Cost;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
▲ Show 20 Lines • Show All 520 Lines • ▼ Show 20 Lines	for (unsigned I = 0, E = VectorizableTree.size(); I < E; ++I) {
LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C
<< " for bundle that starts with " << *TE.Scalars[0]		<< " for bundle that starts with " << *TE.Scalars[0]
<< ".\n"		<< ".\n"
<< "SLP: Current total cost = " << Cost << "\n");		<< "SLP: Current total cost = " << Cost << "\n");
}		}

SmallPtrSet<Value *, 16> ExtractCostCalculated;		SmallPtrSet<Value *, 16> ExtractCostCalculated;
InstructionCost ExtractCost = 0;		InstructionCost ExtractCost = 0;
		SmallBitVector IsIdentity;
		SmallVector<unsigned> VF;
		SmallVector<SmallVector<int>> ShuffleMask;
		SmallVector<Value *> FirstUsers;
		RKSimonUnsubmitted Not Done Reply Inline Actions auto RKSimon: auto
		SmallVector<APInt> DemandedElts;
for (ExternalUser &EU : ExternalUses) {		for (ExternalUser &EU : ExternalUses) {
// We only add extract cost once for the same scalar.		// We only add extract cost once for the same scalar.
if (!ExtractCostCalculated.insert(EU.Scalar).second)		if (!ExtractCostCalculated.insert(EU.Scalar).second)
continue;		continue;

// Uses by ephemeral values are free (because the ephemeral value will be		// Uses by ephemeral values are free (because the ephemeral value will be
// removed prior to code generation, and so the extraction will be		// removed prior to code generation, and so the extraction will be
// removed as well).		// removed as well).
if (EphValues.count(EU.User))		if (EphValues.count(EU.User))
continue;		continue;

// No extract cost for vector "scalar"		// No extract cost for vector "scalar"
if (isa<FixedVectorType>(EU.Scalar->getType()))		if (isa<FixedVectorType>(EU.Scalar->getType()))
continue;		continue;

		// If found user is an insertelement, do not calculate extract cost but try
		// to detect it as a final shuffled/identity match.
		if (EU.User && isa<InsertElementInst>(EU.User)) {
		if (auto *FTy = dyn_cast<FixedVectorType>(EU.User->getType())) {
		Optional<int> InsertIdx = getInsertIndex(EU.User, 0);
		RKSimonUnsubmitted Not Done Reply Inline Actions IsIdentity &= ? RKSimon: IsIdentity &= ?
		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
		continue;
		Value *VU = EU.User;
		auto It = find_if(FirstUsers, [VU](Value V) {
		// Checks if 2 insertelements are from the same buildvector.
		if (VU->getType() != V->getType())
		return false;
		auto *IE1 = cast<InsertElementInst>(VU);
		auto *IE2 = cast<InsertElementInst>(V);
		RKSimonUnsubmitted Not Done Reply Inline Actions Can this be replaced with a if (none_of(FirstUsers)) pattern? You might be able to merge AreFromSingleVector into the lambda as well, although that might get too unwieldy? RKSimon: Can this be replaced with a if (none_of(FirstUsers)) pattern? You might be able to merge…
		do {
		if (IE1 == VU \|\| IE2 == V)
		return true;
		if (IE1)
		IE1 = dyn_cast<InsertElementInst>(IE1->getOperand(0));
		if (IE2)
		IE2 = dyn_cast<InsertElementInst>(IE2->getOperand(0));
		} while (IE1 \|\| IE2);
		return false;
		});
		int VecId = -1;
		if (It == FirstUsers.end()) {
		VF.push_back(FTy->getNumElements());
		ShuffleMask.emplace_back(VF.back(), UndefMaskElem);
		FirstUsers.push_back(EU.User);
		DemandedElts.push_back(APInt::getNullValue(VF.back()));
		IsIdentity.push_back(true);
		VecId = FirstUsers.size() - 1;
		} else {
		VecId = std::distance(FirstUsers.begin(), It);
		}
		int Idx = *InsertIdx;
		ShuffleMask[VecId][Idx] = EU.Lane;
		IsIdentity.set(IsIdentity.test(VecId) &
		(EU.Lane == Idx \|\| EU.Lane == UndefMaskElem));
		DemandedElts[VecId].setBit(Idx);
		}
		}

// If we plan to rewrite the tree in a smaller type, we will need to sign		// If we plan to rewrite the tree in a smaller type, we will need to sign
// extend the extracted value back to the original type. Here, we account		// extend the extracted value back to the original type. Here, we account
// for the extract and the added cost of the sign extend if needed.		// for the extract and the added cost of the sign extend if needed.
auto *VecTy = FixedVectorType::get(EU.Scalar->getType(), BundleWidth);		auto *VecTy = FixedVectorType::get(EU.Scalar->getType(), BundleWidth);
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		auto *ScalarRoot = VectorizableTree[0]->Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);		auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto Extend =		auto Extend =
MinBWs[ScalarRoot].second ? Instruction::SExt : Instruction::ZExt;		MinBWs[ScalarRoot].second ? Instruction::SExt : Instruction::ZExt;
VecTy = FixedVectorType::get(MinTy, BundleWidth);		VecTy = FixedVectorType::get(MinTy, BundleWidth);
ExtractCost += TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),		ExtractCost += TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
VecTy, EU.Lane);		VecTy, EU.Lane);
} else {		} else {
ExtractCost +=		ExtractCost +=
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
}		}
}		}

InstructionCost SpillCost = getSpillCost();		InstructionCost SpillCost = getSpillCost();
Cost += SpillCost + ExtractCost;		Cost += SpillCost + ExtractCost;
		for (int I = 0, E = FirstUsers.size(); I < E; ++I) {
		if (!IsIdentity.test(I)) {
		InstructionCost C = TTI->getShuffleCost(
		TTI::SK_PermuteSingleSrc,
		cast<FixedVectorType>(FirstUsers[I]->getType()), ShuffleMask[I]);
		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C
		<< " for final shuffle of insertelement external users "
		<< *VectorizableTree.front()->Scalars.front() << ".\n"
		<< "SLP: Current total cost = " << Cost << "\n");
		Cost += C;
		}
		unsigned VF = ShuffleMask[I].size();
		for (int &Mask : ShuffleMask[I])
		Mask = (Mask == UndefMaskElem ? 0 : VF) + Mask;
		InstructionCost C = TTI->getShuffleCost(
		TTI::SK_PermuteTwoSrc, cast<FixedVectorType>(FirstUsers[I]->getType()),
		ShuffleMask[I]);
		LLVM_DEBUG(
		dbgs()
		<< "SLP: Adding cost " << C
		<< " for final shuffle of vector node and external insertelement users "
		<< *VectorizableTree.front()->Scalars.front() << ".\n"
		<< "SLP: Current total cost = " << Cost << "\n");
		Cost += C;
		InstructionCost InsertCost = TTI->getScalarizationOverhead(
		cast<FixedVectorType>(FirstUsers[I]->getType()), DemandedElts[I],
		/Insert/ true,
		/Extract/ false);
		Cost -= InsertCost;
		LLVM_DEBUG(dbgs() << "SLP: subtracting the cost " << InsertCost
		<< " for insertelements gather.\n"
		<< "SLP: Current total cost = " << Cost << "\n");
		}

#ifndef NDEBUG		#ifndef NDEBUG
SmallString<256> Str;		SmallString<256> Str;
{		{
raw_svector_ostream OS(Str);		raw_svector_ostream OS(Str);
OS << "SLP: Spill Cost = " << SpillCost << ".\n"		OS << "SLP: Spill Cost = " << SpillCost << ".\n"
<< "SLP: Extract Cost = " << ExtractCost << ".\n"		<< "SLP: Extract Cost = " << ExtractCost << ".\n"
<< "SLP: Total Cost = " << Cost << ".\n";		<< "SLP: Total Cost = " << Cost << ".\n";
▲ Show 20 Lines • Show All 446 Lines • ▼ Show 20 Lines	case Instruction::ExtractValue: {
ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
NewV = ShuffleBuilder.finalize(NewV);		NewV = ShuffleBuilder.finalize(NewV);
E->VectorizedValue = NewV;		E->VectorizedValue = NewV;
return NewV;		return NewV;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
Builder.SetInsertPoint(VL0);		Builder.SetInsertPoint(VL0);
Value *V = vectorizeTree(E->getOperand(0));		Value *V = vectorizeTree(E->getOperand(1));

const unsigned NumElts =		const unsigned NumElts =
cast<FixedVectorType>(VL0->getType())->getNumElements();		cast<FixedVectorType>(VL0->getType())->getNumElements();
const unsigned NumScalars = E->Scalars.size();		const unsigned NumScalars = E->Scalars.size();

// Create InsertVector shuffle if necessary		// Create InsertVector shuffle if necessary
if (NumElts != NumScalars) {
unsigned MinIndex = *getInsertIndex(E->Scalars[0], 0);
Instruction *FirstInsert = nullptr;		Instruction *FirstInsert = nullptr;
for (auto *Scalar : E->Scalars)		bool IsIdentity = true;
		unsigned Offset = UINT_MAX;
		for (unsigned I = 0; I < NumScalars; ++I) {
		Value *Scalar = E->Scalars[I];
if (!FirstInsert &&		if (!FirstInsert &&
!is_contained(E->Scalars,		!is_contained(E->Scalars, cast<Instruction>(Scalar)->getOperand(0)))
cast<Instruction>(Scalar)->getOperand(0)))
FirstInsert = cast<Instruction>(Scalar);		FirstInsert = cast<Instruction>(Scalar);
		Optional<int> InsertIdx = getInsertIndex(Scalar, 0);
		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
		continue;
		unsigned Idx = *InsertIdx;
		if (Idx < Offset) {
		Offset = Idx;
		IsIdentity &= I == 0;
		} else {
		assert(Idx >= Offset && "Failed to find vector index offset");
		IsIdentity &= Idx - Offset == I;
		}
		RKSimonUnsubmitted Not Done Reply Inline Actions assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ? RKSimon: assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset <…
		}
		assert(Offset < NumElts && "Failed to find vector index offset");

// Create shuffle to resize vector		// Create shuffle to resize vector
SmallVector<int, 16> Mask(NumElts, UndefMaskElem);		SmallVector<int> Mask(NumElts, UndefMaskElem);
		if (!IsIdentity) {
		for (unsigned I = 0; I < NumScalars; ++I) {
		Value *Scalar = E->Scalars[I];
		Optional<int> InsertIdx = getInsertIndex(Scalar, 0);
		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
		continue;
		Mask[*InsertIdx - Offset] = I;
		}
		} else {
std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);		std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);
		}
		if (!IsIdentity \|\| NumElts != NumScalars)
V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);		V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);

const unsigned MaxIndex = MinIndex + NumScalars;		if (NumElts != NumScalars) {
for (unsigned I = 0; I < NumElts; I++)		SmallVector<int> InsertMask(NumElts);
Mask[I] =		std::iota(InsertMask.begin(), InsertMask.end(), 0);
(I < MinIndex \|\| I >= MaxIndex) ? I : NumElts - MinIndex + I;		for (unsigned I = 0; I < NumElts; I++) {
		if (Mask[I] != UndefMaskElem)
		InsertMask[Offset + I] = NumElts + I;
		}

V = Builder.CreateShuffleVector(		V = Builder.CreateShuffleVector(
FirstInsert->getOperand(0), V, Mask,		FirstInsert->getOperand(0), V, InsertMask,
cast<Instruction>(E->Scalars[NumScalars - 1])->getName());		cast<Instruction>(E->Scalars.back())->getName());
}		}

++NumVectorInstructions;		++NumVectorInstructions;
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
▲ Show 20 Lines • Show All 2,674 Lines • ▼ Show 20 Lines

static bool findBuildAggregate_rec(Instruction *LastInsertInst,		static bool findBuildAggregate_rec(Instruction *LastInsertInst,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI,
SmallVectorImpl<Value *> &BuildVectorOpds,		SmallVectorImpl<Value *> &BuildVectorOpds,
SmallVectorImpl<Value *> &InsertElts,		SmallVectorImpl<Value *> &InsertElts,
unsigned OperandOffset) {		unsigned OperandOffset) {
do {		do {
Value *InsertedOperand = LastInsertInst->getOperand(1);		Value *InsertedOperand = LastInsertInst->getOperand(1);
Optional<unsigned> OperandIndex =		Optional<int> OperandIndex = getInsertIndex(LastInsertInst, OperandOffset);
getInsertIndex(LastInsertInst, OperandOffset);
if (!OperandIndex)		if (!OperandIndex)
return false;		return false;
if (isa<InsertElementInst>(InsertedOperand) \|\|		if (isa<InsertElementInst>(InsertedOperand) \|\|
isa<InsertValueInst>(InsertedOperand)) {		isa<InsertValueInst>(InsertedOperand)) {
if (!findBuildAggregate_rec(cast<Instruction>(InsertedOperand), TTI,		if (!findBuildAggregate_rec(cast<Instruction>(InsertedOperand), TTI,
BuildVectorOpds, InsertElts, *OperandIndex))		BuildVectorOpds, InsertElts, *OperandIndex))
return false;		return false;
} else {		} else {
BuildVectorOpds[*OperandIndex] = InsertedOperand;		BuildVectorOpds[*OperandIndex] = InsertedOperand;
InsertElts[*OperandIndex] = LastInsertInst;		InsertElts[*OperandIndex] = LastInsertInst;
}		}
if (isa<UndefValue>(LastInsertInst->getOperand(0)))
return true;
LastInsertInst = dyn_cast<Instruction>(LastInsertInst->getOperand(0));		LastInsertInst = dyn_cast<Instruction>(LastInsertInst->getOperand(0));
} while (LastInsertInst != nullptr &&		} while (LastInsertInst != nullptr &&
(isa<InsertValueInst>(LastInsertInst) \|\|		(isa<InsertValueInst>(LastInsertInst) \|\|
isa<InsertElementInst>(LastInsertInst)) &&		isa<InsertElementInst>(LastInsertInst)));
LastInsertInst->hasOneUse());		return true;
return false;
}		}

/// Recognize construction of vectors like		/// Recognize construction of vectors like
/// %ra = insertelement <4 x float> poison, float %s0, i32 0		/// %ra = insertelement <4 x float> poison, float %s0, i32 0
/// %rb = insertelement <4 x float> %ra, float %s1, i32 1		/// %rb = insertelement <4 x float> %ra, float %s1, i32 1
/// %rc = insertelement <4 x float> %rb, float %s2, i32 2		/// %rc = insertelement <4 x float> %rb, float %s2, i32 2
/// %rd = insertelement <4 x float> %rc, float %s3, i32 3		/// %rd = insertelement <4 x float> %rc, float %s3, i32 3
/// starting from the last insertelement or insertvalue instruction.		/// starting from the last insertelement or insertvalue instruction.
▲ Show 20 Lines • Show All 568 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll

Show All 34 Lines	;
%rb = insertelement <4 x float> %ra, float %s1, i32 1		%rb = insertelement <4 x float> %ra, float %s1, i32 1
%rc = insertelement <4 x float> %rb, float %s2, i32 2		%rc = insertelement <4 x float> %rb, float %s2, i32 2
%rd = insertelement <4 x float> %rc, float %s3, i32 3		%rd = insertelement <4 x float> %rc, float %s3, i32 3
ret <4 x float> %rd		ret <4 x float> %rd
}		}

define <8 x float> @simple_select2(<4 x float> %a, <4 x float> %b, <4 x i32> %c) #0 {		define <8 x float> @simple_select2(<4 x float> %a, <4 x float> %b, <4 x i32> %c) #0 {
; CHECK-LABEL: @simple_select2(		; CHECK-LABEL: @simple_select2(
; CHECK-NEXT: [[C0:%.]] = extractelement <4 x i32> [[C:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = icmp ne <4 x i32> [[C:%.]], zeroinitializer
; CHECK-NEXT: [[C1:%.*]] = extractelement <4 x i32> [[C]], i32 1		; CHECK-NEXT: [[TMP2:%.]] = select <4 x i1> [[TMP1]], <4 x float> [[A:%.]], <4 x float> [[B:%.*]]
; CHECK-NEXT: [[C2:%.*]] = extractelement <4 x i32> [[C]], i32 2		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> undef, <8 x i32> <i32 0, i32 undef, i32 1, i32 undef, i32 2, i32 undef, i32 undef, i32 3>
; CHECK-NEXT: [[C3:%.*]] = extractelement <4 x i32> [[C]], i32 3		; CHECK-NEXT: [[RD1:%.*]] = shufflevector <8 x float> undef, <8 x float> [[TMP3]], <8 x i32> <i32 8, i32 1, i32 10, i32 3, i32 12, i32 5, i32 6, i32 15>
; CHECK-NEXT: [[A0:%.]] = extractelement <4 x float> [[A:%.]], i32 0		; CHECK-NEXT: ret <8 x float> [[RD1]]
; CHECK-NEXT: [[A1:%.*]] = extractelement <4 x float> [[A]], i32 1
; CHECK-NEXT: [[A2:%.*]] = extractelement <4 x float> [[A]], i32 2
; CHECK-NEXT: [[A3:%.*]] = extractelement <4 x float> [[A]], i32 3
; CHECK-NEXT: [[B0:%.]] = extractelement <4 x float> [[B:%.]], i32 0
; CHECK-NEXT: [[B1:%.*]] = extractelement <4 x float> [[B]], i32 1
; CHECK-NEXT: [[B2:%.*]] = extractelement <4 x float> [[B]], i32 2
; CHECK-NEXT: [[B3:%.*]] = extractelement <4 x float> [[B]], i32 3
; CHECK-NEXT: [[CMP0:%.*]] = icmp ne i32 [[C0]], 0
; CHECK-NEXT: [[CMP1:%.*]] = icmp ne i32 [[C1]], 0
; CHECK-NEXT: [[CMP2:%.*]] = icmp ne i32 [[C2]], 0
; CHECK-NEXT: [[CMP3:%.*]] = icmp ne i32 [[C3]], 0
; CHECK-NEXT: [[S0:%.*]] = select i1 [[CMP0]], float [[A0]], float [[B0]]
; CHECK-NEXT: [[S1:%.*]] = select i1 [[CMP1]], float [[A1]], float [[B1]]
; CHECK-NEXT: [[S2:%.*]] = select i1 [[CMP2]], float [[A2]], float [[B2]]
; CHECK-NEXT: [[S3:%.*]] = select i1 [[CMP3]], float [[A3]], float [[B3]]
; CHECK-NEXT: [[RA:%.*]] = insertelement <8 x float> undef, float [[S0]], i32 0
; CHECK-NEXT: [[RB:%.*]] = insertelement <8 x float> [[RA]], float [[S1]], i32 2
; CHECK-NEXT: [[RC:%.*]] = insertelement <8 x float> [[RB]], float [[S2]], i32 4
; CHECK-NEXT: [[RD:%.*]] = insertelement <8 x float> [[RC]], float [[S3]], i32 7
; CHECK-NEXT: ret <8 x float> [[RD]]
;		;
%c0 = extractelement <4 x i32> %c, i32 0		%c0 = extractelement <4 x i32> %c, i32 0
%c1 = extractelement <4 x i32> %c, i32 1		%c1 = extractelement <4 x i32> %c, i32 1
%c2 = extractelement <4 x i32> %c, i32 2		%c2 = extractelement <4 x i32> %c, i32 2
%c3 = extractelement <4 x i32> %c, i32 3		%c3 = extractelement <4 x i32> %c, i32 3
%a0 = extractelement <4 x float> %a, i32 0		%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1		%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2		%a2 = extractelement <4 x float> %a, i32 2
▲ Show 20 Lines • Show All 448 Lines • Show Last 20 Lines