External insertelement users can be represented as a result of shuffle
of the vectorized element and noconsecutive insertlements too. Added
support for handling non-consecutive insertelements.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
636 | Explain the purpose of InsertUses in the doxygen |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
636 | Will add, thanks! |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4408 | IsIdentity &= ? | |
llvm/test/Transforms/SLPVectorizer/X86/hsub.ll | ||
174 ↗ | (On Diff #343018) | These regressions looks like we need to do more in the shuffle costs to recognise when the shuffles don't cross subvector boundaries? Either for illegal types like this or across 128-bit subvector boundaries on AVX. |
llvm/test/Transforms/SLPVectorizer/X86/hsub.ll | ||
---|---|---|
174 ↗ | (On Diff #343018) | Yes, need to subtract scalarization overhead for insertelement instruction, trying to handle it correctly in vectorization of InsertElement instructions patch. I'm going to abandon this patch when vectorization of InsertElements is landed. Keeping it just in case. |
A few minors
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2913 | SmallVector seems unnecessary - why not just ValueList VectorOperands[NumOps] ? Even NumOps seems a bit too much. | |
3821 | V is used only once getInsertIndex(VL[I], 0) | |
3833 | assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ? | |
4387 | auto | |
4417 | Can this be replaced with a if (none_of(FirstUsers)) pattern? You might be able to merge AreFromSingleVector into the lambda as well, although that might get too unwieldy? | |
4987 | assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ? |
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll | ||
---|---|---|
72 ↗ | (On Diff #346735) | Still not performing fptoui on the entire <4 x i32>? |
LGTM
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll | ||
---|---|---|
72 ↗ | (On Diff #346735) | This is purely a cost-model issue - fptoui for 2f32 is 8 but 4f32 is 18 (looks like the model assumes they scalarize which they don't) - these are really wrong, but shouldn't stop this patch. |
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll | ||
---|---|---|
72 ↗ | (On Diff #346735) | @ABataev If you rebase this should be fixed after rGeb6429d0fb94fd467e03d229177ae6ff3a44e3cc + rG3ae7f7ae0a33961be48948205981aea91920d3aa |
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll | ||
---|---|---|
72 ↗ | (On Diff #346735) | Ok, thanks. I'll check it. Sorry for the delay with the answers, busy with other regressions. |
We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?
No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.
I was unsuccessful in getting it to repro directly from the open source repo. However I reduced this which shows the issue:
$ cat repro.cc #include <xmmintrin.h> #include <cstdint> #include <cstdio> #include <cstring> // https://github.com/google/tink/blob/a72c9d542cd1dd8b58b2620ab52585cf5544f212/cc/subtle/aes_eax_aesni.cc#L79 inline __m128i Add(__m128i x, uint64_t y) { // Convert to a vector of two uint64_t. uint64_t vec[2]; _mm_storeu_si128(reinterpret_cast<__m128i *>(vec), x); // Perform the addition on the vector. vec[0] += y; if (y > vec[0]) { vec[1]++; } // Convert back to xmm. return _mm_loadu_si128(reinterpret_cast<__m128i *>(vec)); } void print128(__m128i var) { uint64_t parts[2]; memcpy(parts, &var, sizeof(parts)); printf("%lu %lu\n", parts[0], parts[1]); } template <class T> void DoNotOptimize(const T &var) { asm volatile("" : "+m"(const_cast<T &>(var))); } int main() { __m128i x = _mm_setzero_si128(); DoNotOptimize(x); __m128i y = Add(x, 1); print128(x); print128(y); } $ clang++ repro.cc -o /tmp/miscompile -O2 -fno-slp-vectorize && /tmp/miscompile 0 0 1 0 $ clang++ repro.cc -o /tmp/miscompile -O2 && /tmp/miscompile 0 0 1 1
Prior to this patch, there is no difference when enabling or disabling -fslp-vectorize. The issue seems to be how this optimizes Add:
vec[0] += y; if (y > vec[0]) { // This effectively evaluates to true vec[1]++; }
Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.
Consider the following code (ymm-1undef-add_ps_002.cpp):
#include <x86intrin.h> __attribute__((noinline)) __m256 add_ps_002(__m256 a, __m256 b) { __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7], b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] }; return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7); }
Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:
# %bb.0: # %entry vhaddps %xmm0, %xmm0, %xmm2 vextractf128 $1, %ymm0, %xmm0 vhaddps %xmm0, %xmm1, %xmm3 vinsertf128 $1, %xmm3, %ymm0, %ymm3 vhaddps %ymm0, %ymm1, %ymm0 vblendps $3, %ymm2, %ymm3, %ymm2 # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7] vshufpd $2, %ymm0, %ymm2, %ymm0 # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2] retq
With the following characteristics according to llvm-mca:
Iterations: 100 Instructions: 800 Total Cycles: 902 Total uOps: 1200 Dispatch Width: 2 uOps Per Cycle: 1.33 IPC: 0.89 Block RThroughput: 6.0
But after this change, the compiler is now producing the following assembly for the same code:
# %bb.0: # %entry vextractf128 $1, %ymm0, %xmm2 vmovlhps %xmm2, %xmm0, %xmm3 # xmm3 = xmm0[0],xmm2[0] vshufps $17, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[1,0],xmm2[1,0] vshufps $232, %xmm2, %xmm3, %xmm3 # xmm3 = xmm3[0,2],xmm2[2,3] vshufps $248, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,2],xmm2[3,3] vextractf128 $1, %ymm1, %xmm2 vinsertps $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0] vinsertps $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1] vhaddps %xmm2, %xmm1, %xmm1 vhaddps %xmm2, %xmm2, %xmm2 vaddps %xmm0, %xmm3, %xmm0 vpermilps $148, %xmm0, %xmm3 # xmm3 = xmm0[0,1,1,2] vinsertps $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero vinsertps $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1] vinsertf128 $1, %xmm0, %ymm3, %ymm0 retq
Which has the following characteristics according to llvm-mca:
Iterations: 100 Instructions: 1600 Total Cycles: 1007 Total uOps: 1700 Dispatch Width: 2 uOps Per Cycle: 1.69 IPC: 1.59 Block RThroughput: 8.5
With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.
Could you take a look?
Looks like codegen or some other later passes previously recognized the pattern while SLP vectorizer did not. Actually, without SLP vectorizer I'm getting just this:
vperm2f128 $49, %ymm1, %ymm0, %ymm2 # ymm2 = ymm0[2,3],ymm1[2,3] vinsertf128 $1, %xmm1, %ymm0, %ymm0 vhaddps %ymm2, %ymm0, %ymm0 retq
I assume SLP will be able to produce something similar (or even better) after we start supporting vectorization of non-power-2 vectors. Here we have a pattern that matches it exactly:
return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
-1 causes the optimizer to optimize out a[2] + a[3] operation and SLP does not recognize vectorization of 7 addition operations. This is the price we have to pay till the landing of non-power-2 vectorization. Will try to speed up.
Explain the purpose of InsertUses in the doxygen