This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve handling of compensate external uses cost.
ClosedPublic

Authored by ABataev on Apr 29 2021, 11:09 AM.

Details

Summary

External insertelement users can be represented as a result of shuffle
of the vectorized element and noconsecutive insertlements too. Added
support for handling non-consecutive insertelements.

Diff Detail

Event Timeline

ABataev created this revision.Apr 29 2021, 11:09 AM
ABataev requested review of this revision.Apr 29 2021, 11:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2021, 11:09 AM
Matt added a subscriber: Matt.Apr 29 2021, 12:48 PM
RKSimon added inline comments.May 4 2021, 2:42 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
636

Explain the purpose of InsertUses in the doxygen

ABataev added inline comments.May 4 2021, 2:48 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
636

Will add, thanks!

ABataev updated this revision to Diff 343018.May 5 2021, 5:48 AM

Address comments.

RKSimon added inline comments.May 10 2021, 4:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4408

IsIdentity &= ?

llvm/test/Transforms/SLPVectorizer/X86/hsub.ll
174 ↗(On Diff #343018)

These regressions looks like we need to do more in the shuffle costs to recognise when the shuffles don't cross subvector boundaries? Either for illegal types like this or across 128-bit subvector boundaries on AVX.

ABataev added inline comments.May 10 2021, 5:00 AM
llvm/test/Transforms/SLPVectorizer/X86/hsub.ll
174 ↗(On Diff #343018)

Yes, need to subtract scalarization overhead for insertelement instruction, trying to handle it correctly in vectorization of InsertElement instructions patch. I'm going to abandon this patch when vectorization of InsertElements is landed. Keeping it just in case.

ABataev updated this revision to Diff 345545.May 14 2021, 1:33 PM

Rework after handling of insertelements

ABataev edited the summary of this revision. (Show Details)May 14 2021, 1:34 PM
ABataev updated this revision to Diff 346224.May 18 2021, 11:29 AM
ABataev edited the summary of this revision. (Show Details)

Rebase + improved build vector detection.

RKSimon added inline comments.May 18 2021, 1:29 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2911

would this be better as a count_if ?

2916

Any good way to merge the SourceVectors set with the VectorOperands list?

ABataev added inline comments.May 18 2021, 1:41 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2911

We do not need simple count here, we're filling list of operands and counting source vectors at the same time. Rather doubt count_if will help here

2916

Thought about it too, will try to improve it somehow

ABataev updated this revision to Diff 346422.May 19 2021, 5:53 AM

Rebase + address comments

A few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2913

SmallVector seems unnecessary - why not just ValueList VectorOperands[NumOps] ? Even NumOps seems a bit too much.

3821

V is used only once

getInsertIndex(VL[I], 0)
3833

assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ?

4387

auto

4417

Can this be replaced with a if (none_of(FirstUsers)) pattern? You might be able to merge AreFromSingleVector into the lambda as well, although that might get too unwieldy?

4987

assert(Offset < UINT_MAX && "Failed to find vector index offset") ? Or should it be Offset < NumScalars ?

ABataev updated this revision to Diff 346735.May 20 2021, 7:50 AM

Address Comments.

RKSimon added inline comments.May 20 2021, 8:27 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗(On Diff #346735)

Still not performing fptoui on the entire <4 x i32>?

RKSimon accepted this revision.May 21 2021, 1:18 AM

LGTM

llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗(On Diff #346735)

This is purely a cost-model issue - fptoui for 2f32 is 8 but 4f32 is 18 (looks like the model assumes they scalarize which they don't) - these are really wrong, but shouldn't stop this patch.

This revision is now accepted and ready to land.May 21 2021, 1:18 AM
RKSimon added inline comments.May 21 2021, 4:18 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗(On Diff #346735)
ABataev added inline comments.May 21 2021, 4:45 AM
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
72 ↗(On Diff #346735)

Ok, thanks. I'll check it. Sorry for the delay with the answers, busy with other regressions.

This revision was automatically updated to reflect the committed changes.

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.

I was unsuccessful in getting it to repro directly from the open source repo. However I reduced this which shows the issue:

$ cat repro.cc
#include <xmmintrin.h>

#include <cstdint>
#include <cstdio>
#include <cstring>

// https://github.com/google/tink/blob/a72c9d542cd1dd8b58b2620ab52585cf5544f212/cc/subtle/aes_eax_aesni.cc#L79
inline __m128i Add(__m128i x, uint64_t y) {
  // Convert to a vector of two uint64_t.
  uint64_t vec[2];
  _mm_storeu_si128(reinterpret_cast<__m128i *>(vec), x);
  // Perform the addition on the vector.
  vec[0] += y;
  if (y > vec[0]) {
    vec[1]++;
  }
  // Convert back to xmm.
  return _mm_loadu_si128(reinterpret_cast<__m128i *>(vec));
}

void print128(__m128i var) {
  uint64_t parts[2];
  memcpy(parts, &var, sizeof(parts));
  printf("%lu %lu\n", parts[0], parts[1]);
}

template <class T>
void DoNotOptimize(const T &var) {
  asm volatile("" : "+m"(const_cast<T &>(var)));
}

int main() {
  __m128i x = _mm_setzero_si128();
  DoNotOptimize(x);
  __m128i y = Add(x, 1);
  print128(x);
  print128(y);
}
$ clang++ repro.cc -o /tmp/miscompile -O2 -fno-slp-vectorize && /tmp/miscompile
0 0
1 0
$ clang++ repro.cc -o /tmp/miscompile -O2 && /tmp/miscompile
0 0
1 1

Prior to this patch, there is no difference when enabling or disabling -fslp-vectorize. The issue seems to be how this optimizes Add:

vec[0] += y;
if (y > vec[0]) {  // This effectively evaluates to true
  vec[1]++;
}
dyung added a subscriber: dyung.May 26 2021, 2:21 AM
This comment was removed by dyung.
dyung added a comment.May 26 2021, 2:31 AM

Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.

Consider the following code (ymm-1undef-add_ps_002.cpp):

#include <x86intrin.h>

__attribute__((noinline))
__m256 add_ps_002(__m256 a, __m256 b) {
  __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
                       b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
}

Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:

# %bb.0:                                # %entry                                                                                         
        vhaddps %xmm0, %xmm0, %xmm2                                                                                                      
        vextractf128    $1, %ymm0, %xmm0                                                                                                 
        vhaddps %xmm0, %xmm1, %xmm3                                                                                                      
        vinsertf128     $1, %xmm3, %ymm0, %ymm3                                                                                          
        vhaddps %ymm0, %ymm1, %ymm0                                                                                                      
        vblendps        $3, %ymm2, %ymm3, %ymm2         # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
        vshufpd $2, %ymm0, %ymm2, %ymm0         # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
        retq

With the following characteristics according to llvm-mca:

Iterations:        100
Instructions:      800
Total Cycles:      902
Total uOps:        1200

Dispatch Width:    2
uOps Per Cycle:    1.33
IPC:               0.89
Block RThroughput: 6.0

But after this change, the compiler is now producing the following assembly for the same code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm0, %xmm2
        vmovlhps        %xmm2, %xmm0, %xmm3             # xmm3 = xmm0[0],xmm2[0]                                                         
        vshufps $17, %xmm2, %xmm0, %xmm0        # xmm0 = xmm0[1,0],xmm2[1,0]                                                             
        vshufps $232, %xmm2, %xmm3, %xmm3       # xmm3 = xmm3[0,2],xmm2[2,3]                                                             
        vshufps $248, %xmm2, %xmm0, %xmm0       # xmm0 = xmm0[0,2],xmm2[3,3]                                                             
        vextractf128    $1, %ymm1, %xmm2                            
        vinsertps       $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]                                                            
        vinsertps       $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]                                                           
        vhaddps %xmm2, %xmm1, %xmm1                                                                                                      
        vhaddps %xmm2, %xmm2, %xmm2                                                                                                      
        vaddps  %xmm0, %xmm3, %xmm0                                 
        vpermilps       $148, %xmm0, %xmm3      # xmm3 = xmm0[0,1,1,2]                                                                   
        vinsertps       $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero                                                        
        vinsertps       $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]                                                           
        vinsertf128     $1, %xmm0, %ymm3, %ymm0                                                                                          
        retq

Which has the following characteristics according to llvm-mca:

Iterations:        100                                                                                                                   
Instructions:      1600                                                                                                                  
Total Cycles:      1007                                                                                                                  
Total uOps:        1700         

Dispatch Width:    2
uOps Per Cycle:    1.69
IPC:               1.59
Block RThroughput: 8.5

With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.

Could you take a look?

Thanks for the reports, will investigate them all and fix ASAP.

We're seeing some test failures that bisected to this patch, possibly a miscompile. The test failure is in the unit test for this file: https://github.com/google/tink/blob/master/cc/subtle/aes_eax_aesni.cc. Are there already any known issues with this patch?

No, there are not. It would help if you could provide the reproducer and exact compile command to check if the problem exists.

I was unsuccessful in getting it to repro directly from the open source repo. However I reduced this which shows the issue:

$ cat repro.cc
#include <xmmintrin.h>

#include <cstdint>
#include <cstdio>
#include <cstring>

// https://github.com/google/tink/blob/a72c9d542cd1dd8b58b2620ab52585cf5544f212/cc/subtle/aes_eax_aesni.cc#L79
inline __m128i Add(__m128i x, uint64_t y) {
  // Convert to a vector of two uint64_t.
  uint64_t vec[2];
  _mm_storeu_si128(reinterpret_cast<__m128i *>(vec), x);
  // Perform the addition on the vector.
  vec[0] += y;
  if (y > vec[0]) {
    vec[1]++;
  }
  // Convert back to xmm.
  return _mm_loadu_si128(reinterpret_cast<__m128i *>(vec));
}

void print128(__m128i var) {
  uint64_t parts[2];
  memcpy(parts, &var, sizeof(parts));
  printf("%lu %lu\n", parts[0], parts[1]);
}

template <class T>
void DoNotOptimize(const T &var) {
  asm volatile("" : "+m"(const_cast<T &>(var)));
}

int main() {
  __m128i x = _mm_setzero_si128();
  DoNotOptimize(x);
  __m128i y = Add(x, 1);
  print128(x);
  print128(y);
}
$ clang++ repro.cc -o /tmp/miscompile -O2 -fno-slp-vectorize && /tmp/miscompile
0 0
1 0
$ clang++ repro.cc -o /tmp/miscompile -O2 && /tmp/miscompile
0 0
1 1

Prior to this patch, there is no difference when enabling or disabling -fslp-vectorize. The issue seems to be how this optimizes Add:

vec[0] += y;
if (y > vec[0]) {  // This effectively evaluates to true
  vec[1]++;
}

Here is a fix D103164

Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.

Consider the following code (ymm-1undef-add_ps_002.cpp):

#include <x86intrin.h>

__attribute__((noinline))
__m256 add_ps_002(__m256 a, __m256 b) {
  __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
                       b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
}

Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:

# %bb.0:                                # %entry                                                                                         
        vhaddps %xmm0, %xmm0, %xmm2                                                                                                      
        vextractf128    $1, %ymm0, %xmm0                                                                                                 
        vhaddps %xmm0, %xmm1, %xmm3                                                                                                      
        vinsertf128     $1, %xmm3, %ymm0, %ymm3                                                                                          
        vhaddps %ymm0, %ymm1, %ymm0                                                                                                      
        vblendps        $3, %ymm2, %ymm3, %ymm2         # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
        vshufpd $2, %ymm0, %ymm2, %ymm0         # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
        retq

With the following characteristics according to llvm-mca:

Iterations:        100
Instructions:      800
Total Cycles:      902
Total uOps:        1200

Dispatch Width:    2
uOps Per Cycle:    1.33
IPC:               0.89
Block RThroughput: 6.0

But after this change, the compiler is now producing the following assembly for the same code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm0, %xmm2
        vmovlhps        %xmm2, %xmm0, %xmm3             # xmm3 = xmm0[0],xmm2[0]                                                         
        vshufps $17, %xmm2, %xmm0, %xmm0        # xmm0 = xmm0[1,0],xmm2[1,0]                                                             
        vshufps $232, %xmm2, %xmm3, %xmm3       # xmm3 = xmm3[0,2],xmm2[2,3]                                                             
        vshufps $248, %xmm2, %xmm0, %xmm0       # xmm0 = xmm0[0,2],xmm2[3,3]                                                             
        vextractf128    $1, %ymm1, %xmm2                            
        vinsertps       $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]                                                            
        vinsertps       $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]                                                           
        vhaddps %xmm2, %xmm1, %xmm1                                                                                                      
        vhaddps %xmm2, %xmm2, %xmm2                                                                                                      
        vaddps  %xmm0, %xmm3, %xmm0                                 
        vpermilps       $148, %xmm0, %xmm3      # xmm3 = xmm0[0,1,1,2]                                                                   
        vinsertps       $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero                                                        
        vinsertps       $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]                                                           
        vinsertf128     $1, %xmm0, %ymm3, %ymm0                                                                                          
        retq

Which has the following characteristics according to llvm-mca:

Iterations:        100                                                                                                                   
Instructions:      1600                                                                                                                  
Total Cycles:      1007                                                                                                                  
Total uOps:        1700         

Dispatch Width:    2
uOps Per Cycle:    1.69
IPC:               1.59
Block RThroughput: 8.5

With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.

Could you take a look?

Looks like codegen or some other later passes previously recognized the pattern while SLP vectorizer did not. Actually, without SLP vectorizer I'm getting just this:

vperm2f128      $49, %ymm1, %ymm0, %ymm2 # ymm2 = ymm0[2,3],ymm1[2,3]
vinsertf128     $1, %xmm1, %ymm0, %ymm0
vhaddps %ymm2, %ymm0, %ymm0
retq

I assume SLP will be able to produce something similar (or even better) after we start supporting vectorization of non-power-2 vectors. Here we have a pattern that matches it exactly:

return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);

-1 causes the optimizer to optimize out a[2] + a[3] operation and SLP does not recognize vectorization of 7 addition operations. This is the price we have to pay till the landing of non-power-2 vectorization. Will try to speed up.