Page MenuHomePhabricator

[SLP] Initial support for the vectorization of the non-power-of-2 vectors.
Needs ReviewPublic

Authored by ABataev on Jan 22 2019, 8:30 AM.

Details

Summary

Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.

For SPEC CPU2017 it gives ~7% perf gain for 526.blender_r (AVX512,
O3+LTOi, -march=native), ~2% gain for 538.imagick_r and 638.imagick_s,
~2% gain for 525.x264_r and 625.x264_s, ~2% gain for 526.blender_r (AVX2
, O3+LTO, -march=native), ~11% gain 526.blender_r, ~3% gain for
544.nab_r and 644.nab_s (AVX512, O3+LTO), ~3% gain
for 526.blender_r, ~2% gain for 544.nab_r and 644.nab_s (AVX2, O3+LTO).

Compile and link time are pretty the same:

AVX512, O3+LTO, -march=native
Metric: compile_time
Geomean difference -0.1% (-1.85 sec)

Metric: link_time
Geomean difference +1.2% (+14.46 sec)

AVX512, O3+LTO
Metric: compile_time
Geomean difference -0.2% (-4.71 sec)

Metric: link_time
Geomean difference -3.6% (-54.53 sec)

AVX2, O3+LTO, -march=native
Metric: compile_time
Geomean difference +0.3% (+10.56 sec)

Metric: link_time
Geomean difference -0.1% (-2.18 sec)

AVX2, O3+LTO
Metric: compile_time
Geomean difference +0.2% (+5.73 sec)

Metric: link_time
Geomean difference -3.4% (-67.45 sec)

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Nov 20 2020, 11:01 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
21–23

Yes, looks like the issue with the cost of @llvm.masked.gather for masked gather with some undefs in the mask

craig.topper added inline comments.Nov 20 2020, 11:07 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
21–23

Gather is slow on CPUs prior to AVX512. And its cost is proportional to the number of elements. I don't think the value of the mask should be a factor.

ABataev added inline comments.Nov 20 2020, 11:18 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
21–23

True, but in some cases it can be optimized into gather + shuffle instead of wide gather, if there are undefs in mask.

ABataev updated this revision to Diff 307098.Nov 23 2020, 9:09 AM

Rebase + improve handling of masked gathers.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2896

Could it be actually no one Instruction in UseEntry or should it be assert?

2900

"Lane 0" seems outdated here, but not sure about better description.

3080

assert(NumberOfInstructions != 0 && "...") and if (NumberOfInstructions == 1)?

3899

Comment typo: aggrgate.

5644–5645

emplace_back()

7069

typo: "indeces"

ABataev updated this revision to Diff 307204.Nov 23 2020, 2:45 PM

Fixed according to comments

ABataev updated this revision to Diff 307205.Nov 23 2020, 3:02 PM

Fixed function name.

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

dtemirbulatov added a comment.EditedDec 1 2020, 3:37 PM

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

I have done a while back with SPECINT 2006 and as I remember results were good, but I am not sure that I could find those now. Yes, for me, having this new functionality with presented compile-time regression looks ok.

I dont think (geomean) 0.20% is significant compile time problem. TBH, I expected bigger CT regressions - up to 0.5% is fine IMHO.

AFAICT the only outstanding question is whether the compile time increase is acceptable?

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3602

Well, "unsortable" or "unprocessable" term would be more precise. But why did we change if (sortPtrAccesses)... to opposite condition? This change just duplicate debug output, since we didn't differentiate it. Also I'd prefer to see the same if-else structure as for the load case.

nikic added a comment.Dec 9 2020, 5:19 AM

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...

Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:

CMakeFiles/lencod.dir/transform8x8.c.o 	3053M 	3188M (+4.41%)

It may be worthwhile to briefly check it in case something can be improved.

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...

Will try to run the benchmarks and get fresh data.

Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:

CMakeFiles/lencod.dir/transform8x8.c.o 	3053M 	3188M (+4.41%)

It may be worthwhile to briefly check it in case something can be improved.

I'll check what can be improved in terms of compile time. Not sure that will be able to improve it significantly since the patch itself does not add extensive analysis/transformations, just adds an extra 1 iteration for wider vector analysis. But I'll check it anyway and will try to improve things where possible.

While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:

%0 = or i64 16, 2
%arrayidx107 = getelementptr inbounds double, double* %a, i64 %0
%1 = or i64 16, 3
%arrayidx114 = getelementptr inbounds double, double* %a, i64 %1
%2 = or i64 16, 4
%arrayidx131 = getelementptr inbounds double, double* %a, i64 %2
%3 = or i64 16, 6
%arrayidx134 = getelementptr inbounds double, double* %a, i64 %3
%4 = load double, double* %arrayidx134, align 8
%5 = or i64 16, 5
%arrayidx138 = getelementptr inbounds double, double* %a, i64 %5
%6 = or i64 16, 7
%arrayidx141 = getelementptr inbounds double, double* %a, i64 %6
%7 = load double, double* %arrayidx141, align 8
%sub149 = fsub double undef, %4
%sub156 = fsub double undef, %7
store double undef, double* %arrayidx131, align 8
store double undef, double* %arrayidx138, align 8
%sub178 = fsub double undef, %sub156
%add179 = fadd double undef, %sub149
%mul180 = fmul double undef, %sub178
%sub182 = fsub double %mul180, undef
store double %sub182, double* %arrayidx107, align 8
%mul186 = fmul double undef, %add179
%add188 = fadd double %mul186, undef
store double %add188, double* %arrayidx114, align 8
unreachable

}

attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

!llvm.ident = !{!0}

!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}

While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:

%0 = or i64 16, 2
%arrayidx107 = getelementptr inbounds double, double* %a, i64 %0
%1 = or i64 16, 3
%arrayidx114 = getelementptr inbounds double, double* %a, i64 %1
%2 = or i64 16, 4
%arrayidx131 = getelementptr inbounds double, double* %a, i64 %2
%3 = or i64 16, 6
%arrayidx134 = getelementptr inbounds double, double* %a, i64 %3
%4 = load double, double* %arrayidx134, align 8
%5 = or i64 16, 5
%arrayidx138 = getelementptr inbounds double, double* %a, i64 %5
%6 = or i64 16, 7
%arrayidx141 = getelementptr inbounds double, double* %a, i64 %6
%7 = load double, double* %arrayidx141, align 8
%sub149 = fsub double undef, %4
%sub156 = fsub double undef, %7
store double undef, double* %arrayidx131, align 8
store double undef, double* %arrayidx138, align 8
%sub178 = fsub double undef, %sub156
%add179 = fadd double undef, %sub149
%mul180 = fmul double undef, %sub178
%sub182 = fsub double %mul180, undef
store double %sub182, double* %arrayidx107, align 8
%mul186 = fmul double undef, %add179
%add188 = fadd double %mul186, undef
store double %add188, double* %arrayidx114, align 8
unreachable

}

attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

!llvm.ident = !{!0}

!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}

Do you mean compile time increasing? With this patch?

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Crash or incorrect code?

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Crash or incorrect code?

Crash.

wxiao3 added a subscriber: wxiao3.Dec 14 2020, 7:09 AM
ABataev edited the summary of this revision. (Show Details)Feb 10 2021, 6:01 AM

Extra numbers:

AVX512, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   146.00   148.00   1.4%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   146.00   148.00   1.4%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    34.00    34.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    34.00    34.00   0.0%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  5587.00  5560.00  -0.5%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  5587.00  5560.00  -0.5%
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  7384.00  7341.00  -0.6%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9607.00  9359.00  -2.6%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5340.00  5178.00  -3.0%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1053.00  1006.00  -4.5%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1053.00  1006.00  -4.5%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   141.00   134.00  -5.0%
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   141.00   134.00  -5.0%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3996.00  3563.00 -10.8%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3996.00  3563.00 -10.8%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   862.00   767.00 -11.0%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   862.00   767.00 -11.0%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   524.00   463.00 -11.6%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   524.00   463.00 -11.6%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   426.00   370.00 -13.1%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   426.00   370.00 -13.1%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 15945.00 12573.00 -21.1%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test      NaN    16.00   nan%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test      NaN    16.00   nan%
                                                             Geomean difference                     nan%

AVX512, O3+LTO
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    22.00    60.00 172.7%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    22.00    60.00 172.7%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test    68.00    72.00   5.9%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test    68.00    72.00   5.9%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    10.00    10.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  3396.00  3396.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    10.00    10.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  3396.00  3396.00   0.0%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   499.00   497.00  -0.4%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   499.00   497.00  -0.4%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   838.00   826.00  -1.4%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   838.00   826.00  -1.4%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  6090.00  5906.00  -3.0%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test   131.00   127.00  -3.1%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   131.00   127.00  -3.1%
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  8815.00  8452.00  -4.1%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  2864.00  2712.00  -5.3%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  2864.00  2712.00  -5.3%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 16049.00 14753.00  -8.1%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   686.00   621.00  -9.5%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   686.00   621.00  -9.5%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   551.00   473.00 -14.2%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   551.00   473.00 -14.2%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 16240.00 13941.00 -14.2%
                                                             Geomean difference                     4.3%

AVX2, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  7309.00  7341.00   0.4%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    34.00    34.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  5490.00  5490.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    34.00    34.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  5490.00  5490.00   0.0%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   462.00   455.00  -1.5%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   462.00   455.00  -1.5%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9508.00  9347.00  -1.7%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5393.00  5190.00  -3.8%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1066.00   968.00  -9.2%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1066.00   968.00  -9.2%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   151.00   134.00 -11.3%
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   151.00   134.00 -11.3%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   160.00   141.00 -11.9%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   160.00   141.00 -11.9%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   820.00   722.00 -12.0%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   820.00   722.00 -12.0%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3605.00  3173.00 -12.0%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3605.00  3173.00 -12.0%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   438.00   370.00 -15.5%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   438.00   370.00 -15.5%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14842.00 12463.00 -16.0%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test   106.00    79.00 -25.5%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   106.00    79.00 -25.5%
                                                             Geomean difference                    -8.7%

AVX2, O3+LTO
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    22.00    60.00 172.7%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    22.00    60.00 172.7%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test    68.00    72.00   5.9%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test    68.00    72.00   5.9%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    10.00    10.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  3396.00  3396.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    10.00    10.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  3396.00  3396.00   0.0%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   499.00   497.00  -0.4%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   499.00   497.00  -0.4%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   838.00   826.00  -1.4%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   838.00   826.00  -1.4%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test   131.00   127.00  -3.1%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   131.00   127.00  -3.1%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  6094.00  5906.00  -3.1%
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  8734.00  8452.00  -3.2%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  2867.00  2712.00  -5.4%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  2867.00  2712.00  -5.4%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 16026.00 14753.00  -7.9%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   686.00   621.00  -9.5%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   686.00   621.00  -9.5%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 16241.00 13941.00 -14.2%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   559.00   473.00 -15.4%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   559.00   473.00 -15.4%
                                                             Geomean difference                     4.2%

Will update the patch soon.

ABataev updated this revision to Diff 322816.Feb 10 2021, 1:43 PM
ABataev edited the summary of this revision. (Show Details)

Rework, bug fixes, rebase

This is an integral patch, going to split it into several smaller patches.

Btw, how could it be explained NumVectorInstructions stat reducing after this patch?

Extra numbers:

AVX512, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

...
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3996.00  3563.00 -10.8%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3996.00  3563.00 -10.8%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   862.00   767.00 -11.0%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   862.00   767.00 -11.0%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   524.00   463.00 -11.6%
...

This is an integral patch, going to split it into several smaller patches.

Are you planning to send for review all of this patches or just to commit them after this integral review? I.e. should we go on with review here?

Btw, how could it be explained NumVectorInstructions stat reducing after this patch?

Extra numbers:

AVX512, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

...
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3996.00  3563.00 -10.8%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3996.00  3563.00 -10.8%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   862.00   767.00 -11.0%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   862.00   767.00 -11.0%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   524.00   463.00 -11.6%
...

Actually, it is not reducing. This is how test-suite python script works. So, here lhs - number of instructions after this patch, rhs - before. And the less relative number, the more vector instructions we actually generate.

This is an integral patch, going to split it into several smaller patches.

Are you planning to send for review all of this patches or just to commit them after this integral review? I.e. should we go on with review here?

You can publish your comments here, no need to wait for small patches.

Actually, it is not reducing. This is how test-suite python script works. So, here lhs - number of instructions after this patch, rhs - before. And the less relative number, the more vector instructions we actually generate.

Oh, I see. There are still several reducing cases though.

Actually, it is not reducing. This is how test-suite python script works. So, here lhs - number of instructions after this patch, rhs - before. And the less relative number, the more vector instructions we actually generate.

Oh, I see. There are still several reducing cases though.

Actually, no. I compared the resulting IR files for these cases - they are absolutely the same as before. It is just we generating fewer ExtractElement/InsertElement instructions and directly emit Shuffle instructions in some cases. That's why it seems it generates fewer vector instructions though it is not.

ABataev updated this revision to Diff 331873.Mar 19 2021, 7:41 AM

Removed logical reductions conversion code.

What is the status of this patch? any blockers? or just lack of reviewers?

What is the status of this patch? any blockers? or just lack of reviewers?

The patch is just too big, I'm splitting it into smaller chunks and commit them step by step. The final chunk would be largest one. But before need to commit other improvements that can be separated from the patch.

Hi Alexey! Is this patch ready for reviewing or will other patches be splitted from this one?

Hi Alexey! Is this patch ready for reviewing or will other patches be splitted from this one?

Just like I said before, it must be split into several smaller patches. I'm doing this step by step, there is D101109, which is part of this big patch. I want to commit it, then rebase this patch and split it again, there are some other parts that can be committed independently.

Is it worth rebasing this to show the remaining diffs that still need to go in?

Is it worth rebasing this to show the remaining diffs that still need to go in?

There were not many commits for it, need to commit some extra patches to fix some regressions, but any way I've started rebasing. We'll try to rebase it next week