This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Initial support for the vectorization of the non-power-of-2 vectors.
Needs ReviewPublic

Authored by ABataev on Jan 22 2019, 8:30 AM.

Details

Summary

Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.

For SPEC CPU2017 it gives ~7% perf gain for 526.blender_r (AVX512,
O3+LTOi, -march=native), ~2% gain for 538.imagick_r and 638.imagick_s,
~2% gain for 525.x264_r and 625.x264_s, ~2% gain for 526.blender_r (AVX2
, O3+LTO, -march=native), ~11% gain 526.blender_r, ~3% gain for
544.nab_r and 644.nab_s (AVX512, O3+LTO), ~3% gain
for 526.blender_r, ~2% gain for 544.nab_r and 644.nab_s (AVX2, O3+LTO).

Compile and link time are pretty the same:

AVX512, O3+LTO, -march=native
Metric: compile_time
Geomean difference -0.1% (-1.85 sec)

Metric: link_time
Geomean difference +1.2% (+14.46 sec)

AVX512, O3+LTO
Metric: compile_time
Geomean difference -0.2% (-4.71 sec)

Metric: link_time
Geomean difference -3.6% (-54.53 sec)

AVX2, O3+LTO, -march=native
Metric: compile_time
Geomean difference +0.3% (+10.56 sec)

Metric: link_time
Geomean difference -0.1% (-2.18 sec)

AVX2, O3+LTO
Metric: compile_time
Geomean difference +0.2% (+5.73 sec)

Metric: link_time
Geomean difference -3.4% (-67.45 sec)

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev marked an inline comment as done.Sep 22 2020, 8:57 AM

Rebase

some very minor style comments - a general comment would be to try and pre-commit the style/NFC refactor/cleanup changes so the size of this patch is smaller

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2745

Do these trivial style refactors separately now to reduce the size of the patch?

2761

Do these trivial style refactors separately now to reduce the size of the patch?

3574

duplicate cast

3603

duplicate cast

3617

duplicate cast

3671

duplicate cast

4244–4249

trivial style refactor - pull out of patch?

RKSimon added inline comments.Sep 23 2020, 4:24 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
198–199

Are we going to have a problem if VL[0] is UndefValue?

ABataev added inline comments.Sep 23 2020, 4:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
198–199

Yeah, will fix it.

ABataev updated this revision to Diff 293701.Sep 23 2020, 5:30 AM

Rebase + fix

RKSimon added inline comments.Sep 23 2020, 5:52 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

These "feel" like regressions to me - any idea whats going on?

ABataev added inline comments.Sep 23 2020, 5:55 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

The cost model problem, if I recall it correctly. I investigated it before and found out that the cost model for AArch64 is not defined for long vectors in some cases and we fall back to the generic cost model evaluation which is not quite correct in many cases. Need to tweak the cost model for AArch64.

RKSimon added inline comments.Sep 23 2020, 6:15 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Any instruction cost type (extract/shuffle/store?) in particular that needs better costs? It'd be good to at least raise a specific bug report to the aarch64 team

ABataev added inline comments.Sep 23 2020, 6:31 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Do not remember already, need some time to investigate it again. Hope to do it by the end of this week.
PS. There was a question about this test already.

spatel added inline comments.Sep 23 2020, 7:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7432

Is it necessary to copy these? If so, it would be better to name this function something like "getCopyOfExtraArgValues" to make that explicit.

If not, we can just make this a standard 'get' method:

const MapVector<Instruction *, Value *> &getExtraArgs() const {
  return ExtraArgs;
}

And then access the 'second' data in the user code?

ABataev added inline comments.Sep 24 2020, 6:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7432

We don't need to expose the first element of the MapVector here, it is not good from the general design point of view. I'll rename the member function.

ABataev updated this revision to Diff 294040.Sep 24 2020, 6:33 AM

Rebase + rename

ABataev added inline comments.Sep 24 2020, 7:19 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Found the reason. It is the cost of shuffle of TTI::SK_PermuteSingleSrc kind. Before this patch, the test operated with the vector <2 x i16>, which is transformed to llvm::MVT::v2i32 by type legalization function and the cost of this shuffle is tweaked to be 1 (see llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp, AArch64TTIImpl::getShuffleCost). The cost of this operation is 1, per table.

With this patch, the original vector type is <4 x i16> which is transformed to llvm::MVT::v4i16 and there is no optimized value for TTI::SK_PermuteSingleSrc in the table for this type and the function falls back to the pessimistic cost model and returns 18.

There are several TODOs int the file already about fixing the cost model for different shuffle operations.

Does anyone have any more comments?

spatel added inline comments.Sep 30 2020, 6:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3172–3174

Use isValidElementType() or check for undef directly?
I still can't tell from the debug statement exactly what we are guarding against. Should the type check already be here even without this patch?

3839

Are we always creating a masked load for a vector with 2 elements? This logic needs a code comment to explain the cases.

4717

Please add code comment/example to explain what the difference is between these 2 clauses.

4743

Is Passthrough a full vector of undef elements? If so, it should be created/named that way (or directly in the call to CreateMaskedLoad()) rather than in the loop.

4815–4816

Similar to above (so can we add a helper function to avoid duplicating the code?):
Please add code comment/example to explain what the difference is between these 2 clauses.

reverse ping

reverse ping

Will update the patch as soon as I'm back to work, in 2-3 weeks.

reverse ping?

reverse ping?

Need some time to setup my dev environment, will update ASAP

ABataev added inline comments.Nov 9 2020, 10:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3172–3174

I was just trying to protect the code and try to support it only for simple types at first. There are some doubts that the cost for masked loads/stores is completed and I protected it to make it work only for simple types. I can remove this check if the cost model for masked ops is good enough.

3839

No, no need to do it for 2 elements, removed it.

4717

Fixed it, thanks.

4743

Fixed

4815–4816

Fixed, thanks!

RKSimon added inline comments.Nov 9 2020, 11:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3172–3174

masked load/store costs for constant masks should be good enough now (getScalarizationOverhead should now provide us with a reasonable fallback)

ABataev updated this revision to Diff 304196.Nov 10 2020, 8:11 AM

Rebase, updates and fixes

All of my comments were addressed, so LGTM. But please wait for an official 'accept' from at least 1 other reviewer.

RKSimon added inline comments.Nov 12 2020, 9:35 AM
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
280

There isn't a ANY check-prefix atm (it was cleaned out in rG119e4550ddedc75e4 as part of the unused prefix cleanup) - please can you review?

ABataev added inline comments.Nov 12 2020, 9:39 AM
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
280

Yes, need to remove it, I think. Most probably, caused but not quite clean merge.

ABataev updated this revision to Diff 304968.Nov 12 2020, 1:49 PM

Rebase, test cleanup + small code improvements.

xbolva00 added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Regression on avx?

ABataev added inline comments.Nov 20 2020, 11:01 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Yes, looks like the issue with the cost of @llvm.masked.gather for masked gather with some undefs in the mask

craig.topper added inline comments.Nov 20 2020, 11:07 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Gather is slow on CPUs prior to AVX512. And its cost is proportional to the number of elements. I don't think the value of the mask should be a factor.

ABataev added inline comments.Nov 20 2020, 11:18 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

True, but in some cases it can be optimized into gather + shuffle instead of wide gather, if there are undefs in mask.

ABataev updated this revision to Diff 307098.Nov 23 2020, 9:09 AM

Rebase + improve handling of masked gathers.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2555

Could it be actually no one Instruction in UseEntry or should it be assert?

2559

"Lane 0" seems outdated here, but not sure about better description.

2705

assert(NumberOfInstructions != 0 && "...") and if (NumberOfInstructions == 1)?

3460–3463

Comment typo: aggrgate.

4727–4728

emplace_back()

6159

typo: "indeces"

ABataev updated this revision to Diff 307204.Nov 23 2020, 2:45 PM

Fixed according to comments

ABataev updated this revision to Diff 307205.Nov 23 2020, 3:02 PM

Fixed function name.

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

dtemirbulatov added a comment.EditedDec 1 2020, 3:37 PM

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

I have done a while back with SPECINT 2006 and as I remember results were good, but I am not sure that I could find those now. Yes, for me, having this new functionality with presented compile-time regression looks ok.

I dont think (geomean) 0.20% is significant compile time problem. TBH, I expected bigger CT regressions - up to 0.5% is fine IMHO.

AFAICT the only outstanding question is whether the compile time increase is acceptable?

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3178

Well, "unsortable" or "unprocessable" term would be more precise. But why did we change if (sortPtrAccesses)... to opposite condition? This change just duplicate debug output, since we didn't differentiate it. Also I'd prefer to see the same if-else structure as for the load case.

nikic added a comment.Dec 9 2020, 5:19 AM

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...

Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:

CMakeFiles/lencod.dir/transform8x8.c.o 	3053M 	3188M (+4.41%)

It may be worthwhile to briefly check it in case something can be improved.

AFAICT the only outstanding question is whether the compile time increase is acceptable?

I'd agree that geomean = 0.2% is acceptable for the change with such awesome performance impact, just noted that changed time compilation is significant in comparision with other changes. Generally it looks good to me apart from one minor unaddressed comment.

Could the summary of the revision be updated with the performance data? The thread is very long and I didn't spot where the measurements are, so it's hard to say what we're trading off here...

Will try to run the benchmarks and get fresh data.

Generally though this level of geomean regression is fine, the issue is usually in the outliers. For example, I spot this this 4-5% regression:

CMakeFiles/lencod.dir/transform8x8.c.o 	3053M 	3188M (+4.41%)

It may be worthwhile to briefly check it in case something can be improved.

I'll check what can be improved in terms of compile time. Not sure that will be able to improve it significantly since the patch itself does not add extensive analysis/transformations, just adds an extra 1 iteration for wider vector analysis. But I'll check it anyway and will try to improve things where possible.

While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:

%0 = or i64 16, 2
%arrayidx107 = getelementptr inbounds double, double* %a, i64 %0
%1 = or i64 16, 3
%arrayidx114 = getelementptr inbounds double, double* %a, i64 %1
%2 = or i64 16, 4
%arrayidx131 = getelementptr inbounds double, double* %a, i64 %2
%3 = or i64 16, 6
%arrayidx134 = getelementptr inbounds double, double* %a, i64 %3
%4 = load double, double* %arrayidx134, align 8
%5 = or i64 16, 5
%arrayidx138 = getelementptr inbounds double, double* %a, i64 %5
%6 = or i64 16, 7
%arrayidx141 = getelementptr inbounds double, double* %a, i64 %6
%7 = load double, double* %arrayidx141, align 8
%sub149 = fsub double undef, %4
%sub156 = fsub double undef, %7
store double undef, double* %arrayidx131, align 8
store double undef, double* %arrayidx138, align 8
%sub178 = fsub double undef, %sub156
%add179 = fadd double undef, %sub149
%mul180 = fmul double undef, %sub178
%sub182 = fsub double %mul180, undef
store double %sub182, double* %arrayidx107, align 8
%mul186 = fmul double undef, %add179
%add188 = fadd double %mul186, undef
store double %add188, double* %arrayidx114, align 8
unreachable

}

attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

!llvm.ident = !{!0}

!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}

While reviewing the latest update, I think I spotted SLP compile-time failure in SingleSource/Benchmarks/Misc/oourafft.c, here is the reduced testcase to reporduce:
source_filename = "/home/dtemirbulatov/llvm/test-suite/SingleSource/Benchmarks/Misc/oourafft.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local fastcc void @cft1st(double* %a) unnamed_addr #0 {
entry:

%0 = or i64 16, 2
%arrayidx107 = getelementptr inbounds double, double* %a, i64 %0
%1 = or i64 16, 3
%arrayidx114 = getelementptr inbounds double, double* %a, i64 %1
%2 = or i64 16, 4
%arrayidx131 = getelementptr inbounds double, double* %a, i64 %2
%3 = or i64 16, 6
%arrayidx134 = getelementptr inbounds double, double* %a, i64 %3
%4 = load double, double* %arrayidx134, align 8
%5 = or i64 16, 5
%arrayidx138 = getelementptr inbounds double, double* %a, i64 %5
%6 = or i64 16, 7
%arrayidx141 = getelementptr inbounds double, double* %a, i64 %6
%7 = load double, double* %arrayidx141, align 8
%sub149 = fsub double undef, %4
%sub156 = fsub double undef, %7
store double undef, double* %arrayidx131, align 8
store double undef, double* %arrayidx138, align 8
%sub178 = fsub double undef, %sub156
%add179 = fadd double undef, %sub149
%mul180 = fmul double undef, %sub178
%sub182 = fsub double %mul180, undef
store double %sub182, double* %arrayidx107, align 8
%mul186 = fmul double undef, %add179
%add188 = fadd double %mul186, undef
store double %add188, double* %arrayidx114, align 8
unreachable

}

attributes #0 = { "target-features"="+avx,+avx2,+bmi,+bmi2,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+rdrnd,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

!llvm.ident = !{!0}

!0 = !{!"clang version 12.0.0 (https://github.com/llvm/llvm-project.git aaa925795f93c389a96ee01bab73bc2b6b771cbb)"}

Do you mean compile time increasing? With this patch?

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Crash or incorrect code?

Do you mean compile time increasing? With this patch?

no, just compile-time error.

Crash or incorrect code?

Crash.

wxiao3 added a subscriber: wxiao3.Dec 14 2020, 7:09 AM
ABataev edited the summary of this revision. (Show Details)Feb 10 2021, 6:01 AM

Extra numbers:

AVX512, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   146.00   148.00   1.4%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   146.00   148.00   1.4%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    34.00    34.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    34.00    34.00   0.0%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  5587.00  5560.00  -0.5%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  5587.00  5560.00  -0.5%
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  7384.00  7341.00  -0.6%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9607.00  9359.00  -2.6%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5340.00  5178.00  -3.0%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1053.00  1006.00  -4.5%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1053.00  1006.00  -4.5%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   141.00   134.00  -5.0%
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   141.00   134.00  -5.0%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3996.00  3563.00 -10.8%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3996.00  3563.00 -10.8%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   862.00   767.00 -11.0%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   862.00   767.00 -11.0%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   524.00   463.00 -11.6%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   524.00   463.00 -11.6%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   426.00   370.00 -13.1%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   426.00   370.00 -13.1%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 15945.00 12573.00 -21.1%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test      NaN    16.00   nan%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test      NaN    16.00   nan%
                                                             Geomean difference                     nan%

AVX512, O3+LTO
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    22.00    60.00 172.7%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    22.00    60.00 172.7%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test    68.00    72.00   5.9%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test    68.00    72.00   5.9%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    10.00    10.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  3396.00  3396.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    10.00    10.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  3396.00  3396.00   0.0%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   499.00   497.00  -0.4%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   499.00   497.00  -0.4%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   838.00   826.00  -1.4%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   838.00   826.00  -1.4%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  6090.00  5906.00  -3.0%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test   131.00   127.00  -3.1%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   131.00   127.00  -3.1%
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  8815.00  8452.00  -4.1%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  2864.00  2712.00  -5.3%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  2864.00  2712.00  -5.3%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 16049.00 14753.00  -8.1%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   686.00   621.00  -9.5%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   686.00   621.00  -9.5%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   551.00   473.00 -14.2%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   551.00   473.00 -14.2%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 16240.00 13941.00 -14.2%
                                                             Geomean difference                     4.3%

AVX2, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  7309.00  7341.00   0.4%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    34.00    34.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  5490.00  5490.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    34.00    34.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  5490.00  5490.00   0.0%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   462.00   455.00  -1.5%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   462.00   455.00  -1.5%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9508.00  9347.00  -1.7%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5393.00  5190.00  -3.8%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1066.00   968.00  -9.2%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1066.00   968.00  -9.2%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   151.00   134.00 -11.3%
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   151.00   134.00 -11.3%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   160.00   141.00 -11.9%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   160.00   141.00 -11.9%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   820.00   722.00 -12.0%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   820.00   722.00 -12.0%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3605.00  3173.00 -12.0%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3605.00  3173.00 -12.0%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   438.00   370.00 -15.5%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   438.00   370.00 -15.5%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14842.00 12463.00 -16.0%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test   106.00    79.00 -25.5%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   106.00    79.00 -25.5%
                                                             Geomean difference                    -8.7%

AVX2, O3+LTO
Metric: SLP.NumVectorInstructions

Program                                                                         lhs      rhs      diff
         test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    22.00    60.00 172.7%
          test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    22.00    60.00 172.7%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test    68.00    72.00   5.9%
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test    68.00    72.00   5.9%
              test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test    11.00    11.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    10.00    10.00   0.0%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  3396.00  3396.00   0.0%
             test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test    11.00    11.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    10.00    10.00   0.0%
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  3396.00  3396.00   0.0%
     test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   499.00   497.00  -0.4%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   499.00   497.00  -0.4%
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   838.00   826.00  -1.4%
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   838.00   826.00  -1.4%
              test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test   131.00   127.00  -3.1%
               test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   131.00   127.00  -3.1%
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  6094.00  5906.00  -3.1%
             test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  8734.00  8452.00  -3.2%
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  2867.00  2712.00  -5.4%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  2867.00  2712.00  -5.4%
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 16026.00 14753.00  -7.9%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   686.00   621.00  -9.5%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   686.00   621.00  -9.5%
         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 16241.00 13941.00 -14.2%
              test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   559.00   473.00 -15.4%
               test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   559.00   473.00 -15.4%
                                                             Geomean difference                     4.2%

Will update the patch soon.

ABataev updated this revision to Diff 322816.Feb 10 2021, 1:43 PM
ABataev edited the summary of this revision. (Show Details)

Rework, bug fixes, rebase

This is an integral patch, going to split it into several smaller patches.

Btw, how could it be explained NumVectorInstructions stat reducing after this patch?

Extra numbers:

AVX512, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

...
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3996.00  3563.00 -10.8%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3996.00  3563.00 -10.8%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   862.00   767.00 -11.0%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   862.00   767.00 -11.0%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   524.00   463.00 -11.6%
...

This is an integral patch, going to split it into several smaller patches.

Are you planning to send for review all of this patches or just to commit them after this integral review? I.e. should we go on with review here?

Btw, how could it be explained NumVectorInstructions stat reducing after this patch?

Extra numbers:

AVX512, O3+LTO, -march=native
Metric: SLP.NumVectorInstructions

...
      test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3996.00  3563.00 -10.8%
       test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3996.00  3563.00 -10.8%
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   862.00   767.00 -11.0%
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   862.00   767.00 -11.0%
      test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   524.00   463.00 -11.6%
...

Actually, it is not reducing. This is how test-suite python script works. So, here lhs - number of instructions after this patch, rhs - before. And the less relative number, the more vector instructions we actually generate.

This is an integral patch, going to split it into several smaller patches.

Are you planning to send for review all of this patches or just to commit them after this integral review? I.e. should we go on with review here?

You can publish your comments here, no need to wait for small patches.

Actually, it is not reducing. This is how test-suite python script works. So, here lhs - number of instructions after this patch, rhs - before. And the less relative number, the more vector instructions we actually generate.

Oh, I see. There are still several reducing cases though.

Actually, it is not reducing. This is how test-suite python script works. So, here lhs - number of instructions after this patch, rhs - before. And the less relative number, the more vector instructions we actually generate.

Oh, I see. There are still several reducing cases though.

Actually, no. I compared the resulting IR files for these cases - they are absolutely the same as before. It is just we generating fewer ExtractElement/InsertElement instructions and directly emit Shuffle instructions in some cases. That's why it seems it generates fewer vector instructions though it is not.

ABataev updated this revision to Diff 331873.Mar 19 2021, 7:41 AM

Removed logical reductions conversion code.

What is the status of this patch? any blockers? or just lack of reviewers?

What is the status of this patch? any blockers? or just lack of reviewers?

The patch is just too big, I'm splitting it into smaller chunks and commit them step by step. The final chunk would be largest one. But before need to commit other improvements that can be separated from the patch.

Hi Alexey! Is this patch ready for reviewing or will other patches be splitted from this one?

Hi Alexey! Is this patch ready for reviewing or will other patches be splitted from this one?

Just like I said before, it must be split into several smaller patches. I'm doing this step by step, there is D101109, which is part of this big patch. I want to commit it, then rebase this patch and split it again, there are some other parts that can be committed independently.

Is it worth rebasing this to show the remaining diffs that still need to go in?

Is it worth rebasing this to show the remaining diffs that still need to go in?

There were not many commits for it, need to commit some extra patches to fix some regressions, but any way I've started rebasing. We'll try to rebase it next week

Is it worth rebasing this to show the remaining diffs that still need to go in?

There were not many commits for it, need to commit some extra patches to fix some regressions, but any way I've started rebasing. We'll try to rebase it next week

Any chance that you could refresh this patch with your rebase please? I'm investigating a lot of 'float3' style performance issues at the moment (PR50920, PR51075, PR51091) and I'd like to get a better idea of how close the non-pow2 slp support will get us. Thanks.

Is it worth rebasing this to show the remaining diffs that still need to go in?

There were not many commits for it, need to commit some extra patches to fix some regressions, but any way I've started rebasing. We'll try to rebase it next week

Any chance that you could refresh this patch with your rebase please? I'm investigating a lot of 'float3' style performance issues at the moment (PR50920, PR51075, PR51091) and I'd like to get a better idea of how close the non-pow2 slp support will get us. Thanks.

I will try to do it ASAP.

That's awesome thanks, it'll definitely help improve the feedback I can give on the patch series.

ABataev updated this revision to Diff 360420.Jul 21 2021, 6:30 AM

Rebase. Did not test it thoroughly, just rebased and fixed test cases.

Thank you, checked this patch after rebase, trying to fix PR49933. It works well for it, reported to https://bugs.llvm.org/show_bug.cgi?id=49933.

nick added a subscriber: nick.Sep 4 2021, 4:30 PM
vporpo added a subscriber: vporpo.Nov 11 2021, 8:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 7:18 PM

Current status?

Current status?

Still requires several patches to commit.

Current status?

Still requires several patches to commit.

Are those patches linked somewhere?

Current status?

Still requires several patches to commit.

Are those patches linked somewhere?

Almost all my current SLP patches are related to non-power-of-2 support. But even these patches are not the full list. Need to add several others after.