This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve stores vectorization.
ClosedPublic

Authored by ABataev on Jul 13 2023, 4:08 PM.

Details

Summary

Use O(nlogn) instead of O(N2) (N <= 32) sorting approach and do not try
to revectorize all possible combinations of stores, if they
definitely cannot be combined because of mem/data dependencies.
Compile time (O3 + lto, skylake_avx512):
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 117.15 120.11 2.5%
External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 203.67 207.42 1.8%
External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 232.43 235.01 1.1%
External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 205.49 207.25 0.9%
External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 310.46 306.23 -1.4%

Link time (O3+lto, skylake_avx512):
External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 1383.69 1475.94 6.7%

Other changes are too small, cannot rely on them.

size..text
Program size..text

                                                                         results     results0    diff
      test-suite :: SingleSource/Regression/C/Regression-C-sumarray.test      392.00     1439.00 267.1%
            test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   394258.00   394818.00   0.1%
            test-suite :: MultiSource/Applications/JM/lencod/lencod.test   846355.00   847075.00   0.1%
       test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   782816.00   783360.00   0.1%
      test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   779667.00   779923.00   0.0%
          test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   224398.00   224446.00   0.0%
               test-suite :: MultiSource/Applications/oggenc/oggenc.test   185019.00   185035.00   0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12487610.00 12488010.00   0.0%
           test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1051772.00  1051804.00   0.0%
                 test-suite :: MultiSource/Applications/SPASS/SPASS.test   529586.00   529602.00   0.0%
   test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1084684.00  1084716.00   0.0%
         test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1014245.00  1014261.00   0.0%

 test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   223494.00   223478.00  -0.0%
    test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   660843.00   660795.00  -0.0%
     test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   660843.00   660795.00  -0.0%
             test-suite :: MultiSource/Applications/ClamAV/clamscan.test   568824.00   568760.00  -0.0%

espresso - 2 more stores vectorized
x264 - small number of changes in 3-4 functions, generated a bit more
vector stores (2 4x zeroinitializer stores + some other small variations).
clamscan - emitted 32xi8 store instead of several scalar stores + several 4x-8x stores.

Diff Detail

Event Timeline

ABataev created this revision.Jul 13 2023, 4:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2023, 4:08 PM
ABataev requested review of this revision.Jul 13 2023, 4:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2023, 4:08 PM
Herald added a subscriber: wangpc. · View Herald Transcript
vdmitrie added inline comments.Jul 14 2023, 5:14 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12395–12396

Please add a description of what is stored in the set.

12396

Could you please specify actual type of the argument here and also add some description (and specifically what captured data it updates)?

12489

Need a description of what is stored in the container, a quick intro of how to interpret data.

12490

Please add a description of the lambda?

12538

I believe we can avoid abusing this c++ feature here. This is okay for template libraries for reason but here it makes very difficult to read code as it forcing to wade in code in search for the actual type.

ABataev updated this revision to Diff 541070.Jul 17 2023, 8:55 AM

Rebase + address comments

Thanks for the update, Alexey. It was very much helpful.
I just have few questions

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12406–12412

How about the following?

"Distance to the first store in the set"
-->
"Distance of the store address relative to base store address"

whether distance is measured in units or bytes? (condition check "Data.second - PrevDist == 1" suggests units).

12481–12535

May I suggest to rewrite:
"first index of the store in the Stores, set for store index/dist to the very first store"
-->
"first: index of the store into Stores array ref, address of which taken as base,
second: sorted set of pairs {index, dist}, which are indices of stores in the set and their
store location distances relative to the base address." ?

Since "fist index" is perhaps the index of store we started analysis with but it can actually end up not the first in the set (like in your example where it is the last one).

12500–12501

I'd expect the element should actually be like this (if I understand it correctly):
{5, {4, -3}, {2, -2}, {3, -1}, {5, 0}}

hence should try to vectorize "4,2,3,5"
Or sequence in the example should be:

// 1. store x, %p
// 2. store y, %p+2
// 3. store z, %p+1
// 4. store a, %p
// 5. store b, %p+3

Could you please confirm there is a typo here or clarify otherwise?
btw, extra {} would be bit more helpful too: {5, {{4, -3}, {2, -2}, {3, -1}, {5, 0}}}

12510–12512

Hm, I'm confused. "comes after" means store location comes after or "index in the Stores comes after"?
From the code it looks like the latter (as store indices are compared).
Besides, #4 has already been vectorized. Why we would even consider it?

And in either case it is not clear how #5 (%p+3) would come before #4 (%p) ?

Could you please clarify?

ABataev added inline comments.Jul 18 2023, 6:27 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12406–12412

In units

12500–12501

Yes, it was a typo.

12510–12512
  1. Here comes after - comes after in the order, we process stores (we process them in reverse order).
  1. > Besides, #4 has already been vectorized. Why we would even consider it? This store still exist in LLVM IR (will be removed in SLP destrutor), though marked for the deletion. #4 just have the same distance to the base address and this is a mark, that we need to start a new stores sequence, because we have a conflict of distances (it does not matter, actually, if #4 is vectorized or not).

So, for #4 we try to vectorize sequence {2,3,4,5}, for #1 - only {1,2,3} (if #3 was not vectorized) or {1,2} ( if #3 was vectorized), because most likely #5 'depends' (in some way, not directly, but from the programmers logic, this most likely may happen after loop unrolling) on the store #4. And we can exclude it from the chain for #1 to save compile time.

  1. It comes before in the reverse order, we process stores.
ABataev updated this revision to Diff 541511.Jul 18 2023, 6:57 AM

Rebase + address comments

vdmitrie added inline comments.Jul 18 2023, 5:28 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12507

Should the sequence be "4,2,3,5" ? Otherwise it does not match the element.
Could you please also check accuracy of the comment after this point. It looks like the original typo caused some inaccuracy.
For example, I'd expect #3 and #5 might not vectorize (as a tail) but not #2 and #5 (since #4 and #3 are not unit stride stores).

12510–12512

Here you operate (and refer) with indices into Stores array ref and it was not quite obvious that you rely on the fact that stores order in the array ref actually directly related to the order they come in input IR. Thanks for explanation.

I actually was not worried about accessing a deleted instruction (I know that SLP delays their actual removal).
The question was related to this part of the code:

if (Pair.first <= It->first ||
    VectorizedStores.contains(Stores[Pair.first]))
  break;
ABataev added inline comments.Jul 19 2023, 6:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
12510–12512

Yep, this code check if the store was already vectorized or comes before the store with the same distance (#4 in the example)

ABataev updated this revision to Diff 542007.Jul 19 2023, 7:04 AM

Address comments

ABataev updated this revision to Diff 544490.Jul 26 2023, 1:18 PM

Rebase, ping!

ABataev updated this revision to Diff 545623.Jul 31 2023, 6:11 AM

Rebase, ping!

This revision is now accepted and ready to land.Aug 3 2023, 11:27 AM
This revision was automatically updated to reflect the committed changes.