This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Extend vectorization for scatter vectorize nodes.
ClosedPublic

Authored by ABataev on Jun 7 2022, 8:09 AM.

Details

Summary

Currently scatter vectorize nodes can be emitted only for GEPs with
constant indices. But we can also emit such nodes for GEPs with the same
ptr and non-constant vectorizable/gathered indices, if profitable. Patch
adds support for such nodes and tries to improve handling of GEPs with
non-const indeces for such nodes.

Metric: SLP.NumVectorInstructions

Program                                                                                       SLP.NumVectorInstructions
                                                                                              results                   results0 diff
                    test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  5243.00                   5240.00  -0.1%
                     test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  5243.00                   5240.00  -0.1%
                     test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 27550.00                  27507.00  -0.2%
                               test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  5395.00                   5380.00  -0.3%
                       test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5389.00                   5374.00  -0.3%
                    test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   961.00                    958.00  -0.3%
                   test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   961.00                    958.00  -0.3%
                               test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  5664.00                   5643.00  -0.4%
                       test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 13202.00                  13127.00  -0.6%
                                test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   212.00                    207.00  -2.4%
                                test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   890.00                    850.00  -4.5%
                            test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  1695.00                   1581.00  -6.7%
                                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2338.00                   2140.00  -8.5%
                                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test    63.00                     55.00 -12.7%
                             test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   468.00                    356.00 -23.9%
                                                                           Geomean difference                                     -0.3%

All numbers show increased number of generated vector instructions.

Diff:
SingleSource/Benchmarks/Adobe-C++/loop_unroll - better without LTO, but
need an extra analysis with LTO (with LTO compiler generates
masked_gather, while before regular loads were emitted because of extra
data, availbale at LTO time).
SingleSource/UnitTests/matrix-types-spec - more vector code.
MultiSource/Applications/JM/lencod/lencod - same.
External/SPEC/CINT2006/464.h264ref/464.h264ref - same.
MultiSource/Benchmarks/7zip/7zip-benchmark - same.
External/SPEC/CINT2006/445.gobmk/445.gobmk - no changes.
External/SPEC/CFP2017rate/510.parest_r/510.parest_r - more vector code.
External/SPEC/CFP2006/447.dealII/447.dealII - same
External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s - same
External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp - same
External/SPEC/CFP2017rate/511.povray_r/511.povray - same
External/SPEC/CFP2006/453.povray/453.povray - same
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - same
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - same
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same

Diff Detail

Event Timeline

ABataev created this revision.Jun 7 2022, 8:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2022, 8:09 AM
ABataev requested review of this revision.Jun 7 2022, 8:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2022, 8:09 AM
vdmitrie accepted this revision.Jun 8 2022, 10:41 AM

LGTM

BTW. One interesting thing I noticed from the test case.
We might want to special case this:

i32 *%addr
%gep3 = getelementptr inbounds i32, i32* %addr, i32 2
%gep5 = getelementptr inbounds i32, i32* %addr, i32 4

...

%gep15 = getelementptr inbounds i32, i32* %addr, i32 14

to be vectorized as if VL0 was "getelementptr inbounds i32, i32* %addr, i32 0"
Now we end up gathering this node.

This revision is now accepted and ready to land.Jun 8 2022, 10:41 AM

LGTM

BTW. One interesting thing I noticed from the test case.
We might want to special case this:

i32 *%addr
%gep3 = getelementptr inbounds i32, i32* %addr, i32 2
%gep5 = getelementptr inbounds i32, i32* %addr, i32 4

...

%gep15 = getelementptr inbounds i32, i32* %addr, i32 14

to be vectorized as if VL0 was "getelementptr inbounds i32, i32* %addr, i32 0"
Now we end up gathering this node.

Yep, thought about something like this, need to improve the analysis and emit an extra shuffle.

ABataev updated this revision to Diff 436837.Jun 14 2022, 10:01 AM

Improve checks to avoid regressions

ABataev edited the summary of this revision. (Show Details)Jun 14 2022, 10:02 AM
This revision was landed with ongoing or failed builds.Jun 16 2022, 6:53 AM
This revision was automatically updated to reflect the committed changes.