This is an archive of the discontinued LLVM Phabricator instance.

[LoopDist] Distribute vectorizable loops
Needs ReviewPublic

Authored by sanwou01 on Mar 30 2021, 7:36 AM.

Download Raw Diff

Details

Reviewers

fhahn
dmgreen
anemet
SjoerdMeijer
jdoerfert
qcolombet
davide
lebedev.ri
nikic

Summary

Loop distribute bails out early if a loop is already vectorizable. As a
first attempt to make the LoopDistribute pass more generally
useful (with the eventual aim of enabling loop distribute by default at
-O3), this patch removes that restriction.

Originally, this pass tries to separate the vectorizable parts of a loop
from its non-vectorizable parts, such that some of the resulting loops
can be vectorized. Loop distribution could be more generally useful, for
example, by improving cache locality of accesses in each loop.

With this change, all vectorizable load/stores end up in individual
partitions, only to be merged back together. With
--loop-distribute-merge-vectorizable-partitions=false however, the pass
distributes as much as possible, allowing us to start iterating on the
cost model.

To prevent removeUnusedInsts() from creating undefs outside of the loop,
replace any uses of seed instructions. For each value used outside of
the loop there is exactly one partition that uses that instruction as a
seed, thanks to findDefsUsedOutsideOfLoop(). This guarantees that all
uses outside of the loop are mapped to the correct partition.

This change, together with
--loop-distribute-merge-vectorizable-partitions=false (and
--enable-loop-distribute), distributes many more loops in the LLVM test
suite, with very mixed performance results.

Follow-up patches will work on a cost model to improve the performance
impact of the pass.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sanwou01 created this revision.Mar 30 2021, 7:36 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMar 30 2021, 7:36 AM

sanwou01 requested review of this revision.Mar 30 2021, 7:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2021, 7:36 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

with the eventual aim of enabling loop distribute by default at -O3

This would be great!

Harbormaster completed remote builds in B96327: Diff 334165.Mar 30 2021, 9:51 AM

By just looking at this patch I find it a bit difficult to get an overview of all moving parts involved. I.e., this makes probably sense:

Loop distribute bails out early if a loop is already vectorizable.

but by not doing this, do we remove opportunities for the vectoriser? So, perhaps the easiest is to get some perf numbers on the table?

Then, we can think about the cost-model too, and see if we can create some ideas about that. This pass is not enabled by default, so if perf numbers are okay and we don't make (downstream) users of this pass unhappy, it looks like a good step forward to me, but some ideas about steps after that would be good.

sanwou01 added a child revision: D100381: [RFC] Improve loop distribute cost model.Apr 13 2021, 6:05 AM

In D99596#2663702, @SjoerdMeijer wrote:

By just looking at this patch I find it a bit difficult to get an overview of all moving parts involved. I.e., this makes probably sense:

Loop distribute bails out early if a loop is already vectorizable.

but by not doing this, do we remove opportunities for the vectoriser? So, perhaps the easiest is to get some perf numbers on the table?

Then, we can think about the cost-model too, and see if we can create some ideas about that. This pass is not enabled by default, so if perf numbers are okay and we don't make (downstream) users of this pass unhappy, it looks like a good step forward to me, but some ideas about steps after that would be good.

D100381 implements a simple heuristics-based cost model for loop distribute and flips the merge-vectorizable-partitions switch that this patch adds. We can talk about the performance there: it doesn't really make sense here as this is more a bit of preliminary work. Further ideas on the cost model are very welcome, though!

To illustrate the behaviour of this patch a bit more, here are the number of loops distributed in the test suite (including SPEC 2006/2017). First column is the number of loops distributed *before* this patch, with loop distribute enabled; the second column is the same *after* this patch. The third column flips the no-merge-vectorizable switch, which corresponds to the maximum number of loops we *could* distribute.

There are a handful of cases where the first and second columns differ, which I wasn't entirely expecting. I'll have a look at what I've missed there.

Tests: 367
Metric: loop-distribute.NumLoopsDistributed

Program                                                                                              old-distribute new-distribute new-distribute-no-merge diff
 test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test                             2.00           2.00         235.00                  11650.0%
 test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test                              2.00           2.00         235.00                  11650.0%
 test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test                                        5.00           5.00         575.00                  11400.0%
 test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test                               30.00          30.00         1523.00                 4976.7%
 test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test                                 2.00           2.00          61.00                  2950.0%
 test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test                                             4.00           4.00          97.00                  2325.0%
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test                        17.00          17.00         400.00                  2252.9%
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test                       17.00          17.00         400.00                  2252.9%
 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test                                         3.00           4.00          70.00                  2233.3%
 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test                             23.00          23.00         534.00                  2221.7%
 test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test                                         NaN             1.00          23.00                  2200.0%
 test-suite :: MultiSource/Applications/oggenc/oggenc.test                                            NaN             1.00          23.00                  2200.0%
 test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test                                    NaN             2.00          42.00                  2000.0%
 test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test                                         1.00           1.00          13.00                  1200.0%
 test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test                                           1.00           1.00          13.00                  1200.0%
 test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test                                 2.00           2.00          24.00                  1100.0%
 test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test                                2.00           2.00          24.00                  1100.0%
 test-suite :: SingleSource/Benchmarks/Linpack/linpack-pc.test                                         1.00           1.00          10.00                  900.0%
 test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test                              3.00           3.00          29.00                  866.7%
 test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test                                    3.00           3.00          29.00                  866.7%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test                        NaN             2.00          14.00                  600.0%
 test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test                              3.00           3.00          20.00                  566.7%
 test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test                                    3.00           3.00          20.00                  566.7%
 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test                                   19.00          19.00         121.00                  536.8%
 test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test                                    19.00          19.00         121.00                  536.8%
 test-suite :: MicroBenchmarks/LCALS/SubsetBLambdaLoops/lcalsBLambda.test                              3.00           3.00          19.00                  533.3%
 test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test                                    3.00           3.00          19.00                  533.3%
 test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test                                            2.00           2.00           7.00                  250.0%
 test-suite :: MultiSource/Applications/siod/siod.test                                                 1.00           1.00           3.00                  200.0%
 test-suite :: External/SPEC/CINT2006/429.mcf/429.mcf.test                                             2.00           2.00           5.00                  150.0%
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test                       54.00          54.00          88.00                  63.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test                        54.00          54.00          88.00                  63.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.test           2.00           2.00           3.00                  50.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.test                           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test                           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test                               NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test                       NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test                       NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl.test                         NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.test                         NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test                           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test       NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test                   NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test       NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl.test                           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt.test                           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test                                 NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test                         NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test                                            NaN            NaN             3.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Stanford/FloatMM.test                                          NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.test                               NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.test                   NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test                           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test           NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl.test                 NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt.test                 NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.test         NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt.test                   NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.test         NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.test           NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl.test             NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Stanford/Quicksort.test                                        NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt.test             NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl.test                   NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Stanford/RealMM.test                                           NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt.test                   NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl.test           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test                           NaN            NaN            26.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test           NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test                                       NaN            NaN            94.00                   0.0%
 test-suite :: MultiSource/Benchmarks/VersaBench/bmm/bmm.test                                         NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog.test          NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc-C++/oopack_v1p8.test                                      NaN            NaN             3.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/flops-2.test                                              NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/flops.test                                                NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/fp-convert.test                                           NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/pi.test                                                   NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/durbin/durbin.test            NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/revertBits.test                                           NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/salsa20.test                                              NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.test                NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Misc/whetstone.test                                            NaN            NaN             4.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syrk/syrk.test                NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.test                 2.00           2.00           2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky.test        NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.test              NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen.test          NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/McGill/queens.test                                             NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.test  NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test                NaN            NaN             3.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/stencils/adi/adi.test                                NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Stanford/Bubblesort.test                                       NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test                              NaN            NaN            28.00                   0.0%
 test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test                       1.00           1.00           1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/nbench/nbench.test                                              NaN            NaN             5.00                   0.0%
 test-suite :: MultiSource/Benchmarks/sim/sim.test                                                    NaN            NaN             6.00                   0.0%
 test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test                                      NaN            NaN           195.00                   0.0%
 test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test                                      NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test                                      NaN            NaN             1.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/stencils/jacobi-2d-imper/jacobi-2d-imper.test        NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/CoyoteBench/huffbench.test                                     NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/CoyoteBench/lpbench.test                                       NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/stencils/jacobi-1d-imper/jacobi-1d-imper.test        NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/stencils/fdtd-apml/fdtd-apml.test                    NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test         NaN            NaN             2.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Polybench/stencils/fdtd-2d/fdtd-2d.test                        NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/Equivalencing-dbl/Equivalencing-dbl.test                   NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test                        NaN            NaN            30.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test         NaN            NaN             2.00                   0.0%
 test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test                                     NaN            NaN             7.00                   0.0%
 test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test                                  NaN            NaN            27.00                   0.0%
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test                        NaN            NaN             8.00                   0.0%
 test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test                                      NaN            NaN             7.00                   0.0%
 test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test                                   NaN            NaN             6.00                   0.0%
 test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test                           NaN            NaN            10.00                   0.0%
 test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test                                 NaN            NaN            27.00                   0.0%
 test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test                       NaN            NaN             8.00                   0.0%
 test-suite :: MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test         NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt.test                     NaN            NaN             2.00                   0.0%
 test-suite :: MicroBenchmarks/ImageProcessing/Dilate/Dilate.test                                     NaN            NaN             1.00                   0.0%
 test-suite :: MicroBenchmarks/ImageProcessing/Dither/Dither.test                                     NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test                           NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test                           NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Applications/ClamAV/clamscan.test                                          NaN            NaN            50.00                   0.0%
 test-suite :: MultiSource/Applications/JM/lencod/lencod.test                                         NaN            NaN            19.00                   0.0%
 test-suite :: MultiSource/Applications/SPASS/SPASS.test                                               1.00           1.00           1.00                   0.0%
 test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test                            NaN            NaN            10.00                   0.0%
 test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test                                    NaN            NaN             6.00                   0.0%
 test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test                                NaN            NaN            64.00                   0.0%
 test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test                                        NaN            NaN             1.00                   0.0%
 test-suite :: External/SPEC/CFP2006/450.soplex/450.soplex.test                                       NaN            NaN            22.00                   0.0%
 test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test                                       NaN            NaN            36.00                   0.0%
 test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test                                             NaN            NaN             2.00                   0.0%
 test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test                                     NaN            NaN            15.00                   0.0%
 test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test                                   NaN            NaN            63.00                   0.0%
 test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test                               NaN            NaN            35.00                   0.0%
 test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test                                     NaN            NaN             2.00                   0.0%
 test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test                                     NaN            NaN            19.00                   0.0%
 test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test                                    NaN            NaN             2.00                   0.0%
 test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test                                    NaN            NaN            19.00                   0.0%
 test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test                                NaN            NaN            80.00                   0.0%
 test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test                                        NaN            NaN             3.00                   0.0%
 test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test                                        NaN            NaN            11.00                   0.0%
 test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test                                        NaN            NaN             1.00                   0.0%
 test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test                                    NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Applications/d/make_dparser.test                                           NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Applications/hbd/hbd.test                                                   1.00           1.00           1.00                   0.0%
 test-suite :: MultiSource/Applications/lua/lua.test                                                  NaN            NaN            12.00                   0.0%
 test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test                                      NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test                              NaN            NaN             6.00                   0.0%
 test-suite :: MultiSource/Benchmarks/McCat/04-bisect/bisect.test                                     NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test                                         NaN            NaN             1.00                   0.0%
 test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test                                           NaN            NaN            21.00                   0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test                  NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/network-dijkstra/network-dijkstra.test                  NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.test                          NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test                                     NaN            NaN            12.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test                                   NaN            NaN            10.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test                                              NaN            NaN             8.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test                                    NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test                                        NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test                       NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test                       NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl.test                     NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test                                    NaN            NaN             6.00                   0.0%
 test-suite :: MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2.test                            NaN            NaN             5.00                   0.0%
 test-suite :: MultiSource/Applications/minisat/minisat.test                                          NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/FreeBench/analyzer/analyzer.test                                NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Applications/obsequi/Obsequi.test                                          NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Applications/sgefa/sgefa.test                                              NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Applications/sqlite3/sqlite3.test                                          NaN            NaN            13.00                   0.0%
 test-suite :: MultiSource/Applications/viterbi/viterbi.test                                          NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test                                    NaN            NaN             5.00                   0.0%
 test-suite :: MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.test                            NaN            NaN             5.00                   0.0%
 test-suite :: MultiSource/Benchmarks/BitBench/five11/five11.test                                     NaN            NaN             1.00                   0.0%
 test-suite :: MultiSource/Benchmarks/Bullet/bullet.test                                              NaN            NaN            32.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test                              NaN            NaN             8.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test                  NaN            NaN             2.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test                          NaN            NaN            15.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test                            NaN            NaN             3.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test                                  NaN            NaN             9.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test                        NaN            NaN             4.00                   0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test                            NaN            NaN            12.00                   0.0%
 test-suite :: SingleSource/Benchmarks/Stanford/Treesort.test                                         NaN            NaN             2.00                   0.0%

sanwou01 added reviewers: lebedev.ri, nikic, davide.Apr 14 2021, 6:26 AM

Rebased, and addressed discrepancy in the loop distributed. The difference hinges on loops that contain backward dependences which the loop vectorizer can handle, but which would frustrate loop distribution. In this case, we don't distributing the loop and leave it to the loop vectorizer.

Now, there are no differences in distributed loops in the test suite and SPEC, before and after the patch, as intended.

sanwou01 retitled this revision from [RFC] [LoopDist] Distribute vectorizable loops to [LoopDist] Distribute vectorizable loops.Apr 20 2021, 9:15 AM

Harbormaster completed remote builds in B99733: Diff 338893.Apr 20 2021, 10:03 AM

nikic resigned from this revision.Jun 9 2021, 1:48 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:21 PM

Herald added subscribers: • pcwang-thead, StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LoopDistribute.cpp

81 lines

test/

Transforms/

LoopDistribute/

bug-uses-outside-loop.ll

66 lines

diagnostics-with-hotness.ll

2 lines

diagnostics.ll

2 lines

vectorizable-dependences.ll

119 lines

Diff 338893

llvm/lib/Transforms/Scalar/LoopDistribute.cpp

//===- LoopDistribute.cpp - Loop Distribution Pass ------------------------===//		//===- LoopDistribute.cpp - Loop Distribution Pass ------------------------===//
		Lint: Lint Inline Actions clang-format not found in user's PATH; not linting file. Lint: Lint: clang-format not found in user's PATH; not linting file.
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	cl::desc("Whether to distribute into a loop that may not be "
"if-convertible by the loop vectorizer"),		"if-convertible by the loop vectorizer"),
cl::init(false));		cl::init(false));

static cl::opt<unsigned> DistributeSCEVCheckThreshold(		static cl::opt<unsigned> DistributeSCEVCheckThreshold(
"loop-distribute-scev-check-threshold", cl::init(8), cl::Hidden,		"loop-distribute-scev-check-threshold", cl::init(8), cl::Hidden,
cl::desc("The maximum number of SCEV checks allowed for Loop "		cl::desc("The maximum number of SCEV checks allowed for Loop "
"Distribution"));		"Distribution"));

		static cl::opt<unsigned> DistributeRuntimePointerChecksThreshold(
		"loop-distribute-runtime-check-threshold", cl::init(32), cl::Hidden,
		cl::desc("The maximum number of runtime pointer aliasing checks for Loop "
		"Distribution"));

		static cl::opt<bool> DistributeVectorizableLoops(
		"loop-distribute-vectorizable-loops", cl::init(true), cl::Hidden,
		cl::desc(
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - cl::desc( - "Consider vectorizable loops for loop distribution.")); + cl::desc("Consider vectorizable loops for loop distribution.")); Lint: Pre-merge checks: clang-format: please reformat the code ``` - cl::desc( - "Consider vectorizable loops…
		"Consider vectorizable loops for loop distribution."));

		static cl::opt<bool> DistributeMergeVectorizablePartitions(
		"loop-distribute-merge-vectorizable-partitions", cl::init(true), cl::Hidden,
		cl::desc("Merge adjacent partitions that are already vectorizable."));

static cl::opt<unsigned> PragmaDistributeSCEVCheckThreshold(		static cl::opt<unsigned> PragmaDistributeSCEVCheckThreshold(
"loop-distribute-scev-check-threshold-with-pragma", cl::init(128),		"loop-distribute-scev-check-threshold-with-pragma", cl::init(128),
cl::Hidden,		cl::Hidden,
cl::desc(		cl::desc(
"The maximum number of SCEV checks allowed for Loop "		"The maximum number of SCEV checks allowed for Loop "
"Distribution for loop marked with #pragma loop distribute(enable)"));		"Distribution for loop marked with #pragma loop distribute(enable)"));

static cl::opt<bool> EnableLoopDistribute(		static cl::opt<bool> EnableLoopDistribute(
"enable-loop-distribute", cl::Hidden,		"enable-loop-distribute", cl::Hidden,
cl::desc("Enable the new, experimental LoopDistribution Pass"),		cl::desc("Enable the new, experimental LoopDistribution Pass"),
cl::init(false));		cl::init(false));

STATISTIC(NumLoopsDistributed, "Number of loops distributed");		STATISTIC(NumLoopsDistributed, "Number of loops distributed");

namespace {		namespace {

/// Maintains the set of instructions of the loop for a partition before		/// Maintains the set of instructions of the loop for a partition before
/// cloning. After cloning, it hosts the new loop.		/// cloning. After cloning, it hosts the new loop.
class InstPartition {		class InstPartition {
using InstructionSet = SmallPtrSet<Instruction *, 8>;		using InstructionSet = SmallPtrSet<Instruction *, 8>;

public:		public:
InstPartition(Instruction I, Loop L, bool DepCycle = false)		InstPartition(Instruction I, Loop L, bool DepCycle = false)
: DepCycle(DepCycle), OrigLoop(L) {		: DepCycle(DepCycle), OrigLoop(L) {
		SeedInstructions.insert(I);
Set.insert(I);		Set.insert(I);
}		}

/// Returns whether this partition contains a dependence cycle.		/// Returns whether this partition contains a dependence cycle.
bool hasDepCycle() const { return DepCycle; }		bool hasDepCycle() const { return DepCycle; }

/// Adds an instruction to this partition.		/// Adds an instruction to this partition.
void add(Instruction *I) { Set.insert(I); }		void addSeed(Instruction *I) {
		SeedInstructions.insert(I);
		Set.insert(I);
		}

/// Collection accessors.		/// Collection accessors.
InstructionSet::iterator begin() { return Set.begin(); }		InstructionSet::iterator begin() { return Set.begin(); }
InstructionSet::iterator end() { return Set.end(); }		InstructionSet::iterator end() { return Set.end(); }
InstructionSet::const_iterator begin() const { return Set.begin(); }		InstructionSet::const_iterator begin() const { return Set.begin(); }
InstructionSet::const_iterator end() const { return Set.end(); }		InstructionSet::const_iterator end() const { return Set.end(); }
bool empty() const { return Set.empty(); }		bool empty() const { return Set.empty(); }

/// Moves this partition into \p Other. This partition becomes empty		/// Moves this partition into \p Other. This partition becomes empty
/// after this.		/// after this.
void moveTo(InstPartition &Other) {		void moveTo(InstPartition &Other) {
		Other.SeedInstructions.insert(SeedInstructions.begin(),
		SeedInstructions.end());
		SeedInstructions.clear();
Other.Set.insert(Set.begin(), Set.end());		Other.Set.insert(Set.begin(), Set.end());
Set.clear();		Set.clear();
Other.DepCycle \|= DepCycle;		Other.DepCycle \|= DepCycle;
}		}

/// Populates the partition with a transitive closure of all the		/// Populates the partition with a transitive closure of all the
/// instructions that the seeded instructions dependent on.		/// instructions that the seeded instructions dependent on.
void populateUsedSet() {		void populateUsedSet() {
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	public:
}		}

/// The VMap that is populated by cloning and then used in		/// The VMap that is populated by cloning and then used in
/// remapinstruction to remap the cloned instructions.		/// remapinstruction to remap the cloned instructions.
ValueToValueMapTy &getVMap() { return VMap; }		ValueToValueMapTy &getVMap() { return VMap; }

/// Remaps the cloned instructions using VMap.		/// Remaps the cloned instructions using VMap.
void remapInstructions() {		void remapInstructions() {
		// Seed instructions might be used outside the loop.
		for (Instruction *I : SeedInstructions) {
		I->replaceUsesWithIf(VMap[I], [&](Use &U) {
		Instruction *Use = cast<Instruction>(U.getUser());
		return !OrigLoop->contains(Use->getParent());
		});
		}

remapInstructionsInBlocks(ClonedLoopBlocks, VMap);		remapInstructionsInBlocks(ClonedLoopBlocks, VMap);
}		}

/// Based on the set of instructions selected for this partition,		/// Based on the set of instructions selected for this partition,
/// removes the unnecessary ones.		/// removes the unnecessary ones.
void removeUnusedInsts() {		void removeUnusedInsts() {
SmallVector<Instruction *, 8> Unused;		SmallVector<Instruction *, 8> Unused;

Show All 30 Lines	void printBlocks() const {
for (auto *BB : getDistributedLoop()->getBlocks())		for (auto *BB : getDistributedLoop()->getBlocks())
dbgs() << *BB;		dbgs() << *BB;
}		}

private:		private:
/// Instructions from OrigLoop selected for this partition.		/// Instructions from OrigLoop selected for this partition.
InstructionSet Set;		InstructionSet Set;

		/// Instructions from OrigLoop used to seed this partition.
		InstructionSet SeedInstructions;

/// Whether this partition contains a dependence cycle.		/// Whether this partition contains a dependence cycle.
bool DepCycle;		bool DepCycle;

/// The original loop.		/// The original loop.
Loop *OrigLoop;		Loop *OrigLoop;

/// The cloned loop. If this partition is mapped to the original loop,		/// The cloned loop. If this partition is mapped to the original loop,
/// this is null.		/// this is null.
Show All 23 Lines	public:

/// Adds \p Inst into the current partition if that is marked to		/// Adds \p Inst into the current partition if that is marked to
/// contain cycles. Otherwise start a new partition for it.		/// contain cycles. Otherwise start a new partition for it.
void addToCyclicPartition(Instruction *Inst) {		void addToCyclicPartition(Instruction *Inst) {
// If the current partition is non-cyclic. Start a new one.		// If the current partition is non-cyclic. Start a new one.
if (PartitionContainer.empty() \|\| !PartitionContainer.back().hasDepCycle())		if (PartitionContainer.empty() \|\| !PartitionContainer.back().hasDepCycle())
PartitionContainer.emplace_back(Inst, L, /DepCycle=/true);		PartitionContainer.emplace_back(Inst, L, /DepCycle=/true);
else		else
PartitionContainer.back().add(Inst);		PartitionContainer.back().addSeed(Inst);
}		}

/// Adds \p Inst into a partition that is not marked to contain		/// Adds \p Inst into a partition that is not marked to contain
/// dependence cycles.		/// dependence cycles.
///		///
// Initially we isolate memory instructions into as many partitions as		// Initially we isolate memory instructions into as many partitions as
// possible, then later we may merge them back together.		// possible, then later we may merge them back together.
void addToNewNonCyclicPartition(Instruction *Inst) {		void addToNewNonCyclicPartition(Instruction *Inst) {
Show All 27 Lines	mergeAdjacentPartitionsIf([&](const InstPartition *Partition) {
return false;		return false;
}		}
return seenStore;		return seenStore;
});		});
}		}

/// Merges the partitions according to various heuristics.		/// Merges the partitions according to various heuristics.
void mergeBeforePopulating() {		void mergeBeforePopulating() {
		if (DistributeMergeVectorizablePartitions)
mergeAdjacentNonCyclic();		mergeAdjacentNonCyclic();
if (!DistributeNonIfConvertible)		if (!DistributeNonIfConvertible)
mergeNonIfConvertible();		mergeNonIfConvertible();
}		}

/// Merges partitions in order to ensure that no loads are duplicated.		/// Merges partitions in order to ensure that no loads are duplicated.
///		///
/// We can't duplicate loads because that could potentially reorder them.		/// We can't duplicate loads because that could potentially reorder them.
/// LoopAccessAnalysis provides dependency information with the context that		/// LoopAccessAnalysis provides dependency information with the context that
▲ Show 20 Lines • Show All 299 Lines • ▼ Show 20 Lines	for (auto &Dep : Dependences)
LLVM_DEBUG(Dep.print(dbgs(), 2, Instructions));		LLVM_DEBUG(Dep.print(dbgs(), 2, Instructions));
}		}
}		}

private:		private:
AccessesType Accesses;		AccessesType Accesses;
};		};

		static bool hasPossiblyBackwardDependences(
		const SmallVectorImpl<MemoryDepChecker::Dependence> &Dependences) {
		for (auto &Dep : Dependences)
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto &Dep' can be declared as 'const auto &Dep' [llvm-qualified-auto] not useful Lint: Pre-merge checks: clang-tidy: warning: 'auto &Dep' can be declared as 'const auto &Dep' [llvm-qualified-auto]…
		if (Dep.isPossiblyBackward())
		return true;

		return false;
		}

/// The actual class performing the per-loop work.		/// The actual class performing the per-loop work.
class LoopDistributeForLoop {		class LoopDistributeForLoop {
public:		public:
LoopDistributeForLoop(Loop L, Function F, LoopInfo LI, DominatorTree DT,		LoopDistributeForLoop(Loop L, Function F, LoopInfo LI, DominatorTree DT,
ScalarEvolution SE, OptimizationRemarkEmitter ORE)		ScalarEvolution SE, OptimizationRemarkEmitter ORE)
: L(L), F(F), LI(LI), DT(DT), SE(SE), ORE(ORE) {		: L(L), F(F), LI(LI), DT(DT), SE(SE), ORE(ORE) {
setForced();		setForced();
}		}
Show All 16 Lines	if (!L->isRotatedForm())
return fail("NotBottomTested", "loop is not bottom tested");		return fail("NotBottomTested", "loop is not bottom tested");

BasicBlock *PH = L->getLoopPreheader();		BasicBlock *PH = L->getLoopPreheader();

LAI = &GetLAA(*L);		LAI = &GetLAA(*L);

// Currently, we only distribute to isolate the part of the loop with		// Currently, we only distribute to isolate the part of the loop with
// dependence cycles to enable partial vectorization.		// dependence cycles to enable partial vectorization.
if (LAI->canVectorizeMemory())		if (LAI->canVectorizeMemory() && !DistributeVectorizableLoops)
return fail("MemOpsCanBeVectorized",		return fail("MemOpsCanBeVectorized",
"memory operations are safe for vectorization");		"memory operations are safe for vectorization");

auto *Dependences = LAI->getDepChecker().getDependences();		auto *Dependences = LAI->getDepChecker().getDependences();
if (!Dependences \|\| Dependences->empty())		if (!Dependences)
return fail("NoUnsafeDeps", "no unsafe dependences to isolate");		return fail("NoDeps", "dependency analysis failed");

		LLVM_DEBUG(dbgs() << "NumDependences: " << Dependences->size() << "\n");

		// If there are potentially-backward dependencies (which don't prevent
		// vectorisation), loop distribute would spuriously distribute.
		if (LAI->canVectorizeMemory() &&
		hasPossiblyBackwardDependences(*Dependences)) {
		return fail("VectorizableDependences",
		"dependences that won't block vectorization found");
		}

		// If we can't vectorize and the set of depdencies is empty, then that means
		// that Loop Access Analysis gave up and the results are invalid. Don't try
		// to do loop distribution based off it, or Bad Things happen.
		if (!LAI->canVectorizeMemory() && Dependences->empty()) {
		return fail("NoDeps", "dependency analysis failed");
		}

InstPartitionContainer Partitions(L, LI, DT);		InstPartitionContainer Partitions(L, LI, DT);

// First, go through each memory operation and assign them to consecutive		// First, go through each memory operation and assign them to consecutive
// partitions (the order of partitions follows program order). Put those		// partitions (the order of partitions follows program order). Put those
// with unsafe dependences into "cyclic" partition otherwise put each store		// with unsafe dependences into "cyclic" partition otherwise put each store
// in its own "non-cyclic" partition (we'll merge these later).		// in its own "non-cyclic" partition (we'll merge these later).
//		//
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	bool processLoop(std::function<const LoopAccessInfo &(Loop &)> &GetLAA) {
LLVM_DEBUG(dbgs() << "Seeded partitions:\n" << Partitions);		LLVM_DEBUG(dbgs() << "Seeded partitions:\n" << Partitions);
if (Partitions.getSize() < 2)		if (Partitions.getSize() < 2)
return fail("CantIsolateUnsafeDeps",		return fail("CantIsolateUnsafeDeps",
"cannot isolate unsafe dependencies");		"cannot isolate unsafe dependencies");

// Run the merge heuristics: Merge non-cyclic adjacent partitions since we		// Run the merge heuristics: Merge non-cyclic adjacent partitions since we
// should be able to vectorize these together.		// should be able to vectorize these together.
Partitions.mergeBeforePopulating();		Partitions.mergeBeforePopulating();

LLVM_DEBUG(dbgs() << "\nMerged partitions:\n" << Partitions);		LLVM_DEBUG(dbgs() << "\nMerged partitions:\n" << Partitions);
if (Partitions.getSize() < 2)		if (Partitions.getSize() < 2)
return fail("CantIsolateUnsafeDeps",		return fail("CantIsolateUnsafeDeps",
"cannot isolate unsafe dependencies");		"cannot isolate unsafe dependencies");

// Now, populate the partitions with non-memory operations.		// Now, populate the partitions with non-memory operations.
Partitions.populateUsedSet();		Partitions.populateUsedSet();
LLVM_DEBUG(dbgs() << "\nPopulated partitions:\n" << Partitions);		LLVM_DEBUG(dbgs() << "\nPopulated partitions:\n" << Partitions);
Show All 11 Lines	bool processLoop(std::function<const LoopAccessInfo &(Loop &)> &GetLAA) {
// Don't distribute the loop if we need too many SCEV run-time checks, or		// Don't distribute the loop if we need too many SCEV run-time checks, or
// any if it's illegal.		// any if it's illegal.
const SCEVUnionPredicate &Pred = LAI->getPSE().getUnionPredicate();		const SCEVUnionPredicate &Pred = LAI->getPSE().getUnionPredicate();
if (LAI->hasConvergentOp() && !Pred.isAlwaysTrue()) {		if (LAI->hasConvergentOp() && !Pred.isAlwaysTrue()) {
return fail("RuntimeCheckWithConvergent",		return fail("RuntimeCheckWithConvergent",
"may not insert runtime check with convergent operation");		"may not insert runtime check with convergent operation");
}		}

		LLVM_DEBUG(dbgs() << "LD: SCEV predicate complexity: "
		<< Pred.getComplexity() << "\n");
if (Pred.getComplexity() > (IsForced.getValueOr(false)		if (Pred.getComplexity() > (IsForced.getValueOr(false)
? PragmaDistributeSCEVCheckThreshold		? PragmaDistributeSCEVCheckThreshold
: DistributeSCEVCheckThreshold))		: DistributeSCEVCheckThreshold))
return fail("TooManySCEVRuntimeChecks",		return fail("TooManySCEVRuntimeChecks",
"too many SCEV run-time checks needed.\n");		"too many SCEV run-time checks needed.\n");

if (!IsForced.getValueOr(false) && hasDisableAllTransformsHint(L))		if (!IsForced.getValueOr(false) && hasDisableAllTransformsHint(L))
return fail("HeuristicDisabled", "distribution heuristic disabled");		return fail("HeuristicDisabled", "distribution heuristic disabled");

LLVM_DEBUG(dbgs() << "\nDistributing loop: " << *L << "\n");		LLVM_DEBUG(dbgs() << "\nDistributing loop: " << *L << "\n");
// We're done forming the partitions set up the reverse mapping from		// We're done forming the partitions set up the reverse mapping from
// instructions to partitions.		// instructions to partitions.
Partitions.setupPartitionIdOnInstructions();		Partitions.setupPartitionIdOnInstructions();

// If we need run-time checks, version the loop now.		// If we need run-time checks, version the loop now.
auto PtrToPartition = Partitions.computePartitionSetForPointers(*LAI);		auto PtrToPartition = Partitions.computePartitionSetForPointers(*LAI);
const auto *RtPtrChecking = LAI->getRuntimePointerChecking();		const auto *RtPtrChecking = LAI->getRuntimePointerChecking();
const auto &AllChecks = RtPtrChecking->getChecks();		const auto &AllChecks = RtPtrChecking->getChecks();

auto Checks = includeOnlyCrossPartitionChecks(AllChecks, PtrToPartition,		auto Checks = includeOnlyCrossPartitionChecks(AllChecks, PtrToPartition,
RtPtrChecking);		RtPtrChecking);

		// Runtime pointer checks could be quadratic in the number of pointers.
		if (Checks.size() > DistributeRuntimePointerChecksThreshold) {
		return fail("TooManyRuntimePointerChecks",
		"too many runtime pointer-alias checks needed.\n");
		}

if (LAI->hasConvergentOp() && !Checks.empty()) {		if (LAI->hasConvergentOp() && !Checks.empty()) {
return fail("RuntimeCheckWithConvergent",		return fail("RuntimeCheckWithConvergent",
"may not insert runtime check with convergent operation");		"may not insert runtime check with convergent operation");
}		}

// To keep things simple have an empty preheader before we version or clone		// To keep things simple have an empty preheader before we version or clone
// the loop. (Also split if this has no predecessor, i.e. entry, because we		// the loop. (Also split if this has no predecessor, i.e. entry, because we
// rely on PH having a predecessor.)		// rely on PH having a predecessor.)
▲ Show 20 Lines • Show All 281 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopDistribute/bug-uses-outside-loop.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-distribute -enable-loop-distribute --loop-distribute-merge-vectorizable-partitions=false -verify-loop-info -verify-dom-info -S < %s \
				; RUN: \| FileCheck %s

				; for (i = 0; i < n; i ++) {
				; sumA += A[i]
				; =========================
				; sumB += B[i]
				; }


				define i64 @f(i64 %n, i32* %a) {
				; CHECK-LABEL: @f(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[ENTRY_SPLIT_LDIST1:%.*]]
				; CHECK: entry.split.ldist1:
				; CHECK-NEXT: br label [[FOR_BODY_LDIST1:%.*]]
				; CHECK: for.body.ldist1:
				; CHECK-NEXT: [[INDEX_LDIST1:%.]] = phi i64 [ 0, [[ENTRY_SPLIT_LDIST1]] ], [ [[INDEX_NEXT_LDIST1:%.]], [[FOR_BODY_LDIST1]] ]
				; CHECK-NEXT: [[SUMA_LDIST1:%.]] = phi i32 [ 0, [[ENTRY_SPLIT_LDIST1]] ], [ [[SUMA_NEXT_LDIST1:%.]], [[FOR_BODY_LDIST1]] ]
				; CHECK-NEXT: [[IDXA_LDIST1:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX_LDIST1]]
				; CHECK-NEXT: [[LOADA_LDIST1:%.]] = load i32, i32 [[IDXA_LDIST1]], align 4
				; CHECK-NEXT: [[SUMA_NEXT_LDIST1]] = add nuw nsw i32 [[LOADA_LDIST1]], [[SUMA_LDIST1]]
				; CHECK-NEXT: [[INDEX_NEXT_LDIST1]] = add nuw nsw i64 [[INDEX_LDIST1]], 1
				; CHECK-NEXT: [[EXITCOND_LDIST1:%.]] = icmp eq i64 [[INDEX_NEXT_LDIST1]], [[N:%.]]
				; CHECK-NEXT: br i1 [[EXITCOND_LDIST1]], label [[ENTRY_SPLIT:%.*]], label [[FOR_BODY_LDIST1]]
				; CHECK: entry.split:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[ENTRY_SPLIT]] ], [ [[INDEX_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[SUMIDXSQ:%.]] = phi i64 [ 0, [[ENTRY_SPLIT]] ], [ [[SUMIDXSQ_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[IDXSQ:%.*]] = mul i64 [[INDEX]], [[INDEX]]
				; CHECK-NEXT: [[SUMIDXSQ_NEXT]] = add nuw nsw i64 [[IDXSQ]], [[SUMIDXSQ]]
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw nsw i64 [[INDEX]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; CHECK: for.end:
				; CHECK-NEXT: [[ZEXT:%.*]] = zext i32 [[SUMA_NEXT_LDIST1]] to i64
				; CHECK-NEXT: [[RET:%.*]] = add nuw nsw i64 [[ZEXT]], [[SUMIDXSQ_NEXT]]
				; CHECK-NEXT: ret i64 [[RET]]
				;
				entry:
				br label %for.body

				for.body:
				%index = phi i64 [ 0, %entry ], [ %index.next, %for.body ]
				%sumA = phi i32 [ 0, %entry ], [ %sumA.next, %for.body ]
				%sumIdxSq = phi i64 [ 0, %entry ], [ %sumIdxSq.next, %for.body ]

				%idxA = getelementptr inbounds i32, i32* %a, i64 %index
				%loadA = load i32, i32* %idxA, align 4
				%sumA.next = add nuw nsw i32 %loadA, %sumA

				%idxSq = mul i64 %index, %index
				%sumIdxSq.next = add nuw nsw i64 %idxSq, %sumIdxSq

				%index.next = add nuw nsw i64 %index, 1

				%exitcond = icmp eq i64 %index.next, %n
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%zext = zext i32 %sumA.next to i64
				%ret = add nuw nsw i64 %zext, %sumIdxSq.next
				ret i64 %ret
				}

llvm/test/Transforms/LoopDistribute/diagnostics-with-hotness.ll

	Show All 19 Lines
	; 4 A[i] = B[i] * C[i];			; 4 A[i] = B[i] * C[i];
	; 5 }			; 5 }
	; 6 }			; 6 }

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.11.0"			target triple = "x86_64-apple-macosx10.11.0"

	; HOTNESS: remark: /tmp/t.c:3:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info (hotness: 300)			; HOTNESS: remark: /tmp/t.c:3:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info (hotness: 300)
	; HOTNESS: remark: /tmp/t.c:3:3: loop not distributed: memory operations are safe for vectorization (hotness: 300)
	; NO_HOTNESS: remark: /tmp/t.c:3:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info{{$}}			; NO_HOTNESS: remark: /tmp/t.c:3:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info{{$}}
	; NO_HOTNESS: remark: /tmp/t.c:3:3: loop not distributed: memory operations are safe for vectorization{{$}}

	define void @forced(i8* %A, i8* %B, i8* %C, i32 %N) !dbg !7 !prof !22 {			define void @forced(i8* %A, i8* %B, i8* %C, i32 %N) !dbg !7 !prof !22 {
	entry:			entry:
	%cmp12 = icmp sgt i32 %N, 0, !dbg !9			%cmp12 = icmp sgt i32 %N, 0, !dbg !9
	br i1 %cmp12, label %ph, label %for.cond.cleanup, !dbg !10, !prof !23			br i1 %cmp12, label %ph, label %for.cond.cleanup, !dbg !10, !prof !23

	ph:			ph:
	br label %for.body			br label %for.body
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopDistribute/diagnostics.ll

Show All 30 Lines
; 17 C[i] = D[i] * E[i];		; 17 C[i] = D[i] * E[i];
; 18 }		; 18 }
; 19 }		; 19 }

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"		target triple = "x86_64-apple-macosx10.11.0"

; MISSED_REMARKS: remark: /tmp/t.c:3:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info		; MISSED_REMARKS: remark: /tmp/t.c:3:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info
; ALWAYS: remark: /tmp/t.c:3:3: loop not distributed: memory operations are safe for vectorization
; ALWAYS: warning: /tmp/t.c:3:3: loop not distributed: failed explicitly specified loop distribution		; ALWAYS: warning: /tmp/t.c:3:3: loop not distributed: failed explicitly specified loop distribution

define void @forced(i8* %A, i8* %B, i8* %C, i32 %N) !dbg !7 {		define void @forced(i8* %A, i8* %B, i8* %C, i32 %N) !dbg !7 {
entry:		entry:
%cmp12 = icmp sgt i32 %N, 0, !dbg !9		%cmp12 = icmp sgt i32 %N, 0, !dbg !9
br i1 %cmp12, label %ph, label %for.cond.cleanup, !dbg !10		br i1 %cmp12, label %ph, label %for.cond.cleanup, !dbg !10

ph:		ph:
Show All 14 Lines	for.body:
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !dbg !10, !llvm.loop !20		br i1 %exitcond, label %for.cond.cleanup, label %for.body, !dbg !10, !llvm.loop !20

for.cond.cleanup:		for.cond.cleanup:
ret void, !dbg !11		ret void, !dbg !11
}		}

; NO_REMARKS-NOT: remark: /tmp/t.c:9:3: loop not distributed: memory operations are safe for vectorization		; NO_REMARKS-NOT: remark: /tmp/t.c:9:3: loop not distributed: memory operations are safe for vectorization
; MISSED_REMARKS: remark: /tmp/t.c:9:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info		; MISSED_REMARKS: remark: /tmp/t.c:9:3: loop not distributed: use -Rpass-analysis=loop-distribute for more info
; ANALYSIS_REMARKS: remark: /tmp/t.c:9:3: loop not distributed: memory operations are safe for vectorization
; ALWAYS-NOT: warning: /tmp/t.c:9:3: loop not distributed: failed explicitly specified loop distribution		; ALWAYS-NOT: warning: /tmp/t.c:9:3: loop not distributed: failed explicitly specified loop distribution

define void @not_forced(i8* %A, i8* %B, i8* %C, i32 %N) !dbg !22 {		define void @not_forced(i8* %A, i8* %B, i8* %C, i32 %N) !dbg !22 {
entry:		entry:
%cmp12 = icmp sgt i32 %N, 0, !dbg !23		%cmp12 = icmp sgt i32 %N, 0, !dbg !23
br i1 %cmp12, label %ph, label %for.cond.cleanup, !dbg !24		br i1 %cmp12, label %ph, label %for.cond.cleanup, !dbg !24

ph:		ph:
▲ Show 20 Lines • Show All 159 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopDistribute/vectorizable-dependences.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -basic-aa -loop-distribute -enable-loop-distribute -verify-loop-info -verify-dom-info -S \
				; RUN: < %s \| FileCheck %s

				@A = global [2 x [16 x [16 x i32]]] zeroinitializer
				@B = global [2 x [16 x [16 x i32]]] zeroinitializer
				@C = global [16 x [16 x i32]] zeroinitializer
				@D = global [16 x [16 x i32]] zeroinitializer

				define void @backward_vectorizable(i32 %j) {
				; CHECK-LABEL: @backward_vectorizable(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[IDXPROM1:%.]] = sext i32 [[J:%.]] to i64
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]] @B, i64 0, i64 0, i64 [[INDVARS_IV]], i64 [[IDXPROM1]]
				; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]] @A, i64 0, i64 0, i64 [[INDVARS_IV]], i64 [[IDXPROM1]]
				; CHECK-NEXT: store i32 [[TMP0]], i32* [[ARRAYIDX6]], align 4
				; CHECK-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]] @B, i64 0, i64 1, i64 [[INDVARS_IV]], i64 [[IDXPROM1]]
				; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX10]], align 4
				; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]] @A, i64 0, i64 1, i64 [[INDVARS_IV]], i64 [[IDXPROM1]]
				; CHECK-NEXT: store i32 [[TMP1]], i32* [[ARRAYIDX14]], align 4
				; CHECK-NEXT: [[ARRAYIDX18:%.]] = getelementptr inbounds [16 x [16 x i32]], [16 x [16 x i32]] @D, i64 0, i64 [[INDVARS_IV]], i64 [[IDXPROM1]]
				; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX18]], align 4
				; CHECK-NEXT: [[ARRAYIDX22:%.]] = getelementptr inbounds [16 x [16 x i32]], [16 x [16 x i32]] @C, i64 0, i64 [[INDVARS_IV]], i64 [[IDXPROM1]]
				; CHECK-NEXT: store i32 [[TMP2]], i32* [[ARRAYIDX22]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16
				; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				%idxprom1 = sext i32 %j to i64
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx2 = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]]* @B, i64 0, i64 0, i64 %indvars.iv, i64 %idxprom1
				%0 = load i32, i32* %arrayidx2, align 4
				%arrayidx6 = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]]* @A, i64 0, i64 0, i64 %indvars.iv, i64 %idxprom1
				store i32 %0, i32* %arrayidx6, align 4
				%arrayidx10 = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]]* @B, i64 0, i64 1, i64 %indvars.iv, i64 %idxprom1
				%1 = load i32, i32* %arrayidx10, align 4
				%arrayidx14 = getelementptr inbounds [2 x [16 x [16 x i32]]], [2 x [16 x [16 x i32]]]* @A, i64 0, i64 1, i64 %indvars.iv, i64 %idxprom1
				store i32 %1, i32* %arrayidx14, align 4
				%arrayidx18 = getelementptr inbounds [16 x [16 x i32]], [16 x [16 x i32]]* @D, i64 0, i64 %indvars.iv, i64 %idxprom1
				%2 = load i32, i32* %arrayidx18, align 4
				%arrayidx22 = getelementptr inbounds [16 x [16 x i32]], [16 x [16 x i32]]* @C, i64 0, i64 %indvars.iv, i64 %idxprom1
				store i32 %2, i32* %arrayidx22, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, 16
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				%struct = type { [361 x i32], [361 x i32] }
				define void @backward_vectorizable2(i32 %a, i32 %b, %struct* nocapture %S) {
				; CHECK-LABEL: @backward_vectorizable2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SUB:%.]] = sub i32 1, [[A:%.]]
				; CHECK-NEXT: [[CMP_NOT12:%.]] = icmp sgt i32 [[A]], [[B:%.]]
				; CHECK-NEXT: br i1 [[CMP_NOT12]], label [[FOR_END:%.]], label [[FOR_BODY_PREHEADER:%.]]
				; CHECK: for.body.preheader:
				; CHECK-NEXT: [[TMP0:%.*]] = zext i32 [[A]] to i64
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[B]], 1
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[TMP0]], [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[TMP2:%.*]] = trunc i64 [[INDVARS_IV]] to i32
				; CHECK-NEXT: [[ADD:%.*]] = add i32 [[SUB]], [[TMP2]]
				; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[ADD]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [[STRUCT:%.]], %struct* [[S:%.*]], i64 0, i32 0, i64 [[IDXPROM]]
				; CHECK-NEXT: store i32 2, i32* [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds [[STRUCT]], %struct [[S]], i64 0, i32 1, i64 [[IDXPROM]]
				; CHECK-NEXT: store i32 1, i32* [[ARRAYIDX4]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32
				; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[TMP1]], [[LFTR_WIDEIV]]
				; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT:%.*]], label [[FOR_BODY]]
				; CHECK: for.end.loopexit:
				; CHECK-NEXT: br label [[FOR_END]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				%sub = sub i32 1, %a
				%cmp.not12 = icmp sgt i32 %a, %b
				br i1 %cmp.not12, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				%0 = zext i32 %a to i64
				%1 = add i32 %b, 1
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ %0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%2 = trunc i64 %indvars.iv to i32
				%add = add i32 %sub, %2
				%idxprom = sext i32 %add to i64
				%arrayidx = getelementptr inbounds %struct, %struct* %S, i64 0, i32 0, i64 %idxprom
				store i32 2, i32* %arrayidx, align 4
				%arrayidx4 = getelementptr inbounds %struct, %struct* %S, i64 0, i32 1, i64 %idxprom
				store i32 1, i32* %arrayidx4, align 4
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond.not = icmp eq i32 %1, %lftr.wideiv
				br i1 %exitcond.not, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}