This is a first stab at an improved cost model for loop distribution,
replacing "always merge adjacent vectorizable partitions" with something
more fine-grained.
Two new heuristics are added. First, any adjacent partitions that have
nearby memory accesses are merged. This helps in cases where we would
otherwise separate accesses to the same buffer. (In particular, this
prevents pathologically bad behaviour on (hand-)unrolled loops.)
Second, any partition that is too small is merged with its neighbours.
This should help to keep ILP and MLP high. Currently, any partition
without load/stores is considered "too small", but I expect that this
will need some more tuning.
This seems to give reasonable results with some outliers that I need to
look at more. From the test suite:
delta exec time benchmark #loop-dist (lower is better) SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.test 1 0.864 (i.e. +86.4% exec time vs no loop distribute) SingleSource/Benchmarks/Stanford/Bubblesort.test 3 0.265 SingleSource/Benchmarks/Polybench/linear-algebra/solvers/durbin/durbin.test 2 0.214 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.test 2 0.144 SingleSource/Benchmarks/Misc/fp-convert.test 1 0.108 SingleSource/Benchmarks/Stanford/Treesort.test 2 0.091 SingleSource/Benchmarks/CoyoteBench/fftbench.test 1 0.081 MultiSource/Applications/hbd/hbd.test 1 0.08 MultiSource/Benchmarks/TSVC/Equivalencing-dbl/Equivalencing-dbl.test 1 0.062 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.test 2 0.06 SingleSource/Benchmarks/Stanford/Quicksort.test 3 0.046 MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test 1 0.042 MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl.test 1 0.042 MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 9 0.04 MultiSource/Benchmarks/MallocBench/espresso/espresso.test 4 0.031 MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test 1 0.029 MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 1 0.023 MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 1 0.022 MultiSource/Benchmarks/VersaBench/bmm/bmm.test 1 0.021 SingleSource/Benchmarks/McGill/queens.test 1 0.02 MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test 1 0.019 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test 1 0.018 MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 2 0.017 MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.test 1 0.016 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.test 1 0.016 MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt.test 1 0.015 MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl.test 1 0.015 MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test 2 0.013 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen.test 1 0.011 MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test 1 0.011 SingleSource/Benchmarks/Polybench/stencils/fdtd-apml/fdtd-apml.test 4 0.007 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syrk/syrk.test 1 0.006 MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 1 0.006 MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test 1 0.005 MultiSource/Benchmarks/Bullet/bullet.test 8 0.004 MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 6 0.004 MultiSource/Applications/oggenc/oggenc.test 6 0.004 SingleSource/Benchmarks/Polybench/stencils/adi/adi.test 2 0.004 MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt.test 1 0.004 MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test 3 0.003 MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 1 0.002 MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test 1 0.002 MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.test 1 0.002 MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 11 0.001 MultiSource/Benchmarks/McCat/04-bisect/bisect.test 4 0.001 SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog.test 2 0.001 MultiSource/Applications/SPASS/SPASS.test 1 0.001 MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 14 0 MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test 9 0 MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test 1 0 MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test 1 0 MultiSource/Benchmarks/7zip/7zip-benchmark.test 16 -0.001 MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 1 -0.001 MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl.test 1 -0.001 MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.test 1 -0.001 MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 1 -0.001 MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt.test 1 -0.001 MultiSource/Applications/JM/lencod/lencod.test 6 -0.002 MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test 1 -0.002 MultiSource/Applications/viterbi/viterbi.test 1 -0.002 MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt.test 1 -0.004 MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test 1 -0.004 MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test 1 -0.005 MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt.test 1 -0.005 SingleSource/Benchmarks/Linpack/linpack-pc.test 1 -0.005 MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.test 3 -0.006 MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl.test 1 -0.006 MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt.test 1 -0.007 MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test 2 -0.008 SingleSource/Benchmarks/Polybench/stencils/fdtd-2d/fdtd-2d.test 1 -0.008 MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.test 1 -0.009 MultiSource/Applications/JM/ldecod/ldecod.test 16 -0.011 MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test 1 -0.011 MultiSource/Benchmarks/sim/sim.test 6 -0.013 MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.test 1 -0.013 MultiSource/Applications/sqlite3/sqlite3.test 3 -0.014 MultiSource/Applications/ClamAV/clamscan.test 2 -0.014 MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl.test 1 -0.014 MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.test 1 -0.019 MultiSource/Benchmarks/mafft/pairlocalalign.test 78 -0.02 SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.test 2 -0.02 MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2.test 4 -0.024 MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 14 -0.027 MultiSource/Applications/obsequi/Obsequi.test 2 -0.027 MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl.test 1 -0.029 SingleSource/Benchmarks/Misc-C++/oopack_v1p8.test 2 -0.03 MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.test 1 -0.031 MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl.test 1 -0.065 SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky.test 1 -0.08 SingleSource/Benchmarks/Polybench/stencils/jacobi-2d-imper/jacobi-2d-imper.test 2 -0.082 MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.test 3 -0.125 MultiSource/Benchmarks/MallocBench/gs/gs.test 3 -0.151 SingleSource/Benchmarks/Polybench/stencils/jacobi-1d-imper/jacobi-1d-imper.test 2 -1
clang-format not found in user's PATH; not linting file.