This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Improve loop distribute cost model
Needs ReviewPublic

Authored by sanwou01 on Apr 13 2021, 5:56 AM.

Details

Summary

This is a first stab at an improved cost model for loop distribution,
replacing "always merge adjacent vectorizable partitions" with something
more fine-grained.

Two new heuristics are added. First, any adjacent partitions that have
nearby memory accesses are merged. This helps in cases where we would
otherwise separate accesses to the same buffer. (In particular, this
prevents pathologically bad behaviour on (hand-)unrolled loops.)

Second, any partition that is too small is merged with its neighbours.
This should help to keep ILP and MLP high. Currently, any partition
without load/stores is considered "too small", but I expect that this
will need some more tuning.

This seems to give reasonable results with some outliers that I need to
look at more. From the test suite:

                                                                                           delta exec time
benchmark                                                                      #loop-dist  (lower is better)
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.test          1   0.864    (i.e. +86.4% exec time vs no loop distribute)
SingleSource/Benchmarks/Stanford/Bubblesort.test                                       3   0.265
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/durbin/durbin.test            2   0.214
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.test                2   0.144
SingleSource/Benchmarks/Misc/fp-convert.test                                           1   0.108
SingleSource/Benchmarks/Stanford/Treesort.test                                         2   0.091
SingleSource/Benchmarks/CoyoteBench/fftbench.test                                      1   0.081
MultiSource/Applications/hbd/hbd.test                                                  1   0.08
MultiSource/Benchmarks/TSVC/Equivalencing-dbl/Equivalencing-dbl.test                   1   0.062
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.test              2   0.06
SingleSource/Benchmarks/Stanford/Quicksort.test                                        3   0.046
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test                  1   0.042
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl.test             1   0.042
MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test                            9   0.04
MultiSource/Benchmarks/MallocBench/espresso/espresso.test                              4   0.031
MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test                           1   0.029
MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test         1   0.023
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test         1   0.022
MultiSource/Benchmarks/VersaBench/bmm/bmm.test                                         1   0.021
SingleSource/Benchmarks/McGill/queens.test                                             1   0.02
MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test                       1   0.019
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test                1   0.018
MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test                            2   0.017
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.test                   1   0.016
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.test                1   0.016
MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt.test                     1   0.015
MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl.test                 1   0.015
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test                  2   0.013
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen.test          1   0.011
MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test       1   0.011
SingleSource/Benchmarks/Polybench/stencils/fdtd-apml/fdtd-apml.test                    4   0.007
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syrk/syrk.test                1   0.006
MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test                   1   0.006
MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test           1   0.005
MultiSource/Benchmarks/Bullet/bullet.test                                              8   0.004
MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test                              6   0.004
MultiSource/Applications/oggenc/oggenc.test                                            6   0.004
SingleSource/Benchmarks/Polybench/stencils/adi/adi.test                                2   0.004
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt.test                   1   0.004
MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test                        3   0.003
MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test                       1   0.002
MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test                                    1   0.002
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.test         1   0.002
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test                                11  0.001
MultiSource/Benchmarks/McCat/04-bisect/bisect.test                                     4   0.001
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog.test          2   0.001
MultiSource/Applications/SPASS/SPASS.test                                              1   0.001
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test                              14  0
MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test                                     9   0
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test                           1   0
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test                           1   0
MultiSource/Benchmarks/7zip/7zip-benchmark.test                                        16  -0.001
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test                        1   -0.001
MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl.test                         1   -0.001
MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.test                         1   -0.001
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test           1   -0.001
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt.test                   1   -0.001
MultiSource/Applications/JM/lencod/lencod.test                                         6   -0.002
MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test                       1   -0.002
MultiSource/Applications/viterbi/viterbi.test                                          1   -0.002
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt.test                           1   -0.004
MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test       1   -0.004
MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test                               1   -0.005
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt.test             1   -0.005
SingleSource/Benchmarks/Linpack/linpack-pc.test                                        1   -0.005
MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.test                            3   -0.006
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl.test                   1   -0.006
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt.test                 1   -0.007
MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test                           2   -0.008
SingleSource/Benchmarks/Polybench/stencils/fdtd-2d/fdtd-2d.test                        1   -0.008
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.test                           1   -0.009
MultiSource/Applications/JM/ldecod/ldecod.test                                         16  -0.011
MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test                       1   -0.011
MultiSource/Benchmarks/sim/sim.test                                                    6   -0.013
MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.test                               1   -0.013
MultiSource/Applications/sqlite3/sqlite3.test                                          3   -0.014
MultiSource/Applications/ClamAV/clamscan.test                                          2   -0.014
MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl.test                     1   -0.014
MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.test         1   -0.019
MultiSource/Benchmarks/mafft/pairlocalalign.test                                       78  -0.02
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.test  2   -0.02
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2.test                            4   -0.024
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test                        14  -0.027
MultiSource/Applications/obsequi/Obsequi.test                                          2   -0.027
MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl.test                           1   -0.029
SingleSource/Benchmarks/Misc-C++/oopack_v1p8.test                                      2   -0.03
MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.test           1   -0.031
MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl.test           1   -0.065
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky.test        1   -0.08
SingleSource/Benchmarks/Polybench/stencils/jacobi-2d-imper/jacobi-2d-imper.test        2   -0.082
MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.test                          3   -0.125
MultiSource/Benchmarks/MallocBench/gs/gs.test                                          3   -0.151
SingleSource/Benchmarks/Polybench/stencils/jacobi-1d-imper/jacobi-1d-imper.test        2   -1

Diff Detail

Event Timeline

sanwou01 created this revision.Apr 13 2021, 5:56 AM
sanwou01 requested review of this revision.Apr 13 2021, 5:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2021, 5:56 AM
lebedev.ri edited the summary of this revision. (Show Details)Apr 13 2021, 6:02 AM

That certainly looks encouraging. Just some remarks about the perf numbers first. I found that the llvm test suite can be quite noisy and you certainly need to restrict it to the subset of CTMark for some more meaningful numbers. Just checking, did you do this? Because looking at all tests, I think I see more tests than I would expect with CTMark, but I could be wrong. How about a SPEC run for something that runs a bit longer?

xbolva00 added a comment.EditedApr 13 2021, 6:47 AM

loop dist in hmmer should always work :) and some improvements in TSVC should be expected as well.

https://github.com/UoB-HPC/TSVC_2/blob/master/src/tsvc.c#L1797

Since test suite can be quite noisy, can you check and share for additionally vectorized loops wrt new cost model?

loop dist in hmmer should always work :)

Ha, true! Perhaps check that the expected gain is still there then. ;-)
But yeah, I basically want to say if we could check this with some other non/less noisy suite, so like this suggestion:

and some improvements in TSVC should be expected as well.

https://github.com/UoB-HPC/TSVC_2/blob/master/src/tsvc.c#L1797

Since test suite can be quite noisy, can you check and share for additionally vectorized loops wrt new cost model?

That certainly looks encouraging. Just some remarks about the perf numbers first. I found that the llvm test suite can be quite noisy and you certainly need to restrict it to the subset of CTMark for some more meaningful numbers. Just checking, did you do this? Because looking at all tests, I think I see more tests than I would expect with CTMark, but I could be wrong. How about a SPEC run for something that runs a bit longer?

Yeah, the test suite is fairly noisy, even with -DTEST_SUITE_BENCHMARKING_ONLY=On). CTMark is mostly for compile time, isn't it? I don't think it helps much in terms of noise.

In any case, I've only included benchmarks which have distributed loops, but there are changes (noise) of similar size in other tests too. IIRC SPEC was neutral except for the expected gain on hmmer, but I'll re-run them with the current patch.

loop dist in hmmer should always work :) and some improvements in TSVC should be expected as well.

https://github.com/UoB-HPC/TSVC_2/blob/master/src/tsvc.c#L1797

I don't think we pick that one up (yet), because it's not the inner loop that needs distribution, which is a limitation of the current pass. It looks like there are a few more in TSVC, so I'll have a look at those.

Since test suite can be quite noisy, can you check and share for additionally vectorized loops wrt new cost model?

Good idea, I'll get some numbers for that. Loop distribution increases the total number of loops as well, so you'd naturally expect the number of vectorized loops to go up, so it'd be interesting to compare the change in number of loops to the number of *vectorized* loops.

fhahn added a comment.Apr 14 2021, 6:25 AM

We also have some target-specific heuristics for loop-distribute, which focus on the number of memory streams a CPU can handle IIRC. I never got around posting them upstream so far. Let me go back and look at those heuristics.

We also have some target-specific heuristics for loop-distribute, which focus on the number of memory streams a CPU can handle IIRC. I never got around posting them upstream so far. Let me go back and look at those heuristics.

That would be awesome, thanks!

IIRC SPEC was neutral except for the expected gain on hmmer, but I'll re-run them with the current patch.

SPEC (AArch64, LTO) isn't where I'd like it to be just yet. hmmer's score is up 20% (vs no loop distribute), but the old loop distribute saw +45%, so looks like something isn't quite right. The other surprise is 2006 fprate 433.milc which is down 19%. I'll have a look at both of them.

nikic resigned from this revision.Apr 16 2021, 1:30 PM

Looking at TSVC a bit, @xbolva00 :

  • s221 won't distribute because the second read of a[i] is removed by EarlyCSE, so there is no unique load instruction for a second loop. For this loop I'm not convinced that distribution is likely to help performance; it's a trade-off between (some) vectorization and re-loading both a[i] and d[i].
  • s222 also gets mangled by EarlyCSE, but the result would still be distributable if it weren't for the order of the stores to e[i] and a[i]. This runs into a limitation of LoopAccessAnalysis, which can't reorder instructions. Perhaps it could help to do a bit of scheduling on IR?
  • s2275 as mentioned above, this runs into another LoopAccessAnalysis limitation: it only handles innermost loops. I'm not sure how easy (if at all possible) it would be to lift that restriction.

So, unfortunately, it looks like we can't handle these loops without some pretty fundamental changes to LoopAccessAnalysis. Thoughts?

Thank you for the detailed analysis!

I think it would be good to share this analysis on llvm-dev too and ask community about LoopAccessAnalysis.

cc @fhahn @reames as they work on loop optimizations

fhahn added a comment.Apr 21 2021, 5:32 AM

Looking at TSVC a bit, @xbolva00 :

  • s221 won't distribute because the second read of a[i] is removed by EarlyCSE, so there is no unique load instruction for a second loop. For this loop I'm not convinced that distribution is likely to help performance; it's a trade-off between (some) vectorization and re-loading both a[i] and d[i].
  • s222 also gets mangled by EarlyCSE, but the result would still be distributable if it weren't for the order of the stores to e[i] and a[i]. This runs into a limitation of LoopAccessAnalysis, which can't reorder instructions. Perhaps it could help to do a bit of scheduling on IR?
  • s2275 as mentioned above, this runs into another LoopAccessAnalysis limitation: it only handles innermost loops. I'm not sure how easy (if at all possible) it would be to lift that restriction.

So, unfortunately, it looks like we can't handle these loops without some pretty fundamental changes to LoopAccessAnalysis. Thoughts?

Yes, those are known limitations and I would recommend focusing on showing loop-distribute's value with current LAA.

(snip)

So, unfortunately, it looks like we can't handle these loops without some pretty fundamental changes to LoopAccessAnalysis. Thoughts?

Yes, those are known limitations and I would recommend focusing on showing loop-distribute's value with current LAA.

On a general comment, I highly recommend trying to be incremental and fixing one set of issues at a time. If you can show benefit with current LAA, do that. If you can't and need to change LAA in some way, go do that and show the benefit on it's own. Trying to do too many things at once is recipe for failure.

(p.s. I know very little about the specific code in loop-distribute. This is just a general comment.)

Thanks for the comments. I'm happy to leave LoopAccessAnalysis alone for now and focus on Loop Distribute's cost model. I was hoping that a cost model tweak or two might enable some more loop distribution in TSVC, but it looks like that isn't the case.

Matt added a subscriber: Matt.Apr 22 2021, 6:24 AM

Maybe worth to check gcc’s test cases in gcc.dg/tree-ssa/ldist-*.c ?

lebedev.ri resigned from this revision.Jan 12 2023, 5:23 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:23 PM
Herald added a subscriber: StephenFan. · View Herald Transcript
Allen added a subscriber: Allen.Aug 9 2023, 4:41 AM