Different targets might handle branch performance differently, so this patch allows for
targets to specify the TailDuplicateSize threshold. Said threshold defines how small a branch
can be and still be duplicated to generate straight-line code instead.
This patch also specifies said override values for the AArch64 subtarget.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I don't think I would mind merging this with D95632 to keep things in one place; don't think separating brings much value in this case.
Also adding some more folks who might care about AArch64 tuning: @fhahn, @samparker , @t.p.northover
We have seen a case for which generating straight-line code for tail blocks is really good for performance. From memory, this was a relatively small function that was called a lot of times, so that this makes a big difference.
But since this is changing a generic AArch64 codegen option/value, I think @NickGuy you need to show that this is good/bad/neutral for some more code other than our one motivation example.
I think a target hook for TailDuplicateSize makes sense though, it is easy to imagine I think that different targets would have different preferences for this.
I don't think I would mind merging this with D95632 to keep things in one place; don't think separating brings much value in this case.
Done
I don't see a problem with only performing this at -O3, and moving the override from TailDuplicator to MachineBlockPlacement doesn't seem to cause any problems (at least, as far as I could tell).
you need to show that this is good/bad/neutral for some more code other than our one motivation example.
Here are the results from the benchmarking performed with this patch; It causes an improvement in a number of cases, however due to the size of said tests, the results are rather susceptible to system noise.
LLVM Test suite, AArch64 exec_time results (cherry-picked for the most interesting/significant tests), lower is better
Benchmark | Duplication threshold = 2 | Duplication threshold = 6 | Difference % |
---|---|---|---|
test-suite :: SingleSource/Benchmarks/Misc/evalloop.test | 1.212 | 0.6376 | -47.39% |
test-suite :: MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test | 4.38 | 6.408 | 46.30% |
test-suite :: MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test | 6.408 | 4.356 | -32.02% |
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary2.test | 0.02 | 0.0136 | -32.00% |
test-suite :: SingleSource/Benchmarks/McGill/misr.test | 0.252 | 0.2968 | 17.78% |
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test | 0.016 | 0.0184 | 15.00% |
test-suite :: MultiSource/Benchmarks/Ptrdist/ks/ks.test | 1.7544 | 1.5 | -14.50% |
test-suite :: SingleSource/Benchmarks/Misc/himenobmtxpa.test | 1.536 | 1.3496 | -12.14% |
test-suite :: MultiSource/Applications/kimwitu++/kc.test | 0.048 | 0.0432 | -10.00% |
test-suite :: SingleSource/Benchmarks/Stanford/Towers.test | 0.0184 | 0.0168 | -8.70% |
test-suite :: MultiSource/Benchmarks/Olden/mst/mst.test | 0.1168 | 0.1072 | -8.22% |
test-suite :: MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test | 5.7104 | 6.1456 | 7.62% |
test-suite :: MultiSource/Applications/hexxagon/hexxagon.test | 12.4872 | 11.6384 | -6.80% |
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test | 0.0256 | 0.0272 | 6.25% |
test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test | 5.7496 | 6.1 | 6.09% |
test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test | 1.2448 | 1.32 | 6.04% |
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test | 6.048 | 5.7024 | -5.71% |
test-suite :: MultiSource/Benchmarks/Olden/bisort/bisort.test | 0.6808 | 0.648 | -4.82% |
test-suite :: MultiSource/Applications/ClamAV/clamscan.test | 0.1984 | 0.1896 | -4.44% |
test-suite :: MultiSource/Benchmarks/McCat/03-testtrie/testtrie.test | 0.0192 | 0.02 | 4.17% |
Spec results are a bit more stable, and show a similar pattern (a small delta in both directions, but overall an improvement).
Beyond the benchmarks shown in this table, no other meaningful differences were identified.
Spec2017 results, higher is better
Benchmark | Difference % |
---|---|
505.mcf_r | 0.675% |
523.xalancbmk_r | -0.24% |
557.xz_r | 0.417% |
Thanks for those results.
I think the summary of those results is: it doesn't make things worse, which is also difficult to imagine from this change, and it improves a few cases.
I am ignoring the first 3 rows of the llvm test suite results, that just looks like noise to me. The TSVC ones look like individual test cases from that vectoriser suite, are very small, and is probably just noise. I would be good to confirm that though, just have a brief look if there are any meaningful codegen difference in e.g. TSVC/Searching-dbl/Searching-dbl.test.
SPEC looks good: even tough it was not our motivating example, it does a little bit of good (or doesn't make things worse, or shows surprises).
llvm/include/llvm/CodeGen/TargetInstrInfo.h | ||
---|---|---|
1943 | Bikeshedding names: don't think we need the Override part in the name. | |
llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll | ||
5 | Can we add a RUN line with tail-dup-size=4, which should give the same result as the O2 one. |
just have a brief look if there are any meaningful codegen difference in e.g. TSVC/Searching-dbl/Searching-dbl.test.
From what I could tell, I couldn't find any meaningful differences in that test
Like we were talking about offline, there's a chance this isn't as good as the first version that altered the taildup threshold directly - this done earlier might give more scheduling freedom and other optimization opportunities to the folded tails. It's now only done in machine block placement late in the pipeline. If this version is working that's great, but check it's still OK.
llvm/include/llvm/CodeGen/TargetInstrInfo.h | ||
---|---|---|
1943 | +1 to removing override. It's probably worth calling out that this is the threshold used in machineblockplacement now, not the one used in the other taildup passes. It seems to be called TailDupPlacementThreshold there. |
If this version is working that's great, but check it's still OK.
I haven't seen any issues so far, and in some cases I've seen a slight improvement. I've included the new Spec results, this time there are more significant changes across the suite.
Benchmark | TailDuplicator | MachineBlockPlacement |
---|---|---|
500.perlbench_r | -0.048% | -0.155% |
502.gcc_r | 0.036% | 0.234% |
505.mcf_r | 0.675% | 1.349% |
513.xalancbmk_r | -0.240% | -0.051% |
520.omnetpp_r | 0.048% | 0.164% |
525.x264_r | 0.062% | 0.185% |
557.xz_r | 0.417% | 0.296% |
Bikeshedding names: don't think we need the Override part in the name.