This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize default threshold
ClosedPublic

Authored by NickGuy on Jan 28 2021, 9:56 AM.

Details

Summary

Different targets might handle branch performance differently, so this patch allows for
targets to specify the TailDuplicateSize threshold. Said threshold defines how small a branch
can be and still be duplicated to generate straight-line code instead.
This patch also specifies said override values for the AArch64 subtarget.

Diff Detail

Event Timeline

NickGuy created this revision.Jan 28 2021, 9:56 AM
NickGuy requested review of this revision.Jan 28 2021, 9:56 AM

I don't think I would mind merging this with D95632 to keep things in one place; don't think separating brings much value in this case.

Also adding some more folks who might care about AArch64 tuning: @fhahn, @samparker , @t.p.northover

We have seen a case for which generating straight-line code for tail blocks is really good for performance. From memory, this was a relatively small function that was called a lot of times, so that this makes a big difference.
But since this is changing a generic AArch64 codegen option/value, I think @NickGuy you need to show that this is good/bad/neutral for some more code other than our one motivation example.

I think a target hook for TailDuplicateSize makes sense though, it is easy to imagine I think that different targets would have different preferences for this.

NickGuy updated this revision to Diff 320791.Feb 2 2021, 8:01 AM

I don't think I would mind merging this with D95632 to keep things in one place; don't think separating brings much value in this case.

Done

This sounds OK to me, but 6 is quite a big jump up from 2. Is is possible to make the normal threshold 2 or 3, and the aggressive threshold 6?
Oh. It looks like there is already an aggressive limit. TailDuplicator is either used by itself or in MachineBlockPlacement. MachineBlockPlacement already sets the limit to 4 when optimizing aggressively. Would it work for your usecase to alter the value there based on the target instead?

I don't see a problem with only performing this at -O3, and moving the override from TailDuplicator to MachineBlockPlacement doesn't seem to cause any problems (at least, as far as I could tell).

you need to show that this is good/bad/neutral for some more code other than our one motivation example.

Here are the results from the benchmarking performed with this patch; It causes an improvement in a number of cases, however due to the size of said tests, the results are rather susceptible to system noise.

LLVM Test suite, AArch64 exec_time results (cherry-picked for the most interesting/significant tests), lower is better

BenchmarkDuplication threshold = 2Duplication threshold = 6Difference %
test-suite :: SingleSource/Benchmarks/Misc/evalloop.test1.2120.6376-47.39%
test-suite :: MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test4.386.40846.30%
test-suite :: MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test6.4084.356-32.02%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-ary2.test0.020.0136-32.00%
test-suite :: SingleSource/Benchmarks/McGill/misr.test0.2520.296817.78%
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test0.0160.018415.00%
test-suite :: MultiSource/Benchmarks/Ptrdist/ks/ks.test1.75441.5-14.50%
test-suite :: SingleSource/Benchmarks/Misc/himenobmtxpa.test1.5361.3496-12.14%
test-suite :: MultiSource/Applications/kimwitu++/kc.test0.0480.0432-10.00%
test-suite :: SingleSource/Benchmarks/Stanford/Towers.test0.01840.0168-8.70%
test-suite :: MultiSource/Benchmarks/Olden/mst/mst.test0.11680.1072-8.22%
test-suite :: MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test5.71046.14567.62%
test-suite :: MultiSource/Applications/hexxagon/hexxagon.test12.487211.6384-6.80%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test0.02560.02726.25%
test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test5.74966.16.09%
test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test1.24481.326.04%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test6.0485.7024-5.71%
test-suite :: MultiSource/Benchmarks/Olden/bisort/bisort.test0.68080.648-4.82%
test-suite :: MultiSource/Applications/ClamAV/clamscan.test0.19840.1896-4.44%
test-suite :: MultiSource/Benchmarks/McCat/03-testtrie/testtrie.test0.01920.024.17%

Spec results are a bit more stable, and show a similar pattern (a small delta in both directions, but overall an improvement).
Beyond the benchmarks shown in this table, no other meaningful differences were identified.

Spec2017 results, higher is better

BenchmarkDifference %
505.mcf_r0.675%
523.xalancbmk_r-0.24%
557.xz_r0.417%

Thanks for those results.

I think the summary of those results is: it doesn't make things worse, which is also difficult to imagine from this change, and it improves a few cases.
I am ignoring the first 3 rows of the llvm test suite results, that just looks like noise to me. The TSVC ones look like individual test cases from that vectoriser suite, are very small, and is probably just noise. I would be good to confirm that though, just have a brief look if there are any meaningful codegen difference in e.g. TSVC/Searching-dbl/Searching-dbl.test.

SPEC looks good: even tough it was not our motivating example, it does a little bit of good (or doesn't make things worse, or shows surprises).

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1943

Bikeshedding names: don't think we need the Override part in the name.

llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll
4

Can we add a RUN line with tail-dup-size=4, which should give the same result as the O2 one.

NickGuy updated this revision to Diff 321099.Feb 3 2021, 7:32 AM
NickGuy marked 2 inline comments as done.
NickGuy retitled this revision from [TailDuplicator] Add TargetInstrInfo hook to modify the TailDuplicateSize default threshold to [CodeGen][AArch64] Add TargetInstrInfo hook to modify the TailDuplicateSize default threshold.
NickGuy edited the summary of this revision. (Show Details)

just have a brief look if there are any meaningful codegen difference in e.g. TSVC/Searching-dbl/Searching-dbl.test.

From what I could tell, I couldn't find any meaningful differences in that test

Like we were talking about offline, there's a chance this isn't as good as the first version that altered the taildup threshold directly - this done earlier might give more scheduling freedom and other optimization opportunities to the folded tails. It's now only done in machine block placement late in the pipeline. If this version is working that's great, but check it's still OK.

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1943

+1 to removing override. It's probably worth calling out that this is the threshold used in machineblockplacement now, not the one used in the other taildup passes. It seems to be called TailDupPlacementThreshold there.

NickGuy updated this revision to Diff 321760.Feb 5 2021, 7:47 AM
NickGuy marked an inline comment as done.

If this version is working that's great, but check it's still OK.

I haven't seen any issues so far, and in some cases I've seen a slight improvement. I've included the new Spec results, this time there are more significant changes across the suite.

BenchmarkTailDuplicatorMachineBlockPlacement
500.perlbench_r-0.048%-0.155%
502.gcc_r0.036%0.234%
505.mcf_r0.675%1.349%
513.xalancbmk_r-0.240%-0.051%
520.omnetpp_r0.048%0.164%
525.x264_r0.062%0.185%
557.xz_r0.417%0.296%
dmgreen accepted this revision.Feb 8 2021, 12:44 AM

OK. Thanks for checking. In that case LGTM.

This revision is now accepted and ready to land.Feb 8 2021, 12:44 AM
This revision was landed with ongoing or failed builds.Feb 8 2021, 5:28 AM
This revision was automatically updated to reflect the committed changes.