A new "cache-aware" strategy for tail duplication.
There are two competing goals in tail duplication: (a) On one hand, we want to
duplicate basic blocks in order to shorten the blocks and reduce the number of
executed instructions; (b) (b) On the other hand, we do not want to duplicate
too many blocks, as it would pollute i-cache and negatively affect performance.
The new strategy finds a tradeoff via a proxy objective, called the cache score,
which is designed to quantify the impact of duplication on i-cache. In the
algorithm, the blocks are duplicated only if that improves the cache score.
As a part of the diff, we have also polished the existing implementation,
adjusted logging, fixed some minor issues etc.
Applying the optimization:
llvm-bolt ... -relocs -split-functions=3 -split-all-cold -icf=1 -lite=1 \ -update-debug-sections=false -split-eh -use-gnu-stack -jump-tables=move \ -reorder-functions=hfsort -reorder-blocks=ext-tsp -tail-duplication=cache
Perf impact is measured on different revisions of the clang binary:
release_7
benchmark1: -0.4006 ± 0.3207 (win) benchmark2: -0.5010 ± 0.2159 (win) benchmark3: 0.0686 ± 0.2403 benchmark4: -0.0713 ± 0.2702
release_10
benchmark1: 0.1915 ± 0.3361 benchmark2: -0.1900 ± 0.1977 benchmark3: -0.0330 ± 0.1959 benchmark4: -0.2428 ± 0.1956 (win)
release_12
benchmark1: 0.3636 ± 0.3040 (regression) benchmark2: -0.3389 ± 0.1707 (win) benchmark3: 0.0583 ± 0.1801 benchmark4: -0.2919 ± 0.1798 (win)
The wins are from the 0.05%-0.1% reduction in the instruction count; branch
misses, i-cache/i-TLB misses also sometimes change in a fairly unpredictable
manner (possibly due to different code alignment and function ordering that
depends on function sizes).
This is also tested on two large real-world services. For the first one, we
observe 0.1%-0.2% perf win (0.1% fewer instructions); for the second one, the
performance is flat.
By reading "jump distance" I would expect to see
as an approximation of the jump distance in a forward jump. E.g.
In the example above, because jmp BlockB is a fall-through, it should calculate 0 as the distance. Using the definition I provided above, it is zero, but not in the source code. In the source code of this diff, JumpDistance would be "src size + dst size". Is this to calculate how much cache space is used by using these two blocks? What's the idea? Can you make more explicit the reasoning in the comments? From reading the comments, I got the impression that a FT should be calculated as zero distance.