This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] Cache-Aware Tail Duplication
ClosedPublic

Authored by spupyrev on Apr 4 2022, 10:28 AM.

Details

Summary

A new "cache-aware" strategy for tail duplication.

There are two competing goals in tail duplication: (a) On one hand, we want to
duplicate basic blocks in order to shorten the blocks and reduce the number of
executed instructions; (b) (b) On the other hand, we do not want to duplicate
too many blocks, as it would pollute i-cache and negatively affect performance.
The new strategy finds a tradeoff via a proxy objective, called the cache score,
which is designed to quantify the impact of duplication on i-cache. In the
algorithm, the blocks are duplicated only if that improves the cache score.

As a part of the diff, we have also polished the existing implementation,
adjusted logging, fixed some minor issues etc.

Applying the optimization:

llvm-bolt ... -relocs -split-functions=3 -split-all-cold -icf=1 -lite=1 \
-update-debug-sections=false -split-eh -use-gnu-stack -jump-tables=move \
-reorder-functions=hfsort -reorder-blocks=ext-tsp -tail-duplication=cache

Perf impact is measured on different revisions of the clang binary:
release_7

benchmark1: -0.4006 ± 0.3207 (win)
benchmark2: -0.5010 ± 0.2159 (win)
benchmark3:  0.0686 ± 0.2403
benchmark4: -0.0713 ± 0.2702

release_10

benchmark1:  0.1915 ± 0.3361
benchmark2: -0.1900 ± 0.1977
benchmark3: -0.0330 ± 0.1959
benchmark4: -0.2428 ± 0.1956 (win)

release_12

benchmark1:  0.3636 ± 0.3040 (regression)
benchmark2: -0.3389 ± 0.1707 (win)
benchmark3:  0.0583 ± 0.1801
benchmark4: -0.2919 ± 0.1798 (win)

The wins are from the 0.05%-0.1% reduction in the instruction count; branch
misses, i-cache/i-TLB misses also sometimes change in a fairly unpredictable
manner (possibly due to different code alignment and function ordering that
depends on function sizes).

This is also tested on two large real-world services. For the first one, we
observe 0.1%-0.2% perf win (0.1% fewer instructions); for the second one, the
performance is flat.

Diff Detail

Event Timeline

spupyrev created this revision.Apr 4 2022, 10:28 AM
Herald added a reviewer: Amir. · View Herald Transcript
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: ayermolo. · View Herald Transcript
spupyrev requested review of this revision.Apr 4 2022, 10:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2022, 10:28 AM
spupyrev edited the summary of this revision. (Show Details)Apr 4 2022, 10:40 AM
wenlei added a subscriber: wenlei.Apr 4 2022, 10:49 AM

There was a *very* brief discussion about system models on llvm-dev. Did you consider using something similar in bolt resp. llvm?
https://lists.llvm.org/pipermail/llvm-dev/2018-October/127319.html

There was a *very* brief discussion about system models on llvm-dev. Did you consider using something similar in bolt resp. llvm?
https://lists.llvm.org/pipermail/llvm-dev/2018-October/127319.html

Interesting, thanks for the pointer. Is there a discussion somewhere, i can only see a single message with RFC.

There was a *very* brief discussion about system models on llvm-dev. Did you consider using something similar in bolt resp. llvm?
https://lists.llvm.org/pipermail/llvm-dev/2018-October/127319.html

Interesting, thanks for the pointer. Is there a discussion somewhere, i can only see a single message with RFC.

I found a phab diff, but that is all I could find.
https://reviews.llvm.org/D58736

spupyrev updated this revision to Diff 425586.Apr 27 2022, 12:03 PM

rebase + a small adjustment of the default optimization flags

rafauler accepted this revision.May 31 2022, 3:02 PM

LGTM with a few comments below

bolt/lib/Passes/TailDuplication.cpp
393

By reading "jump distance" I would expect to see

JumpDistance = DstAddr - (SrcAddr + SrcSize)

as an approximation of the jump distance in a forward jump. E.g.

BlockA:  <src address>
nop
nop
jmp BlockB 
BlockB:  <src address + src size>     also  <dstaddr>
nop
nop
nop
<dstaddr + dst size>

In the example above, because jmp BlockB is a fall-through, it should calculate 0 as the distance. Using the definition I provided above, it is zero, but not in the source code. In the source code of this diff, JumpDistance would be "src size + dst size". Is this to calculate how much cache space is used by using these two blocks? What's the idea? Can you make more explicit the reasoning in the comments? From reading the comments, I got the impression that a FT should be calculated as zero distance.

402

Why are we always returning 1.0 in case of a forward jump? Is this correct? From what I read in the comments of this function, I would expect to see Prob * Count here.

This revision is now accepted and ready to land.May 31 2022, 3:02 PM
spupyrev added inline comments.May 31 2022, 3:27 PM
bolt/lib/Passes/TailDuplication.cpp
393

Thanks for reading the code so carefully! I think that distance is not the best name here and causes the confusion.

The idea is to quantify the impact of the block placement on the instruction cache. In particular, we want to distinguish between the case when BlockB is "large" and the case when it is "small" (eg a few bytes). It feels that the latter is more i-cache friendly than the former; thus, we'd want to duplicate BlockB only when it is "small". Notice that this effect won't be achieved when we compute the "correct" distance as you describe (which is 0 independently of the size of the block).

The above is of course just an unproved intuition that perform reasonably well in practice. I am happy to experiment with alternatives and/or extensions, if you have any. Also let me know if you see a better name for distance.

402

Your assumption is correct, as well as the implementation. Likely my formatting makes the code harder to read.
Do you think the following equivalent would be more readable?

if (IsForwardJump)
  return Prob * Count;
else
  return opts::TailDuplicationCacheBackwardWeight * Prob * Count;
rafauler added inline comments.May 31 2022, 4:12 PM
bolt/lib/Passes/TailDuplication.cpp
393

I think the current implementation makes sense (to quantify size and not just the jump distance) and I don't have any suggestions on improving that. I would probably try something similar. I was just a bit confused because the comment in line 105 of the header file mention that a fallthough jump will map to a 1.0 score. Maybe update the comment?

For the name, I would probably use "JumpScore", but I don't have any strong opinions on it, so if you want to keep JumpDistance, that's fine too.

402

Ohhh sorry, now I see it. You can keep the current formatting, that's fine. Whichever you prefer.

@spupyrev I don't remember if you have commit access to the LLVM repo. If you don't, let me know, and I'll commit this once you are done with modifications and give me a green signal.

spupyrev updated this revision to Diff 434035.Jun 3 2022, 8:34 AM

updated comments per review

This revision was automatically updated to reflect the committed changes.