This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CodeGen] Avoid inverting hot branches during relaxation
ClosedPublic

Authored by dhoekwater on Aug 1 2023, 5:21 PM.

Details

Summary

Current behavior for relaxing out-of-range conditional branches
is to invert the conditional and insert a fallthrough unconditional
branch to the original destination. This approach biases the branch
predictor in the wrong direction, which can degrading performance.

Machine function splitting introduces many rarely-taken cross-section
conditional branches, which are improperly relaxed. Avoid inverting
these branches; instead, retarget them to trampolines at the end of the
function. Doing so increases the runtime cost of jumping to cold code
but eliminates the misprediction cost of jumping to hot code.

Diff Detail

Event Timeline

dhoekwater created this revision.Aug 1 2023, 5:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 5:21 PM

Add exhaustive tests

Remove debug lines

Rebase onto up-to-date parent

dhoekwater published this revision for review.Aug 3 2023, 11:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2023, 11:30 AM
dhoekwater updated this revision to Diff 547421.Aug 4 2023, 5:29 PM

Rebase parent onto main and add clearer comments

removed reviewers because I initially added way too many.

arsenm added inline comments.Aug 10 2023, 2:56 PM
llvm/lib/CodeGen/BranchRelaxation.cpp
184

this just scanned through the function, couldn't you track this on that first loop?

403–427

Split this into a helper function?

arsenm requested changes to this revision.Aug 15 2023, 8:48 AM
This revision now requires changes to proceed.Aug 15 2023, 8:48 AM
dhoekwater marked an inline comment as done.

Move trampoline initialization into an existing loop

dhoekwater marked an inline comment as done.Aug 16 2023, 12:01 PM
dhoekwater added inline comments.
llvm/lib/CodeGen/BranchRelaxation.cpp
403–427

insertUncondBranch, removeBranch, insertBranch, and finalizeBlockChanges are function-local lambdas, so moving this all into a helper function would require moving those to helper functions as well.

I think abstracting away BlockInfo maintenance into helper functions could be a good idea, but doing so in this change would probably be overly ambitious.

dhoekwater edited the summary of this revision. (Show Details)Aug 17 2023, 11:32 AM
dhoekwater marked an inline comment as done.

Remove the closed dependency from the commit message

arsenm accepted this revision.Aug 17 2023, 12:08 PM

LGTM with some nits

llvm/lib/CodeGen/BranchRelaxation.cpp
85

null initialize and remove the one in scanFunction?

184

Need a << '\n'

726

redundant with clearing this at the start of the scan

This revision is now accepted and ready to land.Aug 17 2023, 12:08 PM
dhoekwater marked 3 inline comments as done.

Make the comments more obvious and remove an unnecessary assignment

A bit more info on the motivation for this patch:

Machine function splitting (MFS) places super cold basic blocks in a different ELF section (the .test.split section), so any hot branches targeting the cold basic blocks must be relaxed. The current method of relaxing a conditional branch is to invert the condition, which biases the branch predictor in the opposite direction. Since these branches are almost never taken, the bias has a significant impact on runtime performance.

This patch mitigates the impact of MFS on the branch predictor by avoiding inverting the conditional branch. Instead, if the spot right after the last hot block is in range, we can 1) place a "trampoline," a basic block that only consists of an unconditional to the destination, at this location and 2) retarget the conditional branch to the trampoline. This increases the cost of actually taking the branch to the cold section, but it is almost never taken; doing so significantly improves performance when the branch is not taken.

llvm/lib/CodeGen/BranchRelaxation.cpp
85

We either need the null assignment in scanFunction or runOnMachineFunction since just null initializing it will keep the insertion point from the previous function. I'll keep the one in scanFunction, since it doesn't require that the previous call to runOnMachineFunction executed successfully.

Rebase onto main

Rebase onto main to hopefully resolve some spurious test failures