This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] [BranchRelaxation] Optimize for hot code size in AArch64 branch relaxation
ClosedPublic

Authored by dhoekwater on Jul 31 2023, 5:59 PM.
Tokens
"Like" token, awarded by mingmingl.

Details

Summary

On AArch64, it is safe to let the linker handle relaxation of
unconditional branches; in most cases, the destination is within range,
and the linker doesn't need to do anything. If the linker does insert
fixup code, it clobbers the x16 inter-procedural register, so x16 must
be available across the branch before linking. If x16 isn't available,
but some other register is, we can relax the branch either by spilling
x16 OR using the free register for a manually-inserted indirect branch.

This patch builds on D145211. While that patch is for correctness, this
one is for performance of the common case. As noted in
https://reviews.llvm.org/D145211#4537173, we can trust the linker to
relax cross-section unconditional branches across which x16 is
available.

Programs that use machine function splitting care most about the
performance of hot code at the expense of the performance of cold code,
so we prioritize minimizing hot code size.

Here's a breakdown of the cases:

Hot -> Cold [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Cold -> Hot [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Hot -> Cold [x16 used across the branch, but there is a free register]
  Spill x16; let the linker relax the branch.

  Spilling requires fewer instructions than manually inserting an
  indirect branch.

Cold -> Hot [x16 used across the branch, but there is a free register]
  Manually insert an indirect branch.

  Spilling would require adding a restore block in the hot section.

 Hot -> Cold [No free regs]
   Spill x16; let the linker relax the branch.

 Cold -> Hot [No free regs]
   Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
   Ex:
     [Hot section]
     func.hot:
       ... hot code...
     func.restore:
       ... restore x16 ...
       B func.hot

     [Cold section]
       func.cold:
       ... spill x16 ...
       B func.restore

   Putting the restore block at the end of the function instead of
   just before the destination increases the cost of executing the
   store, but it avoids putting cold code in the middle of hot code.
   Since the restore is very rarely taken, this is a worthwhile
   tradeoff.

Diff Detail

Event Timeline

dhoekwater created this revision.Jul 31 2023, 5:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 6:00 PM
dhoekwater requested review of this revision.Jul 31 2023, 6:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 6:00 PM

Here's a breakdown of the cases:

Hot -> Cold [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Cold -> Hot [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Hot -> Cold [x16 not free, but others are]
  Spill x16; let the linker relax the branch.

Cold -> Hot [x16 not free, but others are]
  Manually insert an indirect branch.

Hot -> Cold [No free regs]
  Spill x16; let the linker relax the branch.

Cold -> Hot [No free regs]
  Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
  Ex:
    [Hot section]
    func.hot:
      ... hot code...
    func.restore:
      ... restore x16 ...
      B func.hot

    [Cold section]
      func.cold:
      ... spill x16 ...
      B func.restore

Rebase onto up-to-date head

dhoekwater updated this revision to Diff 547423.Aug 4 2023, 5:34 PM

removed reviewers because I initially added way too many.

Here's a breakdown of the cases:

Hot -> Cold [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Cold -> Hot [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Hot -> Cold [x16 not free, but others are]
  Spill x16; let the linker relax the branch.

Cold -> Hot [x16 not free, but others are]
  Manually insert an indirect branch.

Hot -> Cold [No free regs]
  Spill x16; let the linker relax the branch.

Cold -> Hot [No free regs]
  Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
  Ex:
    [Hot section]
    func.hot:
      ... hot code...
    func.restore:
      ... restore x16 ...
      B func.hot

    [Cold section]
      func.cold:
      ... spill x16 ...
      B func.restore

The change overall looks reasonable to me and the breakdown is well written.

To make the motivation slightly more clearer, maybe describe what's the current status without this patch (hot blocks not optimized, etc), and how this patch helps when cold blocks are split by function-splitting.

Here's a breakdown of the cases:

Hot -> Cold [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Cold -> Hot [x16 is free across the branch]
  Do nothing; let the linker relax the branch.

Hot -> Cold [x16 not free, but others are]
  Spill x16; let the linker relax the branch.

Cold -> Hot [x16 not free, but others are]
  Manually insert an indirect branch.

Hot -> Cold [No free regs]
  Spill x16; let the linker relax the branch.

Cold -> Hot [No free regs]
  Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
  Ex:
    [Hot section]
    func.hot:
      ... hot code...
    func.restore:
      ... restore x16 ...
      B func.hot

    [Cold section]
      func.cold:
      ... spill x16 ...
      B func.restore
llvm/test/CodeGen/AArch64/branch-relax-b.ll
134

nit: add a brief comment about which case (in the breakdown list) is being tested for posterity.

If I read correctly, bb.3 is the cold block (that would be split out by MFS), and the restore block .LBB2_4 is the restore block and put at the end of the function. For my information, is x16 spilled as the 4th or 6th case?

llvm/test/CodeGen/AArch64/branch-relax-cross-section.mir
31

nit: maybe add test cases in an NFC (https://llvm.org/docs/Lexicon.html#n) and manually put cold blocks at the end of the IR function. This way, how the IR function looks like after function-splitting is easier to read, and the diff of prioritized hot code is clearer.

70–72

for my information, how does branch-relaxation pass tell cold: is a cold block for test case all_used_cold_to_hot? It seems entry will branch to cold unconditionally but I may miss something.

dhoekwater marked 3 inline comments as done.

Without this patch, out-of-range branches are correctly relaxed, but
they are relaxed suboptimally. Notably, most (~99%) of out-of-range
unconditional branches have X16 available across the branch, where we
can safely defer to the linker rather than inserting 2 extra
instructions. Because machine function splitting (MFS) makes
unconditional branches to cold blocks out of range, enabling MFS would
incur a big code size overhead due to cross-section branch relaxation.
This patch lets us mitigate that overhead in the vast majority of cases.

dhoekwater added inline comments.Aug 16 2023, 2:02 PM
llvm/test/CodeGen/AArch64/branch-relax-b.ll
134

Added a comment explaining the case a bit further. In this test, machine function splitting isn't enabled at all, so the function is made up of entirely hot blocks. For this case, the only thing that matters is that we're branching _from_ the hot section, so we don't need to use function splitting in this test.

llvm/test/CodeGen/AArch64/branch-relax-cross-section.mir
70–72

This part of the test (the LLVM IR) is only present so llc is happy; the brunt of the test details are specified in the MIR sections below. For this specific test case, it is specified in

218    bb.2.iftrue (bbsections Cold):

Remove the closed dependency from the commit message

dhoekwater edited the summary of this revision. (Show Details)Aug 17 2023, 11:34 AM

Remove the closed dependency from the commit message

Accidentally updated the wrong patch

arsenm added inline comments.Aug 17 2023, 11:58 AM
llvm/lib/CodeGen/BranchRelaxation.cpp
597

Broken MBB check, MBB cannot be null

dhoekwater marked an inline comment as done.

Remove a redundant null check

dhoekwater added inline comments.Aug 21 2023, 9:04 AM
llvm/lib/CodeGen/BranchRelaxation.cpp
597

Ah, debugging change, I'm not sure how this made it in. Thanks for the feedback.

mingmingl added inline comments.Aug 22 2023, 3:42 PM
llvm/test/CodeGen/AArch64/branch-relax-cross-section.mir
248–259

I should have missed something here.. Where does bb.4 and bb.5 come from for 'x16_used_cold_to_hot'? Testing the IR (https://gcc.godbolt.org/z/esjchq3PK) gives 3 blocks..

Relatedly, I think submitting test cases as NFC and rebase on top of that should display the diff clearer.

dhoekwater added inline comments.Aug 22 2023, 9:02 PM
llvm/test/CodeGen/AArch64/branch-relax-cross-section.mir
248–259

Where does bb.4 and bb.5 come from for 'x16_used_cold_to_hot'?

bb.3, bb.4, and bb.5 are all machine basic blocks created by the Branch Relaxation pass in the fixupUnconditionalBranch function.

  • bb.3 is a block that contains an unconditional branch that will be safely relaxed by the linker.
  • bb.4 is a block that restores the contents of X16 that have been written to the stack in bb.3.entry and clobbered by the linker. For correctness, X16 must be restored before bb.2.cold executes.
  • bb.5 is a block that contains an indirect branch to bb.1.hot. The indirect branch 1) loads the address of bb.1 into an available register by loading the page address then adding the page offset and 2) branches to the register containing the address.

This test in particular is checking that, when a cold -> hot unconditional branch is relaxed, it eagerly inserts an indirect branch using a free register, if there is one.

dhoekwater added inline comments.Aug 22 2023, 9:04 PM
llvm/test/CodeGen/AArch64/branch-relax-cross-section.mir
248–259

Relatedly, I think submitting test cases as NFC and rebase on top of that should display the diff clearer.

In the NFC, do I add the MIR for the tests but not the CHECKs? Or do I add checks for the wrong behavior?

dhoekwater marked an inline comment as done.Aug 28 2023, 10:23 AM
dhoekwater added inline comments.
llvm/test/CodeGen/AArch64/branch-relax-cross-section.mir
248–259

Tests added in an NFC.

Remove redundant assignments to TrampolineInsertionPoint

thanks! The implementation makes sense to me. I'm not very familiar with this pass though, so maybe we could wait a little bit more for owners.

llvm/lib/CodeGen/BranchRelaxation.cpp
595

nit: move this after the cold->hot handling ( if (MBB->getSectionID() == MBBSectionID::ColdSectionID && DestBB->getSectionID() != MBBSectionID::ColdSectionID)), or maybe update it accordingly.

thanks! The implementation makes sense to me. I'm not very familiar with this pass though, so maybe we could wait a little bit more for owners.

I think MaskRay is the only reviewer who's contributed to both BranchRelaxation and AArch64 code. @MaskRay, could you please take a look?

mingmingl accepted this revision.Sep 5 2023, 3:46 PM

lgtm and approve. Maybe give other reviewers one or two days (after the holiday) before commit this.

This revision is now accepted and ready to land.Sep 5 2023, 3:46 PM
dhoekwater added a comment.EditedSep 5 2023, 6:43 PM

lgtm and approve. Maybe give other reviewers one or two days (after the holiday) before commit this.

Sounds good, I'll plan to commit it close to EOD on Wednesday or early Thursday barring any further feedback.

It seems that this patch should link to D145211. This patch will address this comment https://reviews.llvm.org/D145211#4537173

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
285–286

This sentence sounds like the linker needs to do work, but in the majority of cases I think the linker won't need to insert a range extension thunk.

Perhaps:

// If X16 is unused, we can rely on linker range extension thunk in case NewDestBB cannot be reached by a single B (e.g. cross-section).

287

Can we use another variable so that we can use const Register Reg = AArch64::X16; above and remove Reg = AArch64::X16; on L312?

Here's a breakdown of the cases:
[...]

It'd be great to fold this information to the summary.

Hot -> Cold [x16 not free, but others are]

but there is an unused register

dhoekwater edited the summary of this revision. (Show Details)Sep 6 2023, 9:20 AM
dhoekwater edited the summary of this revision. (Show Details)Sep 6 2023, 11:16 AM
dhoekwater marked 3 inline comments as done.
dhoekwater edited the summary of this revision. (Show Details)

Clarify some comments and add a temporary variable to make indirect branch insertion cleaner

It'd be great to fold this information to the summary.
...
but there is an unused register

Folded into the summary and updated phrasing.

MaskRay accepted this revision.Sep 6 2023, 12:16 PM
MaskRay added inline comments.
llvm/lib/CodeGen/BranchRelaxation.cpp
605

End complete sentences with a period.

608

period

dhoekwater updated this revision to Diff 556073.Sep 6 2023, 1:18 PM

Properly punctuate comments

dhoekwater marked 2 inline comments as done.Sep 6 2023, 1:23 PM
MaskRay accepted this revision.Sep 6 2023, 1:45 PM
This revision was landed with ongoing or failed builds.Sep 6 2023, 1:48 PM
This revision was automatically updated to reflect the committed changes.