Page MenuHomePhabricator

Separate CSRs into a subset handled via explicit copies and a subset handled by prologue/epilogue
ClosedPublic

Authored by manmanren on Dec 8 2015, 11:02 AM.

Details

Summary

The issue ----------
For a calling convention with a large set of CSRs, if shrink wrapping can't move the whole prologue/epilogue to the cold path, we can explicitly handle a subset of the CSRs via copies and register allocator will try to spill and reload this subset on the cold path. Without the proposed patches, all CSRs will be saved and restored in the prologue/epilogue.

Specifically, this is motivated by CXX_FAST_TLS calling convention on AArch64. Since the C++-style access function calls tlv_get_addr on Darwin, LR will be clobbered by tlv_get_addr and shrink wrapping will not be able to move the prologue/epilogue.

The patches -----------
There are two patches proposed for this issue. The target independent portion is in this review. The AArch64-specific portion will be in a separate review.

What is included in this patch -----------
The access function has a short entry and a short exit, the initialization block is only run the first time.
To improve the performance, we want to have a short frame at the entry and exit.

We explicitly handle most of the CSRs via copies. Only the CSRs that are not handled via copies will be in CSR_SaveList.

Frame lowering and prologue/epilogue insertion will generate a short frame in the entry and exit according to CSR_SaveList. The majority of the CSRs will
be handled by register allcoator. Register allocator will try to spill and reload them in the initialization block.

We add CSRsViaCopy, it will be explicitly handled during lowering.

1> we first set FunctionLoweringInfo->SplitCSR if conditions are met (a single
   exit block with a return; target supports SplitCSR; the calling convention
   is CXX_TLS), we also call TLI->handleSplitCSR(Entry, nullptr) to perform
   initialization.
2> we insert copies from CSRsViaCopy to virtual registers at beginning of
   the entry block and copies from virtual registers to CSRsViaCopy at beginning
   of the exit block. (this is implemented by handleSplitCSR(Entry, Exit))
3> we also need to make sure the explicit copies will not be eliminated. (this is target-specific)

Thanks to Quentin for suggesting this!

Diff Detail

Repository
rL LLVM

Event Timeline

manmanren updated this revision to Diff 42195.Dec 8 2015, 11:02 AM
manmanren retitled this revision from to Separate CSRs into a subset handled via explicit copies and a subset handled by prologue/epilogue.
manmanren updated this object.
manmanren added reviewers: qcolombet, llvm-commits.
hfinkel added a subscriber: hfinkel.Dec 9 2015, 4:31 AM

I'd like to make sure that I understand the situation:

You have functions that you'd like to shrink wrap, but you can't because you have calls to tlv_get_addr that post-dominate the entry, and tlv_get_addr clobbers the LR and FP registers. Normally, this requires LR and FP to be saved to the stack in the prologue, and here, you still do. And you still can't shrink wrap. But, instead of saving the other CSRs to the stack, you copy them to other vregs, the thought being that you'll spill them as needed anyway, and you might as well let the register allocator make that decision than blindly spilling them all up front.

Saving CSRs by copying normally does not make sense if you'll immediately spill them, but does make sense if you have high-register pressure only in some cold path. I don't understand what this has to do with special calling conventions for TLS, but it seems like a generally-good thing to do any time you have high register pressure in some (cold) part(s) of a function and low register pressure elsewhere. Do functions with this CXX_FAST_TLS calling convention always exhibit that pattern? Should we endeavor to detect it more generally?

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
472 ↗(On Diff #42195)

Restricting this to CallingConv::CXX_FAST_TLS does not belong in target-independent code. Please move this into the TLI->supportSplitCSR() implementation.

480 ↗(On Diff #42195)

Am I correct in assuming that the succ_empty(&BB) check is for blocks that end in, at the IR level, an unreachable? If so, I'm not sure this logic is right, because we should be able to have as many unreachables as we'd like, so long as we have only one actual return.

493 ↗(On Diff #42195)

Fundamentally, why are we checking for a single return? Wouldn't you just insert the vreg->csr copies into each return block independently should there by more than one?

I'd like to make sure that I understand the situation:

You have functions that you'd like to shrink wrap, but you can't because you have calls to tlv_get_addr that post-dominate the entry, and tlv_get_addr clobbers the LR and FP registers. Normally, this requires LR and FP to be saved to the stack in the prologue, and here, you still do. And you still can't shrink wrap. But, instead of saving the other CSRs to the stack, you copy them to other vregs, the thought being that you'll spill them as needed anyway, and you might as well let the register allocator make that decision than blindly spilling them all up front.

Yes, exactly!

Saving CSRs by copying normally does not make sense if you'll immediately spill them, but does make sense if you have high-register pressure only in some cold path. I don't understand what this has to do with special calling conventions for TLS, but it seems like a generally-good thing to do any time you have high register pressure in some (cold) part(s) of a function and low register pressure elsewhere.

Yes, this can be used in a general way, but the patches currently enable it only for CXX_FAST_TLS. The performance of CXX_FAST_TLS will be worse than a normal calling convention without these patches, since we will preserve all the CSRs on the hot path.

Do functions with this CXX_FAST_TLS calling convention always exhibit that pattern?
Yes, as implemented in clang front end (specified by Itanium ABI, we will have a one-time initialization block that can be slow).

Should we endeavor to detect it more generally?
My current plan is to enable it only for CXX_FAST_TLS, when people see other potential performance improvement, it can be made more general with a motivating testing case.

Thanks for reviewing!
Manman

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
472 ↗(On Diff #42195)

Will do

480 ↗(On Diff #42195)

Am I correct in assuming that the succ_empty(&BB) check is for blocks that end in, at the IR level, an unreachable?
Blocks ending with unreachable will have succ_empty(&BB) being true.

This currently is over-conservative, because the current implementation assumes a single return and the explicit copies will be inserted in the return block. I think we should be able to allow multiple unreachable, without explicit copies for the unreachable blocks.

493 ↗(On Diff #42195)

Yes, fundamentally, we can allow multiple returns, as long as we put explicit copies in each return block. I am trying to simplify the logic here, since in CXX_FAST_TLS case, we have a single return. I am okay with making it more general.

manmanren updated this revision to Diff 42327.Dec 9 2015, 12:28 PM
manmanren marked 3 inline comments as done.

Updated patch to address Hal's comments.

Cheers,
Manman

Sounds good.

Yes, as implemented in clang front end (specified by Itanium ABI, we will have a one-time initialization block that can be slow).

Are the Clang parts upstream yet? (or are you waiting on this part first?)

include/llvm/Target/TargetLowering.h
2270 ↗(On Diff #42195)

We need more comments here explaining what exactly this function is supposed to do, at what stage in the pipeline it is called, etc.

manmanren updated this revision to Diff 42337.Dec 9 2015, 2:09 PM

Add comments for functions added to TargetLowering: supportSplitCSR, initializeSplitCSR and insertCopiesSplitCSR.

Sounds good.

Yes, as implemented in clang front end (specified by Itanium ABI, we will have a one-time initialization block that can be slow).

Are the Clang parts upstream yet? (or are you waiting on this part first?)

Not yet, I am waiting for this part to land first. Without this part, on AArch64, the performance can be worse than not using CXX_FAST_TLS.
The clang part is very straightforward, it uses CXX_FAST_TLS on Darwin platform.

Cheers,
Manman

hfinkel accepted this revision.Dec 9 2015, 2:36 PM
hfinkel added a reviewer: hfinkel.

A few minor comments, otherwise, LGTM.

include/llvm/Target/TargetLowering.h
2273 ↗(On Diff #42337)

Add:

// This function is called at the beginning of instruction selection.
2281 ↗(On Diff #42337)

Add:

// This function is called at the end of instruction selection.
lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
506 ↗(On Diff #42337)

Can you move this up so that you can use EntryBB in the call to initializeSplitCSR? If so, please do.

This revision is now accepted and ready to land.Dec 9 2015, 2:36 PM
qcolombet edited edge metadata.Dec 10 2015, 9:24 AM

Hi Manman,

Thanks for doing this!

A quick comment, would it be possible to have a command line option to force the use of this special handling as a follow-up patch?
It’s been a while I’d like to make some measurements to check how interesting this approach would be compared to our current handling of CSRs in reg alloc.

Thanks again!
Q.

Hi Manman,

Thanks for doing this!

A quick comment, would it be possible to have a command line option to force the use of this special handling as a follow-up patch?

It depends on what the command line option will do :]

The implementation currently only works for CXX_FAST_TLS and we can probably generalize it to other calling conventions.
Also as pointed out by Hal in AArch64 patch (http://reviews.llvm.org/D15341), we may need to update that patch to handle unwinding by inserting cfi directives.

It’s been a while I’d like to make some measurements to check how interesting this approach would be compared to our current handling of CSRs in reg alloc.

Is there any special handling of CSRs in reg alloc? I remember from a while back that we only add a penalty for first usage of a CSR.

Thank you for reviewing!
Manman

Thanks again!
Q.

This revision was automatically updated to reflect the committed changes.