This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/
-
llvm/
-
CodeGen/
-
FunctionLoweringInfo.h
-
Target/
-
TargetLowering.h
-
TargetRegisterInfo.h
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
SelectionDAGISel.cpp

Differential D15340

Separate CSRs into a subset handled via explicit copies and a subset handled by prologue/epilogue
ClosedPublic

Authored by manmanren on Dec 8 2015, 11:02 AM.

Download Raw Diff

Details

Reviewers

qcolombet
llvm-commits
hfinkel

Commits

rGabc7c1d1d2d4: CXX_FAST_TLS calling convention: target independent portion.
rL255353: CXX_FAST_TLS calling convention: target independent portion.

Summary

The issue ----------
For a calling convention with a large set of CSRs, if shrink wrapping can't move the whole prologue/epilogue to the cold path, we can explicitly handle a subset of the CSRs via copies and register allocator will try to spill and reload this subset on the cold path. Without the proposed patches, all CSRs will be saved and restored in the prologue/epilogue.

Specifically, this is motivated by CXX_FAST_TLS calling convention on AArch64. Since the C++-style access function calls tlv_get_addr on Darwin, LR will be clobbered by tlv_get_addr and shrink wrapping will not be able to move the prologue/epilogue.

The patches -----------
There are two patches proposed for this issue. The target independent portion is in this review. The AArch64-specific portion will be in a separate review.

What is included in this patch -----------
The access function has a short entry and a short exit, the initialization block is only run the first time.
To improve the performance, we want to have a short frame at the entry and exit.

We explicitly handle most of the CSRs via copies. Only the CSRs that are not handled via copies will be in CSR_SaveList.

Frame lowering and prologue/epilogue insertion will generate a short frame in the entry and exit according to CSR_SaveList. The majority of the CSRs will
be handled by register allcoator. Register allocator will try to spill and reload them in the initialization block.

We add CSRsViaCopy, it will be explicitly handled during lowering.

1> we first set FunctionLoweringInfo->SplitCSR if conditions are met (a single
   exit block with a return; target supports SplitCSR; the calling convention
   is CXX_TLS), we also call TLI->handleSplitCSR(Entry, nullptr) to perform
   initialization.
2> we insert copies from CSRsViaCopy to virtual registers at beginning of
   the entry block and copies from virtual registers to CSRsViaCopy at beginning
   of the exit block. (this is implemented by handleSplitCSR(Entry, Exit))
3> we also need to make sure the explicit copies will not be eliminated. (this is target-specific)

Thanks to Quentin for suggesting this!

Diff Detail

Repository: rL LLVM

Event Timeline

manmanren updated this revision to Diff 42195.Dec 8 2015, 11:02 AM

manmanren retitled this revision from to Separate CSRs into a subset handled via explicit copies and a subset handled by prologue/epilogue.

manmanren updated this object.

manmanren added reviewers: qcolombet, llvm-commits.

Herald added a subscriber: aemerson. · View Herald TranscriptDec 8 2015, 11:02 AM

AArch64 portion in http://reviews.llvm.org/D15341

I'd like to make sure that I understand the situation:

You have functions that you'd like to shrink wrap, but you can't because you have calls to tlv_get_addr that post-dominate the entry, and tlv_get_addr clobbers the LR and FP registers. Normally, this requires LR and FP to be saved to the stack in the prologue, and here, you still do. And you still can't shrink wrap. But, instead of saving the other CSRs to the stack, you copy them to other vregs, the thought being that you'll spill them as needed anyway, and you might as well let the register allocator make that decision than blindly spilling them all up front.

Saving CSRs by copying normally does not make sense if you'll immediately spill them, but does make sense if you have high-register pressure only in some cold path. I don't understand what this has to do with special calling conventions for TLS, but it seems like a generally-good thing to do any time you have high register pressure in some (cold) part(s) of a function and low register pressure elsewhere. Do functions with this CXX_FAST_TLS calling convention always exhibit that pattern? Should we endeavor to detect it more generally?

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
472 ↗	(On Diff #42195)	Restricting this to CallingConv::CXX_FAST_TLS does not belong in target-independent code. Please move this into the TLI->supportSplitCSR() implementation.
480 ↗	(On Diff #42195)	Am I correct in assuming that the `succ_empty(&BB)` check is for blocks that end in, at the IR level, an unreachable? If so, I'm not sure this logic is right, because we should be able to have as many unreachables as we'd like, so long as we have only one actual return.
493 ↗	(On Diff #42195)	Fundamentally, why are we checking for a single return? Wouldn't you just insert the vreg->csr copies into each return block independently should there by more than one?

In D15340#305838, @hfinkel wrote:

I'd like to make sure that I understand the situation:

You have functions that you'd like to shrink wrap, but you can't because you have calls to tlv_get_addr that post-dominate the entry, and tlv_get_addr clobbers the LR and FP registers. Normally, this requires LR and FP to be saved to the stack in the prologue, and here, you still do. And you still can't shrink wrap. But, instead of saving the other CSRs to the stack, you copy them to other vregs, the thought being that you'll spill them as needed anyway, and you might as well let the register allocator make that decision than blindly spilling them all up front.

Yes, exactly!

Saving CSRs by copying normally does not make sense if you'll immediately spill them, but does make sense if you have high-register pressure only in some cold path. I don't understand what this has to do with special calling conventions for TLS, but it seems like a generally-good thing to do any time you have high register pressure in some (cold) part(s) of a function and low register pressure elsewhere.

Yes, this can be used in a general way, but the patches currently enable it only for CXX_FAST_TLS. The performance of CXX_FAST_TLS will be worse than a normal calling convention without these patches, since we will preserve all the CSRs on the hot path.

Do functions with this CXX_FAST_TLS calling convention always exhibit that pattern?
Yes, as implemented in clang front end (specified by Itanium ABI, we will have a one-time initialization block that can be slow).

Should we endeavor to detect it more generally?
My current plan is to enable it only for CXX_FAST_TLS, when people see other potential performance improvement, it can be made more general with a motivating testing case.

Thanks for reviewing!
Manman

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
472 ↗	(On Diff #42195)	Will do
480 ↗	(On Diff #42195)	Am I correct in assuming that the succ_empty(&BB) check is for blocks that end in, at the IR level, an unreachable? Blocks ending with unreachable will have succ_empty(&BB) being true. This currently is over-conservative, because the current implementation assumes a single return and the explicit copies will be inserted in the return block. I think we should be able to allow multiple unreachable, without explicit copies for the unreachable blocks.
493 ↗	(On Diff #42195)	Yes, fundamentally, we can allow multiple returns, as long as we put explicit copies in each return block. I am trying to simplify the logic here, since in CXX_FAST_TLS case, we have a single return. I am okay with making it more general.

Updated patch to address Hal's comments.

Cheers,
Manman

Sounds good.

Yes, as implemented in clang front end (specified by Itanium ABI, we will have a one-time initialization block that can be slow).

Are the Clang parts upstream yet? (or are you waiting on this part first?)

include/llvm/Target/TargetLowering.h
2270 ↗	(On Diff #42195)	We need more comments here explaining what exactly this function is supposed to do, at what stage in the pipeline it is called, etc.

Add comments for functions added to TargetLowering: supportSplitCSR, initializeSplitCSR and insertCopiesSplitCSR.

In D15340#306304, @hfinkel wrote:

Sounds good.

Yes, as implemented in clang front end (specified by Itanium ABI, we will have a one-time initialization block that can be slow).

Are the Clang parts upstream yet? (or are you waiting on this part first?)

Not yet, I am waiting for this part to land first. Without this part, on AArch64, the performance can be worse than not using CXX_FAST_TLS.
The clang part is very straightforward, it uses CXX_FAST_TLS on Darwin platform.

Cheers,
Manman

A few minor comments, otherwise, LGTM.

include/llvm/Target/TargetLowering.h
2273 ↗	(On Diff #42337)	Add: // This function is called at the beginning of instruction selection.
2281 ↗	(On Diff #42337)	Add: // This function is called at the end of instruction selection.
lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
506 ↗	(On Diff #42337)	Can you move this up so that you can use EntryBB in the call to initializeSplitCSR? If so, please do.

This revision is now accepted and ready to land.Dec 9 2015, 2:36 PM

Hi Manman,

Thanks for doing this!

A quick comment, would it be possible to have a command line option to force the use of this special handling as a follow-up patch?
It’s been a while I’d like to make some measurements to check how interesting this approach would be compared to our current handling of CSRs in reg alloc.

Thanks again!
Q.

In D15340#307215, @qcolombet wrote:

Hi Manman,

Thanks for doing this!

A quick comment, would it be possible to have a command line option to force the use of this special handling as a follow-up patch?

It depends on what the command line option will do :]

The implementation currently only works for CXX_FAST_TLS and we can probably generalize it to other calling conventions.
Also as pointed out by Hal in AArch64 patch (http://reviews.llvm.org/D15341), we may need to update that patch to handle unwinding by inserting cfi directives.

It’s been a while I’d like to make some measurements to check how interesting this approach would be compared to our current handling of CSRs in reg alloc.

Is there any special handling of CSRs in reg alloc? I remember from a while back that we only add a penalty for first usage of a CSR.

Thank you for reviewing!
Manman

Thanks again!
Q.

Closed by commit rL255353: CXX_FAST_TLS calling convention: target independent portion. (authored by mren). · Explain WhyDec 11 2015, 10:27 AM

This revision was automatically updated to reflect the committed changes.

jevinskie added a subscriber: jevinskie.Dec 16 2015, 3:09 PM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

CodeGen/

FunctionLoweringInfo.h

3 lines

Target/

TargetLowering.h

23 lines

TargetRegisterInfo.h

5 lines

lib/

CodeGen/

SelectionDAG/

SelectionDAGISel.cpp

37 lines

Diff 42542

llvm/trunk/include/llvm/CodeGen/FunctionLoweringInfo.h

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	public:
MachineFunction *MF;		MachineFunction *MF;
const TargetLowering *TLI;		const TargetLowering *TLI;
MachineRegisterInfo *RegInfo;		MachineRegisterInfo *RegInfo;
BranchProbabilityInfo *BPI;		BranchProbabilityInfo *BPI;
/// CanLowerReturn - true iff the function's return value can be lowered to		/// CanLowerReturn - true iff the function's return value can be lowered to
/// registers.		/// registers.
bool CanLowerReturn;		bool CanLowerReturn;

		/// True if part of the CSRs will be handled via explicit copies.
		bool SplitCSR;

/// DemoteRegister - if CanLowerReturn is false, DemoteRegister is a vreg		/// DemoteRegister - if CanLowerReturn is false, DemoteRegister is a vreg
/// allocated to hold a pointer to the hidden sret parameter.		/// allocated to hold a pointer to the hidden sret parameter.
unsigned DemoteRegister;		unsigned DemoteRegister;

/// MBBMap - A mapping from LLVM basic blocks to their machine code entry.		/// MBBMap - A mapping from LLVM basic blocks to their machine code entry.
DenseMap<const BasicBlock, MachineBasicBlock > MBBMap;		DenseMap<const BasicBlock, MachineBasicBlock > MBBMap;

/// ValueMap - Since we emit code for the function a basic block at a time,		/// ValueMap - Since we emit code for the function a basic block at a time,
▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 2,257 Lines • ▼ Show 20 Lines	public:

/// This method query the target whether it is beneficial for dag combiner to		/// This method query the target whether it is beneficial for dag combiner to
/// promote the specified node. If true, it should return the desired		/// promote the specified node. If true, it should return the desired
/// promotion type by reference.		/// promotion type by reference.
virtual bool IsDesirableToPromoteOp(SDValue /Op/, EVT &/PVT/) const {		virtual bool IsDesirableToPromoteOp(SDValue /Op/, EVT &/PVT/) const {
return false;		return false;
}		}

		/// Return true if the target supports that a subset of CSRs for the given
		/// calling convention is handled explicitly via copies.
		virtual bool supportSplitCSR(CallingConv::ID CC) const {
		return false;
		}

		/// Perform necessary initialization to handle a subset of CSRs explicitly
		/// via copies. This function is called at the beginning of instruction
		/// selection.
		virtual void initializeSplitCSR(MachineBasicBlock *Entry) const {
		llvm_unreachable("Not Implemented");
		}

		/// Insert explicit copies in entry and exit blocks. We copy a subset of
		/// CSRs to virtual registers in the entry block, and copy them back to
		/// physical registers in the exit blocks. This function is called at the end
		/// of instruction selection.
		virtual void insertCopiesSplitCSR(
		MachineBasicBlock *Entry,
		const SmallVectorImpl<MachineBasicBlock *> &Exits) const {
		llvm_unreachable("Not Implemented");
		}

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Lowering methods - These methods must be implemented by targets so that		// Lowering methods - These methods must be implemented by targets so that
// the SelectionDAGBuilder code knows how to lower these.		// the SelectionDAGBuilder code knows how to lower these.
//		//

/// This hook must be implemented to lower the incoming (formal) arguments,		/// This hook must be implemented to lower the incoming (formal) arguments,
/// described by the Ins array, into the specified DAG. The implementation		/// described by the Ins array, into the specified DAG. The implementation
/// should fill in the InVals array with legal-type argument values, and		/// should fill in the InVals array with legal-type argument values, and
▲ Show 20 Lines • Show All 590 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Target/TargetRegisterInfo.h

Show First 20 Lines • Show All 420 Lines • ▼ Show 20 Lines	public:
/// Return a null-terminated list of all of the callee-saved registers on		/// Return a null-terminated list of all of the callee-saved registers on
/// this target. The register should be in the order of desired callee-save		/// this target. The register should be in the order of desired callee-save
/// stack frame offset. The first register is closest to the incoming stack		/// stack frame offset. The first register is closest to the incoming stack
/// pointer if stack grows down, and vice versa.		/// pointer if stack grows down, and vice versa.
///		///
virtual const MCPhysReg*		virtual const MCPhysReg*
getCalleeSavedRegs(const MachineFunction *MF) const = 0;		getCalleeSavedRegs(const MachineFunction *MF) const = 0;

		virtual const MCPhysReg*
		getCalleeSavedRegsViaCopy(const MachineFunction *MF) const {
		return nullptr;
		}

/// Return a mask of call-preserved registers for the given calling convention		/// Return a mask of call-preserved registers for the given calling convention
/// on the current function. The mask should include all call-preserved		/// on the current function. The mask should include all call-preserved
/// aliases. This is used by the register allocator to determine which		/// aliases. This is used by the register allocator to determine which
/// registers can be live across a call.		/// registers can be live across a call.
///		///
/// The mask is an array containing (TRI::getNumRegs()+31)/32 entries.		/// The mask is an array containing (TRI::getNumRegs()+31)/32 entries.
/// A set bit indicates that all bits of the corresponding register are		/// A set bit indicates that all bits of the corresponding register are
/// preserved across the function call. The bit mask is expected to be		/// preserved across the function call. The bit mask is expected to be
▲ Show 20 Lines • Show All 532 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp

Show First 20 Lines • Show All 461 Lines • ▼ Show 20 Lines	if (UseMBPI && OptLevel != CodeGenOpt::None)
FuncInfo->BPI = &getAnalysis<BranchProbabilityInfoWrapperPass>().getBPI();		FuncInfo->BPI = &getAnalysis<BranchProbabilityInfoWrapperPass>().getBPI();
else		else
FuncInfo->BPI = nullptr;		FuncInfo->BPI = nullptr;

SDB->init(GFI, *AA, LibInfo);		SDB->init(GFI, *AA, LibInfo);

MF->setHasInlineAsm(false);		MF->setHasInlineAsm(false);

		FuncInfo->SplitCSR = false;
		SmallVector<MachineBasicBlock*, 4> Returns;

		// We split CSR if the target supports it for the given calling convention
		// and the function has only return exits.
		if (TLI->supportSplitCSR(Fn.getCallingConv())) {
		FuncInfo->SplitCSR = true;

		// Collect all the return blocks.
		for (const BasicBlock &BB : Fn) {
		if (!succ_empty(&BB))
		continue;

		const TerminatorInst *Term = BB.getTerminator();
		if (isa<UnreachableInst>(Term))
		continue;
		if (isa<ReturnInst>(Term)) {
		Returns.push_back(FuncInfo->MBBMap[&BB]);
		continue;
		}

		// Bail out if the exit block is not Return nor Unreachable.
		FuncInfo->SplitCSR = false;
		break;
		}
		}

		MachineBasicBlock *EntryMBB = &MF->front();
		if (FuncInfo->SplitCSR)
		// This performs initialization so lowering for SplitCSR will be correct.
		TLI->initializeSplitCSR(EntryMBB);

SelectAllBasicBlocks(Fn);		SelectAllBasicBlocks(Fn);

// If the first basic block in the function has live ins that need to be		// If the first basic block in the function has live ins that need to be
// copied into vregs, emit the copies into the top of the block before		// copied into vregs, emit the copies into the top of the block before
// emitting the code for the block.		// emitting the code for the block.
MachineBasicBlock *EntryMBB = &MF->front();
const TargetRegisterInfo &TRI = *MF->getSubtarget().getRegisterInfo();		const TargetRegisterInfo &TRI = *MF->getSubtarget().getRegisterInfo();
RegInfo->EmitLiveInCopies(EntryMBB, TRI, *TII);		RegInfo->EmitLiveInCopies(EntryMBB, TRI, *TII);

		// Insert copies in the entry block and the return blocks.
		if (FuncInfo->SplitCSR)
		TLI->insertCopiesSplitCSR(EntryMBB, Returns);

DenseMap<unsigned, unsigned> LiveInMap;		DenseMap<unsigned, unsigned> LiveInMap;
if (!FuncInfo->ArgDbgValues.empty())		if (!FuncInfo->ArgDbgValues.empty())
for (MachineRegisterInfo::livein_iterator LI = RegInfo->livein_begin(),		for (MachineRegisterInfo::livein_iterator LI = RegInfo->livein_begin(),
E = RegInfo->livein_end(); LI != E; ++LI)		E = RegInfo->livein_end(); LI != E; ++LI)
if (LI->second)		if (LI->second)
LiveInMap.insert(std::make_pair(LI->first, LI->second));		LiveInMap.insert(std::make_pair(LI->first, LI->second));

// Insert DBG_VALUE instructions for function arguments to the entry block.		// Insert DBG_VALUE instructions for function arguments to the entry block.
▲ Show 20 Lines • Show All 2,916 Lines • Show Last 20 Lines