This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Prefer prologues with sp adjustments merged into stp/ldp for WinCFI
ClosedPublic

Authored by mstorsjo on Oct 1 2020, 1:38 PM.

Download Raw Diff

Details

Reviewers

efriedma
rnk
ssijaric
TomTan

Commits

rG7d07405761ae: [AArch64] Prefer prologues with sp adjustments merged into stp/ldp for WinCFI…

Summary

This makes the prologue match the windows canonical layout, for cases without a frame pointer.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mstorsjo created this revision.Oct 1 2020, 1:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 1 2020, 1:38 PM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

mstorsjo requested review of this revision.Oct 1 2020, 1:38 PM

Harbormaster completed remote builds in B73710: Diff 295665.Oct 1 2020, 1:38 PM

How badly do we really want to match the canonical packed prologue? Is it really worth generating less efficient instructions to reduce the size of the unwind data? (I guess it's not a lot less efficient, but still.)

In D88701#2307353, @efriedma wrote:

How badly do we really want to match the canonical packed prologue?

Well, the space savings are quite notable, and without this in place, we seldom end up matching the canonical forms allowing use of packed, so I'd hold off pushing D88677 until this one is settled (because there's little point in bending over backwards with the register order if we don't hit the packed forms regularly).

Is it really worth generating less efficient instructions to reduce the size of the unwind data? (I guess it's not a lot less efficient, but still.)

I guess it's marginally less efficient, but in most cases, the produced number of instructions should at least be the same. (In some of the testcase updates, it may look like we're getting more instructions, but that's in cases with sparse CHECK lines without thoroughly checking all with CHECK-NEXT.)

AFAIK in most cases, this patch should amount to changing this:

sub sp, sp, #48
stp x19, x20, [sp, #16]
stp x21, x30, [sp, #32]

Into this:

stp x19, x20, [sp, #-32]!
stp x21, x30, [sp, #16]
sub sp, sp, #16

So the same number of instructions, but sp is updated twice instead of once - that's the only inefficiency I can think of.

Yes, the dependency chain of sp is one instruction longer, and there's one extra arithmetic op on some cores. Those double if you count the epilogue that has the same issue.

Not sure how significant that is in practice, but I suspect it's observable in code with a lot of small functions.

Maybe it makes sense to distinguish between -O2 vs. -Os here?

Limit the change to when optimization for size is requested.

LGTM

This revision is now accepted and ready to land.Oct 2 2020, 6:12 PM

Closed by commit rG7d07405761ae: [AArch64] Prefer prologues with sp adjustments merged into stp/ldp for WinCFI… (authored by mstorsjo). · Explain WhyOct 3 2020, 11:38 AM

This revision was automatically updated to reflect the committed changes.

mstorsjo added a commit: rG7d07405761ae: [AArch64] Prefer prologues with sp adjustments merged into stp/ldp for WinCFI….

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64FrameLowering.cpp

24 lines

test/

CodeGen/

AArch64/

wineh-frame-predecrement.mir

70 lines

Diff 295996

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

Show First 20 Lines • Show All 573 Lines • ▼ Show 20 Lines	static bool windowsRequiresStackProbe(MachineFunction &MF,
if (F.hasFnAttribute("stack-probe-size"))		if (F.hasFnAttribute("stack-probe-size"))
F.getFnAttribute("stack-probe-size")		F.getFnAttribute("stack-probe-size")
.getValueAsString()		.getValueAsString()
.getAsInteger(0, StackProbeSize);		.getAsInteger(0, StackProbeSize);
return (StackSizeInBytes >= StackProbeSize) &&		return (StackSizeInBytes >= StackProbeSize) &&
!F.hasFnAttribute("no-stack-arg-probe");		!F.hasFnAttribute("no-stack-arg-probe");
}		}

		static bool needsWinCFI(const MachineFunction &MF) {
		const Function &F = MF.getFunction();
		return MF.getTarget().getMCAsmInfo()->usesWindowsCFI() &&
		F.needsUnwindTableEntry();
		}

bool AArch64FrameLowering::shouldCombineCSRLocalStackBump(		bool AArch64FrameLowering::shouldCombineCSRLocalStackBump(
MachineFunction &MF, uint64_t StackBumpBytes) const {		MachineFunction &MF, uint64_t StackBumpBytes) const {
AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();		AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
const MachineFrameInfo &MFI = MF.getFrameInfo();		const MachineFrameInfo &MFI = MF.getFrameInfo();
const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();		const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
const AArch64RegisterInfo *RegInfo = Subtarget.getRegisterInfo();		const AArch64RegisterInfo *RegInfo = Subtarget.getRegisterInfo();

if (AFI->getLocalStackSize() == 0)		if (AFI->getLocalStackSize() == 0)
return false;		return false;

		// For WinCFI, if optimizing for size, prefer to not combine the stack bump
		// (to force a stp with predecrement) to match the packed unwind format,
		// provided that there actually are any callee saved registers to merge the
		// decrement with.
		// This is potentially marginally slower, but allows using the packed
		// unwind format for functions that both have a local area and callee saved
		// registers. Using the packed unwind format notably reduces the size of
		// the unwind info.
		if (needsWinCFI(MF) && AFI->getCalleeSavedStackSize() > 0 &&
		MF.getFunction().hasOptSize())
		return false;

// 512 is the maximum immediate for stp/ldp that will be used for		// 512 is the maximum immediate for stp/ldp that will be used for
// callee-save save/restores		// callee-save save/restores
if (StackBumpBytes >= 512 \|\| windowsRequiresStackProbe(MF, StackBumpBytes))		if (StackBumpBytes >= 512 \|\| windowsRequiresStackProbe(MF, StackBumpBytes))
return false;		return false;

if (MFI.hasVarSizedObjects())		if (MFI.hasVarSizedObjects())
return false;		return false;

▲ Show 20 Lines • Show All 377 Lines • ▼ Show 20 Lines	static void adaptForLdStOpt(MachineBasicBlock &MBB,
// add sp, sp, #64		// add sp, sp, #64
//		//
// and the load-store optimizer can merge the last two instructions into:		// and the load-store optimizer can merge the last two instructions into:
//		//
// ldp x26, x25, [sp], #64		// ldp x26, x25, [sp], #64
//		//
}		}

static bool needsWinCFI(const MachineFunction &MF) {
const Function &F = MF.getFunction();
return MF.getTarget().getMCAsmInfo()->usesWindowsCFI() &&
F.needsUnwindTableEntry();
}

static bool isTargetWindows(const MachineFunction &MF) {		static bool isTargetWindows(const MachineFunction &MF) {
return MF.getSubtarget<AArch64Subtarget>().isTargetWindows();		return MF.getSubtarget<AArch64Subtarget>().isTargetWindows();
}		}

// Convenience function to determine whether I is an SVE callee save.		// Convenience function to determine whether I is an SVE callee save.
static bool IsSVECalleeSave(MachineBasicBlock::iterator I) {		static bool IsSVECalleeSave(MachineBasicBlock::iterator I) {
switch (I->getOpcode()) {		switch (I->getOpcode()) {
default:		default:
▲ Show 20 Lines • Show All 2,285 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/wineh-frame-predecrement.mir

This file was added.

				# RUN: llc -o - %s -mtriple=aarch64-windows -start-before=prologepilog \
				# RUN: -stop-after=prologepilog \| FileCheck %s

				# Check that the callee-saved registers are saved starting with a STP
				# with predecrement, followed by a separate stack adjustment later,
				# if the optsize attribute is set.

				# CHECK: early-clobber $sp = frame-setup STPXpre killed $x19, killed $x20, $sp, -2
				# CHECK-NEXT: frame-setup SEH_SaveRegP_X 19, 20, -16
				# CHECK-NEXT: $sp = frame-setup SUBXri $sp, 16, 0
				# CHECK-NEXT: frame-setup SEH_StackAlloc 16
				# CHECK-NEXT: frame-setup SEH_PrologEnd

				--- \|

				define dso_local i32 @func(i32 %a) optsize { ret i32 %a }

				...
				---
				name: func
				alignment: 4
				exposesReturnsTwice: false
				legalized: false
				regBankSelected: false
				selected: false
				failedISel: false
				tracksRegLiveness: true
				hasWinCFI: false
				registers: []
				liveins: []
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 0
				offsetAdjustment: 0
				maxAlignment: 4
				adjustsStack: false
				hasCalls: false
				stackProtector: ''
				maxCallFrameSize: 0
				cvBytesOfCalleeSavedRegisters: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				localFrameSize: 4
				savePoint: ''
				restorePoint: ''
				fixedStack: []
				stack:
				- { id: 0, name: '', type: default, offset: 0, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				local-offset: -4, debug-info-variable: '', debug-info-expression: '',
				debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				bb.0:
				liveins: $x0, $x19, $x20

				renamable $x8 = ADDXri %stack.0, 0, 0
				$x19 = ADDXrr $x0, $x8
				$x20 = ADDXrr $x19, $x0
				$x0 = ADDXrr $x0, killed $x20

				RET_ReallyLR

				...