This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
-
ArgumentPromotion.cpp
-
test/Transforms/ArgumentPromotion/
-
Transforms/
-
ArgumentPromotion/
-
minsize.ll

Differential D149768

[ArgumentPromotion] Bail if any callers are minsize and more instructions are added than removed
AcceptedPublic

Authored by aeubanks on May 3 2023, 9:59 AM.

Download Raw Diff

Details

Reviewers

nikic
fhahn

Commits

rG8b8466fd31e5: [ArgumentPromotion] Bail if any callers are minsize

Summary

Argument promotion mostly works on functions with more than one caller (otherwise the function would be inlined or is dead), so there's a good chance that performing this increases code size since we introduce loads at every call site. If any caller is marked minsize, check that the number of loads introduced at callers isn't greater than the number of loads/stores we remove in the callee.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aeubanks created this revision.May 3 2023, 9:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 9:59 AM

Herald added subscribers: hoy, ormris, StephenFan, hiraditya. · View Herald Transcript

aeubanks requested review of this revision.May 3 2023, 9:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 9:59 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

aeubanks added a reviewer: fhahn.May 3 2023, 10:36 AM

LGTM

This revision is now accepted and ready to land.May 3 2023, 11:04 AM

This revision was landed with ongoing or failed builds.May 3 2023, 11:29 AM

Closed by commit rG8b8466fd31e5: [ArgumentPromotion] Bail if any callers are minsize (authored by aeubanks). · Explain Why

This revision was automatically updated to reflect the committed changes.

aeubanks added a commit: rG8b8466fd31e5: [ArgumentPromotion] Bail if any callers are minsize.

Harbormaster completed remote builds in B229731: Diff 519140.May 3 2023, 11:37 AM

aeubanks mentioned this in rG6f29d1adf298: Reland [Pipeline] Don't limit ArgumentPromotion to -O3.May 3 2023, 1:22 PM

Do you have data showing size improvements on your end? We're observing size regressions at -Oz across the board instead for Meta's Android apps (we build with LTO, but I haven't tested non-LTO builds). It'll take me a while to come up with a standalone example, but some of the largest regressions are in folly::Optional and std::make_shared, which matches ArgumentPromotion's description of helping with templated code.

If you don't data to the contrary, could we revert this? If you do have cases where this shows overall improvements, what can I provide to help tune this to not be a substantial regression for us?

we were seeing size increases with code like the following

define internal i32 @f1(ptr %p) {
  %i = load i32, ptr %p, align 4
  ret i32 %i
}

define i32 @g1(ptr %p) {
  %i = call i32 @f1(ptr %p)
  %q = call i32 @f1(ptr %p)
  %w = call i32 @f1(ptr %p)
  %e = call i32 @f1(ptr %p)
  %r = call i32 @f1(ptr %p)
  %t = call i32 @f1(ptr %p)
  %y = call i32 @f1(ptr %p)
  ret i32 %i
}

becoming

define internal i32 @f1(i32 %p.0.val) {
  ret i32 %p.0.val
}

define i32 @g1(ptr %p) {
  %p.val6 = load i32, ptr %p, align 4
  %i = call i32 @f1(i32 %p.val6)
  %p.val5 = load i32, ptr %p, align 4
  %1 = call i32 @f1(i32 %p.val5)
  %p.val4 = load i32, ptr %p, align 4
  %2 = call i32 @f1(i32 %p.val4)
  %p.val3 = load i32, ptr %p, align 4
  %3 = call i32 @f1(i32 %p.val3)
  %p.val2 = load i32, ptr %p, align 4
  %4 = call i32 @f1(i32 %p.val2)
  %p.val1 = load i32, ptr %p, align 4
  %5 = call i32 @f1(i32 %p.val1)
  %p.val = load i32, ptr %p, align 4
  %6 = call i32 @f1(i32 %p.val)
  ret i32 %i
}

I did add a TODO here for a potential place for improvement, but even with that it doesn't take into account the fact that we can likely simplify more with the value now in SSA form after mem2reg. but maybe that TODO is worth pursuing anyway

but another thing to consider is that we only recently started running argpromo for -O1/2/s/z. I can see two cases where this patch would regress -Oz code size. one is running an -O3 post-link with an -Oz pre-link, which wouldn't make sense in general. the second is that https://reviews.llvm.org/D148269 actually helped with -Oz code size in your case, then this patch regressed it back to where it was before

otherwise if I'm missing something and you really need to revert this, you can revert https://reviews.llvm.org/D148269 and this patch together

In D149768#4320982, @aeubanks wrote:

but another thing to consider is that we only recently started running argpromo for -O1/2/s/z. I can see two cases where this patch would regress -Oz code size. one is running an -O3 post-link with an -Oz pre-link, which wouldn't make sense in general. the second is that https://reviews.llvm.org/D148269 actually helped with -Oz code size in your case, then this patch regressed it back to where it was before

I was thinking about this too. I verified that this patch is what caused the regression, and then the reland of D148269 (which took place after this patch) had no effect for us.

How does that pipeline change work for FullLTO though? Does the Level == OptimizationLevel::O3 check only apply to the pre-link compilations, or does it apply to the actual FullLTO phase based on the --lto-Ox value as well? We should just be using the default for LTO, which is --lto-O2 IIRC. We do build some TUs with -O3 and some TUs with -Oz that are FullLTO'd together.

I'm trying to think of a way to provide a reduced example demonstrating the size increase, but it's proving to be pretty tricky. This is a pretty large FullLTO build from a mix of -Oz and -O3 resources, and there seem to be interactions with other optimizations (in particular outlining) that are significant, so it'll take me a while to narrow things down.

In D149768#4322662, @smeenai wrote:

In D149768#4320982, @aeubanks wrote:

but another thing to consider is that we only recently started running argpromo for -O1/2/s/z. I can see two cases where this patch would regress -Oz code size. one is running an -O3 post-link with an -Oz pre-link, which wouldn't make sense in general. the second is that https://reviews.llvm.org/D148269 actually helped with -Oz code size in your case, then this patch regressed it back to where it was before

I was thinking about this too. I verified that this patch is what caused the regression, and then the reland of D148269 (which took place after this patch) had no effect for us.

How does that pipeline change work for FullLTO though? Does the Level == OptimizationLevel::O3 check only apply to the pre-link compilations, or does it apply to the actual FullLTO phase based on the --lto-Ox value as well? We should just be using the default for LTO, which is --lto-O2 IIRC. We do build some TUs with -O3 and some TUs with -Oz that are FullLTO'd together.

I'm trying to think of a way to provide a reduced example demonstrating the size increase, but it's proving to be pretty tricky. This is a pretty large FullLTO build from a mix of -Oz and -O3 resources, and there seem to be interactions with other optimizations (in particular outlining) that are significant, so it'll take me a while to narrow things down.

Ah you're using FullLTO, which always runs ArgPromo on the merged module for all opt levels, that makes sense then. I'm fine with reverting these two patches while you investigate. I'm not in a rush to get these landed, but I do think that some version of this patch is valuable for code size even on its own.

smeenai mentioned this in rG141be5c062ec: Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -O3".May 5 2023, 2:27 PM

smeenai added a reverting change: rG0e2b4b2dbac3: Revert "[ArgumentPromotion] Bail if any callers are minsize".

Thanks. Reverted and continuing to look into a repro.

Unfortunately my first attempt with llvm-reduce generated a reducer with UB :D I'm trying with a slightly more refined reduction script to try to avoid that. From eyeballing the code it seems like the SROA-like aspect of argpromote is yielding the size wins though.

aeubanks reopened this revision.May 8 2023, 4:27 PM

This revision is now accepted and ready to land.May 8 2023, 4:27 PM

check number of instructions added/removed to decide when to bail

could you test the latest patch? the heuristic should be better (perhaps not perfect because it doesn't take into account simplification after sroa'ing, but it's better than before).

update comment

Harbormaster completed remote builds in B230734: Diff 520520.May 8 2023, 5:25 PM

Thanks for the update! I'm working on testing the size change with the new revision, but our internal infra for that is a little broken right now, so it might take a bit. I'm also still struggling with llvm-reduce's tendency to introduce unreachable in reachable places and use uninitialized memory, unfortunately.

In D149768#4330784, @smeenai wrote:

Thanks for the update! I'm working on testing the size change with the new revision, but our internal infra for that is a little broken right now, so it might take a bit. I'm also still struggling with llvm-reduce's tendency to introduce unreachable in reachable places and use uninitialized memory, unfortunately.

No rush :)

I'm not sure how you're trying using llvm-reduce, it's not meant for any meaningful semantics-preserving transformations.

In D149768#4330790, @aeubanks wrote:

In D149768#4330784, @smeenai wrote:

Thanks for the update! I'm working on testing the size change with the new revision, but our internal infra for that is a little broken right now, so it might take a bit. I'm also still struggling with llvm-reduce's tendency to introduce unreachable in reachable places and use uninitialized memory, unfortunately.

No rush :)

I'm not sure how you're trying using llvm-reduce, it's not meant for any meaningful semantics-preserving transformations.

Ah, that'd explain a lot :) I was trying to use it with an interestingness test of "is the output from a toolchain with this change meaningfully larger than the output from a toolchain without", but I guess it's more intended for crash tests.

Okay, the new revision is better, but it's still a 204 KiB overall size increase for the Facebook Android app, which is considered significant. The previous iteration was a 244 KiB size increase, for reference.

I'm gonna try a -print-after-all -print-changed IR dump from the LTO run to see if I can spot any obvious causes. ArgPromo is kinda uniquely tricky to reason about that way because of the changes being scattered across the function itself and its callees though.

@smeenai any chance to take a look?

In D149768#4653908, @aeubanks wrote:

@smeenai any chance to take a look?

Sorry, not yet. I was hoping to spend time on this today, but something urgent came up. Hopefully I'll have more time for this tomorrow.

Okay, I think the heuristic is falling short by not considering that callers of the functions might have had the argument promoted themselves. As in, if I'm understanding this correctly, right now if you have function f with an argument which would have one load eliminated by argpromo but has two callers g and h, you won't promote the argument. However, g and h might have had the argument promoted themselves, so they're not actually paying any size cost for the promotion (their transitive callers might at some point, but it could still be worth it overall). There's also the potential for further simplification if e.g. only one member of a struct is used.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

ArgumentPromotion.cpp

23 lines

test/

Transforms/

ArgumentPromotion/

minsize.ll

100 lines

Diff 520520

llvm/lib/Transforms/IPO/ArgumentPromotion.cpp

Show First 20 Lines • Show All 456 Lines • ▼ Show 20 Lines	return isDereferenceableAndAlignedPointer(CB.getArgOperand(Arg->getArgNo()),
NeededAlign, Bytes, DL);		NeededAlign, Bytes, DL);
});		});
}		}

/// Determine that this argument is safe to promote, and find the argument		/// Determine that this argument is safe to promote, and find the argument
/// parts it can be promoted into.		/// parts it can be promoted into.
static bool findArgParts(Argument *Arg, const DataLayout &DL, AAResults &AAR,		static bool findArgParts(Argument *Arg, const DataLayout &DL, AAResults &AAR,
unsigned MaxElements, bool IsRecursive,		unsigned MaxElements, bool IsRecursive,
SmallVectorImpl<OffsetAndArgPart> &ArgPartsVec) {		SmallVectorImpl<OffsetAndArgPart> &ArgPartsVec,
		int &LoadStoreCount) {
// Quick exit for unused arguments		// Quick exit for unused arguments
if (Arg->use_empty())		if (Arg->use_empty())
return true;		return true;

// We can only promote this argument if all the uses are loads at known		// We can only promote this argument if all the uses are loads at known
// offsets.		// offsets.
//		//
// Promoting the argument causes it to be loaded in the caller		// Promoting the argument causes it to be loaded in the caller
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	if (auto *GEP = dyn_cast<GetElementPtrInst>(V)) {
AppendUses(V);		AppendUses(V);
continue;		continue;
}		}

if (auto *LI = dyn_cast<LoadInst>(V)) {		if (auto *LI = dyn_cast<LoadInst>(V)) {
if (!HandleEndUser(LI, LI->getType(), / GuaranteedToExecute */ false))		if (!HandleEndUser(LI, LI->getType(), / GuaranteedToExecute */ false))
return false;		return false;
Loads.push_back(LI);		Loads.push_back(LI);
		++LoadStoreCount;
continue;		continue;
}		}

// Stores are allowed for byval arguments		// Stores are allowed for byval arguments
auto *SI = dyn_cast<StoreInst>(V);		auto *SI = dyn_cast<StoreInst>(V);
if (AreStoresAllowed && SI &&		if (AreStoresAllowed && SI &&
U->getOperandNo() == StoreInst::getPointerOperandIndex()) {		U->getOperandNo() == StoreInst::getPointerOperandIndex()) {
if (!*HandleEndUser(SI, SI->getValueOperand()->getType(),		if (!*HandleEndUser(SI, SI->getValueOperand()->getType(),
/* GuaranteedToExecute */ false))		/* GuaranteedToExecute */ false))
return false;		return false;
		++LoadStoreCount;
continue;		continue;
// Only stores TO the argument is allowed, all the other stores are		// Only stores TO the argument is allowed, all the other stores are
// unknown users		// unknown users
}		}

// Unknown user.		// Unknown user.
LLVM_DEBUG(dbgs() << "ArgPromotion of " << *Arg << " failed: "		LLVM_DEBUG(dbgs() << "ArgPromotion of " << *Arg << " failed: "
<< "unknown user " << *V << "\n");		<< "unknown user " << *V << "\n");
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	static Function promoteArguments(Function F, FunctionAnalysisManager &FAM,
// First check: see if there are any pointer arguments! If not, quick exit.		// First check: see if there are any pointer arguments! If not, quick exit.
SmallVector<Argument *, 16> PointerArgs;		SmallVector<Argument *, 16> PointerArgs;
for (Argument &I : F->args())		for (Argument &I : F->args())
if (I.getType()->isPointerTy())		if (I.getType()->isPointerTy())
PointerArgs.push_back(&I);		PointerArgs.push_back(&I);
if (PointerArgs.empty())		if (PointerArgs.empty())
return nullptr;		return nullptr;

		bool MinSize = false;

// Second check: make sure that all callers are direct callers. We can't		// Second check: make sure that all callers are direct callers. We can't
// transform functions that have indirect callers. Also see if the function		// transform functions that have indirect callers. Also see if the function
// is self-recursive.		// is self-recursive.
for (Use &U : F->uses()) {		for (Use &U : F->uses()) {
CallBase *CB = dyn_cast<CallBase>(U.getUser());		CallBase *CB = dyn_cast<CallBase>(U.getUser());
// Must be a direct call.		// Must be a direct call.
if (CB == nullptr \|\| !CB->isCallee(&U) \|\|		if (CB == nullptr \|\| !CB->isCallee(&U) \|\|
CB->getFunctionType() != F->getFunctionType())		CB->getFunctionType() != F->getFunctionType())
return nullptr;		return nullptr;

// Can't change signature of musttail callee		// Can't change signature of musttail callee
if (CB->isMustTailCall())		if (CB->isMustTailCall())
return nullptr;		return nullptr;

		// If the caller is marked minsize, this transformation may increase code
		// size. We assume that there is more than one call to this function since
		// otherwise this function would be inlined or is dead.
		// Below we compare the number of loads/stores removed from the function with
		// the number of introduced loads in callees to see if this is profitable
		// code-size-wise.
		if (CB->getFunction()->hasMinSize())
		MinSize = true;

if (CB->getFunction() == F)		if (CB->getFunction() == F)
IsRecursive = true;		IsRecursive = true;
}		}

// Can't change signature of musttail caller		// Can't change signature of musttail caller
// FIXME: Support promoting whole chain of musttail functions		// FIXME: Support promoting whole chain of musttail functions
for (BasicBlock &BB : *F)		for (BasicBlock &BB : *F)
if (BB.getTerminatingMustTailCall())		if (BB.getTerminatingMustTailCall())
Show All 18 Lines	if (PtrArg->hasStructRetAttr()) {
CallBase &CB = cast<CallBase>(*U.getUser());		CallBase &CB = cast<CallBase>(*U.getUser());
CB.removeParamAttr(ArgNo, Attribute::StructRet);		CB.removeParamAttr(ArgNo, Attribute::StructRet);
CB.addParamAttr(ArgNo, Attribute::NoAlias);		CB.addParamAttr(ArgNo, Attribute::NoAlias);
}		}
}		}

// If we can promote the pointer to its value.		// If we can promote the pointer to its value.
SmallVector<OffsetAndArgPart, 4> ArgParts;		SmallVector<OffsetAndArgPart, 4> ArgParts;
		int LoadStoreCount = 0;

if (findArgParts(PtrArg, DL, AAR, MaxElements, IsRecursive, ArgParts)) {		if (findArgParts(PtrArg, DL, AAR, MaxElements, IsRecursive, ArgParts,
		LoadStoreCount)) {
SmallVector<Type *, 4> Types;		SmallVector<Type *, 4> Types;
for (const auto &Pair : ArgParts)		for (const auto &Pair : ArgParts)
Types.push_back(Pair.second.Ty);		Types.push_back(Pair.second.Ty);

if (areTypesABICompatible(Types, *F, TTI)) {		if (areTypesABICompatible(Types, *F, TTI) &&
		!(MinSize && F->hasNUsesOrMore(LoadStoreCount + 1))) {
NumArgsAfterPromote += ArgParts.size() - 1;		NumArgsAfterPromote += ArgParts.size() - 1;
ArgsToPromote.insert({PtrArg, std::move(ArgParts)});		ArgsToPromote.insert({PtrArg, std::move(ArgParts)});
}		}
}		}
}		}

// No promotable pointer arguments.		// No promotable pointer arguments.
if (ArgsToPromote.empty())		if (ArgsToPromote.empty())
▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

llvm/test/Transforms/ArgumentPromotion/minsize.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
				; RUN: opt -passes=argpromotion -S < %s \| FileCheck %s

				; Basic case without minsize, argpromo should happen.
				define internal i32 @f1(ptr %p) {
				; CHECK-LABEL: define internal i32 @f1
				; CHECK-SAME: (i32 [[P_0_VAL:%.*]]) {
				; CHECK-NEXT: ret i32 [[P_0_VAL]]
				;
				%i = load i32, ptr %p
				ret i32 %i
				}

				define i32 @g1(ptr %p) {
				; CHECK-LABEL: define i32 @g1
				; CHECK-SAME: (ptr [[P:%.*]]) {
				; CHECK-NEXT: [[P_VAL:%.*]] = load i32, ptr [[P]], align 4
				; CHECK-NEXT: [[I:%.*]] = call i32 @f1(i32 [[P_VAL]])
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = call i32 @f1(ptr %p)
				ret i32 %i
				}

				define i32 @g2(ptr %p) {
				; CHECK-LABEL: define i32 @g2
				; CHECK-SAME: (ptr [[P:%.*]]) {
				; CHECK-NEXT: [[P_VAL:%.*]] = load i32, ptr [[P]], align 4
				; CHECK-NEXT: [[I:%.*]] = call i32 @f1(i32 [[P_VAL]])
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = call i32 @f1(ptr %p)
				ret i32 %i
				}

				; With a minsize caller, argpromo shouldn't happen because we only eliminate one load but introduce two loads.
				define internal i32 @f2(ptr %p) {
				; CHECK-LABEL: define internal i32 @f2
				; CHECK-SAME: (ptr [[P:%.*]]) {
				; CHECK-NEXT: [[I:%.*]] = load i32, ptr [[P]], align 4
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = load i32, ptr %p
				ret i32 %i
				}

				define i32 @h1(ptr %p) minsize {
				; CHECK-LABEL: define i32 @h1
				; CHECK-SAME: (ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
				; CHECK-NEXT: [[I:%.*]] = call i32 @f2(ptr [[P]])
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = call i32 @f2(ptr %p)
				ret i32 %i
				}

				define i32 @h2(ptr %p) {
				; CHECK-LABEL: define i32 @h2
				; CHECK-SAME: (ptr [[P:%.*]]) {
				; CHECK-NEXT: [[I:%.*]] = call i32 @f2(ptr [[P]])
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = call i32 @f2(ptr %p)
				ret i32 %i
				}

				; With a minsize caller, argpromo should still happen because we eliminate two loads and introduce two loads.
				define internal i32 @f3(ptr %p) {
				; CHECK-LABEL: define internal i32 @f3
				; CHECK-SAME: (i32 [[P_0_VAL:%.*]]) {
				; CHECK-NEXT: [[R:%.*]] = add i32 [[P_0_VAL]], [[P_0_VAL]]
				; CHECK-NEXT: ret i32 [[R]]
				;
				%i = load i32, ptr %p
				%i2 = load i32, ptr %p
				%r = add i32 %i, %i2
				ret i32 %r
				}

				define i32 @i1(ptr %p) minsize {
				; CHECK-LABEL: define i32 @i1
				; CHECK-SAME: (ptr [[P:%.*]]) #[[ATTR0]] {
				; CHECK-NEXT: [[P_VAL:%.*]] = load i32, ptr [[P]], align 4
				; CHECK-NEXT: [[I:%.*]] = call i32 @f3(i32 [[P_VAL]])
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = call i32 @f3(ptr %p)
				ret i32 %i
				}

				define i32 @i2(ptr %p) {
				; CHECK-LABEL: define i32 @i2
				; CHECK-SAME: (ptr [[P:%.*]]) {
				; CHECK-NEXT: [[P_VAL:%.*]] = load i32, ptr [[P]], align 4
				; CHECK-NEXT: [[I:%.*]] = call i32 @f3(i32 [[P_VAL]])
				; CHECK-NEXT: ret i32 [[I]]
				;
				%i = call i32 @f3(ptr %p)
				ret i32 %i
				}