This is an archive of the discontinued LLVM Phabricator instance.

[ArgumentPromotion] Bail if any callers are minsize and more instructions are added than removed
AcceptedPublic

Authored by aeubanks on May 3 2023, 9:59 AM.

Details

Summary

Argument promotion mostly works on functions with more than one caller (otherwise the function would be inlined or is dead), so there's a good chance that performing this increases code size since we introduce loads at every call site. If any caller is marked minsize, check that the number of loads introduced at callers isn't greater than the number of loads/stores we remove in the callee.

Diff Detail

Event Timeline

aeubanks created this revision.May 3 2023, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 9:59 AM
aeubanks requested review of this revision.May 3 2023, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 9:59 AM
nikic accepted this revision.May 3 2023, 11:04 AM

LGTM

This revision is now accepted and ready to land.May 3 2023, 11:04 AM
This revision was landed with ongoing or failed builds.May 3 2023, 11:29 AM
This revision was automatically updated to reflect the committed changes.
smeenai added a subscriber: smeenai.May 4 2023, 5:52 PM

Do you have data showing size improvements on your end? We're observing size regressions at -Oz across the board instead for Meta's Android apps (we build with LTO, but I haven't tested non-LTO builds). It'll take me a while to come up with a standalone example, but some of the largest regressions are in folly::Optional and std::make_shared, which matches ArgumentPromotion's description of helping with templated code.

If you don't data to the contrary, could we revert this? If you do have cases where this shows overall improvements, what can I provide to help tune this to not be a substantial regression for us?

we were seeing size increases with code like the following

define internal i32 @f1(ptr %p) {
  %i = load i32, ptr %p, align 4
  ret i32 %i
}

define i32 @g1(ptr %p) {
  %i = call i32 @f1(ptr %p)
  %q = call i32 @f1(ptr %p)
  %w = call i32 @f1(ptr %p)
  %e = call i32 @f1(ptr %p)
  %r = call i32 @f1(ptr %p)
  %t = call i32 @f1(ptr %p)
  %y = call i32 @f1(ptr %p)
  ret i32 %i
}

becoming

define internal i32 @f1(i32 %p.0.val) {
  ret i32 %p.0.val
}

define i32 @g1(ptr %p) {
  %p.val6 = load i32, ptr %p, align 4
  %i = call i32 @f1(i32 %p.val6)
  %p.val5 = load i32, ptr %p, align 4
  %1 = call i32 @f1(i32 %p.val5)
  %p.val4 = load i32, ptr %p, align 4
  %2 = call i32 @f1(i32 %p.val4)
  %p.val3 = load i32, ptr %p, align 4
  %3 = call i32 @f1(i32 %p.val3)
  %p.val2 = load i32, ptr %p, align 4
  %4 = call i32 @f1(i32 %p.val2)
  %p.val1 = load i32, ptr %p, align 4
  %5 = call i32 @f1(i32 %p.val1)
  %p.val = load i32, ptr %p, align 4
  %6 = call i32 @f1(i32 %p.val)
  ret i32 %i
}

I did add a TODO here for a potential place for improvement, but even with that it doesn't take into account the fact that we can likely simplify more with the value now in SSA form after mem2reg. but maybe that TODO is worth pursuing anyway

but another thing to consider is that we only recently started running argpromo for -O1/2/s/z. I can see two cases where this patch would regress -Oz code size. one is running an -O3 post-link with an -Oz pre-link, which wouldn't make sense in general. the second is that https://reviews.llvm.org/D148269 actually helped with -Oz code size in your case, then this patch regressed it back to where it was before

otherwise if I'm missing something and you really need to revert this, you can revert https://reviews.llvm.org/D148269 and this patch together

but another thing to consider is that we only recently started running argpromo for -O1/2/s/z. I can see two cases where this patch would regress -Oz code size. one is running an -O3 post-link with an -Oz pre-link, which wouldn't make sense in general. the second is that https://reviews.llvm.org/D148269 actually helped with -Oz code size in your case, then this patch regressed it back to where it was before

I was thinking about this too. I verified that this patch is what caused the regression, and then the reland of D148269 (which took place after this patch) had no effect for us.

How does that pipeline change work for FullLTO though? Does the Level == OptimizationLevel::O3 check only apply to the pre-link compilations, or does it apply to the actual FullLTO phase based on the --lto-Ox value as well? We should just be using the default for LTO, which is --lto-O2 IIRC. We do build some TUs with -O3 and some TUs with -Oz that are FullLTO'd together.

I'm trying to think of a way to provide a reduced example demonstrating the size increase, but it's proving to be pretty tricky. This is a pretty large FullLTO build from a mix of -Oz and -O3 resources, and there seem to be interactions with other optimizations (in particular outlining) that are significant, so it'll take me a while to narrow things down.

but another thing to consider is that we only recently started running argpromo for -O1/2/s/z. I can see two cases where this patch would regress -Oz code size. one is running an -O3 post-link with an -Oz pre-link, which wouldn't make sense in general. the second is that https://reviews.llvm.org/D148269 actually helped with -Oz code size in your case, then this patch regressed it back to where it was before

I was thinking about this too. I verified that this patch is what caused the regression, and then the reland of D148269 (which took place after this patch) had no effect for us.

How does that pipeline change work for FullLTO though? Does the Level == OptimizationLevel::O3 check only apply to the pre-link compilations, or does it apply to the actual FullLTO phase based on the --lto-Ox value as well? We should just be using the default for LTO, which is --lto-O2 IIRC. We do build some TUs with -O3 and some TUs with -Oz that are FullLTO'd together.

I'm trying to think of a way to provide a reduced example demonstrating the size increase, but it's proving to be pretty tricky. This is a pretty large FullLTO build from a mix of -Oz and -O3 resources, and there seem to be interactions with other optimizations (in particular outlining) that are significant, so it'll take me a while to narrow things down.

Ah you're using FullLTO, which always runs ArgPromo on the merged module for all opt levels, that makes sense then. I'm fine with reverting these two patches while you investigate. I'm not in a rush to get these landed, but I do think that some version of this patch is valuable for code size even on its own.

Thanks. Reverted and continuing to look into a repro.

Unfortunately my first attempt with llvm-reduce generated a reducer with UB :D I'm trying with a slightly more refined reduction script to try to avoid that. From eyeballing the code it seems like the SROA-like aspect of argpromote is yielding the size wins though.

aeubanks reopened this revision.May 8 2023, 4:27 PM
This revision is now accepted and ready to land.May 8 2023, 4:27 PM
aeubanks updated this revision to Diff 520510.May 8 2023, 4:27 PM
aeubanks retitled this revision from [ArgumentPromotion] Bail if any callers are minsize to [ArgumentPromotion] Bail if any callers are minsize and more instructions are added than removed.
aeubanks edited the summary of this revision. (Show Details)

check number of instructions added/removed to decide when to bail

could you test the latest patch? the heuristic should be better (perhaps not perfect because it doesn't take into account simplification after sroa'ing, but it's better than before).

aeubanks updated this revision to Diff 520520.May 8 2023, 4:50 PM

update comment

Thanks for the update! I'm working on testing the size change with the new revision, but our internal infra for that is a little broken right now, so it might take a bit. I'm also still struggling with llvm-reduce's tendency to introduce unreachable in reachable places and use uninitialized memory, unfortunately.

Thanks for the update! I'm working on testing the size change with the new revision, but our internal infra for that is a little broken right now, so it might take a bit. I'm also still struggling with llvm-reduce's tendency to introduce unreachable in reachable places and use uninitialized memory, unfortunately.

No rush :)

I'm not sure how you're trying using llvm-reduce, it's not meant for any meaningful semantics-preserving transformations.

Thanks for the update! I'm working on testing the size change with the new revision, but our internal infra for that is a little broken right now, so it might take a bit. I'm also still struggling with llvm-reduce's tendency to introduce unreachable in reachable places and use uninitialized memory, unfortunately.

No rush :)

I'm not sure how you're trying using llvm-reduce, it's not meant for any meaningful semantics-preserving transformations.

Ah, that'd explain a lot :) I was trying to use it with an interestingness test of "is the output from a toolchain with this change meaningfully larger than the output from a toolchain without", but I guess it's more intended for crash tests.

Okay, the new revision is better, but it's still a 204 KiB overall size increase for the Facebook Android app, which is considered significant. The previous iteration was a 244 KiB size increase, for reference.

I'm gonna try a -print-after-all -print-changed IR dump from the LTO run to see if I can spot any obvious causes. ArgPromo is kinda uniquely tricky to reason about that way because of the changes being scattered across the function itself and its callees though.

@smeenai any chance to take a look?

@smeenai any chance to take a look?

Sorry, not yet. I was hoping to spend time on this today, but something urgent came up. Hopefully I'll have more time for this tomorrow.

Okay, I think the heuristic is falling short by not considering that callers of the functions might have had the argument promoted themselves. As in, if I'm understanding this correctly, right now if you have function f with an argument which would have one load eliminated by argpromo but has two callers g and h, you won't promote the argument. However, g and h might have had the argument promoted themselves, so they're not actually paying any size cost for the promotion (their transitive callers might at some point, but it could still be worth it overall). There's also the potential for further simplification if e.g. only one member of a struct is used.