Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[FuncSpec] Make the Function Specializer part of the IPSCCP pass.
ClosedPublic

Authored by labrinea on May 26 2022, 3:09 AM.

Details

Summary

The aim of this patch is to minimize the compilation time overhead of running Function Specialization. It is about 40% slower to run as a standalone pass (IPSCCP + FuncSpec vs IPSCCP with FuncSpec) according to my measurements. I compiled the llvm testsuite with NewPM-O3 + LTO and measured single threaded [user + system] time of IPSCCP and FuncSpec by passing the '-time-passes' option to lld. Then I compared the two configurations in terms of Instruction Count of the total compilation (not of the individual passes) as in https://llvm-compile-time-tracker.com. Geomean for non-LTO builds is -0.25% and LTO is -0.5% approximately.

You can find more info below:
https://discourse.llvm.org/t/rfc-should-we-enable-function-specialization/61518

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
labrinea added inline comments.Oct 30 2022, 8:52 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
161–172

Update: I tried this. It works for 'some' cases. Instead of replacing values with constants I create mappings from the old to the new value and only after all the solving is done then I replace the uses. The specialization of recursive functions doesn't work because it relies on finding allocas of constant integers. Also the rewriting of callsites doesn't work either if the actual arguments have been constant propagated prior to specialization, but the old value hasn't been replaced yet. In theory I could pass on the mappings from sccp to the specializer but it seems overly complicated to do so.

labrinea updated this revision to Diff 472022.Oct 31 2022, 8:39 AM

Changes from last revision:

  • rebased on top of main; that required adjusting a couple of pass-manager tests as the LoopInfo analysis is now used in the default pipelines by the sccp pass
  • lazily eraseFromParent instructions which have been replaced instead of removeFromParent and deleteValue later as suggested by @chill
  • used markUsersAsChanged instead of visit as suggested by @chill (required exposing it to the public interface)
  • moved/renamed solveWhileResolvedUndefsIn to the solver as it is required from the specializer too (fixes a bug I have added a new testcase for)
  • changed createSpecialization to return a pointer to the cloned function
chill added inline comments.Oct 31 2022, 9:39 AM
llvm/lib/Transforms/IPO/SCCP.cpp
41

This part was added for the FunctionSpecialization, if func spec is disabled maybe not pass along the LoopAnalysis?

chill added inline comments.Nov 2 2022, 4:08 AM
llvm/lib/Transforms/IPO/SCCP.cpp
48

Now that we added LoopAnalysis we may well preserve it too. (I should have included it in the patch which introduced the LoopAnalysis here)

Good point. Running the LoopInfo analysis when we do not specialize adds unnecessary compile time overhead. After some investigation I found that notifying the users after replacing a value (instead of notifying only the old users), as well as not removing replaced instructions, both add significant compile time overhead. I am inclined to revert the latter and rework on the former so that markUsersAsChanged can accept a list of users.

labrinea updated this revision to Diff 472680.Nov 2 2022, 10:44 AM

Changes from last revision:

  • Predicated the LoopInfo analysis on the cmd-line option that enables function specialization. As a result we no longer need to adjust the pass manager unit-tests.
  • Reverted the eraseFromParent to removeFromParent/deleteValue for replaced instructions to save compilation time.
  • Modified markUsersAsChanged to accept a UserList so that we avoid revisiting irrelevant users after replacing a value (saves compilation time).

Note: The instruction count delta for CTMark with NewPM-O3+LTO between baseline (da5ded4fc9d8c8edfd4a79fa0e75c2ac9165fa7b) and this patch (with funcspec disabled) is about 0.05 % geomean.

labrinea marked 7 inline comments as done.Nov 2 2022, 10:48 AM
labrinea added inline comments.
llvm/lib/Transforms/IPO/SCCP.cpp
48

I tried this but the compiler crashes. Probably because SCCP deletes dead basic blocks.

chill added inline comments.Nov 2 2022, 12:26 PM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
246

WorkList is a too generic name for a parameter and it's not a worklist per se anyway.

335

This is a little bit fragile in the sense the caller may forget to clear the list.
It would be nicer if this function itself clears the list first thing when it starts execution.

WorkList could also be made a return, taking advantage of move semantics, which looks nicer on paper, but
may cause a few allocations/deallocations if we iterate.

llvm/lib/Transforms/Scalar/SCCP.cpp
116

Wouldn't it work without the temporary vector?

markUsersAsChanged would go over each user, look at the user's operands (including Old),
and find the New (which is some constant) form the lattice values map.

Thus we would maybe get just:

Solver.markUsersAsChanged(Old);
Old->replaceAllUsesWith(New);
161–172

Instead of replacing values with constants I create mappings from the old to the new value ..

But isn't this what the ValueState already contains?

Also the rewriting of callsites doesn't work either if the actual arguments have been constant propagated prior to specialization, but the old value hasn't been replaced yet.

Well, FunctionSpecializer::rewriteCallSites and everything else should lookup lattice values, not work directly with operands.

But OK, let's not make too many changes at once and revisit it later.

661

I would suggest not creating a vector of all the functions in the module as they could be quite a lot (e.g. in LTO)
and thus trigger several heap allocations for WorkList.

solveWhileResolvedUndefIn is quite small and could be overloaded for a Module * parameter.

I considered making this function a template along the lines of:

template<typename RangeT>
void printNames(RangeT &&R) {
    for (auto &F : R)
      llvm::dbgs() << magic(F)->getName();
}

std::vector<llvm::Function *> v;

llvm::Module *M;

int main() {
    printNames(M->functions());
    printNames(v);
}

but couldn't come up with magic.

As for propagateConstants it could be done with a few overloads as well:

static bool propagateConstants(SCCPSolver &Solver, Function *F,
                               SmallPtrSetImpl<Instruction *> &ToDelete);

static bool propagateConstants(SCCPSolver &Solver,
                               SmallVectorImpl<Function *> &WorkList,
                               SmallPtrSetImpl<Instruction *> &ToDelete) {
  for (Function *F : WorkList)
      propagateConstants(Solve, F, ToDelete);
}

static bool propagateConstants(SCCPSolver &Solver, Module *M,
                               SmallPtrSetImpl<Instruction *> &ToDelete) {
  for (auto &F : Module)
      propagateConstants(Solve, &F, ToDelete);
}
llvm/lib/Transforms/Utils/SCCPSolver.cpp
1577 ↗(On Diff #472022)

All the functions here forward to the Visitor, this one should also just be forwarding.

(Not sure why this proxy class exists at all, but I guess we can address it later).

labrinea updated this revision to Diff 473696.Nov 7 2022, 9:26 AM
labrinea marked an inline comment as done.

This revision is somewhat different from the previous ones because we no longer replace instructions/arguments whilst in the main specialization loop of the SCCP pass. Instead we use the lattice value when rewritting callsites and when promoting constant stack values. Last time I checked this was even more lightweight from the previous revision. Builds successfully the llvm-test-suite with aggressive funcspec options; haven't tried clang bootstrap yet. Also improves one of the unit tests with recursive functions.

labrinea marked 4 inline comments as done.Nov 7 2022, 9:44 AM
labrinea added inline comments.
llvm/include/llvm/Transforms/IPO/FunctionSpecialization.h
70

I've removed an unused typedef from here ;)

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
207–208

I am not sure whether this is necessary. The unit tests which exercise recursion are passing without it at least.

281

We now call this function once, not for every clone as it used to be.

709

no need to examine lattices of arguments if it's the key of the CallSpecBinding

711–715

There might be more call sites to rewrite than those in the CallSpecBinding that we have already found, therefore we need to repeat this look up of users here, but at least it now happens once for F compared to Clones.size() times which was the case before.

712

the condition was different before, but I think this is correct

716–717

We are modifying the list whist traversing it, so we swap the current element with the last one and reduce the iteration range by one.

llvm/lib/Transforms/Utils/SCCPSolver.cpp
465–483 ↗(On Diff #473696)

I couldn't template these two. The main reason was that one iterates over Function * whereas the other over Function &.

chill added inline comments.Nov 8 2022, 8:02 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
716–725

You can swap the order of the loops and get rid of CallSiteToRewrite.

chill added inline comments.Nov 8 2022, 8:30 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
716–725

Maybe I'm getting a bit ahead of me, but you can change the function to rewrite just a single call site and factor out the iteration over call sites. The benefit is the function becomes more reusable in as you can independently choose the set of call sites it operates upon. (Incidently, I'm planning to use it that way, but it generally a good change, even if what I have in mind turns out non-working).

labrinea added inline comments.Nov 9 2022, 12:52 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
716–725

You mean to traverse F's users here instead? We are iterating over them while modifying them, which is the reason why CallSiteToRewrite existed in the first place I believe. Also the dynamic cast and other checks we do: CS->getCalledFunction() == F && Solver.isBlockExecutable(CS->getParent()) (note: this one is missing from the current revision) don't need to be repeated on every iteration of the outer loop which walks the specializaions.

716–725

If I change it the way you suggest it will regress the current behaviour. Qsort() from SPEC's mcf won't specialize (it's a recursive function) and it's been quite a drive for this work. What if you alter it once you refactor how we rewrite callsites?

chill added inline comments.Nov 9 2022, 2:23 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
716–725

What we have now is:

forall S : Specialisations {
  forall  C : CallSites {
   do_stuff(S, C);
 }
}

What I'm suggesting is to reorder the loops

forall C: call sites {
   forall S: specialisations {
    do_stuff(S, C);
  }
}

That'll avoid the swaps/pop_back. The vector itself stays, good point.

And then I'm suggesting to move the outer loop out of the function:

func foo() {
 ...
  forall C: CallSites {
    rewriteCallSite(C)
  }
...
}

func rewriteCallSite(C) {
    forall S : Specialisations {
     do_stuff(S, C);
   }
}

Both are NFC.

Also the dynamic cast and other checks we do: CS->getCalledFunction() == F && Solver.isBlockExecutable(CS->getParent()) (note: this one is missing from the current revision) don't need to be repeated on every iteration of the outer loop which walks the specializaions.

I don't understand this. We do nothing between loops, so interchanging them will execute exactly the same operations in the loop body.

If you add these checks somewhere, it's another argument to move the iteration over call sites to the outer loop.

chill added inline comments.Nov 9 2022, 3:49 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
707–708

This is better placed outside of rewriteCallSites, perhaps just after the call to rewriteCallSites.

chill added inline comments.Nov 9 2022, 4:24 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
239

I believe the LLVM convention for these kinds of classes/methods is run, e.g. Vectorizer::run(), EarlyCSE::run(), etc.

chill added inline comments.Nov 10 2022, 2:13 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
707–708

Or a better idea: get the initial size of CallSitesToRewrite, decrement that number every time you update a call site. At the end if this number drops to zero mark the function unreachable.

chill added inline comments.Nov 10 2022, 2:26 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
711

Use braces around the for, since there are more than two levels of nesting.

labrinea added inline comments.Nov 10 2022, 5:28 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
707–708

that won't work for dead recursive functions

labrinea updated this revision to Diff 474522.Nov 10 2022, 5:41 AM

Changes from last revision:

  • renamed specialize() to run() as suggested
  • renamed rewriteCallSites to updateCallSites
  • interchanged the loops in updateCallSites as suggested
  • update callsites of specializations before anything else (changes the test identical-specializations.ll)
  • used braces for nested loop
  • clang formated
labrinea marked 5 inline comments as done.Nov 10 2022, 5:44 AM
labrinea updated this revision to Diff 474555.Nov 10 2022, 8:35 AM

Found a bug in the last revision. The swapping idiom violates the expected order of traversing the call sites to update. They are supposed to be sorted by gain.

labrinea updated this revision to Diff 474673.Nov 10 2022, 11:58 PM

Changes from last revison:

  • Used a counter for updated callsites instead of revising them at the end to identify dead functions.
labrinea added inline comments.Nov 14 2022, 1:47 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
708

I'll rename this and add a comment to explain what it is used for.

labrinea updated this revision to Diff 475858.Nov 16 2022, 10:11 AM

Changes from last revision:

  • rebased
  • renamed the variable which determines whether we decrement the counter of left callsites to replace
  • cached the cloned function pointer to the SpecializationInfo to avoid keeping the Clone vector in sync with the Specialization vector when replacing callsites

Also removed the immediate replacement of known callsites. It's better to lookup the lattice value instead. This fixed a compile time regression possibly caused by otherwise dead duplicate specializations that had to be processed by later passes.

chill added a comment.Nov 18 2022, 4:21 AM

Would you, please, mark as done the no longer relevant comments?
I think the only issue left is with the circular dependency between libraries.

labrinea marked 7 inline comments as done and an inline comment as not done.Nov 18 2022, 8:44 AM

Would you, please, mark as done the no longer relevant comments?
I think the only issue left is with the circular dependency between libraries.

Ack

labrinea updated this revision to Diff 477766.Nov 24 2022, 6:17 AM

Removed the cyclic dependency between LLVMipo and LLVMScalarOpts and rebased on top of D138654.

chill added a comment.Nov 29 2022, 8:20 AM

LGTM, but let's give a chance for other people to have a look too. @sinan @fhahn

labrinea updated this revision to Diff 479303.Dec 1 2022, 8:16 AM
labrinea edited the summary of this revision. (Show Details)
  • rebase
  • migrated all funcspec tests to use the -passes= cmdline option
chill accepted this revision.Dec 5 2022, 1:30 AM
This revision is now accepted and ready to land.Dec 5 2022, 1:30 AM
fhahn added inline comments.Dec 5 2022, 2:17 AM
llvm/lib/Transforms/IPO/SCCP.cpp
25

move this to the the loop below, which uses it

llvm/lib/Transforms/Utils/SCCPSolver.cpp
266 ↗(On Diff #479303)

I think DomTreeUpdater provides a constructor that doesn't take a DT which could be used unconditionally instead of having all those if (DTU) checks spread out across various functions.

labrinea updated this revision to Diff 480052.Dec 5 2022, 4:40 AM
labrinea edited the summary of this revision. (Show Details)
  • Moved a flag close to its use.
  • Used a DomTreeUpdater without DT/PDT for cloned functions.
labrinea marked 2 inline comments as done.Dec 5 2022, 4:41 AM
This revision was landed with ongoing or failed builds.Dec 8 2022, 4:23 AM
This revision was automatically updated to reflect the committed changes.