This is an archive of the discontinued LLVM Phabricator instance.

[FuncSpec] Make the Function Specializer part of the IPSCCP pass.
ClosedPublic

Authored by labrinea on May 26 2022, 3:09 AM.

Details

Summary

The aim of this patch is to minimize the compilation time overhead of running Function Specialization. It is about 40% slower to run as a standalone pass (IPSCCP + FuncSpec vs IPSCCP with FuncSpec) according to my measurements. I compiled the llvm testsuite with NewPM-O3 + LTO and measured single threaded [user + system] time of IPSCCP and FuncSpec by passing the '-time-passes' option to lld. Then I compared the two configurations in terms of Instruction Count of the total compilation (not of the individual passes) as in https://llvm-compile-time-tracker.com. Geomean for non-LTO builds is -0.25% and LTO is -0.5% approximately.

You can find more info below:
https://discourse.llvm.org/t/rfc-should-we-enable-function-specialization/61518

Diff Detail

Event Timeline

labrinea created this revision.May 26 2022, 3:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2022, 3:09 AM
labrinea requested review of this revision.May 26 2022, 3:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2022, 3:09 AM
labrinea planned changes to this revision.May 26 2022, 3:11 AM

This is a proof of concept for RFC: Should we enable Function Specialization?, not ready for review.

labrinea updated this revision to Diff 441002.Jun 29 2022, 7:29 AM
labrinea retitled this revision from [WIP][FuncSpec] Make the Function Specializer part of the IPSCCP pass. to [FuncSpec] Make the Function Specializer part of the IPSCCP pass..
labrinea added a reviewer: llvm-commits.

rebased + fixed tests

sinan added a subscriber: sinan.Jul 11 2022, 11:15 PM

tryToReplaceWithConstant method in SCCP does not update the lattice value map at SCCPSolver, and it might lead to a problem that

if

%arg = getelementptr %struct, %struct* @Global, i32 0, i32 3
%tmp0 = call i64 @func2(i64* %arg)

is folded into

%tmp0 = call i64 @func2(i64* getelementptr inbounds %struct, %struct* @Global, i32 0, i32 3)

a new callbase argument appears, but it is not recorded at SCCPSolver, and this leads to problems such as D128822.

Suggestion:
FunctionSpecializer::tryToReplaceWithConstant use the code piece below to update the propagated argument, and maybe we need such a change for tryToReplaceWithConstant as well.

for (auto *I : UseInsts)
  Solver.visit(I);
labrinea edited reviewers, added: efriedma; removed: eli.friedman.Aug 1 2022, 10:39 AM

tryToReplaceWithConstant method in SCCP does not update the lattice value map at SCCPSolver, and it might lead to a problem that

if

%arg = getelementptr %struct, %struct* @Global, i32 0, i32 3
%tmp0 = call i64 @func2(i64* %arg)

is folded into

%tmp0 = call i64 @func2(i64* getelementptr inbounds %struct, %struct* @Global, i32 0, i32 3)

a new callbase argument appears, but it is not recorded at SCCPSolver, and this leads to problems such as D128822.

Suggestion:
FunctionSpecializer::tryToReplaceWithConstant use the code piece below to update the propagated argument, and maybe we need such a change for tryToReplaceWithConstant as well.

for (auto *I : UseInsts)
  Solver.visit(I);

I am doing this in a later patch (see D126456) as I wanted to keep this one as close as possible to the original implementation.

labrinea edited the summary of this revision. (Show Details)Aug 15 2022, 1:44 AM
fhahn added inline comments.Aug 17 2022, 1:51 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
291

can you fix the indentation here in a NFC?

651

IIUC this is done during in between solver runs, right? Is this needed? Isn't it sufficient to continue with the constant value in the value mapping? This would probably remove the need to tell the solver to forget instructions/values.

labrinea added inline comments.Aug 17 2022, 9:22 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
291

How? It seems ok.

651

This is not the only invocation of tryToReplaceWithConstant in FuncSpec. On this instance we try to replace the arguments of cloned functions. There's another invocation inside the functor RunSCCPSolver. On that instance we try to replace the instructions of cloned functions. Both calls occur as many times as FuncSpecializationMaxIters is set to.

Moreover, the SCCP pass itself does the same thing on arguments of tracked functions and on instructions of executable blocks (with tryToReplaceWithConstant and simplifyInstsInBlock accordingly). This happens after the Solver runs and before the Function Specializer is invoked. Therefore, I think we still need to tell the Solver to forget instructions/values if we want to merge the two passes.

chill added a subscriber: chill.Oct 10 2022, 4:14 AM
chill added inline comments.
llvm/lib/Transforms/Scalar/SCCP.cpp
509 ↗(On Diff #441002)

IMHO, the invocation of the FunctionSpecialization pass ought to happen in this place. The general flow would be like:

  1. Initialise solver
  2. Run solver once (Solver.solve() + resolvedUndefsIn loop)
  3. Run function specialisation
  4. Run solver again
  5. Optionally go to 2.
  6. Do replacements (from line 512 on)

At no point before the last step the passes ought to replace or delete anything (well, except called function operand for cloned functions). If an operand/argument is determined to be a constant, it does not need to be replaced right away, because the passes should consult its lattice value.

Yeah, the devil is in the details, but this is the approach to merging the tow passes, as I see it.

labrinea updated this revision to Diff 469041.Oct 19 2022, 2:50 PM
labrinea added a reviewer: momchil.velikov.

I have moved the invocation of the specializer earlier in the ipsccp pass, such that no instructions get deleted until all the solving is done. This essentially makes all of D128822, D128823, D128824, D128825, D126456 and D128827 obsolete. There is one thing I couldn't get working, which is to update the lattice value of the callsites to specialized functions. Unfortunately the semantics of the solver do not allow lattices to move from a generic to a more specific state (i.e. from a wider to a narrower constant range). That said the zapping of returned values won't work on specialized functions, neither we can propagate a constant returned value to the function body where the callsite resides. On another note, the function analysis information which is provided to the pass (predication info, dominator tree) cannot be used on the specialized functions; not sure if that's a problem though.

labrinea added inline comments.Oct 25 2022, 3:01 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

Just found that we need to do the same inside replaceSignedInst() too. I will move this code a function.

chill edited reviewers, added: chill; removed: momchil.velikov.Oct 27 2022, 7:26 AM
chill added inline comments.Oct 27 2022, 8:30 AM
llvm/lib/Transforms/Scalar/CMakeLists.txt
97 ↗(On Diff #469041)

Why add IPO here ?

llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

Would it be possible to call markUsersAsChanged here ?

237 ↗(On Diff #469041)

Is there a specific reason to remove the instruction from the block?
If not, I'd suggest doing deletion in a single place, as opposed to spreading parts of it all over.

254 ↗(On Diff #469041)

Likewise.

labrinea added inline comments.Oct 28 2022, 4:07 AM
llvm/lib/Transforms/Scalar/CMakeLists.txt
97 ↗(On Diff #469041)

I vaguely remember a link time error without this change. See also llvm/utils/gn/secondary/llvm/lib/Transforms/Scalar/BUILD.gn at the bottom of this diff. The IPSCCP pass now depends on the FunctionSpecializer whose cpp file is under the IPO directory.

llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

I think we can't because if we replace the uses first then the users of the old value will be empty. Can we markUsersAsChanged before we replaceAllUsesWith the new value? Btw markUsersAsChanged is private for the SCCPInstVisitor, but I suppose I could make it public if need be.

237 ↗(On Diff #469041)

I am not entirely sure. I wanted to avoid revisiting this instruction accidentally in either of simplifyInstsInBlock(), solve(), or resolvedUndefsIn(). For simplifyInstsInBlock() I could skip the instruction if it's present in ToDelete. For the others I don't know what the consequences of revisitng would be. I need to run some tests first.

labrinea added inline comments.Oct 28 2022, 6:09 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

Actually I could call markUsersAsChanged on the new Instrcution after replacing the uses of the old Instruction with it.

chill added inline comments.Oct 28 2022, 7:33 AM
llvm/lib/Transforms/Scalar/CMakeLists.txt
97 ↗(On Diff #469041)

IPO already depends on Scalar, i.e. in IPO/CMakeLists.txt we have

...

  COMPONENT_NAME
  IPO

  LINK_COMPONENTS
...
  Scalar
...

Looks like a circular dependency.

Perhaps FunctionSpecialization needs to go to Utils (alongside SCCPSolver).
Or runIPSCCP needs to go to IPO/SCCP.cpp.
Or both.

llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

OK, let's leave it hanging for now, until I can take a look on top of the latest trunk.

Ideally, we are trying to avoid changing code until the Solver is done.
Here we have found that an instruction has constant lattice value - we should not replace the users' operands
right away, but notify the Solver. The Solver in turn would add the instructions that need reexamining to the instructions worklist and update their lattice values the next time we invoke Solvet.solve().

Most likely SCCPSolver::visit should become private, the Solver (and the SCCP algorithm in general) is
driven by its worklists, we should stick to this design: want something done - add it to the worklist.

237 ↗(On Diff #469041)

I can't see why would anything go wrong if the instruction is revisited.
Do we know if the instruction is safe to remove? It could be SDiv/SRem with a zero divisor.

chill added inline comments.Oct 28 2022, 6:26 PM
llvm/lib/Transforms/Scalar/SCCP.cpp
237 ↗(On Diff #469041)

Actually, never mind, we're not replacing the instruction with a constant but with another instruction.

labrinea added inline comments.Oct 30 2022, 8:52 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

Update: I tried this. It works for 'some' cases. Instead of replacing values with constants I create mappings from the old to the new value and only after all the solving is done then I replace the uses. The specialization of recursive functions doesn't work because it relies on finding allocas of constant integers. Also the rewriting of callsites doesn't work either if the actual arguments have been constant propagated prior to specialization, but the old value hasn't been replaced yet. In theory I could pass on the mappings from sccp to the specializer but it seems overly complicated to do so.

labrinea updated this revision to Diff 472022.Oct 31 2022, 8:39 AM

Changes from last revision:

  • rebased on top of main; that required adjusting a couple of pass-manager tests as the LoopInfo analysis is now used in the default pipelines by the sccp pass
  • lazily eraseFromParent instructions which have been replaced instead of removeFromParent and deleteValue later as suggested by @chill
  • used markUsersAsChanged instead of visit as suggested by @chill (required exposing it to the public interface)
  • moved/renamed solveWhileResolvedUndefsIn to the solver as it is required from the specializer too (fixes a bug I have added a new testcase for)
  • changed createSpecialization to return a pointer to the cloned function
chill added inline comments.Oct 31 2022, 9:39 AM
llvm/lib/Transforms/IPO/SCCP.cpp
385

This part was added for the FunctionSpecialization, if func spec is disabled maybe not pass along the LoopAnalysis?

chill added inline comments.Nov 2 2022, 4:08 AM
llvm/lib/Transforms/IPO/SCCP.cpp
392

Now that we added LoopAnalysis we may well preserve it too. (I should have included it in the patch which introduced the LoopAnalysis here)

Good point. Running the LoopInfo analysis when we do not specialize adds unnecessary compile time overhead. After some investigation I found that notifying the users after replacing a value (instead of notifying only the old users), as well as not removing replaced instructions, both add significant compile time overhead. I am inclined to revert the latter and rework on the former so that markUsersAsChanged can accept a list of users.

labrinea updated this revision to Diff 472680.Nov 2 2022, 10:44 AM

Changes from last revision:

  • Predicated the LoopInfo analysis on the cmd-line option that enables function specialization. As a result we no longer need to adjust the pass manager unit-tests.
  • Reverted the eraseFromParent to removeFromParent/deleteValue for replaced instructions to save compilation time.
  • Modified markUsersAsChanged to accept a UserList so that we avoid revisiting irrelevant users after replacing a value (saves compilation time).

Note: The instruction count delta for CTMark with NewPM-O3+LTO between baseline (da5ded4fc9d8c8edfd4a79fa0e75c2ac9165fa7b) and this patch (with funcspec disabled) is about 0.05 % geomean.

labrinea marked 7 inline comments as done.Nov 2 2022, 10:48 AM
labrinea added inline comments.
llvm/lib/Transforms/IPO/SCCP.cpp
392

I tried this but the compiler crashes. Probably because SCCP deletes dead basic blocks.

chill added inline comments.Nov 2 2022, 12:26 PM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
278

WorkList is a too generic name for a parameter and it's not a worklist per se anyway.

373

This is a little bit fragile in the sense the caller may forget to clear the list.
It would be nicer if this function itself clears the list first thing when it starts execution.

WorkList could also be made a return, taking advantage of move semantics, which looks nicer on paper, but
may cause a few allocations/deallocations if we iterate.

llvm/lib/Transforms/Scalar/SCCP.cpp
161–172 ↗(On Diff #469041)

Instead of replacing values with constants I create mappings from the old to the new value ..

But isn't this what the ValueState already contains?

Also the rewriting of callsites doesn't work either if the actual arguments have been constant propagated prior to specialization, but the old value hasn't been replaced yet.

Well, FunctionSpecializer::rewriteCallSites and everything else should lookup lattice values, not work directly with operands.

But OK, let's not make too many changes at once and revisit it later.

116 ↗(On Diff #472680)

Wouldn't it work without the temporary vector?

markUsersAsChanged would go over each user, look at the user's operands (including Old),
and find the New (which is some constant) form the lattice values map.

Thus we would maybe get just:

Solver.markUsersAsChanged(Old);
Old->replaceAllUsesWith(New);
635 ↗(On Diff #472022)

I would suggest not creating a vector of all the functions in the module as they could be quite a lot (e.g. in LTO)
and thus trigger several heap allocations for WorkList.

solveWhileResolvedUndefIn is quite small and could be overloaded for a Module * parameter.

I considered making this function a template along the lines of:

template<typename RangeT>
void printNames(RangeT &&R) {
    for (auto &F : R)
      llvm::dbgs() << magic(F)->getName();
}

std::vector<llvm::Function *> v;

llvm::Module *M;

int main() {
    printNames(M->functions());
    printNames(v);
}

but couldn't come up with magic.

As for propagateConstants it could be done with a few overloads as well:

static bool propagateConstants(SCCPSolver &Solver, Function *F,
                               SmallPtrSetImpl<Instruction *> &ToDelete);

static bool propagateConstants(SCCPSolver &Solver,
                               SmallVectorImpl<Function *> &WorkList,
                               SmallPtrSetImpl<Instruction *> &ToDelete) {
  for (Function *F : WorkList)
      propagateConstants(Solve, F, ToDelete);
}

static bool propagateConstants(SCCPSolver &Solver, Module *M,
                               SmallPtrSetImpl<Instruction *> &ToDelete) {
  for (auto &F : Module)
      propagateConstants(Solve, &F, ToDelete);
}
llvm/lib/Transforms/Utils/SCCPSolver.cpp
1860

All the functions here forward to the Visitor, this one should also just be forwarding.

(Not sure why this proxy class exists at all, but I guess we can address it later).

labrinea updated this revision to Diff 473696.Nov 7 2022, 9:26 AM
labrinea marked an inline comment as done.

This revision is somewhat different from the previous ones because we no longer replace instructions/arguments whilst in the main specialization loop of the SCCP pass. Instead we use the lattice value when rewritting callsites and when promoting constant stack values. Last time I checked this was even more lightweight from the previous revision. Builds successfully the llvm-test-suite with aggressive funcspec options; haven't tried clang bootstrap yet. Also improves one of the unit tests with recursive functions.

labrinea marked 4 inline comments as done.Nov 7 2022, 9:44 AM
labrinea added inline comments.
llvm/include/llvm/Transforms/IPO/FunctionSpecialization.h
70

I've removed an unused typedef from here ;)

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
209–210

I am not sure whether this is necessary. The unit tests which exercise recursion are passing without it at least.

322

We now call this function once, not for every clone as it used to be.

684–688

There might be more call sites to rewrite than those in the CallSpecBinding that we have already found, therefore we need to repeat this look up of users here, but at least it now happens once for F compared to Clones.size() times which was the case before.

691

no need to examine lattices of arguments if it's the key of the CallSpecBinding

698–699

We are modifying the list whist traversing it, so we swap the current element with the last one and reduce the iteration range by one.

701

the condition was different before, but I think this is correct

llvm/lib/Transforms/Utils/SCCPSolver.cpp
708–726

I couldn't template these two. The main reason was that one iterates over Function * whereas the other over Function &.

chill added inline comments.Nov 8 2022, 8:02 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
689–698

You can swap the order of the loops and get rid of CallSiteToRewrite.

chill added inline comments.Nov 8 2022, 8:30 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
689–698

Maybe I'm getting a bit ahead of me, but you can change the function to rewrite just a single call site and factor out the iteration over call sites. The benefit is the function becomes more reusable in as you can independently choose the set of call sites it operates upon. (Incidently, I'm planning to use it that way, but it generally a good change, even if what I have in mind turns out non-working).

labrinea added inline comments.Nov 9 2022, 12:52 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
689–698

You mean to traverse F's users here instead? We are iterating over them while modifying them, which is the reason why CallSiteToRewrite existed in the first place I believe. Also the dynamic cast and other checks we do: CS->getCalledFunction() == F && Solver.isBlockExecutable(CS->getParent()) (note: this one is missing from the current revision) don't need to be repeated on every iteration of the outer loop which walks the specializaions.

689–698

If I change it the way you suggest it will regress the current behaviour. Qsort() from SPEC's mcf won't specialize (it's a recursive function) and it's been quite a drive for this work. What if you alter it once you refactor how we rewrite callsites?

chill added inline comments.Nov 9 2022, 2:23 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
689–698

What we have now is:

forall S : Specialisations {
  forall  C : CallSites {
   do_stuff(S, C);
 }
}

What I'm suggesting is to reorder the loops

forall C: call sites {
   forall S: specialisations {
    do_stuff(S, C);
  }
}

That'll avoid the swaps/pop_back. The vector itself stays, good point.

And then I'm suggesting to move the outer loop out of the function:

func foo() {
 ...
  forall C: CallSites {
    rewriteCallSite(C)
  }
...
}

func rewriteCallSite(C) {
    forall S : Specialisations {
     do_stuff(S, C);
   }
}

Both are NFC.

Also the dynamic cast and other checks we do: CS->getCalledFunction() == F && Solver.isBlockExecutable(CS->getParent()) (note: this one is missing from the current revision) don't need to be repeated on every iteration of the outer loop which walks the specializaions.

I don't understand this. We do nothing between loops, so interchanging them will execute exactly the same operations in the loop body.

If you add these checks somewhere, it's another argument to move the iteration over call sites to the outer loop.

chill added inline comments.Nov 9 2022, 3:49 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
696–701

This is better placed outside of rewriteCallSites, perhaps just after the call to rewriteCallSites.

chill added inline comments.Nov 9 2022, 4:24 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
274

I believe the LLVM convention for these kinds of classes/methods is run, e.g. Vectorizer::run(), EarlyCSE::run(), etc.

chill added inline comments.Nov 10 2022, 2:13 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
696–701

Or a better idea: get the initial size of CallSitesToRewrite, decrement that number every time you update a call site. At the end if this number drops to zero mark the function unreachable.

chill added inline comments.Nov 10 2022, 2:26 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
684

Use braces around the for, since there are more than two levels of nesting.

labrinea added inline comments.Nov 10 2022, 5:28 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
696–701

that won't work for dead recursive functions

labrinea updated this revision to Diff 474522.Nov 10 2022, 5:41 AM

Changes from last revision:

  • renamed specialize() to run() as suggested
  • renamed rewriteCallSites to updateCallSites
  • interchanged the loops in updateCallSites as suggested
  • update callsites of specializations before anything else (changes the test identical-specializations.ll)
  • used braces for nested loop
  • clang formated
labrinea marked 5 inline comments as done.Nov 10 2022, 5:44 AM
labrinea updated this revision to Diff 474555.Nov 10 2022, 8:35 AM

Found a bug in the last revision. The swapping idiom violates the expected order of traversing the call sites to update. They are supposed to be sorted by gain.

labrinea updated this revision to Diff 474673.Nov 10 2022, 11:58 PM

Changes from last revison:

  • Used a counter for updated callsites instead of revising them at the end to identify dead functions.
labrinea added inline comments.Nov 14 2022, 1:47 AM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
673

I'll rename this and add a comment to explain what it is used for.

labrinea updated this revision to Diff 475858.Nov 16 2022, 10:11 AM

Changes from last revision:

  • rebased
  • renamed the variable which determines whether we decrement the counter of left callsites to replace
  • cached the cloned function pointer to the SpecializationInfo to avoid keeping the Clone vector in sync with the Specialization vector when replacing callsites

Also removed the immediate replacement of known callsites. It's better to lookup the lattice value instead. This fixed a compile time regression possibly caused by otherwise dead duplicate specializations that had to be processed by later passes.

chill added a comment.Nov 18 2022, 4:21 AM

Would you, please, mark as done the no longer relevant comments?
I think the only issue left is with the circular dependency between libraries.

labrinea marked 7 inline comments as done and an inline comment as not done.Nov 18 2022, 8:44 AM

Would you, please, mark as done the no longer relevant comments?
I think the only issue left is with the circular dependency between libraries.

Ack

labrinea updated this revision to Diff 477766.Nov 24 2022, 6:17 AM

Removed the cyclic dependency between LLVMipo and LLVMScalarOpts and rebased on top of D138654.

chill added a comment.Nov 29 2022, 8:20 AM

LGTM, but let's give a chance for other people to have a look too. @sinan @fhahn

labrinea updated this revision to Diff 479303.Dec 1 2022, 8:16 AM
labrinea edited the summary of this revision. (Show Details)
  • rebase
  • migrated all funcspec tests to use the -passes= cmdline option
chill accepted this revision.Dec 5 2022, 1:30 AM
This revision is now accepted and ready to land.Dec 5 2022, 1:30 AM
fhahn added inline comments.Dec 5 2022, 2:17 AM
llvm/lib/Transforms/IPO/SCCP.cpp
150

move this to the the loop below, which uses it

llvm/lib/Transforms/Utils/SCCPSolver.cpp
264

I think DomTreeUpdater provides a constructor that doesn't take a DT which could be used unconditionally instead of having all those if (DTU) checks spread out across various functions.

labrinea updated this revision to Diff 480052.Dec 5 2022, 4:40 AM
labrinea edited the summary of this revision. (Show Details)
  • Moved a flag close to its use.
  • Used a DomTreeUpdater without DT/PDT for cloned functions.
labrinea marked 2 inline comments as done.Dec 5 2022, 4:41 AM
This revision was landed with ongoing or failed builds.Dec 8 2022, 4:23 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/Transforms/FunctionSpecialization/function-specialization-noexec.ll