This is an archive of the discontinued LLVM Phabricator instance.

[Outliner] Fix compile-time overhead for candidate choice
ClosedPublic

Authored by paquette on Mar 20 2017, 11:29 AM.

Details

Summary

The candidate collection method in the outliner can cause some dramatic compile-time regressions on large tests.

Currently, it works like this:

  1. Build a suffix tree over the instructions in the program
  2. Query the tree for good candidates
  3. Prune tree
  4. Go to step 2 until no candidates are found.

This hinges off the assumption that pruning the tree based off of overlapping candidates will reduce the search space enough that subsequent queries will take a negligible amount of time. Unfortunately, this assumption is shaky at best. Certain programs don't actually have many overlapping candidates. For very large tests, this is a big problem.

The new candidate collection method works like this:

  1. Build a suffix tree over the instructions in the program
  2. Iterate over each leaf in the tree.
  3. Visit the parent of each leaf. If the parent represents a beneficial string, save a candidate for it.

There are O(I) leaves, where I is the number of instructions in the program. Since we only walk it once, this is now guaranteed to be at worst O(I). Overlapping candidates are now handled entirely by the pruneOverlaps method that handled anything the suffix tree pruning didn't handle. Since that method is on average O(n), this behaves far better.

Improvements

Compile-time was measured by compiling the LLVM test suite for AArch64 using LNT. Clang with Oz and mno-red-zone was tested against clang with Oz, mno-red-zone, and -mllvm -enable-machine-outliner.

Using the old candidate collection method, the worst compile-time regression was a spectacular 284.07 seconds on MultiSource/Benchmarks/7zip/7zip-benchmark.test. This was followed up by MultiSource/Benchmarks/Bullet/bullet.test with a 257.96 second regression. On average, there was a 9.69 second compile-time regression and a median compile-time regression of 0.85 seconds.

Using the new candidate collection method, the worst compile-time regression was 0.542 seconds for MultiSource/Benchmarks/7zip/7zip-benchmark.test, followed up by 0.496 seconds for MultiSource/Benchmarks/Bullet/bullet.test. On average, there was a 0.01 second compile-time regression, with a median compile-time regression of 0.002s.

Changes to code size

This change doesn't impact the code size results. In fact, by collecting all potential candidates, we can probably make the outliner make better decisions for what to outline in the future.

Diff Detail

Event Timeline

paquette created this revision.Mar 20 2017, 11:29 AM
MatzeB edited edge metadata.Mar 20 2017, 11:55 AM

I am confused:

The candidate collection method in the outliner can cause some dramatic code size regressions on large tests.

... vs ...

This change doesn't impact the code size results. In fact, by collecting all potential candidates, we can probably make the outliner make better decisions for what to outline in the future.

paquette added a comment.EditedMar 20 2017, 11:59 AM

I am confused:

The candidate collection method in the outliner can cause some dramatic code size regressions on large tests.

... vs ...

This change doesn't impact the code size results. In fact, by collecting all potential candidates, we can probably make the outliner make better decisions for what to outline in the future.

***compile time!

Too used to typing code size. This is all about compile time being awful. :)

paquette edited the summary of this revision. (Show Details)Mar 20 2017, 12:28 PM
silvas added a subscriber: silvas.Mar 20 2017, 8:52 PM

Overlapping candidates are now handled entirely by the pruneOverlaps method that handled anything the suffix tree pruning didn't handle. Since that method is on average O(n), this behaves far better.

Why is it O(n) on average? It is two nested for loops with some bailouts that only affect the asymptotic runtime if candidates overlap with little-omega(1) other candidates. And it needs to be big-omega(sqrt(n)) if the runtime is to be reduced to even O(n^1.5).

Also, as a side note, pruneOverlaps seems to only consider the first occurrences. E.g. decreasing occurrence counts by 1. Two candidates may overlap in more than one place. This probably explains something I was seeing: if you print out the total benefit of each candidate as it is outlined, the total benefit that the outliner thinks it is getting from the outlining operations it does is more can be more than the total string length, which reflects incorrect cost modeling.

Using the new candidate collection method, the worst compile-time regression was 0.542 seconds for MultiSource/Benchmarks/7zip/7zip-benchmark.test, followed up by 0.496 seconds for MultiSource/Benchmarks/Bullet/bullet.test. On average, there was a 0.01 second compile-time regression, with a median compile-time regression of 0.002s.

This is a big improvement over the previous numbers, but it would be good to phrase it as relative numbers. E.g. does the outliner after this patch ever take more than 10% of compile time? That would still be unfortunate.

Also, you should try this with FullLTO. That will smoke out compile time problems better. If you save off the combined bitcode files for each exe you care about, then you can just run llc on the bitcode directly which should let you iterate faster.

lib/CodeGen/MachineOutliner.cpp
558

Why not just phrase this routine as a single tree walk visiting the internal nodes instead of doing this thing where you walk the leaf vector looking at parents?

580

This FIXME doesn't make sense. Every internal node is the parent (not just ancestor) of a leaf. In fact you are relying on this so that looking at parents of the leaves considers all internal nodes.

620

Should be *Parent, right? Leaves will always have 1 occurrence.

silvas added inline comments.Mar 21 2017, 2:57 AM
lib/CodeGen/MachineOutliner.cpp
580

Or the code is failing to consider all internal nodes with this patch, which seems like a bummer. Do you have data on how many internal nodes aren't parents of leaves (and have found the missed opportunities to be negligible)?

Why is it O(n) on average? It is two nested for loops with some bailouts that only affect the asymptotic runtime if candidates overlap with little-omega(1) other candidates. And it needs to be big-omega(sqrt(n)) if the runtime is to be reduced to even O(n^1.5).

Compiling for AArch64, for the SingleSource and MultiSource tests in the test suite the average length of a candidate is 10 MachineInstrs. The maximum length of a candidate is 108, which appears in a test with 11251 MachineInstrs. So far, it appears to be linear time on average, because for programs with a lot of candidates, the max length of a candidate is usually dominated by the number of candidates. I should really prove the bound though. It's a little tricky because it's O(m * n) where m is the maximum length of a candidate.

Hand waving/brain dump: The tricky case is where m = NumInstructions/k for k > 2 and n ~ O(NumInstructions). As k gets larger, m gets smaller. When m = 2, we only have to compare against one other candidate for each candidate, so we're good.

Previously, since we did some overlap pruning in the tree, we knew that as k gets smaller, the possible number of candidates in the program got smaller. Now that I think about it, that's not really true anymore. I guess the point here would be to show that we do the same number of overlap checks as in the tree here. But then we aren't really talking about pruneOverlaps, so ¯\_(ツ)_/¯

In the meantime, I can make the outliner bail out entirely in the event that things spin out of control. From there, I could start working on maybe

  • Putting stuff in an interval tree or...
  • Adding some stronger checks in the same vein as the max length check to reliably bound the search space now that we don't have the tree pruning

Also, as a side note, pruneOverlaps seems to only consider the first occurrences. E.g. decreasing occurrence counts by 1. Two candidates may overlap in more than one place. This probably explains something I was seeing: if you print out the total benefit of each candidate as it is outlined, the total benefit that the outliner thinks it is getting from the outlining operations it does is more can be more than the total string length, which reflects incorrect cost modeling.

I don't quite understand what you mean here. There are multiple occurrences of one candidate that can overlap with other candidates in multiple places, yes, but the decrement is done on their respective OutlinedFunctions. Also, the benefit used after this function is stored in the OutlinedFunction. Candidate benefit is only used for the greedy choice to prevent candidates from "cancelling each other out". Is that what you're getting at? I didn't really explain this well in the code on a second glance.

paquette added inline comments.Mar 22 2017, 12:17 PM
lib/CodeGen/MachineOutliner.cpp
580

Yeah I was wrong here. I got too excited about the idea of getting rid of a bunch of nodes.

620

BenefitFn looks at the parent of the leaf, so it's fine. It should really be refactored to take in the parent though, since that's what the traversal is concerned with.

MatzeB accepted this revision.Mar 22 2017, 4:34 PM
  • This looks good to me.
  • +1 to the benefit function getting the parent passed in.
  • I'd replace the FIXMEs with TODO, as they are really bugs but possible improvements (as far as I could understand)
  • Superficially the suffixtree operations look okay to me, but I leave the judgement to the experts (=you).
lib/CodeGen/MachineOutliner.cpp
611

Maybe use a reference as Parent cannot be nullptr.

This revision is now accepted and ready to land.Mar 22 2017, 4:34 PM

Why is it O(n) on average? It is two nested for loops with some bailouts that only affect the asymptotic runtime if candidates overlap with little-omega(1) other candidates. And it needs to be big-omega(sqrt(n)) if the runtime is to be reduced to even O(n^1.5).

Compiling for AArch64, for the SingleSource and MultiSource tests in the test suite the average length of a candidate is 10 MachineInstrs. The maximum length of a candidate is 108, which appears in a test with 11251 MachineInstrs. So far, it appears to be linear time on average, because for programs with a lot of candidates, the max length of a candidate is usually dominated by the number of candidates. I should really prove the bound though. It's a little tricky because it's O(m * n) where m is the maximum length of a candidate.

Hand waving/brain dump: The tricky case is where m = NumInstructions/k for k > 2 and n ~ O(NumInstructions). As k gets larger, m gets smaller. When m = 2, we only have to compare against one other candidate for each candidate, so we're good.

Previously, since we did some overlap pruning in the tree, we knew that as k gets smaller, the possible number of candidates in the program got smaller. Now that I think about it, that's not really true anymore. I guess the point here would be to show that we do the same number of overlap checks as in the tree here. But then we aren't really talking about pruneOverlaps, so ¯\_(ツ)_/¯

Oh, I think I was getting really confused about Candidate representing a single occurrence of a string or the equivalence class for that string itself. E.g. there is a comment "an occurrence of this candidate", but "Candidate" is the "occurrence", and "OutlinedFunction" is the "candidate". Maybe call them "Occurrence" and "CandidateClass" or something?

Also, I think I was thinking that this was trying to emulate the previous serialized greedy-choice, prune/recalculate benefit, greedy-choice, prune/recalculate benefit, ... that it used to do (or maybe I was misunderstanding that too... why would pruneOverlaps have been needed in that case...).

Anyway, this seems fine, sorry for the confusion. I see that on the version you committed, you added the comment that it is O(MaxCandidateLen * CandidateList.size()). It might be good to put some concrete numbers to give readers a feel. E.g. MaxCandidateLen in practice is basically a constant, and CandidateList.size() is typically some fraction of the number of instructions.

In the meantime, I can make the outliner bail out entirely in the event that things spin out of control. From there, I could start working on maybe

  • Putting stuff in an interval tree or...
  • Adding some stronger checks in the same vein as the max length check to reliably bound the search space now that we don't have the tree pruning

Now that I understand it better, I don't think this really needs any improvement. There seem to be only a couple ways of improving the code size benefit of the outliner:

  1. Make more instructions viable for outlining (and you've been doing plenty of work on this, e.g. stack fixups and tail calls)
  2. Make the outlining decision process get a more optimal result (this is a combinatorial optimization problem)
  3. Heuristics for making outlining results across TU's converge more (I don't think there's much to do for this TBH)

I think the original hope for the suffix tree was to make 2 a lot better, but I think that on practical inputs even simple approaches like the finding parents of leaf children + pruneOverlaps already get pretty close to optimal (for example, it seems that just looking at parents of leaf children doesn't hurt the result too much).

Do you have any other plans here to exploit the suffix tree. I think that after this patch I think all we use the suffix tree for is to find repeated substrings. We might want to replace that with single sort + linear scan. I.e. initialize a vector of pointers into each instruction, then sort them by the suffix starting at that instruction.

You might want to do an experiment where you make "is instruction safe to outline" always return true and see how much more code size improvement is achieved. That will give a bound on how much you can expect to gain by improving 1., which seems useful to know.

Also, as a side note, pruneOverlaps seems to only consider the first occurrences. E.g. decreasing occurrence counts by 1. Two candidates may overlap in more than one place. This probably explains something I was seeing: if you print out the total benefit of each candidate as it is outlined, the total benefit that the outliner thinks it is getting from the outlining operations it does is more can be more than the total string length, which reflects incorrect cost modeling.

I don't quite understand what you mean here. There are multiple occurrences of one candidate that can overlap with other candidates in multiple places, yes, but the decrement is done on their respective OutlinedFunctions. Also, the benefit used after this function is stored in the OutlinedFunction. Candidate benefit is only used for the greedy choice to prevent candidates from "cancelling each other out". Is that what you're getting at? I didn't really explain this well in the code on a second glance.

I think I was really confused here. Sorry for the noise.

MatzeB closed this revision.Jul 19 2017, 4:13 PM

This seems to have landed a while ago in r298648