This is an archive of the discontinued LLVM Phabricator instance.

Parallelize ICF to make LLD's ICF really fast.
ClosedPublic

Authored by ruiu on Nov 29 2016, 8:12 PM.

Details

Summary

ICF is short for Identical Code Folding. It is a size optimization to identify two or more functions that happened to have the same contents to merges them. It usually reduces output size by a few percent.

ICF is slow because it is computationally intensive process. I tried to paralellize it before but failed because I couldn't make a parallelized version produce consistent outputs. Although it didn't create broken executables, every invocation of the linker generated slightly different output, and I couldn't figure out why.

I think I now understand what was going on, and also came up with a simple algorithm to fix it. So is this patch.

The result is very exciting. Chromium for example has 780,662 input sections in which 20,774 are reducible by ICF. LLD previously took 7.980 seconds for ICF. Now it finishes in 1.065 seconds.

As a result, LLD can now link a Chromium binary (output size 1.59 GB) in 10.28 seconds on my machine with ICF enabled. Compared to gold which takes 40.94 seconds to do the same thing, this is an amazing number.

From here, I'll describe what we are doing for ICF, what was the previous problem, and what I did in this patch.

In ICF, two sections are considered identical if they have the same section flags, section data, and relocations. Relocations are tricky, becuase two relocations are considered the same if they have the same relocation type, values, and if they point to the same section _in terms of ICF_.

Here is an example. If foo and bar defined below are compiled to the same machine instructions, ICF can (and should) merge the two, although their relocations point to each other.

void foo() { bar(); }
void bar() { foo(); }

This is not an easy problem to solve.

What we are doing in LLD is some sort of coloring algorithm. We color non-identical sections using different colors repeatedly, and sections in the same color when the algorithm terminates are considered identical. Here is the details:

  1. First, we color all sections using their hash values of section types, section contents, and numbers of relocations. At this moment, relocation targets are not taken into account. We just color sections that apparently differ in different colors.
  2. Next, for each color C, we visit sections having color C to see if their relocations are the same. Relocations are considered equal if their targets have the same color. We then recolor sections that have different relocation targets in a new colors.
  3. If we recolor some section in step 2, relocations that were previously pointing to the same color targets may now be pointing to different colors. Therefore, repeat 2 until a convergence is obtained.

Step 2 is a heavy operation. For Chromium, the first iteration of step 2 takes 2.882 seconds, and the second iteration takes 1.038 seconds, and in total it needs 23 iterations.

Parallelizing step 1 is easy because we can color each section independently. This patch does that.

Parallelizing step 2 is tricky. We could work on each color independently, but we cannot recolor sections in place, because it will break the invariance that two possibly-identical sections must have the same color at any moment.

Consider sections S1, S2, S3, S4 in the same color C, where S1 and S2 are identical, S3 and S4 are identical, but S2 and S4 are not. Thread A is about to recolor S1 and S2 in C'. After thread A recolor S1 in C', but before recolor S2 in C', other thread B might observe S1 and S2. Then thread B will conclude that S1 and S2 are different, and it will split thread B's sections into smaller groups wrongly. Over-splitting doesn't produce broken results, but it loses a chance to merge some identical sections. That was the cause of indeterminism.

To fix the problem, I made sections have two colors, namely current color and next color. At the beginning of each iteration, both colors are the same. Each thread reads from current color and writes to next color. In this way, we can avoid threads from reading partial results. After each iteration, we flip current and next.

This is a very simple solution and is implemented in less than 50 lines of code.

I tested this patch with Chromium and confirmed that this parallelized ICF produces the identical output as the non-parallelized one.

Event Timeline

ruiu updated this revision to Diff 79699.Nov 29 2016, 8:12 PM
ruiu retitled this revision from to Parallelize ICF to make LLD's ICF really fast..
ruiu updated this object.
ruiu added reviewers: silvas, rafael.
ruiu added a subscriber: llvm-commits.
emaste added a subscriber: emaste.Nov 29 2016, 8:19 PM
ruiu updated this revision to Diff 79705.Nov 29 2016, 10:34 PM
  • Fixed correctness issue. When we flip GroupId[0] and GroupId[1], we needed to copy new colors from old space to new space.
  • Handle !Config->Thread case.
ruiu updated this object.Nov 29 2016, 10:35 PM
silvas edited edge metadata.Nov 29 2016, 11:36 PM

Nice! The idea of storing current and next colors solves the nondeterminism in a very simple way!

One concern: this increases sizeof(InputSection) :(
With -ffunction-sections -fdata-sections we might have a very large number of them, so reducing memory footprint is important. I'm afraid that this might slow down regular links.
After this patch, we will be paying one pointer size for DependentSection and 2 for GroupId. That is 3 pointer memory overhead which is really quite nontrivial.
It doesn't have to be done in this patch, but I think we can adjust the memory allocation to optionally allocate this data "off the tail" of the InputSection, so that we don't pay the memory overhead if these advanced features aren't being used.

Is the large (23) number of iterations due to slowness of propagating identicalness across references (one level of references per iteration currently) or due to ICF<ELFT>::segregate only being able to split into two at each iteration? (or a combination of both?). Here are two ideas for reducing the number of iterations:

  1. do some sort of topological sorting (even approximate) and then do partial iterations which only sort only part of the array. (more generally, avoid revisiting sections that are unlikely to change this iteration). This can speed up the convergence since we avoid wasting work on nodes that won't change.

One interesting observation is that if the array is topologically sorted (i.e. except for cycles) then I believe that a serial visitation with relaxation at each step (i.e., cannot be parallelized deterministically) would be guaranteed to resolve in a single iteration. The savings of reducing iterations might pay off.
Note that --gc-sections already has to compute some of this, so this topological ordering information might not be so expensive.

  1. make the "equal" comparison actually be a "less". That will allow ICF<ELFT>::segregate to sort instead of partition, which allows it to generate multiple ranges at a time.

What we are doing in LLD is some sort of coloring algorithm

Believe it or not, once I started learning about GVN I learned that this algorithm is actually a textbook example of an "optimistic" GVN algorithm. So it is actually a well-studied kind of algorithm.

ELF/InputSection.h
292

Would

struct {
  uint64_t Current;
  uint64_t Next;
} GroupId;

be a bit better?

ruiu updated this revision to Diff 79777.Nov 30 2016, 10:09 AM
ruiu edited edge metadata.
  • Change the type of GroupId from uint64_t to uint32_t to keep the original object size.
ruiu added a comment.Nov 30 2016, 10:32 AM

Is the large (23) number of iterations due to slowness of propagating identicalness across references (one level of references per iteration currently) or due to ICF<ELFT>::segregate only being able to split into two at each iteration? (or a combination of both?). Here are two ideas for reducing the number of iterations:

Very good question. Single-threaded iteration with a single GroupId for each InputSection took 11 iterations, so this patch slows down single threaded execution. The main loop gets faster as we have less number of input sections to mutate, but still executing an empty iteration takes ~100ms. I made a change to code to optimize for that case. (I could optimize it even more if I added more code for single thread, but I think in practice this should be enough.)

do some sort of topological sorting (even approximate) and then do partial iterations which only sort only part of the array. (more generally, avoid revisiting sections that are unlikely to change this iteration). This can speed up the convergence since we avoid wasting work on nodes that won't change.

Interesting idea. I could imagine that we might be able to compute strongly connected component and toposort them to do some sort of optimization. I wouldn't do that in this patch, but that's probably worth trying.

make the "equal" comparison actually be a "less". That will allow ICF<ELFT>::segregate to sort instead of partition, which allows it to generate multiple ranges at a time.

This is very interesting idea too. I'll try to do that after submit this patch.

ELF/InputSection.h
292

I'm using GroupId[0] and GroupId[1] alternately, so I can't make that change.

ruiu updated this revision to Diff 79780.Nov 30 2016, 10:32 AM
  • Optimize for single-thread case.
silvas accepted this revision.Dec 1 2016, 12:16 AM
silvas edited edge metadata.

This LGTM. Thanks for looking so closely at this! It's a very nice speedup!

do some sort of topological sorting (even approximate) and then do partial iterations which only sort only part of the array. (more generally, avoid revisiting sections that are unlikely to change this iteration). This can speed up the convergence since we avoid wasting work on nodes that won't change.

Interesting idea. I could imagine that we might be able to compute strongly connected component and toposort them to do some sort of optimization. I wouldn't do that in this patch, but that's probably worth trying.

We may not even need SCC's, as cycles are relatively rare. Just ordering by a simple DFS/BFS may give most of the benefit. So we may be able to get this "for free" from the work we do for --gc-sections.

(as far as parallelizing graph traversals, parallel graph algorithms have actually been studied quite a bit, although often in a distributed setting.
Some ones I remember seeing are:
https://github.com/MicrosoftResearch/Naiad/blob/release_0.5/Examples/DifferentialDataflow/ConnectedComponents.cs
https://github.com/MicrosoftResearch/Naiad/blob/release_0.5/Examples/DifferentialDataflow/StronglyConnectedComponents.cs
https://github.com/frankmcsherry/timely-dataflow/blob/master/examples/bfs.rs

Ideas from there might be useful inspiration for any parallel graph processing we have to do.
)

This revision is now accepted and ready to land.Dec 1 2016, 12:16 AM
This revision was automatically updated to reflect the committed changes.