This is an archive of the discontinued LLVM Phabricator instance.

Do not use a hash table to uniquify mergeable strings.
ClosedPublic

Authored by ruiu on Dec 3 2018, 2:01 PM.

Details

Summary

Previously, we have a hash table containing strings and their offsets to
manage mergeable strings. Technically we can live without that because we
can do binary search on a vector of mergeable strings to find a mergeable
strings. The table was there to speed up offset -> string piece lookup.

We recently observed that lld tend to consume more memory than gold when
linking executables with debug info, and we found that a few percent of
memory is consumed by the hash table. I wondered if we can save memory
here, so I run a few benchmarks with and without the hash table. Here is
the result.

Speed (measured by `perf stat -r10`)

Program         w/patch         w/o patch       Slowdown
chrome          2.004718511     1.988568454     0.81%
clang           0.518536707     0.503528155     2.98%
clang-fsds      0.566316070     0.551120769     2.75%
clang-gdb-index 5.091346130     5.052952745     0.75%
gold            0.325922095     0.318968615     2.17%
gold-fsds       0.353570003     0.342080243     3.35%
linux-kernel    0.872819415     0.866891291     0.68%
llvm-as         0.057709114     0.054467349     5.95%
llvm-as-fsds    0.054338842     0.053751006     1.09%
mozilla         3.730120566     3.836634236     -2.77%
scylla          1.011386393     1.031144224     -1.91%

Maximum RSS (measured by `time -v`)

Program         w/patch         w/o patch       Memory saving
chrome          1163088         1163572         0.04%
clang           369364          373760          1.17%
clang-fsds      392484          396712          1.06%
clang-gdb-index 10391636        10391680        0.00%
gold            227508          229108          0.69%
gold-fsds       237308          238960          0.69%
linux-kernel    143028          146932          2.65%
llvm-as         46792           46976           0.39%
llvm-as-fsds    47112           47336           0.47%
mozilla         4631424         4833940         4.18%
scylla          2712036         2793468         2.91%

It looks like the slowdown is not negligible, but lld got slower only when
the program being linked is small. For large programs, the regression
seems small, or it even got faster. Given that, I don't think having the
hash table is still a good tradeoff; we should drop the hash table to save
memory.

Event Timeline

ruiu created this revision.Dec 3 2018, 2:01 PM
smeenai added a subscriber: smeenai.Dec 3 2018, 5:00 PM
grimar added a comment.EditedDec 4 2018, 4:33 AM

Program w/patch w/o patch Memory saving
chrome 1163088 1163572 0.04%

I guess that was not the debug chrome perhaps.
That might explain why it is 0.81% slower but saves an only minor amount of memory.

I am not sure what I feel about this patch. It can make the link faster in some cases,
but for regular use (linking not huge apps, like clang) it always seems to be up to a few percents slower.

Saving 4.18%, or 200Mb for Mozilla (4631424->4833940) is good, but is that so significant and worth doing?
200Mb is only 200Mb, while -2.77% of link time looks much more impressive to me and
unfortunately, the link time for most of the apps tested is slower with this patch.

Given that this patch is about a finding the balance between speed and memory,
I would like to hear from other people on this.

ruiu added a comment.Dec 4 2018, 9:10 AM

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

grimar added a comment.Dec 4 2018, 9:32 AM

and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive.

But chrome is 0.81% slower. That is my concern too.

Let me benchmark this on my side too please (1-2 days), I'll bring my results here.

ruiu added a comment.Dec 4 2018, 9:49 AM

As one of the people who cares about chromium most in the lld community, I can say that 16 milliseconds regression of linking chromium without debug info is totally negligible.

ruiu added a comment.Dec 4 2018, 10:04 AM

What we should and do actually care about are programs whose size is multi-gibibyte that take 30 seconds to a few minutes to link. Unfortunately I cannot share the programs with you, but we observed both speedup and memory usage reduction with this patch.

grimar accepted this revision.Dec 4 2018, 10:36 AM

What we should and do actually care about are programs whose size is multi-gibibyte that take 30 seconds to a few minutes to link. Unfortunately I cannot share the programs with you, but we observed both speedup and memory usage reduction with this patch.

Well, that sounds like an argument I can't resist. I do not have such apps underhand for testing,
so if you observe the stable improvements you're happy with, then LGTM.

This revision is now accepted and ready to land.Dec 4 2018, 10:36 AM
ruiu added a comment.Dec 4 2018, 10:39 AM

I will collect more data points even though we cannot share programs with you to be fair, so please wait for a while.

MaskRay accepted this revision.EditedDec 4 2018, 4:23 PM
MaskRay added a subscriber: MaskRay.

I didn't notice this patch when I sent out D55248 (I just observed a locally-invented upper_bound without a good justification :) ).

I agree that the memory-consuming llvm::DenseMap<uint32_t, uint32_t> OffsetMap; is not necessary. (just for the record, gold Object_merge_map::get_output_offset uses a similar upper_bound without a hash table).

I think the hash table has some improvement when there are not too many pieces. But when there are many (some .debug_str), the cache miss for each lookup isn't totally negligible. It can be the reason that pure binary search may be faster.

If we do want to optimize llvm::upper_bound, however, I think a one-entry cache (size_t LastIdx; uint64_t LastOffset;) may be more efficient, I've checked locally that many uses follow the pattern of calling getSectionPiece with increasing Offset.

The one-entry cache I mentioned is like the following (I don't say I suggest doing that as it is so complicated with probably so little value):

size_t I;
if (Offset < LastOffset) {
  I = std::upper_bound(Pieces.begin(), Pieces.begin() + LastIdx, Offset,
                       [](uint64_t Offset, SectionPiece P) {
                         return Offset < P.InputOff;
                       }) -
      Pieces.begin();
} else {
  I = LastIdx;
  if (I < Pieces.size() && Pieces[I].InputOff <= Offset)
    ++I;
  if (I < Pieces.size() && Pieces[I].InputOff <= Offset)
    I = std::upper_bound(Pieces.begin() + I + 1, Pieces.end(), Offset,
                         [](uint64_t Offset, SectionPiece P) {
                           return Offset < P.InputOff;
                         }) -
        Pieces.begin();
}

LastOffset = Offset;
LastIdx = I;
return &Pieces[I - 1];

Actually upper_bound is not the best choice here, a slow starting step sizes are better: 1,2,4,8,16,(slow start process) 8,4,2,1 (regular binary search steps)

ruiu added a comment.Dec 5 2018, 11:03 AM

I ran lld with a few more large programs, and here is the result.

Size   Before  After  
6.1GB  20.16s  19.15s (-4.98%)
3.8GB  56.27s  53.38s (-5.14%)
2.5GB  32.65s  32.56s (+0.28%)

Since large programs take time to link, and in order to reduce noise I need to run it many times, it is not easy to run lld on many large programs, but I believe you can still see a pattern. At least I think I can say that removing the hash table is not bad.

This revision was automatically updated to reflect the committed changes.

I ran lld with a few more large programs, and here is the result.

Size   Before  After  
6.1GB  20.16s  19.15s (-4.98%)
3.8GB  56.27s  53.38s (-5.14%)
2.5GB  32.65s  32.56s (+0.28%)

Since large programs take time to link, and in order to reduce noise I need to run it many times, it is not easy to run lld on many large programs, but I believe you can still see a pattern. At least I think I can say that removing the hash table is not bad.

Looks good, thanks for the numbers.

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

FWIW linking times on smaller (than the very large) programs like Clang are still significant to many folks (including LLVM developers themselves) - for example when running "ninja check-all" many LLVM tools (~30? I forget roughly how many) have to be linked - and saving time on each (or the longest one) means getting out of that very serial step in the check run (massively parallel compiles, serial links, then massively parallel test runs).

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

FWIW linking times on smaller (than the very large) programs like Clang are still significant to many folks (including LLVM developers themselves) - for example when running "ninja check-all" many LLVM tools (~30? I forget roughly how many) have to be linked - and saving time on each (or the longest one) means getting out of that very serial step in the check run (massively parallel compiles, serial links, then massively parallel test runs).

I second the point, but see my comment above. If we want to optimize further, I think we should try some one-entry cache and leverage the getVA() calling pattern, instead of using a hash table (which has been deleted by this patch). The hash table may actually do worse with some SHF_MERGE|SHF_STRINGS.

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

FWIW linking times on smaller (than the very large) programs like Clang are still significant to many folks (including LLVM developers themselves) - for example when running "ninja check-all" many LLVM tools (~30? I forget roughly how many) have to be linked - and saving time on each (or the longest one) means getting out of that very serial step in the check run (massively parallel compiles, serial links, then massively parallel test runs).

I second the point, but see my comment above. If we want to optimize further, I think we should try some one-entry cache and leverage the getVA() calling pattern, instead of using a hash table (which has been deleted by this patch). The hash table may actually do worse with some SHF_MERGE|SHF_STRINGS.

Yep yep - neat idea/something someone can experiment with at some point if they're feeling like it :)