This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/trunk/ELF/
-
trunk/
-
ELF/
-
InputSection.h
-
InputSection.cpp

Differential D55234

Do not use a hash table to uniquify mergeable strings.
ClosedPublic

Authored by ruiu on Dec 3 2018, 2:01 PM.

Download Raw Diff

Details

Reviewers

grimar
• espindola
MaskRay

Commits

rGc9c34bdc1af8: Do not use a hash table to uniquify mergeable strings.
rLLD348401: Do not use a hash table to uniquify mergeable strings.
rL348401: Do not use a hash table to uniquify mergeable strings.

Summary

Previously, we have a hash table containing strings and their offsets to
manage mergeable strings. Technically we can live without that because we
can do binary search on a vector of mergeable strings to find a mergeable
strings. The table was there to speed up offset -> string piece lookup.

We recently observed that lld tend to consume more memory than gold when
linking executables with debug info, and we found that a few percent of
memory is consumed by the hash table. I wondered if we can save memory
here, so I run a few benchmarks with and without the hash table. Here is
the result.

Speed (measured by `perf stat -r10`)

Program         w/patch         w/o patch       Slowdown
chrome          2.004718511     1.988568454     0.81%
clang           0.518536707     0.503528155     2.98%
clang-fsds      0.566316070     0.551120769     2.75%
clang-gdb-index 5.091346130     5.052952745     0.75%
gold            0.325922095     0.318968615     2.17%
gold-fsds       0.353570003     0.342080243     3.35%
linux-kernel    0.872819415     0.866891291     0.68%
llvm-as         0.057709114     0.054467349     5.95%
llvm-as-fsds    0.054338842     0.053751006     1.09%
mozilla         3.730120566     3.836634236     -2.77%
scylla          1.011386393     1.031144224     -1.91%

Maximum RSS (measured by `time -v`)

Program         w/patch         w/o patch       Memory saving
chrome          1163088         1163572         0.04%
clang           369364          373760          1.17%
clang-fsds      392484          396712          1.06%
clang-gdb-index 10391636        10391680        0.00%
gold            227508          229108          0.69%
gold-fsds       237308          238960          0.69%
linux-kernel    143028          146932          2.65%
llvm-as         46792           46976           0.39%
llvm-as-fsds    47112           47336           0.47%
mozilla         4631424         4833940         4.18%
scylla          2712036         2793468         2.91%

It looks like the slowdown is not negligible, but lld got slower only when
the program being linked is small. For large programs, the regression
seems small, or it even got faster. Given that, I don't think having the
hash table is still a good tradeoff; we should drop the hash table to save
memory.

Diff Detail

Repository: rL LLVM

Event Timeline

ruiu created this revision.Dec 3 2018, 2:01 PM

Herald added a reviewer: • espindola. · View Herald TranscriptDec 3 2018, 2:01 PM

Herald added subscribers: arichardson, aprantl, emaste. · View Herald Transcript

Harbormaster completed remote builds in B25623: Diff 176474.Dec 3 2018, 2:01 PM

smeenai added a subscriber: smeenai.Dec 3 2018, 5:00 PM

Program w/patch w/o patch Memory saving
chrome 1163088 1163572 0.04%

I guess that was not the debug chrome perhaps.
That might explain why it is 0.81% slower but saves an only minor amount of memory.

I am not sure what I feel about this patch. It can make the link faster in some cases,
but for regular use (linking not huge apps, like clang) it always seems to be up to a few percents slower.

Saving 4.18%, or 200Mb for Mozilla (4631424->4833940) is good, but is that so significant and worth doing?
200Mb is only 200Mb, while -2.77% of link time looks much more impressive to me and
unfortunately, the link time for most of the apps tested is slower with this patch.

Given that this patch is about a finding the balance between speed and memory,
I would like to hear from other people on this.

grimar added subscribers: ikudrin, evgeny777, psmith.Dec 4 2018, 4:34 AM

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

In D55234#1318754, @ruiu wrote:

and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive.

But chrome is 0.81% slower. That is my concern too.

Let me benchmark this on my side too please (1-2 days), I'll bring my results here.

As one of the people who cares about chromium most in the lld community, I can say that 16 milliseconds regression of linking chromium without debug info is totally negligible.

What we should and do actually care about are programs whose size is multi-gibibyte that take 30 seconds to a few minutes to link. Unfortunately I cannot share the programs with you, but we observed both speedup and memory usage reduction with this patch.

In D55234#1318900, @ruiu wrote:

What we should and do actually care about are programs whose size is multi-gibibyte that take 30 seconds to a few minutes to link. Unfortunately I cannot share the programs with you, but we observed both speedup and memory usage reduction with this patch.

Well, that sounds like an argument I can't resist. I do not have such apps underhand for testing,
so if you observe the stable improvements you're happy with, then LGTM.

This revision is now accepted and ready to land.Dec 4 2018, 10:36 AM

I will collect more data points even though we cannot share programs with you to be fair, so please wait for a while.

I didn't notice this patch when I sent out D55248 (I just observed a locally-invented upper_bound without a good justification :) ).

I agree that the memory-consuming llvm::DenseMap<uint32_t, uint32_t> OffsetMap; is not necessary. (just for the record, gold Object_merge_map::get_output_offset uses a similar upper_bound without a hash table).

I think the hash table has some improvement when there are not too many pieces. But when there are many (some .debug_str), the cache miss for each lookup isn't totally negligible. It can be the reason that pure binary search may be faster.

If we do want to optimize llvm::upper_bound, however, I think a one-entry cache (size_t LastIdx; uint64_t LastOffset;) may be more efficient, I've checked locally that many uses follow the pattern of calling getSectionPiece with increasing Offset.

The one-entry cache I mentioned is like the following (I don't say I suggest doing that as it is so complicated with probably so little value):

size_t I;
if (Offset < LastOffset) {
  I = std::upper_bound(Pieces.begin(), Pieces.begin() + LastIdx, Offset,
                       [](uint64_t Offset, SectionPiece P) {
                         return Offset < P.InputOff;
                       }) -
      Pieces.begin();
} else {
  I = LastIdx;
  if (I < Pieces.size() && Pieces[I].InputOff <= Offset)
    ++I;
  if (I < Pieces.size() && Pieces[I].InputOff <= Offset)
    I = std::upper_bound(Pieces.begin() + I + 1, Pieces.end(), Offset,
                         [](uint64_t Offset, SectionPiece P) {
                           return Offset < P.InputOff;
                         }) -
        Pieces.begin();
}

LastOffset = Offset;
LastIdx = I;
return &Pieces[I - 1];

Actually upper_bound is not the best choice here, a slow starting step sizes are better: 1,2,4,8,16,(slow start process) 8,4,2,1 (regular binary search steps)

I ran lld with a few more large programs, and here is the result.

Size   Before  After  
6.1GB  20.16s  19.15s (-4.98%)
3.8GB  56.27s  53.38s (-5.14%)
2.5GB  32.65s  32.56s (+0.28%)

Since large programs take time to link, and in order to reduce noise I need to run it many times, it is not easy to run lld on many large programs, but I believe you can still see a pattern. At least I think I can say that removing the hash table is not bad.

Closed by commit rL348401: Do not use a hash table to uniquify mergeable strings. (authored by ruiu). · Explain WhyDec 5 2018, 11:16 AM

This revision was automatically updated to reflect the committed changes.

In D55234#1320483, @ruiu wrote:
I ran lld with a few more large programs, and here is the result.
Size   Before  After  
6.1GB  20.16s  19.15s (-4.98%)
3.8GB  56.27s  53.38s (-5.14%)
2.5GB  32.65s  32.56s (+0.28%)
Since large programs take time to link, and in order to reduce noise I need to run it many times, it is not easy to run lld on many large programs, but I believe you can still see a pattern. At least I think I can say that removing the hash table is not bad.

Looks good, thanks for the numbers.

In D55234#1318754, @ruiu wrote:

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

FWIW linking times on smaller (than the very large) programs like Clang are still significant to many folks (including LLVM developers themselves) - for example when running "ninja check-all" many LLVM tools (~30? I forget roughly how many) have to be linked - and saving time on each (or the longest one) means getting out of that very serial step in the check run (massively parallel compiles, serial links, then massively parallel test runs).

In D55234#1325804, @dblaikie wrote:

In D55234#1318754, @ruiu wrote:

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

FWIW linking times on smaller (than the very large) programs like Clang are still significant to many folks (including LLVM developers themselves) - for example when running "ninja check-all" many LLVM tools (~30? I forget roughly how many) have to be linked - and saving time on each (or the longest one) means getting out of that very serial step in the check run (massively parallel compiles, serial links, then massively parallel test runs).

I second the point, but see my comment above. If we want to optimize further, I think we should try some one-entry cache and leverage the getVA() calling pattern, instead of using a hash table (which has been deleted by this patch). The hash table may actually do worse with some SHF_MERGE|SHF_STRINGS.

In D55234#1325837, @MaskRay wrote:

In D55234#1325804, @dblaikie wrote:

In D55234#1318754, @ruiu wrote:

I think you should focus on large programs. We don't really care about the marginal improvements or regressions for small programs because linking small program is very fast anyway, and we don't care about 10ms improvements or regressions. For large programs, it seems reasonably positive. Also this patch only deletes code.

FWIW linking times on smaller (than the very large) programs like Clang are still significant to many folks (including LLVM developers themselves) - for example when running "ninja check-all" many LLVM tools (~30? I forget roughly how many) have to be linked - and saving time on each (or the longest one) means getting out of that very serial step in the check run (massively parallel compiles, serial links, then massively parallel test runs).

I second the point, but see my comment above. If we want to optimize further, I think we should try some one-entry cache and leverage the getVA() calling pattern, instead of using a hash table (which has been deleted by this patch). The hash table may actually do worse with some SHF_MERGE|SHF_STRINGS.

Yep yep - neat idea/something someone can experiment with at some point if they're feeling like it :)

Revision Contents

Path

Size

lld/

trunk/

ELF/

InputSection.h

1 line

InputSection.cpp

9 lines

Diff 176860

lld/trunk/ELF/InputSection.h

Show First 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	public:

// Translate an offset in the input section to an offset in the parent		// Translate an offset in the input section to an offset in the parent
// MergeSyntheticSection.		// MergeSyntheticSection.
uint64_t getParentOffset(uint64_t Offset) const;		uint64_t getParentOffset(uint64_t Offset) const;

// Splittable sections are handled as a sequence of data		// Splittable sections are handled as a sequence of data
// rather than a single large blob of data.		// rather than a single large blob of data.
std::vector<SectionPiece> Pieces;		std::vector<SectionPiece> Pieces;
llvm::DenseMap<uint32_t, uint32_t> OffsetMap;

// Returns I'th piece's data. This function is very hot when		// Returns I'th piece's data. This function is very hot when
// string merging is enabled, so we want to inline.		// string merging is enabled, so we want to inline.
LLVM_ATTRIBUTE_ALWAYS_INLINE		LLVM_ATTRIBUTE_ALWAYS_INLINE
llvm::CachedHashStringRef getData(size_t I) const {		llvm::CachedHashStringRef getData(size_t I) const {
size_t Begin = Pieces[I].InputOff;		size_t Begin = Pieces[I].InputOff;
size_t End =		size_t End =
(Pieces.size() - 1 == I) ? data().size() : Pieces[I + 1].InputOff;		(Pieces.size() - 1 == I) ? data().size() : Pieces[I + 1].InputOff;
▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines

lld/trunk/ELF/InputSection.cpp

	Show First 20 Lines • Show All 1,200 Lines • ▼ Show 20 Lines
	// thread-safe (i.e. no memory allocation from the pools).			// thread-safe (i.e. no memory allocation from the pools).
	void MergeInputSection::splitIntoPieces() {			void MergeInputSection::splitIntoPieces() {
	assert(Pieces.empty());			assert(Pieces.empty());

	if (Flags & SHF_STRINGS)			if (Flags & SHF_STRINGS)
	splitStrings(data(), Entsize);			splitStrings(data(), Entsize);
	else			else
	splitNonStrings(data(), Entsize);			splitNonStrings(data(), Entsize);

	OffsetMap.reserve(Pieces.size());
	for (size_t I = 0, E = Pieces.size(); I != E; ++I)
	OffsetMap[Pieces[I].InputOff] = I;
	}			}

	SectionPiece *MergeInputSection::getSectionPiece(uint64_t Offset) {			SectionPiece *MergeInputSection::getSectionPiece(uint64_t Offset) {
	if (this->data().size() <= Offset)			if (this->data().size() <= Offset)
	fatal(toString(this) + ": offset is outside the section");			fatal(toString(this) + ": offset is outside the section");

	// Find a piece starting at a given offset.
	auto It = OffsetMap.find(Offset);
	if (It != OffsetMap.end())
	return &Pieces[It->second];

	// If Offset is not at beginning of a section piece, it is not in the map.			// If Offset is not at beginning of a section piece, it is not in the map.
	// In that case we need to do a binary search of the original section piece vector.			// In that case we need to do a binary search of the original section piece vector.
	auto It2 =			auto It2 =
	llvm::upper_bound(Pieces, Offset, [](uint64_t Offset, SectionPiece P) {			llvm::upper_bound(Pieces, Offset, [](uint64_t Offset, SectionPiece P) {
	return Offset < P.InputOff;			return Offset < P.InputOff;
	});			});
	return &It2[-1];			return &It2[-1];
	}			}
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines