This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
ELF/
8
Writer.cpp

Differential D46145

Use a buffer when allocating relocations
Needs ReviewPublic

Authored by • espindola on Apr 26 2018, 2:39 PM.

Download Raw Diff

Details

Reviewers

ruiu
grimar

Summary

This reduces peak allocations from 564.2 to 545.7 MB when linking chrome.

Diff Detail

Event Timeline

• espindola created this revision.Apr 26 2018, 2:39 PM

Herald added subscribers: arichardson, emaste. · View Herald TranscriptApr 26 2018, 2:39 PM

• espindola edited the summary of this revision. (Show Details)Apr 26 2018, 2:42 PM

This seems a bit tricky. Doesn't shrink_to_fit work?

In D46145#1080123, @ruiu wrote:

This seems a bit tricky. Doesn't shrink_to_fit work?

At least on linux with libstdc++ it doesn't seem to move the data to a smaller buffer. I still get a peak of 564.2 with it.

Does your code make this part of code faster? It reduces memory allocation but it does extra memcpy, so I'm wondering.

In D46145#1080172, @ruiu wrote:

Does your code make this part of code faster? It reduces memory allocation but it does extra memcpy, so I'm wondering.

On my machine the difference is in the noise. Note that it is not a given that this has more copying. When the vector is resized there is a copy, and we should expect fewer resizes with the single buffer.

I confirm the peak allocation difference. I think that implementation can be a bit simpler though, my suggestion is below.

ELF/Writer.cpp
875	It can be a bit simpler I think: (RAM profiling shows that the version below consume the same amount of memory). std::vector<Relocation> Buffer; // <new comment> auto WithBuffer = [&](InputSectionBase &Sec) { Fn(Sec); Buffer = std::move(Sec.Relocations); Sec.Relocations = Buffer; };

ELF/Writer.cpp
875	That would reduce peak allocation, but each call to Fn is still allocating a new buffer, no? But does suggest a way to simplify it a bit.

grimar added inline comments.Apr 27 2018, 8:06 AM

ELF/Writer.cpp
875	CPU time was about the same as the original code for my way I think. Your way should be slightly faster though: I observed a minor speed up between original and your version. It was: Original CPU time total (3 runs): 2632, 2637, 2622. Patch: 2604, 2610, 2592. So it is about 0.5% I think. Results seem was more or less stable. I do not have precise numbers for my version underhand (I can retest it if you think it worth, but it was something about original code numbers). I think the simplicity of implementation makes more sense here probably (and I hope we will be able to use `shrink_to_fit` one day).

• espindola added inline comments.Apr 27 2018, 8:09 AM

ELF/Writer.cpp
875	Testing the suggestion the peak memory is indeed the same, but the total memory allocated is actually higher: master 765.92 564.2 patch 748.16, 545.7 patch2 865.48, 545.7

I wonder if you can make a guess on how many relocations will be inserted to the vector. We know the number of relocations for each input section, so calling reserve() might work.

In D46145#1081117, @ruiu wrote:

I wonder if you can make a guess on how many relocations will be inserted to the vector. We know the number of relocations for each input section, so calling reserve() might work.

I think reserve() is exactly what causes high peaks now. We already call it. See:
https://github.com/llvm-mirror/lld/blob/master/ELF/Relocations.cpp#L1020

Then maybe we should remove it?

In D46145#1081118, @grimar wrote:

In D46145#1081117, @ruiu wrote:

I wonder if you can make a guess on how many relocations will be inserted to the vector. We know the number of relocations for each input section, so calling reserve() might work.

I think reserve() is exactly what causes high peaks now. We already call it. See:
https://github.com/llvm-mirror/lld/blob/master/ELF/Relocations.cpp#L1020

Yes, currently using reserve is critical for performance, but wastes memory.

Use end() when inserting. Doesn't make a difference in here, but is the canonical way of concatenating vectors.

grimar added inline comments.Apr 27 2018, 8:31 AM

ELF/Writer.cpp
875	Did not see that (my profiler seem shows only peak memory). Then, do we really need to care too much about total memory allocated if performance and peak memory consumption are about the same?

In D46145#1081129, @espindola wrote:

Use end() when inserting. Doesn't make a difference in here, but is the canonical way of concatenating vectors.

Wrong review page :)

In D46145#1081120, @ruiu wrote:

Then maybe we should remove it?

Just removing it would be disastrous for performance. It was a 6% improvement when it was added in r319976.

It might be possible to remove it after this patch.

correct patch .

I tried this patch and found that at least for Chrome, we wasted memory by calling reserve() on ".data.rel.ro" sections. Such section in Chrome doesn't add any item to Sec.Relocations while their Rels.size() is relatively large, so reserved memory is totally wasted. If you don't call reserve on such section, you can save almost the same memory as you did in this patch. Can you take a look?

In D46145#1081212, @ruiu wrote:

I tried this patch and found that at least for Chrome, we wasted memory by calling reserve() on ".data.rel.ro" sections. Such section in Chrome doesn't add any item to Sec.Relocations while their Rels.size() is relatively large, so reserved memory is totally wasted. If you don't call reserve on such section, you can save almost the same memory as you did in this patch. Can you take a look?

The issue is that chrome is linked with -pie. The section in question has a lot of

.quad foo
.quad bar
...

They become dynamic relocation instead of static ones.

I don't see how to predict that in general. I will try removing the reserve in combination with this patch.

Delete call to reserve.

The results are

master 765.5 total, 563.7 peak
with reserve 747.76 total, 545.3 peak
without reserve 747.63 total, 545.3 peak

Few more ideas about that one.

ELF/Writer.cpp
861–874	Less tricky way probably could be to pass `std::vector<Relocation>` Buffer to `Fn`, so it could populate it. That would need to add this argument to few functions. Though also seem would remove argument from some of them, for example `handleMipsTlsRelocation` uses `InputSectionBase &C` to access to `C.Relocations`. It could take the buffer instead of the section. And code would be something like next then: auto WithBuffer = [&](InputSectionBase &Sec) { Fn(Sec, Buffer); Sec.Relocations.insert(Sec.Relocations.end(), Buffer.begin(), Buffer.end()); Buffer.clear(); };
866	I would probably add that we do that to decrease total memory consumption. Maybe "the exact internal buffer size" would make comment more understandable.

• espindola added inline comments.Apr 27 2018, 1:44 PM

ELF/Writer.cpp
861–874	This would require passing a buffer to a lot of places. For example, RelocationBaseSection::addReloc would need it. As written the code with swap is a bit more complicated, but it is completely local.

Revision Contents

Path

Size

ELF/

Writer.cpp

17 lines

Diff 144350

ELF/Writer.cpp

Show First 20 Lines • Show All 852 Lines • ▼ Show 20 Lines	template <class ELFT> void Writer<ELFT>::addRelIpltSymbols() {
addOptionalRegular(S, InX::RelaIplt, 0, STV_HIDDEN, STB_WEAK);		addOptionalRegular(S, InX::RelaIplt, 0, STV_HIDDEN, STB_WEAK);

S = Config->IsRela ? "__rela_iplt_end" : "__rel_iplt_end";		S = Config->IsRela ? "__rela_iplt_end" : "__rel_iplt_end";
ElfSym::RelaIpltEnd =		ElfSym::RelaIpltEnd =
addOptionalRegular(S, InX::RelaIplt, 0, STV_HIDDEN, STB_WEAK);		addOptionalRegular(S, InX::RelaIplt, 0, STV_HIDDEN, STB_WEAK);
}		}

template <class ELFT>		template <class ELFT>
void Writer<ELFT>::forEachRelSec(std::function<void(InputSectionBase &)> Fn) {		void Writer<ELFT>::forEachRelSec(std::function<void(InputSectionBase &)> Fn) {
		std::vector<Relocation> Buffer;

		// We allocate a lot of relocations. By using a single buffer for
		// all sections we ensure that the final vector in each section has
		// the exact size.
		grimarUnsubmitted Not Done Reply Inline Actions I would probably add that we do that to decrease total memory consumption. Maybe "the exact internal buffer size" would make comment more understandable. grimar: I would probably add that we do that to decrease total memory consumption. Maybe "the exact…
		auto WithBuffer = [&](InputSectionBase &Sec) {
		swap(Sec.Relocations, Buffer);
		Fn(Sec);
		swap(Sec.Relocations, Buffer);
		Sec.Relocations.insert(Sec.Relocations.end(), Buffer.begin(), Buffer.end());
		Buffer.clear();
		};

		grimarUnsubmitted Not Done Reply Inline Actions Less tricky way probably could be to pass `std::vector<Relocation>` Buffer to `Fn`, so it could populate it. That would need to add this argument to few functions. Though also seem would remove argument from some of them, for example `handleMipsTlsRelocation` uses `InputSectionBase &C` to access to `C.Relocations`. It could take the buffer instead of the section. And code would be something like next then: auto WithBuffer = [&](InputSectionBase &Sec) { Fn(Sec, Buffer); Sec.Relocations.insert(Sec.Relocations.end(), Buffer.begin(), Buffer.end()); Buffer.clear(); }; grimar: Less tricky way probably could be to pass `std::vector<Relocation>` Buffer to `Fn`, so it could…
		espindolaAuthorUnsubmitted Not Done Reply Inline Actions This would require passing a buffer to a lot of places. For example, RelocationBaseSection::addReloc would need it. As written the code with swap is a bit more complicated, but it is completely local. espindola: This would require passing a buffer to a lot of places. For example, RelocationBaseSection…
// Scan all relocations. Each relocation goes through a series		// Scan all relocations. Each relocation goes through a series
		grimarUnsubmitted Not Done Reply Inline Actions It can be a bit simpler I think: (RAM profiling shows that the version below consume the same amount of memory). std::vector<Relocation> Buffer; // <new comment> auto WithBuffer = [&](InputSectionBase &Sec) { Fn(Sec); Buffer = std::move(Sec.Relocations); Sec.Relocations = Buffer; }; grimar: It can be a bit simpler I think: (RAM profiling shows that the version below consume the same…
		espindolaAuthorUnsubmitted Not Done Reply Inline Actions That would reduce peak allocation, but each call to Fn is still allocating a new buffer, no? But does suggest a way to simplify it a bit. espindola: That would reduce peak allocation, but each call to Fn is still allocating a new buffer, no?
		grimarUnsubmitted Not Done Reply Inline Actions CPU time was about the same as the original code for my way I think. Your way should be slightly faster though: I observed a minor speed up between original and your version. It was: Original CPU time total (3 runs): 2632, 2637, 2622. Patch: 2604, 2610, 2592. So it is about 0.5% I think. Results seem was more or less stable. I do not have precise numbers for my version underhand (I can retest it if you think it worth, but it was something about original code numbers). I think the simplicity of implementation makes more sense here probably (and I hope we will be able to use `shrink_to_fit` one day). grimar: CPU time was about the same as the original code for my way I think. Your way should be…
		espindolaAuthorUnsubmitted Not Done Reply Inline Actions Testing the suggestion the peak memory is indeed the same, but the total memory allocated is actually higher: master 765.92 564.2 patch 748.16, 545.7 patch2 865.48, 545.7 espindola: Testing the suggestion the peak memory is indeed the same, but the total memory allocated is…
		grimarUnsubmitted Not Done Reply Inline Actions Did not see that (my profiler seem shows only peak memory). Then, do we really need to care too much about total memory allocated if performance and peak memory consumption are about the same? grimar: Did not see that (my profiler seem shows only peak memory). Then, do we really need to care too…
// of tests to determine if it needs special treatment, such as		// of tests to determine if it needs special treatment, such as
// creating GOT, PLT, copy relocations, etc.		// creating GOT, PLT, copy relocations, etc.
// Note that relocations for non-alloc sections are directly		// Note that relocations for non-alloc sections are directly
// processed by InputSection::relocateNonAlloc.		// processed by InputSection::relocateNonAlloc.
for (InputSectionBase *IS : InputSections)		for (InputSectionBase *IS : InputSections)
if (IS->Live && isa<InputSection>(IS) && (IS->Flags & SHF_ALLOC))		if (IS->Live && isa<InputSection>(IS) && (IS->Flags & SHF_ALLOC))
Fn(*IS);		WithBuffer(*IS);
for (EhInputSection *ES : InX::EhFrame->Sections)		for (EhInputSection *ES : InX::EhFrame->Sections)
Fn(*ES);		WithBuffer(*ES);
}		}

// This function generates assignments for predefined symbols (e.g. _end or		// This function generates assignments for predefined symbols (e.g. _end or
// _etext) and inserts them into the commands sequence to be processed at the		// _etext) and inserts them into the commands sequence to be processed at the
// appropriate time. This ensures that the value is going to be correct by the		// appropriate time. This ensures that the value is going to be correct by the
// time any references to these symbols are processed and is equivalent to		// time any references to these symbols are processed and is equivalent to
// defining these symbols explicitly in the linker script.		// defining these symbols explicitly in the linker script.
template <class ELFT> void Writer<ELFT>::setReservedSymbolSections() {		template <class ELFT> void Writer<ELFT>::setReservedSymbolSections() {
▲ Show 20 Lines • Show All 1,462 Lines • Show Last 20 Lines