This is an archive of the discontinued LLVM Phabricator instance.

Use a buffer when allocating relocations
Needs ReviewPublic

Authored by espindola on Apr 26 2018, 2:39 PM.

Details

Reviewers
ruiu
grimar
Summary

This reduces peak allocations from 564.2 to 545.7 MB when linking chrome.

Diff Detail

Event Timeline

espindola created this revision.Apr 26 2018, 2:39 PM
espindola edited the summary of this revision. (Show Details)Apr 26 2018, 2:42 PM
ruiu added a comment.Apr 26 2018, 2:54 PM

This seems a bit tricky. Doesn't shrink_to_fit work?

This seems a bit tricky. Doesn't shrink_to_fit work?

At least on linux with libstdc++ it doesn't seem to move the data to a smaller buffer. I still get a peak of 564.2 with it.

ruiu added a comment.Apr 26 2018, 3:20 PM

Does your code make this part of code faster? It reduces memory allocation but it does extra memcpy, so I'm wondering.

Does your code make this part of code faster? It reduces memory allocation but it does extra memcpy, so I'm wondering.

On my machine the difference is in the noise. Note that it is not a given that this has more copying. When the vector is resized there is a copy, and we should expect fewer resizes with the single buffer.

I confirm the peak allocation difference. I think that implementation can be a bit simpler though, my suggestion is below.

ELF/Writer.cpp
875

It can be a bit simpler I think:
(RAM profiling shows that the version below consume the same amount of memory).

std::vector<Relocation> Buffer;

// <new comment>
auto WithBuffer = [&](InputSectionBase &Sec) {
  Fn(Sec);
  Buffer = std::move(Sec.Relocations);
  Sec.Relocations = Buffer;
};
ELF/Writer.cpp
875

That would reduce peak allocation, but each call to Fn is still allocating a new buffer, no?

But does suggest a way to simplify it a bit.

grimar added inline comments.Apr 27 2018, 8:06 AM
ELF/Writer.cpp
875

CPU time was about the same as the original code for my way I think.
Your way should be slightly faster though:

I observed a minor speed up between original and your version. It was:
Original CPU time total (3 runs): 2632, 2637, 2622.
Patch: 2604, 2610, 2592.

So it is about 0.5% I think. Results seem was more or less stable. I do not have precise numbers for my version underhand (I can retest it if you think it worth, but it was something about original code numbers). I think the simplicity of implementation makes more sense here probably (and I hope we will be able to use shrink_to_fit one day).

espindola added inline comments.Apr 27 2018, 8:09 AM
ELF/Writer.cpp
875

Testing the suggestion the peak memory is indeed the same, but the total memory allocated is actually higher:

master 765.92 564.2
patch 748.16, 545.7
patch2 865.48, 545.7

ruiu added a comment.Apr 27 2018, 8:22 AM

I wonder if you can make a guess on how many relocations will be inserted to the vector. We know the number of relocations for each input section, so calling reserve() might work.

I wonder if you can make a guess on how many relocations will be inserted to the vector. We know the number of relocations for each input section, so calling reserve() might work.

I think reserve() is exactly what causes high peaks now. We already call it. See:
https://github.com/llvm-mirror/lld/blob/master/ELF/Relocations.cpp#L1020

ruiu added a comment.Apr 27 2018, 8:27 AM

Then maybe we should remove it?

I wonder if you can make a guess on how many relocations will be inserted to the vector. We know the number of relocations for each input section, so calling reserve() might work.

I think reserve() is exactly what causes high peaks now. We already call it. See:
https://github.com/llvm-mirror/lld/blob/master/ELF/Relocations.cpp#L1020

Yes, currently using reserve is critical for performance, but wastes memory.

Use end() when inserting. Doesn't make a difference in here, but is the canonical way of concatenating vectors.

grimar added inline comments.Apr 27 2018, 8:31 AM
ELF/Writer.cpp
875

Did not see that (my profiler seem shows only peak memory).
Then, do we really need to care too much about total memory allocated if performance and peak memory consumption are about the same?

Use end() when inserting. Doesn't make a difference in here, but is the canonical way of concatenating vectors.

Wrong review page :)

Then maybe we should remove it?

Just removing it would be disastrous for performance. It was a 6% improvement when it was added in r319976.

It might be possible to remove it after this patch.

correct patch .

ruiu added a comment.Apr 27 2018, 9:11 AM

I tried this patch and found that at least for Chrome, we wasted memory by calling reserve() on ".data.rel.ro" sections. Such section in Chrome doesn't add any item to Sec.Relocations while their Rels.size() is relatively large, so reserved memory is totally wasted. If you don't call reserve on such section, you can save almost the same memory as you did in this patch. Can you take a look?

I tried this patch and found that at least for Chrome, we wasted memory by calling reserve() on ".data.rel.ro" sections. Such section in Chrome doesn't add any item to Sec.Relocations while their Rels.size() is relatively large, so reserved memory is totally wasted. If you don't call reserve on such section, you can save almost the same memory as you did in this patch. Can you take a look?

The issue is that chrome is linked with -pie. The section in question has a lot of

.quad foo
.quad bar
...

They become dynamic relocation instead of static ones.

I don't see how to predict that in general. I will try removing the reserve in combination with this patch.

Delete call to reserve.

The results are

master 765.5 total, 563.7 peak
with reserve 747.76 total, 545.3 peak
without reserve 747.63 total, 545.3 peak

Few more ideas about that one.

ELF/Writer.cpp
861–874

Less tricky way probably could be to pass std::vector<Relocation> Buffer to Fn, so it could populate it.
That would need to add this argument to few functions. Though also seem would remove argument from some of them,
for example handleMipsTlsRelocation uses InputSectionBase &C to access to C.Relocations. It could take the buffer instead
of the section.

And code would be something like next then:

auto WithBuffer = [&](InputSectionBase &Sec) {
  Fn(Sec, Buffer);
  Sec.Relocations.insert(Sec.Relocations.end(), Buffer.begin(), Buffer.end());
  Buffer.clear();
};
866

I would probably add that we do that to decrease total memory consumption.
Maybe "the exact internal buffer size" would make comment more understandable.

espindola added inline comments.Apr 27 2018, 1:44 PM
ELF/Writer.cpp
861–874

This would require passing a buffer to a lot of places. For example, RelocationBaseSection::addReloc would need it.

As written the code with swap is a bit more complicated, but it is completely local.