This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/DebugInfo/DWARF/
-
DebugInfo/
-
DWARF/
1/1
DWARFDebugLine.cpp

Differential D27960

Use exact vector capacities to store DWARF line tables
AcceptedPublic

Authored by sque on Dec 19 2016, 5:03 PM.

Download Raw Diff

Details

Reviewers

echristo

Summary

A large binary's DWARF info can contain line tables with thousands of entries. Since dynamically resized vectors don't have exact capacities, there could be a large amount of wasted space. This patch changes the DWARF reader to store the line table rows in vectors with exact capacities.

Diff Detail

Build Status

Buildable 2296
Build 2296: arc lint + arc unit

Event Timeline

sque updated this revision to Diff 82041.Dec 19 2016, 5:03 PM

sque retitled this revision from to Use exact vector capacities to store DWARF line tables.

sque updated this object.

sque added a reviewer: echristo.

sque added a subscriber: llvm-commits.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptDec 19 2016, 5:03 PM

mehdi_amini asked: "Isn't there a way to reserve() appropriately initially?"

Not trivially. Look at the while loop that precedes the new code. It looks through the line table but doesn't know beforehand how many Rows there are.

One possible optimization is to first have a dry-run that counts the number of Rows, reserves memory for them, and then runs through again to create the Rows.

Is there a runtime cost to doing this?

lib/DebugInfo/DWARF/DWARFDebugLine.cpp
525	memory

Fixed typo: s/memroy/memory

There is a runtime cost. It would make more sense to do a dry-run as I suggested in the comment above. What do you think of that idea?

For a tool like llvm-dsymutil, that reads hundreds of object files on after another and immediately discards the line table after it's finished processing one, this seems like it would always be slower, right?

What's a use-case were we would benefit from this optimization? Did you benchmark it?

In D27960#627263, @sque wrote:

mehdi_amini asked: "Isn't there a way to reserve() appropriately initially?"

Not trivially. Look at the while loop that precedes the new code. It looks through the line table but doesn't know beforehand how many Rows there are.

I saw that, but I was more wondering if there isn't another way in the Dwarf to get the information (I don't know if this number could be encoded somehow separately).

One possible optimization is to first have a dry-run that counts the number of Rows, reserves memory for them, and then runs through again to create the Rows.

Right, but it depends how fast is it to perform this dry-run, especially compared to cost of a little bit of wasted memory.
I mean the maximum waste here is always < 1 page of memory right?

I mean the maximum waste here is always < 1 page of memory right?

To clarify: there is virtual memory overhead but this is free on 64 bits systems so I don't expect it to matter.

You'd be surprised ... the max RSS dropped from 18 MB to 5 MB when I added this change plus https://reviews.llvm.org/D27961, even though the expected reduction is only about 50%

I would have to wonder whether parsing the entire line table twice (once to count rows, once to populate the list) would cost more than the resize at the end.

In D27960#627361, @sque wrote:

You'd be surprised ... the max RSS dropped from 18 MB to 5 MB when I added this change plus https://reviews.llvm.org/D27961, even though the expected reduction is only about 50%

I'd like to see such a reduction *with this patch alone*, and have an explanation for it, before considering moving forward in any way.

In D27960#627364, @mehdi_amini wrote:

I'd like to see such a reduction *with this patch alone*, and have an explanation for it, before considering moving forward in any way.

Here are some new numbers. First value is max RSS in kB, and second value is duration in seconds. I ran each four times. The first run took a little longer, presumably because of the time it took to load the executables for the first time.

A: Without this patch

6319548 21.26
6193428 17.04
6455776 17.92
6403748 17.28

B1: With this change (shrink_to_fit)

5475640 15.26
5722124 15.13
5735732 14.69
5605456 14.06

B2: With reserving new vector instead of shrink_to_fit

3138480 5.79
3136468 5.06
3136444 5.12
3138528 5.19

I counted the nominal number of kB saved by reducing vector capacities: 812400. This is about the difference in max RSS between A and B1.

But B2 had a much larger reduction in both max RSS and runtime duration even though it accomplishes the same thing. B2 does the following instead of calling shrink_to_fit (see https://reviews.llvm.org/D27883):

std::vector<DWARFDebugLine::Row> compactRows;
compactRows.reserve(Rows.size());
std::copy(Rows.begin(), Rows.end(), compactRows.begin());
Rows = std::move(compactRows);

Can you provide the instructions (and inputs if needed) to reproduce?

Also, have you tried the approach you mentioned of having dry-run first to count the number of row? (Just curious)

Thanks.

Also, do you have an explanation for these numbers?

In D27960#628119, @mehdi_amini wrote:

Can you provide the instructions (and inputs if needed) to reproduce?

I'm using non-public code, but if you have some binary that can read in debug info using DWARFContext, it would suffice. As for the input binary, I used a large 4 GB executable, but I see a similar difference between B1 and B2 with a 200 MB binary.

Also, have you tried the approach you mentioned of having dry-run first to count the number of row? (Just curious)

No, but I can try it later.

Also, do you have an explanation for these numbers?

Not off the top of my head. I'll have to run a few other things.

In D27960#628350, @sque wrote:

In D27960#628119, @mehdi_amini wrote:

Can you provide the instructions (and inputs if needed) to reproduce?

I'm using non-public code, but if you have some binary that can read in debug info using DWARFContext, it would suffice. As for the input binary, I used a large 4 GB executable, but I see a similar difference between B1 and B2 with a 200 MB binary.

I decided to try this with llvm-symbolizer, built from the LLVM source repo. Max RSS in kB:
A: 343280
B1: 318584
B2: 318508

Nominal capacity reduction is 32705 kB, which is about the difference between A and B1/B2. There is no significant difference between B1 and B2.

I suspect the discrepancy we saw earlier between B1 and B2 was due to a quirk of the allocator we used. But what exactly that quirk is, I don't know.

I can try the dry run tomorrow.

Dry run max RSS (kB) and duration (s):
314544 5.64
314564 5.36
314544 5.71

shrink_to_fit() max RSS (kB) and duration (s):
318796 5.89
318816 5.57
318796 4.54

With dry run it uses about 1.5% less memory than calling shrink_to_fit(). But the runtime durations are about the same.

I can believe that the dry-run could save a little bit of RSS, since it won't stress as much the memory allocator (but RSS is not necessarily the critical metric either here).
All the remaining differences are in the noise? (Not sure if I got all the numbers right)

Her'es another 5 runs of each. None of these is the initial run after building llvm-symbolizer, which seems to take slightly longer because of loading the 69 MB binary for the first time.

With shrink_to_fit:
318800 5.64
318808 5.35
318796 5.74
318812 5.43
318816 5.76

With dry run:
314600 5.49
314592 4.85
314544 5.68
314600 5.14
314544 5.05

I think the difference in duration is just noise. But changing to a dry-run model seems like a lot of code churn for saving 4.2 MB of max RSS on a 200 MB input file.

And shrink_to_fit() compare to status-quo?

Here are numbers from status quo, skipping initial run:
343304 5.43
342304 5.50
342288 4.51
344284 4.48
344844 4.76
347296 4.93
345820 5.52
344532 4.72
343596 4.70

Some of these are under 5 seconds. But I ran the shrink_to_fit version another 5 more times and 2 of those times were about 4.75 seconds.

There's plenty of overlap between the two, in the 4.75-5.5 second range. But the current version still looks a little faster.

This seems like a decent idea.

Mehdi: Adrian: Any objections still?

Simon: I'm going to accept, but let's not commit until they're happy.

Thanks!

-eric

This revision is now accepted and ready to land.Jan 3 2017, 7:03 PM

In D27960#634855, @echristo wrote:

This seems like a decent idea.

Mehdi: Adrian: Any objections still?

Yes, I'd like either the ability to reproduce, or a rational explanation.

If you *need* a specific proprietary code for reproducing anything and it does not show up on for example the clang build, that'd raise more concerns.

Also the recent measurements looks like this patch was a little slower.

Another thing you could provide outside of a reproducer, is the output of a run with tcmalloc and the environment variable MALLOCSTATS

In D27960#634907, @mehdi_amini wrote:

If you *need* a specific proprietary code for reproducing anything and it does not show up on for example the clang build, that'd raise more concerns.

Definitely shouldn't need that. It may need a large enough set of debug info to matter, but showing scaling issues with that sort of thing should be fine and possible. Otherwise there's no compelling argument to put this patch in :)

I like the idea of a tcmalloc stats run as well.

I will defer to Mehdi here.

aprantl edited edge metadata.Jan 4 2017, 1:26 PM

aprantl added a subscriber: friss.

FYI, I just benchmarked "dsymutil llc" (all targets, both release+asserts+debuginfo) and comparing the best out of three runs each, the difference between before and after the patch is in the noise. This should not be too surprising since dsymutil is mostly I/O-bound.

Revision Contents

Path

Size

lib/

DebugInfo/

DWARF/

DWARFDebugLine.cpp

5 lines

Diff 82048

lib/DebugInfo/DWARF/DWARFDebugLine.cpp

Show First 20 Lines • Show All 514 Lines • ▼ Show 20 Lines	if (!Sequences.empty()) {
// Note: actually, instruction address ranges of sequences should not		// Note: actually, instruction address ranges of sequences should not
// overlap (in shared objects and executables). If they do, the address		// overlap (in shared objects and executables). If they do, the address
// lookup would still work, though, but result would be ambiguous.		// lookup would still work, though, but result would be ambiguous.
// We don't report warning in this case. For example,		// We don't report warning in this case. For example,
// sometimes .so compiled from multiple object files contains a few		// sometimes .so compiled from multiple object files contains a few
// rudimentary sequences for address ranges [0x0, 0xsomething).		// rudimentary sequences for address ranges [0x0, 0xsomething).
}		}

		// There could be hundreds or thousands of Rows in the line table. The vector
		// will likely have an internal reserved memory block that is much bigger
		// than necessary. Reduce the reserved capacity to conserve memory.
		aprantlUnsubmitted Done Reply Inline Actions memory aprantl: memory
		Rows.shrink_to_fit();

return end_offset;		return end_offset;
}		}

uint32_t		uint32_t
DWARFDebugLine::LineTable::findRowInSeq(const DWARFDebugLine::Sequence &seq,		DWARFDebugLine::LineTable::findRowInSeq(const DWARFDebugLine::Sequence &seq,
uint64_t address) const {		uint64_t address) const {
if (!seq.containsPC(address))		if (!seq.containsPC(address))
return UnknownRowIndex;		return UnknownRowIndex;
▲ Show 20 Lines • Show All 153 Lines • Show Last 20 Lines