A large binary's DWARF info can contain line tables with thousands of entries. Since dynamically resized vectors don't have exact capacities, there could be a large amount of wasted space. This patch changes the DWARF reader to store the line table rows in vectors with exact capacities.
Details
Diff Detail
- Build Status
Buildable 2296 Build 2296: arc lint + arc unit
Event Timeline
mehdi_amini asked: "Isn't there a way to reserve() appropriately initially?"
Not trivially. Look at the while loop that precedes the new code. It looks through the line table but doesn't know beforehand how many Rows there are.
One possible optimization is to first have a dry-run that counts the number of Rows, reserves memory for them, and then runs through again to create the Rows.
Is there a runtime cost to doing this?
lib/DebugInfo/DWARF/DWARFDebugLine.cpp | ||
---|---|---|
525 | memory |
There is a runtime cost. It would make more sense to do a dry-run as I suggested in the comment above. What do you think of that idea?
For a tool like llvm-dsymutil, that reads hundreds of object files on after another and immediately discards the line table after it's finished processing one, this seems like it would always be slower, right?
What's a use-case were we would benefit from this optimization? Did you benchmark it?
I saw that, but I was more wondering if there isn't another way in the Dwarf to get the information (I don't know if this number could be encoded somehow separately).
One possible optimization is to first have a dry-run that counts the number of Rows, reserves memory for them, and then runs through again to create the Rows.
Right, but it depends how fast is it to perform this dry-run, especially compared to cost of a little bit of wasted memory.
I mean the maximum waste here is always < 1 page of memory right?
I mean the maximum waste here is always < 1 page of memory right?
To clarify: there is virtual memory overhead but this is free on 64 bits systems so I don't expect it to matter.
You'd be surprised ... the max RSS dropped from 18 MB to 5 MB when I added this change plus https://reviews.llvm.org/D27961, even though the expected reduction is only about 50%
I would have to wonder whether parsing the entire line table twice (once to count rows, once to populate the list) would cost more than the resize at the end.
I'd like to see such a reduction *with this patch alone*, and have an explanation for it, before considering moving forward in any way.
Here are some new numbers. First value is max RSS in kB, and second value is duration in seconds. I ran each four times. The first run took a little longer, presumably because of the time it took to load the executables for the first time.
A: Without this patch
6319548 21.26
6193428 17.04
6455776 17.92
6403748 17.28
B1: With this change (shrink_to_fit)
5475640 15.26
5722124 15.13
5735732 14.69
5605456 14.06
B2: With reserving new vector instead of shrink_to_fit
3138480 5.79
3136468 5.06
3136444 5.12
3138528 5.19
I counted the nominal number of kB saved by reducing vector capacities: 812400. This is about the difference in max RSS between A and B1.
But B2 had a much larger reduction in both max RSS and runtime duration even though it accomplishes the same thing. B2 does the following instead of calling shrink_to_fit (see https://reviews.llvm.org/D27883):
std::vector<DWARFDebugLine::Row> compactRows; compactRows.reserve(Rows.size()); std::copy(Rows.begin(), Rows.end(), compactRows.begin()); Rows = std::move(compactRows);
Can you provide the instructions (and inputs if needed) to reproduce?
Also, have you tried the approach you mentioned of having dry-run first to count the number of row? (Just curious)
Thanks.
I'm using non-public code, but if you have some binary that can read in debug info using DWARFContext, it would suffice. As for the input binary, I used a large 4 GB executable, but I see a similar difference between B1 and B2 with a 200 MB binary.
Also, have you tried the approach you mentioned of having dry-run first to count the number of row? (Just curious)
No, but I can try it later.
Also, do you have an explanation for these numbers?
Not off the top of my head. I'll have to run a few other things.
I decided to try this with llvm-symbolizer, built from the LLVM source repo. Max RSS in kB:
A: 343280
B1: 318584
B2: 318508
Nominal capacity reduction is 32705 kB, which is about the difference between A and B1/B2. There is no significant difference between B1 and B2.
I suspect the discrepancy we saw earlier between B1 and B2 was due to a quirk of the allocator we used. But what exactly that quirk is, I don't know.
I can try the dry run tomorrow.
Dry run max RSS (kB) and duration (s):
314544 5.64
314564 5.36
314544 5.71
shrink_to_fit() max RSS (kB) and duration (s):
318796 5.89
318816 5.57
318796 4.54
With dry run it uses about 1.5% less memory than calling shrink_to_fit(). But the runtime durations are about the same.
I can believe that the dry-run could save a little bit of RSS, since it won't stress as much the memory allocator (but RSS is not necessarily the critical metric either here).
All the remaining differences are in the noise? (Not sure if I got all the numbers right)
Her'es another 5 runs of each. None of these is the initial run after building llvm-symbolizer, which seems to take slightly longer because of loading the 69 MB binary for the first time.
With shrink_to_fit:
318800 5.64
318808 5.35
318796 5.74
318812 5.43
318816 5.76
With dry run:
314600 5.49
314592 4.85
314544 5.68
314600 5.14
314544 5.05
I think the difference in duration is just noise. But changing to a dry-run model seems like a lot of code churn for saving 4.2 MB of max RSS on a 200 MB input file.
Here are numbers from status quo, skipping initial run:
343304 5.43
342304 5.50
342288 4.51
344284 4.48
344844 4.76
347296 4.93
345820 5.52
344532 4.72
343596 4.70
Some of these are under 5 seconds. But I ran the shrink_to_fit version another 5 more times and 2 of those times were about 4.75 seconds.
There's plenty of overlap between the two, in the 4.75-5.5 second range. But the current version still looks a little faster.
This seems like a decent idea.
Mehdi: Adrian: Any objections still?
Simon: I'm going to accept, but let's not commit until they're happy.
Thanks!
-eric
If you *need* a specific proprietary code for reproducing anything and it does not show up on for example the clang build, that'd raise more concerns.
Also the recent measurements looks like this patch was a little slower.
Another thing you could provide outside of a reproducer, is the output of a run with tcmalloc and the environment variable MALLOCSTATS
Definitely shouldn't need that. It may need a large enough set of debug info to matter, but showing scaling issues with that sort of thing should be fine and possible. Otherwise there's no compelling argument to put this patch in :)
I like the idea of a tcmalloc stats run as well.
FYI, I just benchmarked "dsymutil llc" (all targets, both release+asserts+debuginfo) and comparing the best out of three runs each, the difference between before and after the patch is in the noise. This should not be too surprising since dsymutil is mostly I/O-bound.
memory