This is an archive of the discontinued LLVM Phabricator instance.

Rewrite the cached map used for locating the most precise DIE among inlined subroutines for a given address.
ClosedPublic

Authored by chandlerc on Dec 7 2017, 3:24 PM.

Details

Summary

This is essentially the hot path of llvm-symbolizer when extracting
inlined frames during symbolization. Previously, we would read every
subprogram and every inlined subroutine, building a std::map across the
entire PC space to the best DIE, and then do only a handful of queries
as we symbolized a backtrace. A huge fraction of the time was spent
building the map itself.

This patch changes it two a two-level system. First, we just build a map
from PC-interval to DWARF subprograms. These are required to be disjoint
and so constructing this is pretty easy. Second, we build a map *just*
for the inlined subroutines within the subprogram containing the query
address. This allows us to look at far fewer DIEs and build a *much*
smaller set of cached maps in the llvm-symbolizer case where only a few
address get symbolized during the entire run.

It also builds both interval maps in a very different way. It constructs
a single flat vector of pairs that maps from offset -> index. The
indices point into collections of DIE objects, but can also be
"tombstones" (-1) to mark gaps. In the case of subprograms, this mostly
just simplifies the data structure a bit. For inlined subroutines,
because we carefully split them as we build the map, we end up in many
cases having no holes and not having to store both start and stop
offsets.

Finally, the PC ranges for the inlined subroutines are compressed into
32-bits by making them relative to the base PC of the outer subprogram.
This means that if you have a single function body with over 2gb of
executable code in it, we will stop mapping address past the first 2gb
of that function into inlined subroutines and just give you the
subprogram. This doesn't seem like a problem. ;]

All of this combines to make llvm-symbolizer *well* over 2x faster for
symbolizing backtraces out of LLVM's unittests. Death-test heavy unit
tests are running >2x faster. I'm still going to look at completely
disabling symbolization there, but figured while I had a good benchmark
we should make symbolization a bit better.

Sadly, the logic to build the flat interval map for the inlined
subroutines is fairly complex. I'm not super happy about this and
welcome any simplifying suggestions.

Also, the names of various components here seem a bit confusing and/or
redundant. I've tried a bunch of options and this is the least bad one
I've found but I'd love better naming patterns to use.

And last but not least, some aspects of the algorithm for this changed
several times while I was working on this. I may have some stale
comments or failing to comment things that really should be; don't
hesitate to let me know about these or to just ignore them and I'll do
a thorough once over tomorrow.

Huge thanks to Dave Blaikie who helped walk me through what the various
things I needed to do in DWARF to make this work.

Event Timeline

chandlerc created this revision.Dec 7 2017, 3:24 PM
dblaikie edited edge metadata.Dec 7 2017, 3:57 PM

Code mostly looks plausible. Haven't quite understood the "ParentIntervals*" variables/processing.

llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
358–359
for(DWARFDie Child : Die.children())
371–382

This does slightly change the claim in the patch description about only 2GB functions being a problem. Since this at least in theory supports fragmented functions, even a small function with, say split into a hot and cold section, could end up with the hot part 2GB of address away from the cold part perhaps more readily/likely than a 2GB function existing.

But still probably nothing to worry about.

391–395

subprograms that are children of other subprograms might be arbitrarily nested - not only direct children (looks like this code only handles direct children?)

(though there's no /good/ reason they would be nested inside an inlined subroutine)

eg: you could have code like:

void f1() {
  if (int x = ...) {
    void f2() { ... }
    ... 
  }
}

in which case, you could have DWARF like:

DW_TAG_subprogram
  DW_AT_name "f1"
  DW_TAG_lexical_block
    DW_TAG_subprogram
      DW_AT_name "f2"

which, yes, does mean you have to walk all the DIEs to find all the subprograms... not sure how much that defeats the goals here (I guess not having to build up so much in the way of data structures while doing so is still advantageous).

464

What's the "first" in this case - is it deterministically the most nested? Least nested? Unspecified?

We'd need to make sure it's the most nested.

478

Nit: Usually I'm of the opinion if a lambda doesn't leak out of scope (via a std::function or similar), just use [&] rather than an explicit list. Same as a loop/conditional/etc doesn't need documentation about which variables are used in the scope. One more thing to have to touch if the code changes, especially since there's a -Wunused-lambda-capture that comes up a bit on the bots regularly.

But up to you.

481–482
for (DWARFDie Child : Die)
489

Spurious semi

514–516

This is just for bad input, I take it? (where the address range of an inlined subroutine is outside that of the enclosing subprogram)

542–543

What's the behavior if they do overlap? While the DWARF spec & a reasonable person would say they can't, physically the data could contain such ranges. Will this crash? Have unreliable behavior? What's reasonable on invalid input here?

(it's probably fine, just checking it's articulated)

568

Nit: Usually I see "i != e" in C++ code. Is < preferred?

625–627

Nit: {} on a single line scope.

aprantl added inline comments.Dec 7 2017, 3:59 PM
llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h
457

I think it would be nice to copy the high level overview of the two-level approach outlined in the phabricator review into a comment somewhere here.

llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
568

LLVM style wants this to be I and E.

709

That --J looks scary since we are using J later in the same expression.
Perhaps factor this into a separate statement?

chandlerc updated this revision to Diff 126113.Dec 8 2017, 3:16 AM
chandlerc marked 7 inline comments as done.

Update based on code review comments.

Thanks for all the feedback!

llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h
457

Yeah, added comments to both of these routines in the source file (since they're implementation functions) and also added the two-level descriptive comments to the implementation of getSubroutineForAddress as that seems to be the most cohesive place to show the pattern of access.

llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
358–359

This is what I get for cargo culting. =]

371–382

Yeah, I've switched to at least use uint32_t so we have 4gb. I'll document it as well.

391–395

Ow, OK.

This does make things slower, but we still get to avoid decoding every inlined subroutine. We just walk the graph w/o reading the address ranges. So my patch still helps.

This change costs us about 10% total. =[

464

We aggressively split as we insert so that we should end up with essentially a single (most nested) offset which has a valid DIE index.

514–516

Yes, I just wanted to bound how bad that got below when I re-base the offsets.

542–543

It should produces some unpredictable result, but never crash. Regardless of whether this holds, we can still sort and unique them. It's just that the resulting thing may be.... surprising in the results it gives. But it still ends up sorted, so all our upper_bound queries should still work, etc etc. I don't think we assert on anything other than "it isn't empty" and "it is sorted" which should hold regardless.

568

If these were iterators, I would use I and E. We are increasingly commonly using i for an int for loop variable.

I used e but am not super happy about it. More typically we have something like Size which makes i < Size much nicer (IMO). Here, the end point isn't the size and I didn't come up with a good name for it. But the != I think largely comes from iterators and unsigned integers.... < seems much more clear for "normal" signed integers.

709

Yeah, never was super happy w/ this. I'm just exploding it and dealing w/ repeated code.

JDevlieghere edited edge metadata.Dec 8 2017, 6:57 AM

Thanks for working on this Chandler! It took me a while to fully grasp why you needed the parent interval in the second layer but it makes sense to me. I was considering a slightly different approach approach where you would do a depth first traversal and add the address ranges as you encounter them, but that would presumably just complicate getting the LowPC mapped to the most nested (most precise) inlined subroutine.

llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h
216

s/indexinto/index into/

llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
378

I'm curious if there is any particular reason you prefer ::push_back({}) over ::emplace_back?

521

Maybe extract (uint64_t)std::numeric_limits<uint32_t>::max() to make this a little more readable?

Not relevant here (as it actually improves readability) but I wonder what the consensus is on C-style casts in LLVM? I don't think the style guide actually mentions it.

chandlerc marked an inline comment as done.Dec 8 2017, 11:53 AM
chandlerc added inline comments.
llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
378

Unless the code will only compile with emplace_back, I strongly prefer push_back.

  • I find the code substanitally easier to read and failures easier to debug. Forwarding is a horrifically complicated thing in C++ and it tends to not be worth the cost it imposes.
  • We generally insist upon cheap-to-move objects making push_back's "extra" move not an important consideration.
JDevlieghere added inline comments.Dec 11 2017, 3:38 AM
llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
378

Got it, thanks!

dblaikie accepted this revision.Dec 14 2017, 11:58 AM

I'm pretty OK with this - haven't thought /deeeply/ about the algorithm from the code (more from the discussions we've had), mostly glossed over, looked at the code style, etc, seems plausible.

Wonder if it's worth doing some kind of stress test for this? (take an optimized clang, symbolize all the addresses, compare before/after this change - that would've likely caught the "there can be non-subroutine scopes that need to be recursed through" case I brought up in the first pass of review & maybe some others I've not spotted?)

llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
496

I know the naming here is subtle/annoying, but maybe "inlined subroutines" rather than "nested subroutines" would make it more clear what we're searching for?

521–527

Worth pulling out a named constant for std::numeric_limits<uint32_t>::max() ? it's a long expression used 4 times in these long/wrapped lines.

571–573

top level const on locals is a bit uncommon (don't have to change it, just mentioning it in case there's some special motivation, or in case the choice is worth revisiting)

This revision is now accepted and ready to land.Dec 14 2017, 11:58 AM
chandlerc marked 3 inline comments as done.Dec 21 2017, 10:04 PM

Thanks all! Going to land this after a touch more testing so that my test runs start being faster. =D

llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
521–527

Not sure that any name I come up with is better than std::numeric_limits<uint32_t>::max(), but sure. Combined with the suggestion above.

571–573

I added these to catch bugs where I was mutating something that wasn't a reference and I meant to be mutating something that was a reference. ::shrug::

I generally try to be pragmatic about this rather than dogmatic and make things const when useful to avoid mistakes.

This revision was automatically updated to reflect the committed changes.
chandlerc marked an inline comment as done.

Sorry, missed this somehow... I didn't really look at the second layer, but I have a suggestion for the initial tree walk.

llvm/trunk/lib/DebugInfo/DWARF/DWARFUnit.cpp
380 ↗(On Diff #127984)

Would the overall algorithm be faster if it adds the child to the worklist only if the child itself is a subprogram, or has children? Currently this loop adds uninteresting leaf DIEs to the worklist, only to discover they are uninteresting on a later iteration. I'd think keeping the size of the worklist down could be beneficial.

There are DIEs that can have children but do not represent scopes (array_type and enum_type come to mind) or otherwise cannot have subprogram children, and it would be possible to come up with a list of those. But checking for a long list of tag types might get too expensive. With my above suggestion, you still (for example) add enum_type to the worklist, but not the individual enumerator DIEs, and that gets you the bulk of the performance benefit.