This is an archive of the discontinued LLVM Phabricator instance.

[opt] Replace the recursive walk for GC with a worklist algorithm.
ClosedPublic

Authored by chandlerc on Jun 26 2015, 9:02 PM.

Details

Summary

This flattens the entire liveness walk from a recursive mark approach to
a worklist approach. It also sinks the worklist management completely
out of the SectionChunk and into the Writer by exposing the ability to
iterato over children of a chunk and over the symbol bodies of relocated
symbols. I'm not 100% happy with the API names, so suggestions welcome
there.

This allows us to use a single worklist for the entire recursive walk
and would also be a natural place to take advantage of parallelism at
some future point.

With this, we completely inline away the GC walk into the
Writer::markLive function and it makes it very easy to profile what is
slow. Currently, time is being wasted checking whether a Chunk isa
SectionChunk (it essentially always is), finding (or skipping)
a replacement for a symbol, and chasing pointers between symbols and
their chunks. There are a bunch of things we can do to fix this, and its
easier to do them after this change IMO.

This change alone saves 1-2% of the time for my self-link of lld.exe
(which I'm running and benchmarking on Linux ironically).

Perhaps more notably, we'll no longer blow out the stack for large
links. =]

Just as an FYI, at this point, I/O is starting to really dominate the
profile. Well over 10% of the time appears to be inside the kernel doing
page table silliness. I think a decent chunk of this can be nuked as
well, but it's a little odd as cross-linking in this way isn't really
the primary goal here.

Depends on D10789

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc updated this revision to Diff 28624.Jun 26 2015, 9:02 PM
chandlerc retitled this revision from to [opt] Replace the recursive walk for GC with a worklist algorithm..
chandlerc updated this object.
chandlerc edited the test plan for this revision. (Show Details)
chandlerc added a reviewer: ruiu.
chandlerc added a subscriber: Unknown Object (MLST).
majnemer added inline comments.
COFF/Writer.cpp
118 ↗(On Diff #28624)

Realistically, won't we always blow out this SmallVector?

ruiu added inline comments.Jun 27 2015, 12:05 PM
COFF/Chunks.h
13 ↗(On Diff #28624)

This makes this header and InputFiles.h mutually dependent. Please update InputFiles.h to remove inclusion of this file and add a forward declaration for the class Chunk.

141 ↗(On Diff #28624)

symbol_iterator -> SymbolIterator?

151–156 ↗(On Diff #28624)

Move this at beginning of the class definition. Also write operator*() in one line if it fits 80 cols.

COFF/Writer.cpp
118 ↗(On Diff #28624)

Yeah, this can be much longer than 16. Is SmallVector faster than std::vector for non-small vectors? If not, can we use std::vector instead?

135 ↗(On Diff #28624)

You don't need to call markLive() when you add a new chunk to the Worklist. Instead you can call that function here.

146–147 ↗(On Diff #28624)

Associative sections are always SectionChunk. You want to change the type to eliminate this dyn_cast.

silvas added a subscriber: silvas.Jun 29 2015, 11:54 AM
silvas added inline comments.
COFF/Writer.cpp
118 ↗(On Diff #28624)

Yes. For large POD arrays it can use realloc tricks (e.g. swapping page tables instead of actually copying).

chandlerc updated this revision to Diff 28699.Jun 29 2015, 12:00 PM

Update fixing several reviewer comments (and rebasing on Rui's commits).

chandlerc added inline comments.Jun 29 2015, 12:01 PM
COFF/Chunks.h
13 ↗(On Diff #28624)

Yea, the prior change fixed that.

141 ↗(On Diff #28624)

This is a common convention in LLVM to name iterators based on the STL naming conventions. I'd really like to avoid deviating from that here.

151–156 ↗(On Diff #28624)

Moved, but whether this fits on one line is up to clang-format. =] It formatted it this way, and I don't want to change it.

COFF/Writer.cpp
118 ↗(On Diff #28624)

SmallVector is much faster than std::vector in my experience.

Also, we can avoid a bunch of early allocations by starting this fairly high. I've raised the number on the stack to 256 so that we get the early churn out of the way without malloc. I could be convinced it should be closer to 4k if you'd like to wait until we're around a page size before we start hitting the system allocator.

135 ↗(On Diff #28624)

If we fail to call markLive here, then we will add the same SectionChunk to the worklist *many* times. We're essentually using the bool 'Live' state to ensure we only add a chunk to the worklist once.

146–147 ↗(On Diff #28624)

Done (reflecting your change of the underlying container type).

ruiu accepted this revision.Jun 29 2015, 1:34 PM
ruiu edited edge metadata.

LGTM. I'm planning to make a change to the resolver which would conflict with this change, so I'd like to see this being landed before starting creating my patch.

COFF/Symbols.h
130 ↗(On Diff #28699)

I still want you to rename this class for consistency.

COFF/Writer.cpp
135 ↗(On Diff #28699)

So you can write SC->markLive() here only once, no? Then you can remove markLive() from all the other places in this function.

This revision is now accepted and ready to land.Jun 29 2015, 1:34 PM
chandlerc updated this revision to Diff 28705.Jun 29 2015, 1:42 PM
chandlerc edited edge metadata.

Create the correct patch for this, sorry for the prior patch being garbage.

I'll confirm with you that the worklist logic below is OK, and then land. Thanks!

COFF/Symbols.h
130 ↗(On Diff #28699)

Sorry, this should not have been part of this patch. It'll be in D10792, and I'll update that review thread shortly.

COFF/Writer.cpp
135 ↗(On Diff #28699)

No, that is a different algorithm.

While we look for the symbols referenced by this section, we may see a symbol more than once. In that case, before we put it into the worklist, we first test that it is not-yet-live, mark it live, and then add it to the worklist. If I make the change you are suggesting, then we would potentially add the same not-yet-live symbol to the worklist many times.

This would either require adding a set to the worklist, or make the worklist grow much larger and require an early exit from the processing loop. Neither seem good. Given that we have a bit to track the liveness embedded in the chunk, it seems much better to use that to add them to the worklist exactly once.

This revision was automatically updated to reflect the committed changes.