Page MenuHomePhabricator

Add ThinLtoJIT example
ClosedPublic

Authored by sgraenitz on Jan 9 2020, 3:39 PM.

Details

Summary

Prototype of a JIT compiler that utilizes ThinLTO summaries to compile modules ahead of time. This is an implementation of the concept I presented in my "ThinLTO Summaries in JIT Compilation" talk at the 2018 Developers' Meeting: http://llvm.org/devmtg/2018-10/talk-abstracts.html#lt8

Upfront the JIT first populates the *combined ThinLTO module index*, which provides fast access to the global call-graph and module paths by function. Next, it loads the main function's module and compiles it. All functions in the module will be emitted with two prolog instructions that *fire a discovery flag* once execution reaches them. In parallel, the *discovery thread* is busy-watching the existing flags. Once it detects one has fired, it uses the module index to find all functions that are reachable from it within a given number of calls and submits their defining modules to the compilation pipeline.

While execution continues, more flags are fired and further modules added. Ideally the JIT can be tuned in a way, so that the code on the execution path can always be compiled ahead of time. In cases where it doesn't work, the JIT has a *definition generator* in place that loads modules if missing functions are reached.

The example that was discussed in the presentation has a good size for debugging. Here's how to build it:

# Build LLVM
$ cd build-release
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD=host /path/to/llvm-project/llvm
$ ninja clang llvm-dis lli SpeculativeJIT ThinLtoJIT

# Get the example code
$ git clone https://github.com/weliveindetail/tinyexpr

# Unrepresentative minimal runtime comparison
$ /path/to/llvm-project/llvm/examples/ThinLtoJIT/bench
Usage: bench <path to llvm binaries> <path to c-sources> <main source file> [<override sysroot>]

$ /path/to/llvm-project/llvm/examples/ThinLtoJIT/bench $(pwd)/bin $(pwd)/tinyexpr example.c

Note that on macOS you may want to pass a sysroot override like /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk.

I would like to gather some feedback. Do you think it's fine to have this as another JIT example? I am happy to maintain it and to continue working on it.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

DataLayout can be owned by MangleWrapper

sgraenitz marked an inline comment as done.Jan 12 2020, 9:45 AM

Hi Stefan. Thanks for posting this! I’m out on vacation this week, but looking forward to reviewing when I get back to the office next week.

Hi Lang, that would be great! Looking forward to your comments.
It does pass the 403.gcc test case now, but can't beat lli orc-lazy per-module yet in terms of real-time performance.
There is one particular thing that I have to work on and would like to hear your recommendations. Please see the inline comment.

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
101

I am struggling to find a solution for this: The discovery thread is about to add the required module, but it's waiting for the session lock in JITDylib::define(). As the lock is recursive, I can load it again and add it from here, but that's expensive and I cannot prevent the other one from getting into defineImpl() anymore! This causes a lot of duplicate symbol errors. They are not fatal, but add too much extra overhead.

I am currently working on parallel module parsing which makes this much more likely to happen. The only way I see so far is to change the design and avoid concurrent AddModule() calls. Would be possible, but I'd lose some speculation performance.

Happy about any suggestions. Thanks!

Make LookaheadLevels and LookupOnAdd configurable

sgraenitz marked an inline comment as done.Jan 14 2020, 8:29 AM
sgraenitz added inline comments.
llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
101

Ok, I should probably not add modules in parallel, but instead issue a lookup for one of their symbols from the DiscoveryThread. The lookup will end up in the definition generator and I can do the AddModule there. What do you think?

I have a first patch for it here, but it has some open todos:
https://reviews.llvm.org/differential/diff/237989/#change-u8RDZezBvAb6

sgraenitz updated this revision to Diff 238368.Jan 15 2020, 2:17 PM

Avoid concurrent addModule() calls, improve ranking algorithm in discovery, add multi-threaded bitcode parsing

sgraenitz updated this revision to Diff 238766.Jan 17 2020, 6:44 AM

Pardon, apparently the last update was missing the actual code changes. Trying again.

Respect environment variables CFLAGS and EXEC_ARGS in the bench script

Submit actual module keys and sort getAllModulePaths() by ID

Fix GUID when instrumenting functions with local linkage

Harbormaster completed remote builds in B44367: Diff 239005.
Harbormaster completed remote builds in B44368: Diff 239006.

Don't dispatch materialization for trivial modules to the thread pool

Sort PathRank candidates by minimal distance

Harbormaster completed remote builds in B44370: Diff 239008.
sgraenitz marked 2 inline comments as done.Sun, Jan 19, 12:12 PM

If you want to build this patch, please apply D72301 before. I will try and make it work without it soon.

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

@lhames: It looks like the session lock is a major bottleneck and not dispatching "trivial" modules to threads seems to help reducing the pressure. Do you think that's reasonable?

Are MaterializationUnit types meant to be distinguishable by type? Would it benefit from isa<>/cast<> support?

llvm/examples/ThinLtoJIT/ThinLtoModuleIndex.cpp
55

@tejohnson In which cases does the summary list for a global value have more than 1 entry? Is it reasonable to always refer to the first one (while this is a work in progress)? I am currently limited to sources in C and never see it happen.

I need to check this out and play around with it a bit to wrap my head around it, especially the performance tuning issues. This is very exciting though!

Heads up: I probably broke this with ce2207abaf9, but you just need to switch to using std::make_unique to construct the IRCompiler (see e.g. fixes in 98e55477558).

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

Yes! I ran in to the same problem when developing our demo for the 2018 developer’s meeting.

I posted https://reviews.llvm.org/D39111 as a solution to distinguishing MUs by type, but haven’t had time to land it yet. Maybe it’s time to start making some noise on that review again.

lhames added inline comments.Tue, Jan 21, 10:41 PM
llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

Side note: The bottle neck here may not be the session lock. I thought that was the culprit for our 2018 dev-meeting demo too, but it was really the thread pool: trivial materializes couldn’t get a thread to run on at all, rather than running but blocking on the session lock. Could that be what’s happening to you too?

Either way: I think I should add some logging to runSessionLocked to track how long threads spend holding the session lock, and how long they spend blocked waiting on it.

tejohnson added inline comments.Wed, Jan 22, 6:50 AM
llvm/examples/ThinLtoJIT/ThinLtoModuleIndex.cpp
55

In C you can presumably still get duplicates on symbols declared with attribute((weak)), a GNU extension supported by gcc and clang.

Additionally, you could also get duplicates when there are same-named local symbols in same-named source files in different subdirectories and the user has not compiled with enough distinguishing source path. E.g. assume you have local symbol foo defined in path1/bar.c and path2/bar.c, and the user has done the following:

$ cd path1
$ clang -c -flto=thin bar.c
$ cd ../path2
$ clang -c -flto=thin bar.c

When you link those two bar.o together it will have two matching GUID since the globalized name for locals used to compute the GUID has the relative path (encoded in the module) prepended, and in this case that is both "." essentially. It happens, and we support this typically by looking for one in the same module as the caller:
http://llvm-cs.pcc.me.uk/lib/Transforms/IPO/FunctionImport.cpp#216

or in the current module when doing some handling in the ThinLTO backend, e.g.:
http://llvm-cs.pcc.me.uk/lib/Transforms/Utils/FunctionImportUtils.cpp#62

lhames accepted this revision.Fri, Jan 24, 3:45 PM

LGTM! I really like this. I'm all for adding it as another example if you're happy to maintain it.

The code could probably use some extra comments (including pointers to your talk), but I think that would be better dealt with in-tree.

This revision is now accepted and ready to land.Fri, Jan 24, 3:45 PM
sgraenitz updated this revision to Diff 240443.Sun, Jan 26, 9:08 AM

Adjust for emulated-TLS changes in ce2207abaf9a

sgraenitz updated this revision to Diff 240444.Sun, Jan 26, 9:11 AM

Drop requirement to apply D72301 before this patch

Harbormaster completed remote builds in B44954: Diff 240444.
sgraenitz marked 4 inline comments as done.Sun, Jan 26, 11:35 AM

LGTM! I really like this. I'm all for adding it as another example if you're happy to maintain it.

The code could probably use some extra comments (including pointers to your talk), but I think that would be better dealt with in-tree.

Thanks! I have a few small patches (polishing and fixing some todos) that I'd like to add next week, and a larger one (use call-through stubs for lazy module loading) that I will probably add in a separate commit. And sure, happy to add more comments step by step as the overall design consolidates.

I need to check this out and play around with it a bit to wrap my head around it, especially the performance tuning issues. This is very exciting though!

With the two last updates the code should build just fine from TOT. The tinyexpr example is a good starting point. It's small and I used it in my presentation. I split it up into multiple files in order to illustrate the principle with per-module lazy compilation (the original version has only two translation units): https://github.com/weliveindetail/tinyexpr

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

I posted https://reviews.llvm.org/D39111 as a solution to distinguishing MUs by type

I see your point, that it would be convenient for clients to distinguish out-of-tree MU types using the same mechanism. For the moment, however, I could imagine the regular LLVM style isa for built-in MU types to be just fine. In case more and more clients utilize MUs as extension points to ORC, this would serve as a good argument for your open-class-hierarchy proposal (in addition to streamlining the llvm::ErrorInfo mechanism). I will put the former on my todo list and come back to you with my findings in a separate review.

trivial materializes couldn’t get a thread to run on at all, rather than running but blocking on the session lock. Could that be what’s happening to you too?

Possible. I will check that in more detail when going ahead with performance tuning. Thanks for the heads-up!

llvm/examples/ThinLtoJIT/ThinLtoModuleIndex.cpp
55

Ok, thanks for the explanation! I should check that all modules have unique paths upfront when building the index. Then I could turn the debug warning here into an assertion, saying that attribute((weak)) is not yet supported. I think that's fine for a first version and I can improve that once the overall design consolidates. Thanks for the pointers on how this is handled in other cases!

sgraenitz updated this revision to Diff 241678.Fri, Jan 31, 3:51 AM
sgraenitz marked 2 inline comments as done.

bench script: link and disassemble global ThinLTO index file and respect LDFLAGS in static compilation

sgraenitz updated this revision to Diff 241679.Fri, Jan 31, 3:52 AM

Accept ThinLTO global index as input file

sgraenitz updated this revision to Diff 241680.Fri, Jan 31, 3:53 AM

Prefer std::atomic fetch_add over compare-exchange in reserveDiscoveryFlags()

Harbormaster completed remote builds in B45433: Diff 241680.
sgraenitz updated this revision to Diff 241681.Fri, Jan 31, 3:54 AM

Don't risk "post-mortem" access violations. Keep DiscoveryThread attached and join it on shutdown.

sgraenitz updated this revision to Diff 241682.Fri, Jan 31, 3:55 AM

Drop DiscoveryThread module requests via the definition generator. Instead trigger asynchronous module adding from there.

sgraenitz updated this revision to Diff 241684.Fri, Jan 31, 3:56 AM

Let scheduleModuleParsing() process a range of paths with a single lock

sgraenitz updated this revision to Diff 241685.Fri, Jan 31, 3:57 AM

The blocking single-module parsing function parseModuleFromFile() should use the regular primitives

Harbormaster completed remote builds in B45436: Diff 241684.
Harbormaster completed remote builds in B45437: Diff 241685.
sgraenitz updated this revision to Diff 241804.Fri, Jan 31, 2:10 PM
sgraenitz marked an inline comment as done.

Assume single entries in summary lists after checking module paths to be unique

sgraenitz updated this revision to Diff 241806.Fri, Jan 31, 2:15 PM

Split off ThinLtoInstrumentationLayer::dump() from its dtor

Harbormaster completed remote builds in B45481: Diff 241806.
sgraenitz updated this revision to Diff 241810.Fri, Jan 31, 2:19 PM

Move ExplicitMemoryBarrier enum to ThinLtoJIT

sgraenitz updated this revision to Diff 241887.Sat, Feb 1, 9:54 AM

Add -print-stats command line flag

sgraenitz updated this revision to Diff 241889.Sat, Feb 1, 10:41 AM

Refine ThinLtoModuleIndex::selectNextPaths()

sgraenitz updated this revision to Diff 241891.Sat, Feb 1, 10:54 AM

Polish initialization, cleanup, formatting

This revision was automatically updated to reflect the committed changes.