This is an archive of the discontinued LLVM Phabricator instance.

Add ThinLtoJIT example
ClosedPublic

Authored by sgraenitz on Jan 9 2020, 3:39 PM.

Details

Summary

Prototype of a JIT compiler that utilizes ThinLTO summaries to compile modules ahead of time. This is an implementation of the concept I presented in my "ThinLTO Summaries in JIT Compilation" talk at the 2018 Developers' Meeting: http://llvm.org/devmtg/2018-10/talk-abstracts.html#lt8

Upfront the JIT first populates the *combined ThinLTO module index*, which provides fast access to the global call-graph and module paths by function. Next, it loads the main function's module and compiles it. All functions in the module will be emitted with two prolog instructions that *fire a discovery flag* once execution reaches them. In parallel, the *discovery thread* is busy-watching the existing flags. Once it detects one has fired, it uses the module index to find all functions that are reachable from it within a given number of calls and submits their defining modules to the compilation pipeline.

While execution continues, more flags are fired and further modules added. Ideally the JIT can be tuned in a way, so that the code on the execution path can always be compiled ahead of time. In cases where it doesn't work, the JIT has a *definition generator* in place that loads modules if missing functions are reached.

The example that was discussed in the presentation has a good size for debugging. Here's how to build it:

# Build LLVM
$ cd build-release
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD=host /path/to/llvm-project/llvm
$ ninja clang llvm-dis lli SpeculativeJIT ThinLtoJIT

# Get the example code
$ git clone https://github.com/weliveindetail/tinyexpr

# Unrepresentative minimal runtime comparison
$ /path/to/llvm-project/llvm/examples/ThinLtoJIT/bench
Usage: bench <path to llvm binaries> <path to c-sources> <main source file> [<override sysroot>]

$ /path/to/llvm-project/llvm/examples/ThinLtoJIT/bench $(pwd)/bin $(pwd)/tinyexpr example.c

Note that on macOS you may want to pass a sysroot override like /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk.

I would like to gather some feedback. Do you think it's fine to have this as another JIT example? I am happy to maintain it and to continue working on it.

Diff Detail

Event Timeline

sgraenitz created this revision.Jan 9 2020, 3:39 PM
Herald added a project: Restricted Project. · View Herald Transcript
sgraenitz updated this revision to Diff 237222.Jan 9 2020, 4:08 PM

Polishing

Sample output:

> ./ThinLtoJIT -debug -debug-only="thinltojit" -compile-threads=2 example.bc tinyexpr.bc compile.bc free.bc eval.bc
1 new flags raised
Nudged 1 new functions into discovery
Instrumented 1 new functions in module example.bc
Generator: Added 1 new modules
Nudged 1 new functions into discovery
DiscoveryThread: 3 new modules (1 known modules)
1 new flags raised
DiscoveryThread: 0 new modules (3 known modules)
1 new flags raised
DiscoveryThread: 0 new modules (4 known modules)
Instrumented 3 new functions in module tinyexpr.bc
1 new flags raised
DiscoveryThread: 0 new modules (3 known modules)
Instrumented 23 new functions in module compile.bc
Generator: Added 0 new modules
11 new flags raised
DiscoveryThread: 0 new modules (3 known modules)
1 new flags raised
DiscoveryThread: 0 new modules (0 known modules)
Instrumented 1 new functions in module eval.bc
1 new flags raised
DiscoveryThread: 0 new modules (1 known modules)
Instrumented 2 new functions in module free.bc
Generator: Added 0 new modules
2 new flags raised
DiscoveryThread: 0 new modules (1 known modules)
2 new flags raised
DiscoveryThread: 0 new modules (0 known modules)
The expression:
	sqrt(5^2+7^2+11^2+(8-2)^2)
evaluates to:
	15.198684
Discovery flags stats
Alloc:  4096
Used:     32
Fired:    21

FlagOwnersMap has 11 remaining entries.

Hi Stefan. Thanks for posting this! I’m out on vacation this week, but looking forward to reviewing when I get back to the office next week.

sgraenitz updated this revision to Diff 237455.Jan 10 2020, 4:12 PM

Pull in new modules also for global variables

sgraenitz updated this revision to Diff 237456.Jan 10 2020, 4:19 PM

Workaround: parsed module requested but waiting in JITDylib::define()

sgraenitz updated this revision to Diff 237457.Jan 10 2020, 4:21 PM

Make DiscoveryFlagsPerBatch and MemFence settings configurable

sgraenitz updated this revision to Diff 237458.Jan 10 2020, 4:22 PM

Polishing, comments, cosmetics

sgraenitz updated this revision to Diff 237459.Jan 10 2020, 4:23 PM

Make it configurable whether the symbol generator can nugde things into discovery

sgraenitz updated this revision to Diff 237460.Jan 10 2020, 4:25 PM

Fix: Protect the set of parsed module IDs from concurrent access

Harbormaster completed remote builds in B43738: Diff 237457.
Harbormaster completed remote builds in B43739: Diff 237458.
Harbormaster completed remote builds in B43740: Diff 237459.
Harbormaster completed remote builds in B43741: Diff 237460.
sgraenitz updated this revision to Diff 237542.Jan 12 2020, 9:15 AM

Use runAsMain() utility function from ExecutionUtils

sgraenitz updated this revision to Diff 237544.Jan 12 2020, 9:18 AM

DataLayout can be owned by MangleWrapper

sgraenitz marked an inline comment as done.Jan 12 2020, 9:45 AM

Hi Stefan. Thanks for posting this! I’m out on vacation this week, but looking forward to reviewing when I get back to the office next week.

Hi Lang, that would be great! Looking forward to your comments.
It does pass the 403.gcc test case now, but can't beat lli orc-lazy per-module yet in terms of real-time performance.
There is one particular thing that I have to work on and would like to hear your recommendations. Please see the inline comment.

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
101

I am struggling to find a solution for this: The discovery thread is about to add the required module, but it's waiting for the session lock in JITDylib::define(). As the lock is recursive, I can load it again and add it from here, but that's expensive and I cannot prevent the other one from getting into defineImpl() anymore! This causes a lot of duplicate symbol errors. They are not fatal, but add too much extra overhead.

I am currently working on parallel module parsing which makes this much more likely to happen. The only way I see so far is to change the design and avoid concurrent AddModule() calls. Would be possible, but I'd lose some speculation performance.

Happy about any suggestions. Thanks!

Make LookaheadLevels and LookupOnAdd configurable

sgraenitz marked an inline comment as done.Jan 14 2020, 8:29 AM
sgraenitz added inline comments.
llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
101

Ok, I should probably not add modules in parallel, but instead issue a lookup for one of their symbols from the DiscoveryThread. The lookup will end up in the definition generator and I can do the AddModule there. What do you think?

I have a first patch for it here, but it has some open todos:
https://reviews.llvm.org/differential/diff/237989/#change-u8RDZezBvAb6

sgraenitz updated this revision to Diff 238368.Jan 15 2020, 2:17 PM

Avoid concurrent addModule() calls, improve ranking algorithm in discovery, add multi-threaded bitcode parsing

sgraenitz updated this revision to Diff 238766.Jan 17 2020, 6:44 AM

Pardon, apparently the last update was missing the actual code changes. Trying again.

Respect environment variables CFLAGS and EXEC_ARGS in the bench script

Submit actual module keys and sort getAllModulePaths() by ID

Fix GUID when instrumenting functions with local linkage

Harbormaster completed remote builds in B44367: Diff 239005.
Harbormaster completed remote builds in B44368: Diff 239006.

Don't dispatch materialization for trivial modules to the thread pool

Sort PathRank candidates by minimal distance

Harbormaster completed remote builds in B44370: Diff 239008.
sgraenitz marked 2 inline comments as done.Jan 19 2020, 12:12 PM

If you want to build this patch, please apply D72301 before. I will try and make it work without it soon.

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

@lhames: It looks like the session lock is a major bottleneck and not dispatching "trivial" modules to threads seems to help reducing the pressure. Do you think that's reasonable?

Are MaterializationUnit types meant to be distinguishable by type? Would it benefit from isa<>/cast<> support?

llvm/examples/ThinLtoJIT/ThinLtoModuleIndex.cpp
55

@tejohnson In which cases does the summary list for a global value have more than 1 entry? Is it reasonable to always refer to the first one (while this is a work in progress)? I am currently limited to sources in C and never see it happen.

I need to check this out and play around with it a bit to wrap my head around it, especially the performance tuning issues. This is very exciting though!

Heads up: I probably broke this with ce2207abaf9, but you just need to switch to using std::make_unique to construct the IRCompiler (see e.g. fixes in 98e55477558).

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

Yes! I ran in to the same problem when developing our demo for the 2018 developer’s meeting.

I posted https://reviews.llvm.org/D39111 as a solution to distinguishing MUs by type, but haven’t had time to land it yet. Maybe it’s time to start making some noise on that review again.

lhames added inline comments.Jan 21 2020, 10:41 PM
llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

Side note: The bottle neck here may not be the session lock. I thought that was the culprit for our 2018 dev-meeting demo too, but it was really the thread pool: trivial materializes couldn’t get a thread to run on at all, rather than running but blocking on the session lock. Could that be what’s happening to you too?

Either way: I think I should add some logging to runSessionLocked to track how long threads spend holding the session lock, and how long they spend blocked waiting on it.

tejohnson added inline comments.Jan 22 2020, 6:50 AM
llvm/examples/ThinLtoJIT/ThinLtoModuleIndex.cpp
55

In C you can presumably still get duplicates on symbols declared with attribute((weak)), a GNU extension supported by gcc and clang.

Additionally, you could also get duplicates when there are same-named local symbols in same-named source files in different subdirectories and the user has not compiled with enough distinguishing source path. E.g. assume you have local symbol foo defined in path1/bar.c and path2/bar.c, and the user has done the following:

$ cd path1
$ clang -c -flto=thin bar.c
$ cd ../path2
$ clang -c -flto=thin bar.c

When you link those two bar.o together it will have two matching GUID since the globalized name for locals used to compute the GUID has the relative path (encoded in the module) prepended, and in this case that is both "." essentially. It happens, and we support this typically by looking for one in the same module as the caller:
http://llvm-cs.pcc.me.uk/lib/Transforms/IPO/FunctionImport.cpp#216

or in the current module when doing some handling in the ThinLTO backend, e.g.:
http://llvm-cs.pcc.me.uk/lib/Transforms/Utils/FunctionImportUtils.cpp#62

lhames accepted this revision.Jan 24 2020, 3:45 PM

LGTM! I really like this. I'm all for adding it as another example if you're happy to maintain it.

The code could probably use some extra comments (including pointers to your talk), but I think that would be better dealt with in-tree.

This revision is now accepted and ready to land.Jan 24 2020, 3:45 PM
sgraenitz updated this revision to Diff 240443.Jan 26 2020, 9:08 AM

Adjust for emulated-TLS changes in ce2207abaf9a

sgraenitz updated this revision to Diff 240444.Jan 26 2020, 9:11 AM

Drop requirement to apply D72301 before this patch

Harbormaster completed remote builds in B44954: Diff 240444.
sgraenitz marked 4 inline comments as done.Jan 26 2020, 11:35 AM

LGTM! I really like this. I'm all for adding it as another example if you're happy to maintain it.

The code could probably use some extra comments (including pointers to your talk), but I think that would be better dealt with in-tree.

Thanks! I have a few small patches (polishing and fixing some todos) that I'd like to add next week, and a larger one (use call-through stubs for lazy module loading) that I will probably add in a separate commit. And sure, happy to add more comments step by step as the overall design consolidates.

I need to check this out and play around with it a bit to wrap my head around it, especially the performance tuning issues. This is very exciting though!

With the two last updates the code should build just fine from TOT. The tinyexpr example is a good starting point. It's small and I used it in my presentation. I split it up into multiple files in order to illustrate the principle with per-module lazy compilation (the original version has only two translation units): https://github.com/weliveindetail/tinyexpr

llvm/examples/ThinLtoJIT/ThinLtoJIT.cpp
259

I posted https://reviews.llvm.org/D39111 as a solution to distinguishing MUs by type

I see your point, that it would be convenient for clients to distinguish out-of-tree MU types using the same mechanism. For the moment, however, I could imagine the regular LLVM style isa for built-in MU types to be just fine. In case more and more clients utilize MUs as extension points to ORC, this would serve as a good argument for your open-class-hierarchy proposal (in addition to streamlining the llvm::ErrorInfo mechanism). I will put the former on my todo list and come back to you with my findings in a separate review.

trivial materializes couldn’t get a thread to run on at all, rather than running but blocking on the session lock. Could that be what’s happening to you too?

Possible. I will check that in more detail when going ahead with performance tuning. Thanks for the heads-up!

llvm/examples/ThinLtoJIT/ThinLtoModuleIndex.cpp
55

Ok, thanks for the explanation! I should check that all modules have unique paths upfront when building the index. Then I could turn the debug warning here into an assertion, saying that attribute((weak)) is not yet supported. I think that's fine for a first version and I can improve that once the overall design consolidates. Thanks for the pointers on how this is handled in other cases!

sgraenitz updated this revision to Diff 241678.Jan 31 2020, 3:51 AM
sgraenitz marked 2 inline comments as done.

bench script: link and disassemble global ThinLTO index file and respect LDFLAGS in static compilation

sgraenitz updated this revision to Diff 241679.Jan 31 2020, 3:52 AM

Accept ThinLTO global index as input file

sgraenitz updated this revision to Diff 241680.Jan 31 2020, 3:53 AM

Prefer std::atomic fetch_add over compare-exchange in reserveDiscoveryFlags()

Harbormaster completed remote builds in B45433: Diff 241680.
sgraenitz updated this revision to Diff 241681.Jan 31 2020, 3:54 AM

Don't risk "post-mortem" access violations. Keep DiscoveryThread attached and join it on shutdown.

sgraenitz updated this revision to Diff 241682.Jan 31 2020, 3:55 AM

Drop DiscoveryThread module requests via the definition generator. Instead trigger asynchronous module adding from there.

sgraenitz updated this revision to Diff 241684.Jan 31 2020, 3:56 AM

Let scheduleModuleParsing() process a range of paths with a single lock

sgraenitz updated this revision to Diff 241685.Jan 31 2020, 3:57 AM

The blocking single-module parsing function parseModuleFromFile() should use the regular primitives

Harbormaster completed remote builds in B45436: Diff 241684.
Harbormaster completed remote builds in B45437: Diff 241685.
sgraenitz updated this revision to Diff 241804.Jan 31 2020, 2:10 PM
sgraenitz marked an inline comment as done.

Assume single entries in summary lists after checking module paths to be unique

sgraenitz updated this revision to Diff 241806.Jan 31 2020, 2:15 PM

Split off ThinLtoInstrumentationLayer::dump() from its dtor

Harbormaster completed remote builds in B45481: Diff 241806.
sgraenitz updated this revision to Diff 241810.Jan 31 2020, 2:19 PM

Move ExplicitMemoryBarrier enum to ThinLtoJIT

sgraenitz updated this revision to Diff 241887.Feb 1 2020, 9:54 AM

Add -print-stats command line flag

sgraenitz updated this revision to Diff 241889.Feb 1 2020, 10:41 AM

Refine ThinLtoModuleIndex::selectNextPaths()

sgraenitz updated this revision to Diff 241891.Feb 1 2020, 10:54 AM

Polish initialization, cleanup, formatting

This revision was automatically updated to reflect the committed changes.