This is an archive of the discontinued LLVM Phabricator instance.

[ORC] Lock SearchOrder independently from SessionLock to avoid deadlock when resolving relocations.
AbandonedPublic

Authored by sgraenitz on Aug 22 2018, 2:32 PM.

Details

Reviewers
lhames
Summary

Is it necessary to lock the entire session in order to resolve relocations or would it be sufficient to:
(1) Lock the collection of related dylibs.
(2) Visit each dylib individually with a lock for only this one instance.

The change here adds SearchOrderMutex and visitInSearchOrder() to achieve (1).
It does not implement (2), but instead just assumes that the individual dylibs remain unchanged during visitation. I started drafting (2), but it quickly became big and I stopped to avoid wasting time here.

One more note on the current approach:
IIUC ExecutionSessionBase::lookup() either resolves all symbols in all dylibs or returns an error. Wouldn't this mean that a single broken dylib could cause everything to fail?
Is there a fundamental reason for this?
Otherwise, for (2) it would be easy to return both, errors and symbols for each dylib, and degrade errors to warnings if all requested symbols could be resolved.
With your ErrorList implementation we would be perfectly prepared for it.

Diff Detail

Event Timeline

sgraenitz created this revision.Aug 22 2018, 2:32 PM

Is it necessary to lock the entire session in order to resolve relocations or would it be sufficient to:
(1) Lock the collection of related dylibs.
(2) Visit each dylib individually with a lock for only this one instance.

I have tried to keep the locking model as simple as I can. The locking model for a non-blocking lookup is:

(1) lookup creates a Query object.
(2) The session is locked.
(3) All dylibs are scanned in lookup order to find the symbols requested.
(4) For any symbol not fully materialized already, add the Query to the list of queries waiting on that symbol.
(4) For any symbol that has not started materializing, add its materializer to the list of materializers to run once the session is unlocked.
(5) Unlock the session.
(6) Run the materializers.

Steps (3), (4), and (5) are all simple data structure manipulations, so should be fairly quick. Unless we discover performance issues with this scheme I think we should stick with this simple model, as it is relatively easy to describe and reason about.

The change here adds SearchOrderMutex and visitInSearchOrder() to achieve (1).
It does not implement (2), but instead just assumes that the individual dylibs remain unchanged during visitation. I started drafting (2), but it quickly became big and I stopped to avoid wasting time here.

The title mentioned a deadlock. Did you encounter one, or was this a pre-emptive attempt to reduce lock contention?

One more note on the current approach:
IIUC ExecutionSessionBase::lookup() either resolves all symbols in all dylibs or returns an error. Wouldn't this mean that a single broken dylib could cause everything to fail?
Is there a fundamental reason for this?

Just simplicity of the model. I expect that clients will only look up one symbol, the entry point, and call that, so they would not benefit from this. The JIT linker uses multi-symbol lookups when fixing up modules (e.g. if a module references external symbols { "foo", "bar", "baz" }, the JIT linker will issue a query for those three symbols together). Because the JIT linker operates on whole modules either the whole lookup succeeds (in which case linking the module succeeds), or if any symbol resolution fails then linking the whole module fails.

Otherwise, for (2) it would be easy to return both, errors and symbols for each dylib, and degrade errors to warnings if all requested symbols could be resolved.
With your ErrorList implementation we would be perfectly prepared for it.

I have actually been talking to people about removing ErrorList. :)

There is already a SymbolsNotFound error in ORC, which lookup will return if not all symbols can be resolved, and this returns the full set of symbols that could not be resolved. I think custom compound errors of this kind are easier for clients to deal with.

The error approach I have taken so far is that potentially fatal operations should be as transactional as possible. So, for instance, if you issue a query for { "foo", "bar", "baz" } and it turns out that "bar" and "baz" can not be resolved then lookup will return SymbolsNotFound({"bar", "baz"}), and the JIT state will be as if the query had not been executed (i.e. foo will remain uncompiled). That lends itself to a simple recovery scheme: fix environment + retry. Partial successes, if allowed, seem more difficult to recover from: You would need to figure out the difference between what you originally asked for and what you actually got, and then only retry the part that failed. It is early days though, so we may change the scheme as we get more experience with it.

sgraenitz abandoned this revision.Aug 24 2018, 4:15 AM

The title mentioned a deadlock. Did you encounter one, or was this a pre-emptive attempt to reduce lock contention?

Yes first I did, but using the multithreaded version of the SimpleCompiler fixed that. Lock contention was just a side-effect I thought about.

The error approach I have taken so far is that potentially fatal operations should be as transactional as possible.

Very reasonable.

I expect that clients will only look up one symbol, the entry point, and call that, so they would not benefit from this.

Ok it's probably the main use case, but I think the ThinLTOLayer would already do parallel lookup. Will investigate next week.

I have actually been talking to people about removing ErrorList. :)

Nooooooo. Will send you a separate mail for this..