This is an archive of the discontinued LLVM Phabricator instance.

Parse compile commands lazily in InterpolatingCompilationDatabase
ClosedPublic

Authored by ilya-biryukov on Aug 27 2018, 9:39 AM.

Details

Summary

This greatly reduces the time to read 'compile_commands.json'.
For Chromium on my machine it's now 0.7 seconds vs 30 seconds before the
change.

Diff Detail

Event Timeline

ilya-biryukov created this revision.Aug 27 2018, 9:39 AM
ilya-biryukov added inline comments.Aug 27 2018, 9:45 AM
lib/Tooling/InterpolatingCompilationDatabase.cpp
440

We can't look at 'Type' at this point anymore, because it needs parsing of TranserableCommands. Not sure what's the best way to replace it. @sammccall, any ideas?

jfb requested changes to this revision.Aug 27 2018, 9:45 AM
jfb added inline comments.
lib/Tooling/InterpolatingCompilationDatabase.cpp
206

It's not clear to me that this entire function is safe to call from multiple threads at the same time. Even if it's safe now, I'm willing to bet it won't always be that way. getTraits should therefore use a magic static or call_once and avoid the headache entirely.

263

The comment should say why accesses need to be atomic. Or better yet, this should only be usable "the right way".

This revision now requires changes to proceed.Aug 27 2018, 9:45 AM
  • Add a comment
  • Use std::call_once to compute the lazy value
ilya-biryukov marked 2 inline comments as done.
  • Remove computeTraits, put the body inside a lambda
ilya-biryukov added inline comments.Aug 27 2018, 10:07 AM
lib/Tooling/InterpolatingCompilationDatabase.cpp
206

Thanks, call_once is both simpler and more reliable.

263

Clarified how it's used.

  • Remove #include <atomic>, it is not used anymore
jfb added inline comments.Aug 27 2018, 10:26 AM
lib/Tooling/InterpolatingCompilationDatabase.cpp
125

This comment about move isn't really saying anything. Also, it's valid but unspecified (in the case of STL things). I'd drop it.

129

The once_flag should just be a static, don't allocate it.

ilya-biryukov added inline comments.Aug 28 2018, 1:14 AM
lib/Tooling/InterpolatingCompilationDatabase.cpp
125

We specifically assert that object cannot be called after move() (check the unique_ptr that stores our once_flag). It's definitely undefined behavior to call any methods, because they will immediately dereference a null pointer (the aforementioned unique_ptr).

Happy to drop the comment, though, we do have asserts for that.

129

Sorry, I don't seem to follow. We need one once_flag per TransferableCommand

Thanks for finding this problem, this fix *mostly* looks good (though I think we can probably drop memoization).

I'm a bit uncomfortable about the places where we need the type, because this is the only thing forcing us to parse before we've picked a command to transfer, and the number of commands we need to parse is data-dependent and hard to reason about.

Let me think about this a little - I suspect slightly more invasive changes (change the concept of type, tweak the heuristics, or do a "lightweight parse" to get the type) might make this cleaner and performance more predictable.

lib/Tooling/InterpolatingCompilationDatabase.cpp
134

I think you're overthinking things with the memoization here (of course I say this as the person who underthought it!)

AIUI, the problem is that *eagerly* parsing all the compile commands takes 3x as long as reading them, which hurts startup time with big compile_commands.json.

But I think we can afford to just parse them when transferTo is called, without memoization. (Remember we only hit this code path when querying a file *not in the CDB*, so it should never get called in a tight loop).

The benefit of slightly reducing the latency of`getCompileCommand` for unknown files when we happen to pick the same template file for the second time... it's unclear to me, and the code would be easier to follow without it.

160

Is this so important to dynamically check? Most types don't.

172

Traits is a bit vague, and a bit template-nightmare-y!
maybe ParsedCommand?

384

I think this is going to force parsing of all candidates that get any points at all, with a flat directory structure this could be quite a lot :-(

384

ah, now I see that the memoization also allows us to pretend that this is an eagerly computed value, without considering exactly when it's computed.

I'm not sure I like this if we do consider it performance sensitive - it obfuscates exactly which set of commands we parse, it would be nice if we were upfront about this.

440

So filtering out this has a couple of effects

  • it's a performance optimization (don't bother indexing filenames for useless files). We don't need this
  • it prevents a TY_INVALID command being chosen for transfer. I'm not sure whether this would occur often enough to be a problem in practice.
sammccall added inline comments.Aug 28 2018, 1:39 AM
lib/Tooling/InterpolatingCompilationDatabase.cpp
125

I think the idea is "this comment just reiterates standard C++ semantics".

(FWIW I find the asserts hurt readability - it's unusual to try to detect this condition, and TraitsComputed is easily mistaken for a boolean)

I'm a bit uncomfortable about the places where we need the type, because this is the only thing forcing us to parse before we've picked a command to transfer, and the number of commands we need to parse is data-dependent and hard to reason about.

Let me think about this a little - I suspect slightly more invasive changes (change the concept of type, tweak the heuristics, or do a "lightweight parse" to get the type) might make this cleaner and performance more predictable.

Having studied the code a bit - we use the parsed type to evaluate candidates, but we're comparing against the type extracted from the query filename (we have nothing else!).
So we should be fine just to compare extensions instead here. So either TransferableCommand would have an Extension field or a TypeFromFilename field - but we should be careful not to conflate the type inferred from the filename (fine for *selecting* a command) with the one parsed from the command (needed to *transfer* the command).

Then we have no need to parse for selection, and getCompileCommands() only needs to parse a single command from the underlying CDB, which should yield very predictable/consistent performance.
(It also couples nicely with the idea of only grabbing the full command from the underlying CDB lazily - we'll only do that once, too).

lib/Tooling/InterpolatingCompilationDatabase.cpp
481–483

as you pointed out offline, we actually only need the filename for indexing, so we could ask the underlying DB for the filenames and get their commands on demand.

(we need to verify that the values returned don't differ from the filenames stored in CompileCommand, e.g. by filename normalization, in a way that matters)

  • Remove mutexes, recompute every time instead
  • Delay creation of TransferableCommand to avoid calling getAllCommands() on JSONCompilationDatabase
ilya-biryukov marked 4 inline comments as done.Aug 28 2018, 3:19 AM
ilya-biryukov added inline comments.
lib/Tooling/InterpolatingCompilationDatabase.cpp
134

Totally agree, memoization does not buy us much here.

ilya-biryukov edited the summary of this revision. (Show Details)Aug 28 2018, 3:21 AM
sammccall accepted this revision.Aug 28 2018, 6:08 AM

Awesome :-)

lib/Tooling/InterpolatingCompilationDatabase.cpp
232

restore this comment?

266

This summary is a bit unclear to me. (too many clauses, maybe too abstract).
And the high level heuristics are hidden a bit below the implementation ideas.

Maybe

Given a filename, FileProximityIndex picks the best matching file from the underlying DB. This is the proxy file whose CompileCommand will be reused.

The heuristics incorporate file name, extension, and directory structure.

Strategy:
...
266

nit: I'd prefer FileIndex or FilenameIndex here - "proximity" emphasizes directory structure over stem/extension, which are pretty important!

384

hmm, I would have thought we'd store the values of guessType() when building the index. I guess it doesn't matter, it just seems surprising to see this call here.

384

you're calling foldType(guessType(...)) on the query, do you need to fold here too?

464

this guessType/foldType call should be folded into Index.chooseProxy now I think - Index explicitly knows that the language it deals with must be derived from the filename.

ilya-biryukov marked 6 inline comments as done.
  • Update the comments
  • Rename the new class to FileIndex
  • Restore an accidentally lost comment
  • Store file types in a parallel array instead of recomputing on each call
  • Use foldType(guessType()) when obtaining lang type from filename
  • Reformat the code
  • Minor spelling fix
sammccall accepted this revision.Aug 28 2018, 7:10 AM
sammccall added inline comments.
lib/Tooling/InterpolatingCompilationDatabase.cpp
274

(this comment is covered in the summary now)

  • Lowercase everything stored in the index.
  • Handle TransferableCommands with TY_INVALID type (never transfer -x flag for those)
  • Add a test with invalid extensions, seen a crash while experimenting
  • Update the test wrt to the new behavior.
  • Sort Paths, they are different from OriginalPaths, i.e. lowercased.
sammccall accepted this revision.Aug 28 2018, 7:55 AM
sammccall added inline comments.
lib/Tooling/InterpolatingCompilationDatabase.cpp
126–127

(or just "never TY_INVALID" which would fit on prev line :-)

127–128

it's always set here, drop the condition.

443

nit: no llvm:: :-)

ilya-biryukov marked 3 inline comments as done.
  • Cleanups
This revision was not accepted when it landed; it landed in state Needs Review.Aug 28 2018, 9:16 AM
This revision was automatically updated to reflect the committed changes.