This greatly reduces the time to read 'compile_commands.json'.
For Chromium on my machine it's now 0.7 seconds vs 30 seconds before the
change.
Details
Diff Detail
- Repository
- rC Clang
- Build Status
Buildable 21968 Build 21968: arc lint + arc unit
Event Timeline
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
440 | We can't look at 'Type' at this point anymore, because it needs parsing of TranserableCommands. Not sure what's the best way to replace it. @sammccall, any ideas? |
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
206 | It's not clear to me that this entire function is safe to call from multiple threads at the same time. Even if it's safe now, I'm willing to bet it won't always be that way. getTraits should therefore use a magic static or call_once and avoid the headache entirely. | |
263 | The comment should say why accesses need to be atomic. Or better yet, this should only be usable "the right way". |
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
125 | We specifically assert that object cannot be called after move() (check the unique_ptr that stores our once_flag). It's definitely undefined behavior to call any methods, because they will immediately dereference a null pointer (the aforementioned unique_ptr). Happy to drop the comment, though, we do have asserts for that. | |
129 | Sorry, I don't seem to follow. We need one once_flag per TransferableCommand |
Thanks for finding this problem, this fix *mostly* looks good (though I think we can probably drop memoization).
I'm a bit uncomfortable about the places where we need the type, because this is the only thing forcing us to parse before we've picked a command to transfer, and the number of commands we need to parse is data-dependent and hard to reason about.
Let me think about this a little - I suspect slightly more invasive changes (change the concept of type, tweak the heuristics, or do a "lightweight parse" to get the type) might make this cleaner and performance more predictable.
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
134 | I think you're overthinking things with the memoization here (of course I say this as the person who underthought it!) AIUI, the problem is that *eagerly* parsing all the compile commands takes 3x as long as reading them, which hurts startup time with big compile_commands.json. But I think we can afford to just parse them when transferTo is called, without memoization. (Remember we only hit this code path when querying a file *not in the CDB*, so it should never get called in a tight loop). The benefit of slightly reducing the latency of`getCompileCommand` for unknown files when we happen to pick the same template file for the second time... it's unclear to me, and the code would be easier to follow without it. | |
160 | Is this so important to dynamically check? Most types don't. | |
172 | Traits is a bit vague, and a bit template-nightmare-y! | |
384 | I think this is going to force parsing of all candidates that get any points at all, with a flat directory structure this could be quite a lot :-( | |
384 | ah, now I see that the memoization also allows us to pretend that this is an eagerly computed value, without considering exactly when it's computed. I'm not sure I like this if we do consider it performance sensitive - it obfuscates exactly which set of commands we parse, it would be nice if we were upfront about this. | |
440 | So filtering out this has a couple of effects
|
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
125 | I think the idea is "this comment just reiterates standard C++ semantics". (FWIW I find the asserts hurt readability - it's unusual to try to detect this condition, and TraitsComputed is easily mistaken for a boolean) |
Having studied the code a bit - we use the parsed type to evaluate candidates, but we're comparing against the type extracted from the query filename (we have nothing else!).
So we should be fine just to compare extensions instead here. So either TransferableCommand would have an Extension field or a TypeFromFilename field - but we should be careful not to conflate the type inferred from the filename (fine for *selecting* a command) with the one parsed from the command (needed to *transfer* the command).
Then we have no need to parse for selection, and getCompileCommands() only needs to parse a single command from the underlying CDB, which should yield very predictable/consistent performance.
(It also couples nicely with the idea of only grabbing the full command from the underlying CDB lazily - we'll only do that once, too).
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
481–483 | as you pointed out offline, we actually only need the filename for indexing, so we could ask the underlying DB for the filenames and get their commands on demand. (we need to verify that the values returned don't differ from the filenames stored in CompileCommand, e.g. by filename normalization, in a way that matters) |
- Remove mutexes, recompute every time instead
- Delay creation of TransferableCommand to avoid calling getAllCommands() on JSONCompilationDatabase
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
134 | Totally agree, memoization does not buy us much here. |
Awesome :-)
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
232 | restore this comment? | |
266 | This summary is a bit unclear to me. (too many clauses, maybe too abstract). Maybe Given a filename, FileProximityIndex picks the best matching file from the underlying DB. This is the proxy file whose CompileCommand will be reused. The heuristics incorporate file name, extension, and directory structure. Strategy: ... | |
266 | nit: I'd prefer FileIndex or FilenameIndex here - "proximity" emphasizes directory structure over stem/extension, which are pretty important! | |
384 | hmm, I would have thought we'd store the values of guessType() when building the index. I guess it doesn't matter, it just seems surprising to see this call here. | |
384 | you're calling foldType(guessType(...)) on the query, do you need to fold here too? | |
464 | this guessType/foldType call should be folded into Index.chooseProxy now I think - Index explicitly knows that the language it deals with must be derived from the filename. |
- Update the comments
- Rename the new class to FileIndex
- Restore an accidentally lost comment
- Store file types in a parallel array instead of recomputing on each call
- Use foldType(guessType()) when obtaining lang type from filename
lib/Tooling/InterpolatingCompilationDatabase.cpp | ||
---|---|---|
274 | (this comment is covered in the summary now) |
- Handle TransferableCommands with TY_INVALID type (never transfer -x flag for those)
- Add a test with invalid extensions, seen a crash while experimenting
- Update the test wrt to the new behavior.
This comment about move isn't really saying anything. Also, it's valid but unspecified (in the case of STL things). I'd drop it.