This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Drop usage of PreambleStatCache in scanPreamble
ClosedPublic

Authored by kadircet on Jun 12 2020, 1:30 AM.

Details

Summary

It was used inside buildCompilerInvocation to speed up stats. But
preambleStatCache doesn't contain stat information performed while
building compiler invocation. So it was an unnecessary optimization.

Furthermore, buildCompilerInvocation in scanPreamble doesn't need to
find gcc installation, include paths and such, as it is only trying to
lex directives. Hence we are passing an empty FS to get rid of any
redundant IO.

Diff Detail

Event Timeline

kadircet created this revision.Jun 12 2020, 1:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2020, 1:30 AM

As discussed offline, the cache is always missing today, but we do have reason to believe we're doing a fair amount of IO in buildCompilerInvocation and it should be very cacheable.
So dropping the cache here might be the wrong direction vs fixing it.
I don't know:

  • for sure how much IO is there or what it is
  • whether it's mostly stats and so could be handled by this mechanism
  • exactly what scope we should be filling the cache at (it's really tempting to do it on startup, which is a bigger change)

If the plan is to make the statcache wrap providers instead of/as well as FSes, do we actually need this change? (Then we can avoid having to decide right now)

clang-tools-extra/clangd/Preamble.cpp
215

This is confusing - this class is essentially unused (move to next patch?) and erasing the class with a lambda seems unnecessarily obscure.

248

(This change isn't related to the description, which threw me for a loop - may want to defer it)

Looked at this more in details, today we call buildCompilerInvocation in 3 different places:

  • Once in TUScheduler whenever we receive an update, later on it propagates into buildPreamble and buildAST.
  • Once during signatureHelp/codeCompletion. In theory that one can benefit from the stats we've performed during update above.
  • Two times during preamblePatching, as we create two different compiler instances. It seems like we don't care about those stats in here at all. As it is used to figure out resource/stdlib directories and other built-in include paths, and preamble patching doesn't care about those.
  • Once for read operations, if AST was eviced from cache, again this can benefit from the IO done while building invocation during last update request.

As for the amount of IO done during buildCompilerInvocation, I've run a simple test and here is a summary:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.71    0.002330           5       394       374 openat
 10.43    0.000301           4        71        39 stat
  1.49    0.000043           3        12           write
  1.42    0.000041           5         8           lstat
  1.28    0.000037           6         6           getdents64
  1.07    0.000031           1        21           fstat
  1.00    0.000029           1        20           close
  0.87    0.000025           4         6           readlink
  0.55    0.000016           2         6           pread64
  0.38    0.000011          11         1           getcwd
  0.31    0.000009           4         2           ioctl
  0.28    0.000008           8         1         1 lseek
  0.21    0.000006           0        50           mmap
  0.00    0.000000           0        10           read
  0.00    0.000000           0        35        34 access
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    0.002887                   644       448 total

Looks like only caching stats, especially without caching failures, is not going to a big win.
Most of the io is done during directory listing(openat), even though caching "success" in case of directory listing might be hard, we can actually cache only failures to mark missing files.

So ISTM:

  • buildCompilerInvocation usage inside scanPreamble doesn't need any access to any files, so I suggest we just pass empty FS.
  • we need a different cache for buildCompilerInvocation, one that caches dir_begin() failures.
  • IO done by buildCompilerInvocation depends on compiler args like sysroot, stdlib (and likely many more), hence I believe they should be cached inside TUScheduler and invalidated whenever compile commands change (maybe even more intrusive, a new one at each update since underlying FS can change too).

I am planning to just drop FS usage inside scanPreamble for good, and let the new caching mechanism emerge when we need it more :D WDYT?

kadircet added a comment.EditedJun 15 2020, 3:48 AM

Moreover, the most time consuming(well this might be subject to ltrace not counting IO wait times, as an openat above was only 5microsecs/call and a tolower call is 56 microsecs/call :D) bit actually seems like the string comparisons/manipulations done while parsing the command line args:

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 48.73    2.244680          56     39747 tolower
 21.88    1.008019          54     18522 strlen
  7.18    0.330929          54      6055 _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_appendEPKcm
  5.45    0.251138          54      4605 _ZdlPv
  4.69    0.216144          54      3961 bcmp
  4.61    0.212397          54      3890 memcpy
  1.44    0.066164          54      1204 _Znwm

So caching the result of command line parsing might actually be better than caching IO. This also seems quite easy, as that part of the code is already isolated into a clang::driver::Compilation.
We can cache the result of command line parsing and invalidate it whenever compile commands changes(this is safe, as it is a pure function, (well maybe minus env variables, haven't checked in detail)). And later on use this for creating a compiler invocation.

Thanks for all this investigation!

80.71 0.002330 5 394 374 openat

I'm curious what the 400 attempts and 20 successes are (I've seen this before but don't remember now). Probably not worth digging into though unless you happen to have the strace logs.

buildCompilerInvocation usage inside scanPreamble doesn't need any access to any files, so I suggest we just pass empty FS

I guess this makes sense, My only worry is the driver getting into a different state if probing or cwd or something fails. But this really shouldn't affect preamble scanning. If it's safe, this seems worth doing just to have more isolation.

we need a different cache for buildCompilerInvocation, one that caches dir_begin() failures

Yeah this is complicated - worthwhile if the IO is actually adding ~20ms. Easiest way to tell if tracing tools aren't helping might be to use an empty FS and ignore all the resulting problems - timing for buildCompilerInvocation should be correct.
If needed, maybe the record/replay FSes used for lldb reproducers are usable? Nice to avoid that complexity if possible though.

48.73 2.244680 56 39747 tolower

How many per call to buildCompilerInvocation? Maybe arg parsing is doing something dumb...

kadircet updated this revision to Diff 271083.Jun 16 2020, 7:08 AM
kadircet marked 2 inline comments as done.
  • Drop VFS usage in scanPreamble completely to avoid IO.

Thanks for all this investigation!

80.71 0.002330 5 394 374 openat

I'm curious what the 400 attempts and 20 successes are (I've seen this before but don't remember now). Probably not worth digging into though unless you happen to have the strace logs.

This is mostly gcc installation scanning for libc and such, biggest call site in https://github.com/llvm/llvm-project/blob/master/clang/lib/Driver/ToolChains/Gnu.cpp#L2408, which is called multiple times from https://github.com/llvm/llvm-project/blob/master/clang/lib/Driver/ToolChains/Gnu.cpp#L1907.

buildCompilerInvocation usage inside scanPreamble doesn't need any access to any files, so I suggest we just pass empty FS

I guess this makes sense, My only worry is the driver getting into a different state if probing or cwd or something fails. But this really shouldn't affect preamble scanning. If it's safe, this seems worth doing just to have more isolation.

Changing this patch to do that instead.

we need a different cache for buildCompilerInvocation, one that caches dir_begin() failures

Yeah this is complicated - worthwhile if the IO is actually adding ~20ms. Easiest way to tell if tracing tools aren't helping might be to use an empty FS and ignore all the resulting problems - timing for buildCompilerInvocation should be correct.
If needed, maybe the record/replay FSes used for lldb reproducers are usable? Nice to avoid that complexity if possible though.

Benchmarked with an empty inmemoryfs and real filesystem using fallback commands(clang a.cc). there seems to be about a 6 times speed up when buildCompilerInvocation is run without IO.
empty fs takes about 0.17 ms on average, whereas real file system takes 1.01 ms on average.

Changed the compile command to something google3 sized (~400 args):
empty fs takes about 1.4ms, whereas the real IO takes about 1.8ms.

So shaving off some IO might help a lot for trivial command lines, but for complicated commands we need to improve command line parsing or start caching the result.

48.73 2.244680 56 39747 tolower

How many per call to buildCompilerInvocation? Maybe arg parsing is doing something dumb...

this is only a single call to buildCompilerInvocation :(

Just for fun, top 5 library calls in a complicated command line case (~400 args):

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 50.61    3.815022          54     69820 tolower
 25.35    1.911093          55     34416 strlen
  9.35    0.704789          53     13067 bcmp
  5.09    0.383536          56      6773 _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_appendEPKcm
  3.69    0.278408          56      4935 _ZdlPv

so calls to tolower/strlen seems to be scaling sub-linearly (previous command line had only 2 args, so there's about 200x increase whereas call counts seem to have only doubled).

kadircet edited the summary of this revision. (Show Details)Jun 16 2020, 7:09 AM
sammccall accepted this revision.Jun 17 2020, 8:09 AM
This revision is now accepted and ready to land.Jun 17 2020, 8:09 AM
This revision was automatically updated to reflect the committed changes.