Page MenuHomePhabricator

Prime module caches outside of the module list lock for better parallelism
ClosedPublic

Authored by scott.smith on Apr 27 2017, 10:04 AM.

Details

Reviewers
labath
jingham
clayborg
Group Reviewers
Restricted Project
Summary

Loading a shared library can require a large amount of work; rather than do that serially for each library, provide a mechanism to do some amount of priming before hand.

Diff Detail

Repository
rL LLVM

Event Timeline

scott.smith created this revision.Apr 27 2017, 10:04 AM

Here's the controversial patch. It has been brought up that the intent is to not load symbols before they are needed, but at least in my experience this patch has no effect on performance when running:
lldb -b -o run /path/to/myprogram
where myprogram has main(){return 0;} and links in a whole lot of libraries. My platform is Ubuntu 14.04; are there other systems where symbol processing is actually delayed? Maybe on those systems PrimeCaches() can be a no-op to not impact performance?

This patch requires D32597 in order to actually improve performance, but does not require it for correctness.

jingham edited edge metadata.

Adding Greg as a Reviewer.

clayborg edited edge metadata.Apr 27 2017, 11:22 AM

Making an empty main program and saying I see no difference is not enough testing to enable this. I also don't see the benefit of this path. When LLDB is used for symbolication, it might never actually load any debug info or symbols from modules that are added to a target. Also if there is any DWARF that doesn't have the Apple accelerator tables then the DWARF manually indexes itself by spawning threads that will index a compile unit per thread using a thread pool.. So now if we load all symbols at once, using NUM_CORES threads, and then each DWARF file spins off NUM_CORES threads to index the DWARF we can probably bring the machine to the brink? If this feature does go it, we should add a setting that is disabled by default, but if enabled, uses this new functionality. That way we can get it in, ask people to opt into it and later possibly change the default if all goes well. That is my two cents. I still vote for do as little as possible and only do it when needed. But I can see the desire to do things as you coded them. So I think it should be an option that is off by default, but users can set it in their ~/.lldbinit file.

jingham requested changes to this revision.EditedApr 27 2017, 11:28 AM

Instead of having the cache priming be determined by platform or something like that, it would be better to add a setting (on the target level seems right) that controls this. I think you are right that in most common usage, there's going to be an unrestricted query that will end up looking through all the symbols, and in that case, having ingested them in parallel would be a win. So the correct default would be to do this eagerly.

The control is more for people who are writing special purpose tools than for ordinary users. For instance, when Xcode runs its "Test" action, by default it uses lldb to run the test binaries. That great because it means that if a test crashes, you'll be right in the debugger and can analyze the failure. But most of the time the tests never actually stop in the debugger. So reading in a whole raft of symbol information in that use case is pretty pointless. They would be a good candidate to use this setting.

Note, this will make the lazy loading symbol path the less common one, so we'll have to be careful to add some tests to make sure that mode still works.

This revision now requires changes to proceed.Apr 27 2017, 11:28 AM

Making an empty main program and saying I see no difference is not enough testing to enable this.

It's not quite an empty main program; it links in 40+ shared libraries with 2M+ symbols. The point of having main() just return 0 is so I could compare the performance of "b main; run" vs "run" without taking program execution into account, but just the amount of work lldb does.

But as it turns out, I still managed to screw that up. I'm now seeing a difference:

WITHOUT CHANGE:
lldb -b -o 'b main' -o 'run' xyz:
real 0m15.248s
user 0m24.632s
sys 0m3.839s
lldb -b -o 'run' xyz:
real 0m11.540s
user 0m9.692s
sys 0m0.892s

WITH CHANGE:
real 0m9.747s
user 0m29.507s
sys 0m10.531s

(timings are reasonably representative, but they do change a bit run to run)

I also don't see the benefit of this path.

It improves lldb startup time significantly. Our use cases seem centered around using symbols, so for us, delaying loading of symbols does not help, it just moves when we wait to another point.

When LLDB is used for symbolication, it might never actually load any debug info or symbols from modules that are added to a target. Also if there is any DWARF that doesn't have the Apple accelerator tables then the DWARF manually indexes itself by spawning threads that will index a compile unit per thread using a thread pool.. So now if we load all symbols at once, using NUM_CORES threads, and then each DWARF file spins off NUM_CORES threads to index the DWARF we can probably bring the machine to the brink?

No, because it uses a global TaskPool, so regardless of how many users there are, there will only be NUM_CORES threads. Note my comments in D32597 about changing TaskPool so it will be file-local instead of global, so TaskPool workers can spawn other TaskPool work.

Instead of having the cache priming be determined by platform or something like that, it would be better to add a setting (on the target level seems right) that controls this. I think you are right that in most common usage, there's going to be an unrestricted query that will end up looking through all the symbols, and in that case, having ingested them in parallel would be a win. So the correct default would be to do this eagerly.

Define "a setting." A method on class Target? A parameter when constructing class Target? Or something via the command line, or lldbinit?

jingham added a comment.EditedApr 27 2017, 11:57 AM

Ah, sorry. I was talking about an lldb command line setting, which you access through the "setting set/get" command. Those are actually implemented by adding a property on the class in question. Look in Target.cpp for the TargetProperties class. You'll just want to add another property there.

scott.smith edited edge metadata.

Added setting. Verified that:
settings set target.cache-priming off
works as expected.

jingham requested changes to this revision.Apr 27 2017, 12:37 PM

This is picky but can you call it "symbol-cache-priming". The help text explains it, but this is a very specific cache so we should scope it in case we have some other one later on.

And we really need to test that both modes continue to work. If we're always fetching symbols up-front by default, then it will be easy for triggering lazy loading to break. One way to do that would to add another mode of the testsuite (like the dwarf vrs. dSYM vrs. dso vrs. gmodules.) But at some point continually multiplying runs of the whole testsuite is going to make it so time consuming people will stop running all the options.

In this case, it's probably sufficient to add a test that runs to a shared-library scoped breakpoint, and performs some operations that require symbols from other shared-libraries and makes sure they succeed. Then do that with the setting on and off.

This revision now requires changes to proceed.Apr 27 2017, 12:37 PM

Would "preload-symbols" be a bit more clear and concise?

Yes, I like that better too.

scott.smith edited edge metadata.
  1. Rename to preload-symbols / PreloadSymbols()
  2. Modify an existing test to run with and without symbol preloading.

Fix default param to setup_common so that we actually test both paths.

That test is good. That tests the lazy lookup of the dwarf type indices. The other thing this changes is symbol reading. Can you also add a similar test that relies on finding a symbol in a shared library we might not have read in? It probably fine to add it to this test case... It's unlikely we'll break lazy symbol reading and not also break lazy Dwarf Index building, but you never know. Other than that, this looks fine to me.

Add test to print expression (calls func, hence resolves symbols). Also better parameterize the common test to reduce code duplication.

jingham accepted this revision.Apr 27 2017, 2:54 PM

Looks good. Thanks for working on this!

This revision is now accepted and ready to land.Apr 27 2017, 2:54 PM
clayborg accepted this revision.Apr 27 2017, 3:38 PM

Looks good.

Can someone commit this for me? Thanks!

jingham closed this revision.Apr 27 2017, 6:04 PM

Sure. Closed with r301609.