This is an archive of the discontinued LLVM Phabricator instance.

[lldb/ObjectFileMachO] Fetch shared cache images from our own shared cache
ClosedPublic

Authored by friss on Jul 1 2020, 10:15 PM.

Details

Summary

On macOS 11, the libraries that have been integrated in the system
shared cache are not present on the filesystem anymore. LLDB was
using those files to get access to the symbols of those libraries.
LLDB can get the images from the target process memory though.

This has 2 consequences:

  • LLDB cannot load the images before the process starts, reporting an error if someone tries to break on a system symbol.
  • Loading the symbols by downloading the data from the inferior is super slow. It takes tens of seconds at the start of the debug session to populate the Module list.

To fix this, we can use the library images LLDB has in its own
mapping of the shared cache. To do this patch extends ModuleSpec
to be able to store a DataBuffer for the Module that the MacOS
platform will provide by querying a new HostInfo utility which
describes the contents of the shared cache.

This patch fixes a number of test failures on macOS 11 due to the
first problem described above.

Diff Detail

Event Timeline

friss created this revision.Jul 1 2020, 10:15 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 1 2020, 10:15 PM
friss added a comment.Jul 1 2020, 10:18 PM

@labath I tagged you mostly for the generic parts of the patch. I don't suppose you care a lot about ObjectFileMachO.cpp

labath added a comment.Jul 2 2020, 8:09 AM

I think this is a very interesting feature (lldb being able to load modules from memory; the mac shared cache thingy is interesting too, but in a different way). We have a feature request from people who are downloading modules from a network (from a proprietary symbol server, etc.) and would like to pass them to lldb without having to serialize them to disk. This would be a step towards making that happen. It could also be useful for our own unit tests which now have to do a similar thing.

However, I think this could use some clean up. There's a lot of juggling of data/file/object offsets going on, and it all seems inconsistent and avoidable to me. Please see inline comments for details.

The patch is also quite light on testing. If done right, I believe this should make it possible to yaml2obj a file into memory in a unit test and then create a Module from that. That would enable us to test the Module(Spec) changes in isolation, and move them to a separate patch.

I don't suppose you care a lot about ObjectFileMachO.cpp

I care about ObjectFileMachO to the extent that I need to occasionally touch it when working on generic interfaces. And I gotta say that changing anything in there is pretty hard right now... :/

lldb/include/lldb/Host/macosx/HostInfoMacOSX.h
21–42

The way we've done this elsewhere is to add the interface to the base class with a default stubbed-out implementation. That way, you don't have to put #ifdef __APPLE__ into all of the code which tries to use this. HostInfo::GetXcodeSDKPath is the latest example of that.

37

I think this could just be an ArrayRef<uint8_t>, or a void* or something, and then you could create an appropriately sized DataBufferHostMemory (or whatever ends up called) when working with a specific module.

lldb/include/lldb/Utility/DataBuffer.h
82 ↗(On Diff #275004)

All of our other DataBuffers also point to host memory (how could they not do that?). I guess what really makes this one special is that it does not own the storage that it points to...

lldb/source/Core/Module.cpp
154–159 ↗(On Diff #275004)

I think this overloads the meaning of module_spec.GetObjectOffset() in a fairly confusing way. Normally, I believe ModuleSpec::GetObject{Name,Offset,Size} is used to refer to the name/offse/size of a object file located in an archive (.a). However, here it seems you are reusing it for something else. It seems unfortunate that the meaning of this field should change depending on the "data" field being present.

What exactly is the purpose of this field? Could we avoid this by just creating an appropriately-sized DataBuffer ?

1265–1272 ↗(On Diff #275004)

With an appropriately sized data_sp, I'm hoping most if this could go away...

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
934–947

Wouldn't this be better handled by adjusting the file offsets during Section creation?

lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp
42–44

Leftovers from an earlier implementation?

lldb/source/Symbol/ObjectFile.cpp
217 ↗(On Diff #275004)

this data/file_offset business would be nice to get rid of too...

lldb/unittests/Host/HostInfoTest.cpp
74 ↗(On Diff #275004)

EXPECT_LT

llvm/include/llvm/BinaryFormat/MachO.h
86 ↗(On Diff #275004)

Maybe just commit this, and all the 0x80000000u replacements straight away.

friss marked 4 inline comments as done.Jul 2 2020, 9:17 AM

I think this is a very interesting feature (lldb being able to load modules from memory; the mac shared cache thingy is interesting too, but in a different way). We have a feature request from people who are downloading modules from a network (from a proprietary symbol server, etc.) and would like to pass them to lldb without having to serialize them to disk. This would be a step towards making that happen. It could also be useful for our own unit tests which now have to do a similar thing.

However, I think this could use some clean up. There's a lot of juggling of data/file/object offsets going on, and it all seems inconsistent and avoidable to me. Please see inline comments for details.

I'll see what can be done. My main goal while working on this was to avoid changing the semantics outside of the shared cache usecase. I understand fairly well the codepath that I added and then just moved some other bits around to keep the existing semantics for the rest. Happy to rework this.

The patch is also quite light on testing. If done right, I believe this should make it possible to yaml2obj a file into memory in a unit test and then create a Module from that. That would enable us to test the Module(Spec) changes in isolation, and move them to a separate patch.

I agree about the light testing. FWIW, this got significant living-on testing on Apple platforms, but some more targeted testing would be great.

I don't suppose you care a lot about ObjectFileMachO.cpp

I care about ObjectFileMachO to the extent that I need to occasionally touch it when working on generic interfaces. And I gotta say that changing anything in there is pretty hard right now... :/

Tell me about it...

I'll do a couple experiments in response to your comments. Thanks for the review!

lldb/include/lldb/Host/macosx/HostInfoMacOSX.h
21–42

Yeah. When I wrote this patch, GetXcodeSDKPath didn't exist and I did it this way to avoid putting a very non-generic API at the top-level. As we're fine with this, it's easy to change.

lldb/include/lldb/Utility/DataBuffer.h
82 ↗(On Diff #275004)

Good point. What about DataBufferUnowned ? (and adding a comment to explain what it's used for)

lldb/source/Core/Module.cpp
154–159 ↗(On Diff #275004)

So the shared cache is some kind of a container in some ways similar to a .a, file, in some ways very different. The main difference is that the contained dylibs are not each in their own subpart of the file. For example, the __LINKEDIT segment is shared by all of them. The load commands that describe the images are relative to the shared cache itself, not to one specific image.

So I reused the object_offset field in its "offset from the beginning of a container sense".

I think I tried having the SharedCacheInfo return only the subpart of the cache that contains the image, and ran into issues with this. But I don't remember exactly what, and I have a much better understanding now, so I should try it again.

lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp
42–44

yep, good catch

I think this is a very interesting feature (lldb being able to load modules from memory; the mac shared cache thingy is interesting too, but in a different way). We have a feature request from people who are downloading modules from a network (from a proprietary symbol server, etc.) and would like to pass them to lldb without having to serialize them to disk. This would be a step towards making that happen. It could also be useful for our own unit tests which now have to do a similar thing.

However, I think this could use some clean up. There's a lot of juggling of data/file/object offsets going on, and it all seems inconsistent and avoidable to me. Please see inline comments for details.

I'll see what can be done. My main goal while working on this was to avoid changing the semantics outside of the shared cache usecase. I understand fairly well the codepath that I added and then just moved some other bits around to keep the existing semantics for the rest. Happy to rework this.

So, if an object file needs to access some data which is outside of "its" image then my idea about using sliced data buffers will probably not work. It that case, using the "object offset" field to communicate the location of the "object" might not be a bad idea (it's still different than the use in .a files, but maybe we can stretch that definition). The part that bugs me then is having this functionality key off of the "data" field being set. Ideally, these would be two orthogonal features:

  • the "data" would control whether you read the file system to obtain the object contents
  • the "object offset" would tell you where to locate the desired object inside these "jumbo" objects

I think that might be doable by adding one more argument to the ObjectFile::GetModuleSpecifications interface, which says "this buffer contains a lot of stuff, but the object I am really interested in is at <offset>". Then ObjectFileMachO could do what it needs to read the object from that offset, but it would still have access to the entire buffer to read the __LINKEDIT thingy.

Or.... we could try to do something completely different, and avoid touching the common interfaces altogether. The way I see it, this shared cache thing is unlikely to be reusable for anything else, and the reason we're adding these interfaces is so that we can communicate information from DynamicLoaderDarwin to ObjectFileMachO through Target::GetOrCreateModule. But that is silly, because DynamicLoaderDarwin already knows that the request is going to go to ObjectFileMachO (and indeed things would break if it doesn't). If there was a way for these two to communicate directly, then all of this would be unneeded, because DynamicLoaderDarwin could use a ObjectFileMachO interface, which would be specially crafted for this purpose.

It also seems to me we are only really interested in the "Get" part of Target::GetOrCreateModule -- i.e., ensuring we don't create multiple module instances for the same object. The rest of the function deals with various ways to try to find a module, but that's the thing we actually want to avoid, as we already know the data buffer that contains it. And all of the checks about matching uuids/architecures/etc. don't seem useful either as DynamicLoaderDarwin is repeating most of these anyway.

We actually already have an interface that almost does this: Module::CreateModuleFromObjectFile. It allows one to create a Module while directly specifying the ObjectFile class to use, and passing arbitrary constructor arguments. The part it does *not* do is register this module into the global module cache. That's because so far, this function is used to create "weird" modules that we wouldn't want to cache anyway. But, what if we created a version of this function which does that? Could DynamicLoaderDarwin then instead of target.GetOrCreateModule(shared_cache_spec) do something like:

module_sp = target.GetImages().FindFirstModule(shared_cache_spec);
// If target already contains the module, we're done.
if (!module_sp) {
  // This checks the global module cache for a matching module. If it finds one, it returns it. Otherwise, it creates a module through CreateModuleFromObjectFile, and registers it into the cache.
  module_sp = Module::GetOrCreateModuleFromObjectFile<ObjectFileMachO>(shared_cache_spec, whatever_args_we_need_to_get_macho_to_load_properly);
  target.GetImages().Append(module_sp, /*notify=*/false);
}

I think an interface like this could be useful in the future as an escape hatch, because there are a lot of situations where one already knows he is dealing with a specific kind of object files, but is not able to take advantage of that. For example, DynamicLoaderPOSIXDYLD "knows" it's going to load an ELF file, that knowledge does not help it in any way. So far, we haven't needed to do anything super weird there, but that class already does contain a bunch of fallbacks/workarounds for bugs in various dynamic loaders and other platform features (e.g. the VDSO pseudo-module). I can certainly see how being able to create an module more directly could be helpful at times.

The patch is also quite light on testing. If done right, I believe this should make it possible to yaml2obj a file into memory in a unit test and then create a Module from that. That would enable us to test the Module(Spec) changes in isolation, and move them to a separate patch.

I agree about the light testing. FWIW, this got significant living-on testing on Apple platforms, but some more targeted testing would be great.

Yeah, I'm sure it works now. I'm more worried about it continuing to work in face of other random changes. :)

lldb/include/lldb/Utility/DataBuffer.h
82 ↗(On Diff #275004)

Sounds good. I don't think that the comment really needs to mention the shared cache. I can see this being useful in other circumstances too...

lldb/source/Core/Module.cpp
154–159 ↗(On Diff #275004)

Hmm... interesting. I'm going to touch on this more in the main comment.

friss added a comment.Jul 9 2020, 3:00 PM

I think this is a very interesting feature (lldb being able to load modules from memory; the mac shared cache thingy is interesting too, but in a different way). We have a feature request from people who are downloading modules from a network (from a proprietary symbol server, etc.) and would like to pass them to lldb without having to serialize them to disk. This would be a step towards making that happen. It could also be useful for our own unit tests which now have to do a similar thing.

However, I think this could use some clean up. There's a lot of juggling of data/file/object offsets going on, and it all seems inconsistent and avoidable to me. Please see inline comments for details.

I'll see what can be done. My main goal while working on this was to avoid changing the semantics outside of the shared cache usecase. I understand fairly well the codepath that I added and then just moved some other bits around to keep the existing semantics for the rest. Happy to rework this.

So, if an object file needs to access some data which is outside of "its" image then my idea about using sliced data buffers will probably not work. It that case, using the "object offset" field to communicate the location of the "object" might not be a bad idea (it's still different than the use in .a files, but maybe we can stretch that definition). The part that bugs me then is having this functionality key off of the "data" field being set. Ideally, these would be two orthogonal features:

  • the "data" would control whether you read the file system to obtain the object contents
  • the "object offset" would tell you where to locate the desired object inside these "jumbo" objects

I think this will work. And I can hide the ugliness inside ObjectFileMachO. A shared cache image only ever needs to access data after its start, so I can model images a stretching from their starting point to the end of the shared cache. I remember doing it this way first, and the reason I changed my mind was because of some checks in the ObjectFileMachO::CreateSections which fired because the load commands were relative to the full shared cache instead of just the image. This can be dealt with locally in this function (the rest of the code has to deal with it anyway, because once an ObjectFile plugin claims an input, the data gets clamped to not have a starting offset anymore

Take a look at https://reviews.llvm.org/D83512, it implements the generic part and it required basically no work to support ELF in-memory files.

friss updated this revision to Diff 278331.Jul 15 2020, 3:40 PM
  • Rebase on top of D83512
  • Change the ObjectFileMachO pieces to rewrite offsets to look like a standard Mach-o image instead of adding a bunch of conditionals to handle the new cases.
jasonmolenda accepted this revision.Jul 15 2020, 6:01 PM

The rewrite of the ObjectFileMachO parts is very nice. LGTM.

lldb/source/Host/macosx/objcxx/HostInfoMacOSX.mm
463

not important but #include "Utility/UuidCompatibility.h" would get you this.

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
2346

this is a bool, maybe !is_shared_cache_image would be clearer? The original code was comparing a bitfield to 0 so it made a little more sense.

This revision is now accepted and ready to land.Jul 15 2020, 6:01 PM

I like how this has turned out. Some small requests inline.

lldb/include/lldb/Host/HostInfoBase.h
107

Some doxygen here? "Try to find a module with the given name in the address space of the current process?"

lldb/source/Host/macosx/objcxx/HostInfoMacOSX.mm
494–495

The class is already in an implementation file (you could empasize that by making that an anonymous namespace: http://llvm.org/docs/CodingStandards.html#anonymous-namespaces). I don't think you need to go through all that trouble to avoid someone instantiating it...

lldb/unittests/Host/HostInfoTest.cpp
81–117 ↗(On Diff #278331)

This is mostly about checking the ObjectFileMachO functionality, is it not? Could you make that a ObjectFileMachO unittest?

I'm mainly trying to avoid pulling in lots of libraries into the host unittest binary. The unittest binary is a good way to ensure that that the host module does not grow external dependencies (as the unittest wouldn't link), but this would pull in pretty much everything into that binary.

friss updated this revision to Diff 278495.Jul 16 2020, 8:23 AM
friss marked an inline comment as done.

Address review feedback

labath accepted this revision.Jul 16 2020, 8:27 AM
labath added inline comments.
lldb/source/Host/macosx/objcxx/HostInfoMacOSX.mm
474–481

extern "C" in an anonymous namespace looks weird, even if it does work. Best move this part out...

friss added a comment.Jul 16 2020, 8:33 AM

I'll commit after I've re-built top of tree and fully retested

This revision was automatically updated to reflect the committed changes.