Page MenuHomePhabricator

[lldb] Assert file cache and live memory are equal on debug builds
ClosedPublic

Authored by augusto2112 on Jul 22 2021, 12:33 PM.

Diff Detail

Event Timeline

augusto2112 requested review of this revision.Jul 22 2021, 12:33 PM
augusto2112 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 12:33 PM

Hm there's no phab action which says "reject approach" :) the perf impact of forcing all reads to be out of memory would be dramatically bad for a remote system debug session, this can't be done. If there is a specific caller that must use live memory even for a section marked "read-only", it needs to use force_live_memory=true in its call to ReadMemory. This change would make app launch time for iOS apps etc probably take 30-60 seconds longer than they do right now.

I should add — I also don’t think this will fix the bug you’re working on.

I do like the debug-lldb check that the file cache read matches the live memory part. :)

Hi Jason,

I could place this check on the caller, but I think here is the right place for it, maybe I can convince you.

the perf impact of forcing all reads to be out of memory would be dramatically bad for a remote system debug session, this can't be done.

On macOS, this would only affect images that have the LC_SEGMENT_SPLIT_INFO load command. Note that this does not include images that were extracted from the shared cache, as __LINKEDIT is already stripped in that case, so I don't think the performance impact would be that big. For other object files, I could change the default behavior of IsSafeToReadFromFileCache to what we had prior, but, similar to the problem we had on mach-O, there might be situations where the memory diverges that we don't know about.

My reasoning for placing this change here is that I think Target::ReadMemory should be as safe as possible when using the filecache (within reason, if a process changes read-only data mid-execution then that's their responsibility , and they can force a live memory read), this way more callers can benefit from this optimization without having to thoroughly investigate that would be safe for themselves. For example, right now any from a section that contains LC_SEGMENT_SPLIT_INFO could read wrong data, and we would have to reason at every call sites if the default behavior of Target::ReadMemory is a problem or not. Also, many of those call sites are generic, and having to check if this is a problem for a specific object file everywhere would not be very good.

If you're not convinced, I can see a couple of alternatives:

  • Instead of passing a boolean determining if we prefer the filecache or memory, pass an enum instead with more options, something like:
    • Default to memory (fallback to filecache).
    • Default to filecache (fallback to memory).
    • Use filecache if deemed safe.
    • Use filecache if read-only.
    • Etc.
  • Create another version of "safe" read memory that does what this change does.
  • Concentrate this logic of figuring out if the filecache read is safe, so we can reuse it from different call sites.

I should add — I also don’t think this will fix the bug you’re working on.

At the bare minimum it solves rdar://76973844, which I verified. I couldn't verify my bug with a real watch yet but as it stands we're definitely reading wrong memory sometimes.

I do like the debug-lldb check that the file cache read matches the live memory part. :)

At least that we can keep then :) Do you think it's should be a crash, or maybe a log? Wasn't sure what was best.

vsk added a subscriber: vsk.Jul 23 2021, 2:45 PM

Hey Augusto, thanks for tackling this, I'm just now slowly paging things in.

Is this a correct statement of the problem: LLDB is failing to disable its file cache optimization when reading writable segments (say, __DATA) from a MachO sourced from the shared cache?

If that's right, then I wonder whether you considered "simply" doing a bounds check on the address? The shared region should be mapped at a fixed virtual range in the debuggee process, and we can determine that range using dyld APIs.

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
7040 ↗(On Diff #360929)

Is LC_SEGMENT_SPLIT_INFO present if and only if the MachO is from the shared cache?

Sorry for the delay in replying, I needed to think & look at this a bit.

Hi Jason,

I could place this check on the caller, but I think here is the right place for it, maybe I can convince you.

the perf impact of forcing all reads to be out of memory would be dramatically bad for a remote system debug session, this can't be done.

On macOS, this would only affect images that have the LC_SEGMENT_SPLIT_INFO load command.

I tried expanding some shared caches and played around a bit to make sure I understood what we're dealing with. You know all of this, but just to make sure everyone is following along, we have three possible states:

  1. Binary on disk, before it gets included in a shared cache. Will have LC_SEGMENT_SPLIT_INFO.
  2. Binary embedded in the shared cache binary blog (either on disk or in memory)
  3. Binary on disk, extracted from the shared cache. LC_SEGMENT_SPLIT_INFO is absent, cannot be reconstructed from the shared cache binary blob.

With simulator process debugging, the debugger can find the pre-shared-cache distinct binaries on the local system, with the LC_SEGMENT_SPLIT_INFOs.

With iOS debugging, the debugger will have post-shared-cache distinct binaries on the local system, without LC_SEGMENT_SPLIT_INFOs.

I'm not sure what watch debugging looks like -- instead of copying the shared cache off of a watch and expanding it, Xcode downloads a disk image from Apple with the shared cache contents. I don't know if that disk image has pre-shared-cache or post-shared-cache binaries; it's likely that pre-shared-cache binaries are included in the disk image.

In your bug, you've got a table which has relative offsets in it. The offsets are to things in the binary image. They may cross segment boundaries. The binary-on-disk will have TEXT, DATA, and LINKEDIT next to one another. When it's incorporated into the shared cache, the segments will be separated - and these offsets will be updated in the process of creating the shared cache. (are the offsets in the pre-shared-cache binary even valid? Not important) You have a bug where you're reading these offsets and getting them from the pre-shared-cache on-disk-binary, which are incorrect, and want to force reading them from the shared cache memory.

(post-shared-cache on-disk-binaries may have the same offsets as it was in the original shared cache -- but it's actually going to be a MORE subtle bug in that case because people plug in different iPhones and Xcode only extracts the shared cache for one of them if they're all running the same build. But different model iphones will have different shared caches, so using the offsets from the post-shared-cache on-disk-binary-image from one device for another device will be wrong.)

The simulator version of this bug will be fixed, with less perf hit but still probably too much, by ignoring the pre-shared-cache on-disk-binary -- we'll have to read all of the symbol names out of memory via debugserver. If this was over a USB cable, you'd be looking at a minute hit - but for a local debug session, it's probably closer to the order of 5-6 seconds. I'm just guessing, I haven't measured it. But it's a lot of packet traffic.

You'll still have the bug in the scenario outlined above, where I have two different iOS devices running the same iOS build. Xcode will expand the shared cache for the first one I plug in, and lldb will use the same binary images (without LC_SEGMENT_SPLIT_INFOs) for both debug sessions. And it's a questionmark about watch debugging - I suspect those will have LC_SEGMENT_SPLIT_INFOs and so app launch perf would take a gigantic hit to read symbol names all out of memory at debug time.

I think we need to concentrate on special casing the code that reads these offsets, to force it to use live memory every time. I know in theory this problem could happen anywhere -- we could have these cross-segments offsets anywhere in the binary -- but in practice, we know it currently happens in one spot that the debugger looks at.

I admit I'd be a lot happier if we could always use the file cache for read-only Sections and know that the data in the file matches the data in the memory. But if we only have one type of data where this isn't safe (so far?! :), I'm not willing to toss the optimization. At the very least, we'd need to find ways to retain the same perf characteristics as we have right now, and I think that might be tricky.

I do like the debug-lldb check that the file cache read matches the live memory part. :)

At least that we can keep then :) Do you think it's should be a crash, or maybe a log? Wasn't sure what was best.

I think an assert is good, there's no scenario where this should get hit -- if it does, then we really have a bug in using the filecache data and we need to halt immediately in CI or on an lldb developer's desktop. I'm a little less sure about thing this on for DEBUG builds all the time because it does change lldb's perf behavior a lot, specifically adding lots of packet traffic, but I think we should give it a try and see if it turns out to be too much of a change.

Hi Jason,

At least that we can keep then :) Do you think it's should be a crash, or maybe a log? Wasn't sure what was best.

I think an assert is good, there's no scenario where this should get hit -- if it does, then we really have a bug in using the filecache data and we need to halt immediately in CI or on an lldb developer's desktop. I'm a little less sure about thing this on for DEBUG builds all the time because it does change lldb's perf behavior a lot, specifically adding lots of packet traffic, but I think we should give it a try and see if it turns out to be too much of a change.

If the perf behavior of lldb changes too much, we could put the "also read from live memory & memcmp against file cache when we use filecache" behind a setting, and set it for testsuite runs, so we get this used in CI and desktop testruns. The extra packet traffic in those cases is going to be less important. Maybe we start with the easy "always enable on DEBUG builds" and fall back to a setting when it turns out to be too much, unless other people have an opinion.

Hey Augusto, thanks for tackling this, I'm just now slowly paging things in.

Is this a correct statement of the problem: LLDB is failing to disable its file cache optimization when reading writable segments (say, __DATA) from a MachO sourced from the shared cache?

If that's right, then I wonder whether you considered "simply" doing a bounds check on the address? The shared region should be mapped at a fixed virtual range in the debuggee process, and we can determine that range using dyld APIs.

Hey Vedant! Not exactly. Jason's description is spot on.

@jasonmolenda thanks for putting so much thought into this!

(post-shared-cache on-disk-binaries may have the same offsets as it was in the original shared cache -- but it's actually going to be a MORE subtle bug in that case because people plug in different iPhones and Xcode only extracts the shared cache for one of them if they're all running the same build. But different model iphones will have different shared caches, so using the offsets from the post-shared-cache on-disk-binary-image from one device for another device will be wrong.)

I finally got a watch to test with, and found that the bug still happens on dylibs extracted from the shared cache, and this might be the problem.

I'm updating the patch and keeping only the assert, and think of another way of solving the problem.

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
7040 ↗(On Diff #360929)

No, it means that a dylib is eligible for being incorporated in the shared cache later on.

Keep only filecache and live memory equality assertion

augusto2112 retitled this revision from [lldb] Improve checking of file cache read eligibility for mach-O to [lldb] Assert file cache and live memory are equal on debug builds.Jul 27 2021, 5:29 AM
augusto2112 edited the summary of this revision. (Show Details)

Looks good! Maybe we should add a setting like target.file-cache-memory-reads-verify (a boolean) to enable/disable this check instead of keying off NDEBUG. We can turn it on in the testsuite, and we can optionally have people with release installs of lldb enable it if we ever suspect that it might be an issue. The settings are added with an entry in source/Target/TargetProperties.td and a getter/setter in TargetProperties.

Looks good! Maybe we should add a setting like target.file-cache-memory-reads-verify (a boolean) to enable/disable this check instead of keying off NDEBUG. We can turn it on in the testsuite, and we can optionally have people with release installs of lldb enable it if we ever suspect that it might be an issue. The settings are added with an entry in source/Target/TargetProperties.td and a getter/setter in TargetProperties.

If we do that we should make it an lldbassert so it prints something in release builds.

Add setting to enable assert

Should this be enabled in the test suite? (packages/Python/lldbsuite/test/lldbtest.py:783, test/Shell/lit-lldb-init.in)

lldb/source/Target/Target.cpp
1773

Can we use a unique_ptr here (std::make_unique<uint8_t[]>)?

Run file-cache verification on tests, change raw buffer to unique pointer

jasonmolenda accepted this revision.Jul 28 2021, 1:53 PM

Looks great! Thanks for doing this.

This revision is now accepted and ready to land.Jul 28 2021, 1:53 PM
JDevlieghere accepted this revision.Jul 28 2021, 8:05 PM

Seems like this is still missing form lit-lldb-init.in which means the shell tests (and by extension the Swift REPL tests downstream) will run with the setting off. Other than that this LGTM.

Add test to test/Shell/lit-lldb-init.in

jankratochvil added a subscriber: jankratochvil.EditedJul 29 2021, 7:54 AM

It broke some Linux buildbots and it also broke testsuite on my Fedora machine:
functionalities/gdb_remote_client/TestTargetXMLArch.py

Some buildbots do not use -DLLVM_ENABLE_ASSERTIONS=ON which may be the reason why they PASS.

python3.9: /home/jkratoch/redhat/llvm-monorepo/lldb/source/Target/Target.cpp:1781: size_t lldb_private::Target::ReadMemory(const lldb_private::Address &, void *, size_t, lldb_private::Status &, bool, lldb::addr_t *): Assertion `memcmp(live_buf.get(), dst, dst_len) == 0 && "File cache and live memory diverge!"' failed.
(lldb) p dst_len
(size_t) $1 = 1
(lldb) p/x *live_buf.get()
(unsigned char) $2 = 0x00
(lldb) p/x *(uint8_t *)dst
(uint8_t) $3 = 0xc3

  * frame #4: 0x00007fffe119d36a _lldb.so`lldb_private::Target::ReadMemory(this=0x00005555559585e0, addr=0x00005555559ff9a8, dst=0x0000555555922560, dst_len=1, error=0x00007fffffff9088, force_live_memory=false, load_addr_ptr=0x0000000000000000) at Target.cpp:1780:15
    frame #5: 0x00007fffe17d0932 _lldb.so`UnwindAssembly_x86::GetNonCallSiteUnwindPlanFromAssembly(this=0x00005555559117b0, func=0x00005555559ff9a8, thread=0x00005555559eb7a0, unwind_plan=0x00005555559224d0) at UnwindAssembly-x86.cpp:56:31
    frame #6: 0x00007fffe10d42df _lldb.so`lldb_private::FuncUnwinders::GetAssemblyUnwindPlan(this=0x00005555559ff9a0, target=0x00005555559585e0, thread=0x00005555559eb7a0) at FuncUnwinders.cpp:341:32