This is an archive of the discontinued LLVM Phabricator instance.

[lldb][Target] Flush the scratch TypeSystem when process gets deleted
ClosedPublic

Authored by Michael137 on Nov 25 2022, 8:34 AM.

Details

Summary

Summary

This patch addresses #59128, where LLDB would crash when evaluating
a type that has been imported into the same instance of a target's scratch AST
across process re-runs. The proposed solution is to clear the scratch AST (and associated
persistent variables, ClangASTImporter, etc.) whenever a module that
could've owned one of the stale TypeSystems gets unloaded/destroyed.

Details:

  1. The first time we evaluate the expression we import the decl for Foo into the Targets scratch AST context (lives in m_scratch_type_system_map). During this process we also create a ClangASTImporter that lives in the ClangPersistentVariables::m_ast_importer_sp. This importer has decl tracking structures which reference the source AST that the decl got imported from. This importer also gets re-used for all calls to DeportType (which we use to copy the final decl into the Targets scratch AST).
  2. Rebuilding the executable triggers a tear-down of the Module that was backing the ASTContext that we originally got the Foo decl from (which lived in the Module::m_type_system_map). However, the Target’s scratch AST lives on.
  3. Re-running the same expression will now create a new ASTImporterDelegate where the destination TranslationUnitDecl is the same as the one from step (1).
  4. When importing the new Foo decl we first try to find it in the destination DeclContext, which happens to be the scratch destination TranslationUnitDecl. The Foo decl exists in this context since we copied it into the scratch AST in the first run. The ASTImporter then queries LLDB for the origin of that decl. Using the same persistent variable ClangASTImporter we claim the decl has an origin in the AST context that got torn down with the Module. This faulty origin leads to a use-after-free.

Testing

  • Added API test

Diff Detail

Event Timeline

Michael137 created this revision.Nov 25 2022, 8:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2022, 8:34 AM
Michael137 requested review of this revision.Nov 25 2022, 8:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2022, 8:34 AM
  • Update comment
kastiglione added inline comments.
lldb/source/Target/Target.cpp
208

Do we have some place in the life-cycle where we can perform this only if the target has changed? Ideally this would happen when the binary has a different timestamp, or for mach-o a different UUID.

aprantl added inline comments.Nov 29 2022, 1:56 PM
lldb/source/Target/Target.cpp
207

Not opposed to this, but this is leaking an implementation detail of TypeSystemClang into Target. Just out of curiosity — would if be feasible to use weak_ptr in DeclOrigin or is that defined inside of Clang and can't be changed?

Michael137 added inline comments.Nov 30 2022, 4:03 AM
lldb/source/Target/Target.cpp
207

I agree with your point about leaking TypeSystem implementation details into Process lifecycle. I briefly considered converting DeclOrigins to hold weak pointers but unfortunately the pointers are handed out by clang::Decl. Though thinking about it, clang::ASTContexts are refcounted (via llvm::RefCountedBase). So perhaps we could keep them alive that way. Though I’m concerned that we then commit to supporting multiple definitions of a type in the scratch AST and never really properly clearing the DeclOrigin structures. But I’ll play around with the idea

208

There is DidUnloadModules which gets notified when an lldb_private::Module gets unloaded (e.g., on rebuilt). But this includes JITted modules (e.g., when running AppleObjectiveCRuntimeV2 utility functions) in which case we wouldn’t want to flush the type systems. Also the Clang REPL (and I assume the Swift REPL) rely on the type system being still present after we unloaded the module associated with the evaluated expression. All this is to say, I found it to be quite fiddly to determine when to flush the persistent variables from within that notification. Really we would like a notification to the Target which says “we restarted the debugee AND some module got rebuilt”. In that case it’s not safe to keep the persistent variables around. I’ll double check if there isn’t something like that around.

Michael137 added inline comments.Nov 30 2022, 8:05 AM
lldb/source/Target/Target.cpp
207

Hmm this becomes a bit awkward because TypeSystemClang owns the backing ASTContext uniquely: std::unique_ptr<clang::ASTContext> m_ast_up.

So we can't bump the refcount unless we somehow prevent the m_ast_up deleter from kicking in when the backing lldb_private::Module gets destroyed.

Stepping back a little, essentially what we want is to:

  1. Either, accept the fact that the OriginMap can contain stale pointers and provide a mechanism to detect this (and clear them out of the map)
  2. OR, never allow a situation where we have stale pointers in this map.
Michael137 added inline comments.Nov 30 2022, 8:30 AM
lldb/source/Target/Target.cpp
208

I suppose we could do something like:

void Target::ModulesDidUnload(ModuleList &module_list, bool delete_locations) {
    ...
    if each module in module_list is Type::eTypeExecutable or Type::eTypeObjectFile {
        m_scratch_type_system_map.Clear();
    }
aprantl added inline comments.Nov 30 2022, 9:30 AM
lldb/source/Target/Target.cpp
208

That looks like a reasonable middle ground?

  • Clear TypeSystem in notification
Michael137 edited the summary of this revision. (Show Details)Nov 30 2022, 12:59 PM
Michael137 added inline comments.
lldb/source/Target/Target.cpp
208

Updated diff with this alternative

kastiglione added inline comments.Nov 30 2022, 1:39 PM
lldb/source/Core/ModuleList.cpp
1080 ↗(On Diff #479052)

I wonder why ForEach doesn't deal out a Module &? I would think a ModuleList should not allow for null Module pointers.

lldb/source/Target/Target.cpp
1687

How come this is AllOf and not a AnyOf?

1705–1711

nit: Why such a narrow line wrapping width?

I think I'm fine with this variant modulo outstanding comments!

lldb/source/Core/ModuleList.cpp
1080 ↗(On Diff #479052)

I think this is historic. +1 for taking a Module & (unless we for some reason need a shared_ptr in the lambda).

lldb/source/Target/Target.cpp
1705

Can you make it clear that this comment is talking about TypeSystemClang specifically?

Michael137 added inline comments.Dec 1 2022, 12:54 AM
lldb/source/Core/ModuleList.cpp
1080 ↗(On Diff #479052)

I was wondering that myself. I took a stab at changing it to pass a reference but turns out there's several places where we end up storing the shared_ptr. If we changed this to a reference we'd have to call shared_from_this for those cases. Which is a bit iffy considering calling it on references that are not shared_ptr owned would throw (which can't happen currently but seems similarly awkward to the current API).

A compromise could be to document that these pointers are guaranteed to be non-null and add an assert into ForEach. AFAICT all call-sites assume the pointer is non-null anyway

Michael137 added inline comments.Dec 1 2022, 1:09 AM
lldb/source/Target/Target.cpp
1687

Here I'm checking whether all unloaded modules are regular executables or object files. Otherwise we wouldn't want to clear the TypeSystems. E.g., for JITted modules this would not be desirable. I'll add a comment about this

Michael137 added inline comments.Dec 1 2022, 1:18 AM
lldb/source/Target/Target.cpp
1687

Though perhaps AnyOf would be the safer thing to do. Not sure when we'd have a mixture of object-files in here. But I suppose if we unloaded any that could've been a source AST then we're in unsafe territory.

Michael137 marked 2 inline comments as done.Dec 1 2022, 1:18 AM
Michael137 updated this revision to Diff 479190.Dec 1 2022, 1:20 AM
  • Reflow comment
  • Remove redundant null-check
Michael137 marked an inline comment as done.
Michael137 marked 2 inline comments as done.
Michael137 marked an inline comment as done.
Michael137 edited the summary of this revision. (Show Details)Dec 1 2022, 8:09 AM
Michael137 edited the summary of this revision. (Show Details)
aprantl accepted this revision.Dec 1 2022, 12:18 PM

LGTM with a slightly more relaxed test

lldb/source/Target/Target.cpp
1707

nit: no {} in LLVM style

lldb/test/API/functionalities/rerun_and_expr/TestRerunAndExpr.py
47

This is really great — I didn't know we already had this!

51

This may be checking too much detail. I don't know how frequently the Clang AST changes, but if there is any information we don't really need I would try to omit it so we don't have to update this test every couple of months.

This revision is now accepted and ready to land.Dec 1 2022, 12:18 PM
Michael137 marked 3 inline comments as done.Dec 1 2022, 2:50 PM
Michael137 updated this revision to Diff 479446.Dec 1 2022, 2:51 PM
  • Add dylib tests

Broke windows and linux build bots. Checking....

Reverted for now while figuring out how to best deal with the tests on Linux

labath added a subscriber: labath.Dec 2 2022, 6:46 AM

A simple USE_LIBDL := 1 could fix the linux test.

As for windows, we have some dlopen "portability wrappers" in packages/Python/lldbsuite/test/make/dylib.h, but windows has the additional problem (well, maybe it's kind of a feature) of not being able to modify a binary while it's open). It will probably be necessary to kill the process and remove the module from its module list (if that doesn't defeat the purpose of the test). Force unloading the module (SBDebugger::MemoryPressureDetected) might be necessary as well.

(or skip the test on Windows)

Michael137 reopened this revision.Dec 3 2022, 7:59 AM

Linux issue was just about dlopening the correct library name. Fixing that makes the test pass on my Ubuntu machine.

Skipping for now on Windows because I don't have access to a machine at the moment

This revision is now accepted and ready to land.Dec 3 2022, 7:59 AM
Michael137 updated this revision to Diff 479841.Dec 3 2022, 7:59 AM
  • Fix Linux tests: pass solib into source
  • Only flush modified files (including shared libraries)
  • Skip on Windows for now
  • Relaunch process using SBAPI
Michael137 updated this revision to Diff 479844.Dec 3 2022, 8:29 AM
  • Rebase
  • Reword commit
aprantl accepted this revision.Dec 5 2022, 8:51 AM

Argh broke linux again.

This time it's a lack of dlclose to make LLDB destroy the module the way I want it to.

About to push a fix...

Don't have a Windows setup to confidently test this on right now but will see if I can get one going this week. Skipped it for now. But the problem is really what Pavel was alluding to. Rebuilding the binary doesn't trigger the module unload in the same way it does on Linux and Darwin. But should be fixable once I have the setup

I just found out that this test (the non-shared-library version) fails on linux if the executable is built with -no-pie (which is the default if the CLANG_DEFAULT_PIE_ON_LINUX option is not set, which happened to be the case for my build). I think the important fact here is that a PIE ELF executable gets its e_type field set to ET_DYN, which causes lldb to identify it as a shared library. However, it is not clear to me why would that matter in this case...

I just found out that this test (the non-shared-library version) fails on linux if the executable is built with -no-pie (which is the default if the CLANG_DEFAULT_PIE_ON_LINUX option is not set, which happened to be the case for my build). I think the important fact here is that a PIE ELF executable gets its e_type field set to ET_DYN, which causes lldb to identify it as a shared library. However, it is not clear to me why would that matter in this case...

Thanks for reporting, will check

Michael137 added a comment.EditedDec 12 2022, 7:54 AM

I just found out that this test (the non-shared-library version) fails on linux if the executable is built with -no-pie (which is the default if the CLANG_DEFAULT_PIE_ON_LINUX option is not set, which happened to be the case for my build). I think the important fact here is that a PIE ELF executable gets its e_type field set to ET_DYN, which causes lldb to identify it as a shared library. However, it is not clear to me why would that matter in this case...

From what I can tell the main difference with no-PIE executable is that Target::ModulesDidUnload doesn't get called when relaunching the process. Whereas in the PIE case we do:

PlatformPOSIX::DebugProcess
-> Process::Launch
  -> DynamicLoaderPOSIXDYLD::DidLaunch
    -> DynamicLoader::GetTargetExecutable
      -> Target::SetExecutableModule
        -> Target::ClearModules
          -> Target::ModulesDidUnload
Michael137 added a comment.EditedDec 13 2022, 8:41 AM

I just found out that this test (the non-shared-library version) fails on linux if the executable is built with -no-pie (which is the default if the CLANG_DEFAULT_PIE_ON_LINUX option is not set, which happened to be the case for my build). I think the important fact here is that a PIE ELF executable gets its e_type field set to ET_DYN, which causes lldb to identify it as a shared library. However, it is not clear to me why would that matter in this case...

From what I can tell the main difference with no-PIE executable is that Target::ModulesDidUnload doesn't get called when relaunching the process. Whereas in the PIE case we do:

PlatformPOSIX::DebugProcess
-> Process::Launch
  -> DynamicLoaderPOSIXDYLD::DidLaunch
    -> DynamicLoader::GetTargetExecutable
      -> Target::SetExecutableModule
        -> Target::ClearModules
          -> Target::ModulesDidUnload

Ah I may have spotted the issue (but yet to confirm).

Currently, when relaunching an executable, the DynamicLoader plugin does this:

ModuleSP DynamicLoader::GetTargetExecutable() {
        ...
        if (executable.get() != target.GetExecutableModulePointer()) {        
          // Don't load dependent images since we are in dyld where we will   
          // know and find out about all images that are loaded
          target.SetExecutableModule(executable, eLoadDependentsNo);

In target.GetExecutableModulePointer we return the pointer to the first module of type eTypeObjectFile, which with no-pie will not be a.out, so then we just return the first module in m_images. In the no-pie case we don't take this branch and maybe this is why. Will check shortly