This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Implement out-of-process allocator enumeration for macOS
AbandonedPublic

Authored by delcypher on Aug 6 2018, 6:10 AM.

Details

Summary

Add support to allocators to enumerate allocations out-of-process.

This work is part of a multi patch series to implement a malloc zone
enumerator on Darwin. On Darwin zone enumerators work out-of-process. A
memory analysis tool wishing to inspect a malloc zone of a target
process, pauses the target process, dlopen()'s the target process's
allocator library (in our case asan/lsan/tsan) into itself and then
calls the malloc zone enumeration function (from the dlopened library)
to inspect the target process on the analysis process's behalf.

In order to implement this a first step is to allow the sanitizer
allocators to be enumerated out-of-process. This is what this
patch focuses on. Other necessary parts will be implemented in
later patches.

Earlier attempts at this patch suffered from code duplication where the
in-process and out-of-process code paths were independent which would
have lead to code maintenance problems in the future. This patch
unifies both the in-process and out-of-process code paths into a single
implementation using an ObjectView abstraction.

The ObjectView abstraction is an interface that abstracts away (most
of) the differences between in-process and out-of-process enumeration in
a type safe and efficient manner.

The LocalObjectView implementation of this abstraction is for
in-process enumeration and tries to be efficient by not storing
unnecessary data and by not performing any allocations or copy
operations.

The RemoteObjectView implementation of this abstraction is for
out-of-process enumeration. It automatically handles copying data from
another process and allocating/freeing storage for that copy.
Under the hood this implementation uses a new VMReadContext class to do
the actual copying and then uses a storage class to manage
the storage.

VMReadContext is a low-level abstraction for reading the memory of
another process. It is not intended to be used directly.

Existing allocator methods (such as ForEachChunk) simply create a
LocalObjectView (which is essentially acts just like a this pointer)
and call a static method with a OOP suffix that does the actual work.
The static methods with the OOP suffix are templated over a
ObjectView type which can either be LocalObjectView or some
instantiation of RemoteObjectView and also take a pointer to an
instance of that type. The choice to make these OOP methods being
static is deliberate because it means there is no this pointer and
so it is harder to accidenly call normal instance methods which could be
unsafe.

The Sanitizer allocator unit tests have been updated to do enumeration
using the RemoteObjectView interface in addition to the
LocalObjectView interface so that we can check that

  • The code correctly compiles when instantiated with RemoteObjectView (In the sanitizer libraries it currently is never instantiated with this type).
  • That the copying and allocation steps of RemoteObjectView work correctly.

rdar://problem/45284065

Event Timeline

delcypher created this revision.Aug 6 2018, 6:10 AM
Herald added subscribers: Restricted Project, mgorny. ยท View Herald TranscriptAug 6 2018, 6:10 AM
kubamracek added a project: Restricted Project.
kcc edited reviewers, added: morehouse; removed: alekseyshl.Aug 16 2018, 12:54 PM

[sorry for delay, I've been OOO]
Ouch. That's really a lot of code in the core parts of *san, which adds quite a bit of maintenance tax.
Two questions before we start reviewing the code:

  • is it possible to move more of the logic into Mac-specific files?
  • can other platforms benefit from this code in some way? How?
lib/sanitizer_common/sanitizer_process_vm_reader.h
27 โ†—(On Diff #159284)

is it possible to avoid #ifdefs in common code?

@kcc

In D50330#1203010, @kcc wrote:

[sorry for delay, I've been OOO]
Ouch. That's really a lot of code in the core parts of *san, which adds quite a bit of maintenance tax.
Two questions before we start reviewing the code:

  • is it possible to move more of the logic into Mac-specific files?

Yes, but that is some what in conflict with the idea the goal to have this code tested on all platforms. We would like to do this ideally because
we want tests to fail when someone changes the allocators in a way that would break out-of-process enumeration.

  • can other platforms benefit from this code in some way? How?

There would be several bits to this.

  • Other platforms would need to implement ProcessVMReaderContext so that it is able to read the memory of another ASanified process. In the code I note some ways this could be done on Linux.
  • To be able to do the actually enumeration the external process would need to know where the allocator lives in the target process. We would need to add mechanism for this for other platforms. On Darwin a mechanism already exists for this.

A more important question though is, what use cases on other Platforms are there? On macOS our use case is a lot of our pre-existing memory analysis
tools assume they can pause a running application and then enumerate the paused application's allocation. I'm not sure what use cases exist on other
platforms of enumerating allocations out of process.

Something worth noting is that there are multiple steps involved in implementing out-of-process allocation enumeration and this RFC is only the first step.
It didn't make sense to me to try to upstream the whole thing because that would make the patch too large and hard to review.

On to your more general concern about this approach adding too much code to core parts of ASan leading to more maintenance.

I've been thinking about this and there is a way to unify the ForEachChunk(...) and ForEachChunkOutOfProcess(...) methods so there is only a single implementation but it would
require a complete redesign of ProcessVMReaderContext. A way to unify the separate implementations is to template them on a type T, where T
implements the interface of ProcessVMReaderContext. When doing in-process enumeration the type T would basically be doing no-ops. When doing out-of-process
enumeration T would actually be the ProcessVMReaderContext type. There are some downsides to this approach.

  • With the in-process and out-of-process variants merged into one, anyone wishing to change this methods need to be very careful. They have to exclusively use the interface of T to dereference pointers. The code is also now harder to read because its written from the perspective of doing out-of-process enumeration, with the in-process case being handled in a very non-transparent way.
  • The design of ProcessVMReaderContext would need to be changed significantly. Currently ProcessVMReaderContext does not own or manage the storage allocated for data copied from a target process. Right now the storage is handled externally by having stack allocations and internal_memcpy(...)-ing data into those locations. This design isn't compatible if we have unified ForEachChunk implementations because we want the in-process enumeration to have zero added cost (we don't want additional stack allocations and internal_memcpy(...) calls). To do this we would need ProcessVMReaderContext to actually own the storage allocated for out-of-process enumeration. Arguably this a better design anyway.

A problem with the suggestion above though is I'm not 100% sure if we can unify everything. For example later patches add an out-of-process version of AsanChunk and
I'm not sure if we can have a single implementation of that class that supports both in-process and out-of-process enumeration.

So the question is. What is more important, keeping the in-process and out-of-process functions separate for readability or unifying for maintainability?

lib/sanitizer_common/sanitizer_process_vm_reader.h
27 โ†—(On Diff #159284)

Sure. I guess lib/sanitizer_common/sanitizer_internal_defs.h would be a sensible place to put the typedef?

kcc added a comment.Aug 20 2018, 6:12 PM

I've looked at the code once more and I am really afraid of this extra complexity in the core of asan that is never needed outside of OSX/iOS.
It would be much better to hide this into Mac-specific code that would use allocator's public APIs.
It's fine to add some public APIs to the allocator for that purpose, as long as they are testable.
Is that possible?

Also, just for my education, why would someone need to use heap, vmmap, or leaks on an asan-ified process?
For leaks, we have lsan (already works on Mac IIRC).
For heap, asan introduces a significant distortion so that the heap profiles won't be similar to the ones in production builds.
For vmmap, part of that can be achieved by asan's builtin heap profiler (linux-only today).

You certainly know more about Mac than I do, I just want to make sure you are solving a problem that needs to be solved.

In D50330#1206984, @kcc wrote:

Also, just for my education, why would someone need to use heap, vmmap, or leaks on an asan-ified process?

This is all about preserving existing workflows and features that iOS/macOS developers rely on. I know that ASan has alternatives for some of these tools, but we can't just teach all existing developers to use a completely different toolset when using ASan. One example: The 'heap' tool and Xcode's Memory Graph Debugger are rich tools for analysing your heap and whether you're running with ASan or not doesn't matter from the user's perspective -- they expect these tools to work. I'm not worried about things like fragmentation, which as you said can vary greatly, but most information isn't going to be different under ASan: per-size allocation histogram, pointer reference cycles, number of allocations of certain size, per object class breakdown of allocations, for strings the tools can preview content, etc. All of this is already built into 'heap', 'leaks' and the Memory Graph Debugger and these things can't realistically be re-implemented into ASan/LSan. The idea of adding compatibility into the sanitizer allocator sounds much simpler in comparison.

In D50330#1206984, @kcc wrote:

I've looked at the code once more and I am really afraid of this extra complexity in the core of asan that is never needed outside of OSX/iOS.
It would be much better to hide this into Mac-specific code that would use allocator's public APIs.

Just to make sure we're on the same page: You're only talking about the three added ForEachChunkOutOfProcess functions, right? The rest of the patch is almost entirely support/testing code that isn't adding any complexity to the core.

In D50330#1206984, @kcc wrote:

I've looked at the code once more and I am really afraid of this extra complexity in the core of asan that is never needed outside of OSX/iOS.
It would be much better to hide this into Mac-specific code that would use allocator's public APIs.
It's fine to add some public APIs to the allocator for that purpose, as long as they are testable.
Is that possible?

Could you elaborate on what you are envisioning? The only methods I've added to the allocators in this patch are the ForEachChunkOutOfProcess(...)
methods (later patches do add a few out-of-process variants of existing methods). At present the ForEachChunkOutOfProcess(...) methods have to
be in the allocator classes
because they access internal implementation details of the allocators which are private. Adding public methods to let external code (that arguably belongs in the allocator) get at these internals
seems like a horrible layering violation. I'm not convinced doing that would have any positive impact because now:

  • There will be a public API that has to be maintained for just a single client in addition to having to maintain the allocator internals. Previously we only needed to worry about maintaining the allocator internals. Creating more maintenance burden than what is in this patch seems like the wrong way to go.
  • It's even easier to break out-of-process enumeration because the code for it would live far away the relevant allocators.

What about the other idea I suggested where the ForEachChunk methods are unified so there is only one implementation for both in-process and out-of-process?
For in-process we'd use a zero-cost abstraction so that for in-process enumeration we don't do any of the expensive operations required for out-of-process enumeration.
That would seem to me to fix your valid concerns over having to maintain two different ForEachChunk implementations. I've not tried implementing this yet because I don't
want to spend time implementing something that won't be accepted upstream. However if this an approach you might consider acceptable then I can try to implement it.

Also, just for my education, why would someone need to use heap, vmmap, or leaks on an asan-ified process?
For leaks, we have lsan (already works on Mac IIRC).
For heap, asan introduces a significant distortion so that the heap profiles won't be similar to the ones in production builds.
For vmmap, part of that can be achieved by asan's builtin heap profiler (linux-only today).

You certainly know more about Mac than I do, I just want to make sure you are solving a problem that needs to be solved.

@kubamracek Has already listed the main reasons why we want to do this. I'd like to add this will also help LSan support on macOS. Right now LSan "kind of" works on macOS. It has false positives due it not understanding various internal implementation details of various macOS specific libraries like GCD and the Swift runtime. The leaks tool on the other hand does not have these false positives because it is aware of these internal implementation details. Improvements to LSan on macOS will be greatly helped by comparing LSan reports against reports produced by the leaks tool. Being able to run the leaks tool on ASanified binary would make this process easier because programs don't need to be recompiled with ASan disabled to do leak checking with the leaks tool.

kcc added a comment.Aug 23 2018, 4:49 PM

Just to make sure we're on the same page: You're only talking about the three added ForEachChunkOutOfProcess functions, right? The rest of the patch is almost entirely support/testing code that isn't adding any complexity to the core.

Correct. Looking at them again...

kcc added a comment.Aug 23 2018, 4:50 PM

and of course, also, the ProcessVMReaderContext

In D50330#1211776, @kcc wrote:

and of course, also, the ProcessVMReaderContext

If the abstraction of ProcessVMReaderContext is bothering you, we should be able to move most of it into Mac-specific source files and just have ForEachChunkOutOfProcess take more callback pointers instead.

But moving ForEachChunkOutOfProcess into Mac-specific files doesn't sound easy to do: ForEachChunkOutOfProcess needs access to internal structures of the allocator.

kcc added a comment.Aug 27 2018, 1:30 PM

ForEachChunkOutOfProcess needs access to internal structures of the allocator.

That's exactly why I proposed to consider implementing a public API instead.
It might be hard, but I'd like to know why before we proceed with the current change.

delcypher updated this revision to Diff 167989.Oct 2 2018, 11:23 AM
  • Implement SizeClassAllocator32 out-of-process enumeration
  • Split implementation of out-of-process functions into their own header files. These can be included by the platforms that actually need this functionality.
  • Remove ProcessVMReaderContext::ReadErrorTy and use bool instead.
  • Move declaration of the ProcessHandle into sanitizer_internal_defs.h.
vitalybuka added inline comments.Oct 2 2018, 11:37 AM
lib/sanitizer_common/sanitizer_allocator_bytemap.h
21

I'd avoid this typedefs use in implementation only
for sizeof please use actual variables, e.g. sizeof(*this)
other few places just decltype(*this)

In D50330#1214896, @kcc wrote:

ForEachChunkOutOfProcess needs access to internal structures of the allocator.

That's exactly why I proposed to consider implementing a public API instead.
It might be hard, but I'd like to know why before we proceed with the current change.

Sorry for taking so long to revisit this. I hope you'll find this version of the patch more palatable.

I tried looking at the idea of a "public API" and I see no reasonable way to do this. The out-of-process code has to know lots of internal details about the implementations of the allocators that I see no practical way to abstract. In particular the fact we are performing out-of-process reads couples us even very tightly to an allocator implementation because we have to be very careful to only call APIs on an allocator that do not dereference pointers. In several cases I have had to implement out-of-process versions of existing APIs because the existing ones aren't designed to work with the allocator being in a different process. To me these are very strong indicators that it does not make sense to pull the out-of-process enumeration code out of allocators.

Previously you raised concerns about having extra complexity inside the ASan allocators that no other platforms are using. The new version of the patch aims at a compromise that partially addresses this.

In this patch all of the out-of-process functions are declared in the allocator classes but they are defined in other header files. The idea here is that only platforms (i.e. Darwin) that actually need the out-of-process functionality should include the *_oop.h header files. All other platforms simply don't include those header files which means the out-of-process code never ends up inside the sanitizer libraries. In this patch only the unit tests consume the *_oop.h header files so we can ensure that compilation doesn't get broken. So in this patch the new out-of-process code doesn't ever appear in the sanitizer libraries. The intention in later patches is that the Darwin ASan library will also consume the *_oop.h header files but no other platform will.

This partially fixes your concerns because all non-Darwin sanitizer libraries will not contain out-of-process enumerator code. What this does not fix is the maintenance burden of maintaining two separate allocator enumerator implementations. This is unavoidable unless we combine them into an abstracted version of ForEachChunk(...) (and friends) that describes both in and out-of-process implementations.

Please let me know what you think.

delcypher added inline comments.Oct 4 2018, 5:47 AM
lib/sanitizer_common/sanitizer_allocator_bytemap.h
21

I can't do as you suggest because ReadOutOfProcess is a static method so there is no this pointer. However we can make this typedef private if you prefer. I was copying an existing pattern I saw in the size class allocators.

delcypher edited the summary of this revision. (Show Details)

New type safe and unified implementation based on discussions with @kcc.

@kcc Here is a new version of the patch that tries to address your concern of having two separate in-process and out-of-process implementations.

Just to note at our discussions at LLVMDev you wanted the template type (in this case what I call ObjectView) to an allocator parameter. I have not done this because I discovered this is impossible (when ObjectView is templated on the type of the object it is representing) because you end up with mutually recursive types. I.e.

ObjectView<Allocator<Params< ObjectView<Allocator<Params< ... > > > > > >

Instead the methods that need to be used for out-of-process enumeration are templated. This seems like a better choice anyway because it means

  • There are not different allocator data types for in and out-of-process enumeration. This means it is impossible to have a data-layout divergence between the two types because there is only one.
  • It explicitly documents which methods are involved in out-of-process enumeration because they are templated on ObjectView.

I should also note this is not necessarily the final design. I have not finished reworking my internal patch to re-use this new ObjectView interface everywhere yet so there might need to be some design tweaks to accommodate this. However I really wanted to get early feedback on this new design because it has large implications on my work.

delcypher edited the summary of this revision. (Show Details)Oct 30 2018, 11:21 AM

Add missing ALWAYS_INLINE attributes.

kcc added a comment.Oct 30 2018, 3:15 PM

Sorry, this isn't any better than the previous version.
Please understand my position: this is the core code for all the existing sanitizers, scudo, and the new hwasan,
we are planing to have this code in production (in fact, we already have it in production in many places).
There is no way we can maintain code with so much extra stuff in it. The change has to be less intrusive.

Just to note at our discussions at LLVMDev you wanted the template type (in this case what I call ObjectView) to an allocator parameter. I have not done this because I discovered this is impossible (when ObjectView is templated on the type of the object it is representing) because you end up with mutually recursive types. I.e.
ObjectView<Allocator<Params< ObjectView<Allocator<Params< ... > > > > > >

I don't understand this. Let's take some specific class, e.g. SizeClassAllocator64,
and see why ObjectView can't be made part of "Params".

Is it possible to make this change in several iterations where you start from introducing one more template parameter (ObjectView?) that does very little,
then we can iterate extending it.

In D50330#1281420, @kcc wrote:

Sorry, this isn't any better than the previous version.

I don't agree with that but perhaps I misunderstood precisely what you disliked about the previous version of the patch. My understanding was that you didn't like the fact there were duplicate implementation for enumerating allocations, one for in-process (what exists today), and one for out-of-process (what I'm trying to add).
This patch removes the duplicate implementations by unifying the implementations and templating them on a ObjectView type which abstracts the out-of-process operations (allocating memory, copying memory from another process into that allocated memory, deallocating the memory).

Please understand my position: this is the core code for all the existing sanitizers, scudo, and the new hwasan,
we are planing to have this code in production (in fact, we already have it in production in many places).
There is no way we can maintain code with so much extra stuff in it. The change has to be less intrusive.

Okay, I understand your position but I can't easily fix this patch unless you give me more explicit guidance on which parts you do not like. If you point to a particular thing (e.g. The design pattern that is being consistently followed) you don't like I can explain the reasoning behind it and also discuss alternatives.
Enumerating the allocators in an out-of-process mode is complicated and I think it is inevitable that the change will be intrusive.

There are certain places where I have made changes for safety reasons, but they could be dropped. For example the SpaceBeg() instance method has been modified because it is in the call graph of instance methods that are called from ForEachChunk(...). The rule I have been following is that instance methods should not be called directly
because doing so may be unsafe because they might deference memory that does not exist in the out-of-process scenario. A method like SpaceBeg() is actually safe to call directly on an allocator copied for another process but there is nothing enforcing that. Changing SpaceBeg() (in this case it's actually renamed to SpaceBegOOP() and a new method named SpaceBeg() just calls it with the appropriate arguments) to be templated on ObjectView and be a static method means that:

  • There is no this pointer with prevents anyone calling instance methods directly.
  • The fact that the method is templated on ObjectView documents the fact that this method is in the call graph for out-of-process enumeration. This forces the programmer to actually think about what they are doing because the method is used in the in-process and out-of-process code paths.

But as I mentioned earlier SpaceBeg() is currently safe to call directly on an Allocator so I could drop the changes (and to similar methods that are safe to call directly) if you don't like them. I'm not a fan of this because I prefer safety over simplicity.

Just to note at our discussions at LLVMDev you wanted the template type (in this case what I call ObjectView) to an allocator parameter. I have not done this because I discovered this is impossible (when ObjectView is templated on the type of the object it is representing) because you end up with mutually recursive types. I.e.
ObjectView<Allocator<Params< ObjectView<Allocator<Params< ... > > > > > >

I don't understand this. Let's take some specific class, e.g. SizeClassAllocator64,
and see why ObjectView can't be made part of "Params".

Sorry I was incorrect when I said this was "impossible". When I looked this initially I couldn't see a way to do because ObjectView is templated on the type it represents. If you have Params contain a fully instantiated typedef (e.g. ObjectView<SizeClassAllocator<...>>) you end up with
recursive types. To illustrate this consider this

#include <stdlib.h>

template <typename ObjectTy> class LocalObjectView {
private:
  uintptr_t local_address;

public:
  LocalObjectView(ObjectTy *addr)
      : local_address(reinterpret_cast<uintptr_t>(addr)) {}
  ObjectTy *GetLocalAddress() const {
    return reinterpret_cast<ObjectTy *>(local_address);
  }
  size_t GetSize() const { return sizeof(ObjectTy); }
};

template <template ObjectViewT>
class AP64 {
public:
  static const int kSomeConstant = 0;
  typedef ObjectViewT ObjectView;
};

template <typename Params> class SizeClassAllocator {
public:
  typedef SizeClassAllocator<Params> ThisT;
  static const auto kValue = Params::kSomeConstant;
  int foo = 0;

  static void DoSomething(Params::ObjectView* view) {
    view->GetLocalAddress()->foo = 1;
  }
};

int main() {
  // Can't declare the type because of recursive type
  // ??? = SizeClassAllocator<AP64<LocalObjectView<???>>>
  SizeClassAllocator<AP64<LocalObjectView<???>>> allocator;
}

Notice how it's impossible to actually instantiate the SizeClassAllocator in this example. Because of this I went with the approach of templating the methods I needed to change.

I've given this some thought and I've realised that there is a way to make this work by making Params::ObjectView a dependent type as shown below.

#include <stdlib.h>

template <typename ObjectTy> class LocalObjectView {
private:
  uintptr_t local_address;

public:
  LocalObjectView(ObjectTy *addr)
      : local_address(reinterpret_cast<uintptr_t>(addr)) {}
  ObjectTy *GetLocalAddress() const {
    return reinterpret_cast<ObjectTy *>(local_address);
  }
  size_t GetSize() const { return sizeof(ObjectTy); }
};

template <template <typename> class ObjectViewT> class AP64 {
public:
  static const int kSomeConstant = 0;
  template <typename ObjectTy> using ObjectView = ObjectViewT<ObjectTy>;
};

template <typename Params> class SizeClassAllocator {
public:
  typedef SizeClassAllocator<Params> ThisT;
  static const auto kValue = Params::kSomeConstant;
  int foo = 0;

  static void DoSomething(typename Params::template ObjectView<ThisT> *view) {
    view->GetLocalAddress()->foo = 1;
  }
};

int main() {
  SizeClassAllocator<AP64<LocalObjectView>> allocator;
  LocalObjectView<decltype(allocator)> view(&allocator);
  decltype(allocator)::DoSomething(&view);
}

So actually I could make ObjectView be part of the allocator parameters instead of what I've done in this patch.
A nice aspect of this is it means all the static_asserts can be removed (I think). A downside however is that it makes declaring allocator types a bit more cumbersome because
this type will have to appear in three places in the allocator (in the combined, primary, and secondary). Does this seem like the right trade-off?

Is it possible to make this change in several iterations where you start from introducing one more template parameter (ObjectView?) that does very little,
then we can iterate extending it.

I can certainly break this patch up into smaller pieces but I haven't done so because some parts don't make any sense in isolation. If you'd like me to submit in smaller pieces I can, but basically it would be this patch but with ObjectView as an allocator parameter. Is that something you'd accepted or are there more fundamental changes that you want to make?
I'm very wary of putting in the work to breaking up this patch if it's definitely going to be rejected again.

@kcc Just to give you an update on this. @kubamracek and I have been discussing this patch and we've decided that we will produce a much smaller version of this patch that focuses on making changes to just a single allocator and in a simpler way. @kubamracek 's view is that this will be much easier to review because this will illustrate the core design change we'd like to make with much less code that should be easier to critique.

delcypher abandoned this revision.Nov 1 2018, 6:19 AM

Abandoning this change in favour of incremental patches to achieve the same functionality.