This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Escape symbols when creating std::initializer_list.
ClosedPublic

Authored by NoQ on Jul 10 2017, 11:48 AM.
Tokens
"100" token, awarded by grham389.

Details

Summary

I've seen a few false positives that appear because we construct C++11 std::initializer_list objects with brace initializers, and such construction is not properly modeled. For instance, if a new object is constructed on the heap only to be put into a brace-initialized STL container, the object is reported to be leaked.

Approach (0): This can be trivially fixed by this patch, which causes pointers passed into initializer list expressions to immediately escape.

This fix is overly conservative though. So i did a bit of investigation as to how model std::initializer_list better.

According to the standard, std::initializer_list<T> is an object that has methods begin(), end(), and size(), where begin() returns a pointer to continous array of size() objects of type T, and end() is equal to begin() plus size(). The standard does hint that it should be possible to implement std::initializer_list<T> as a pair of pointers, or as a pointer and a size integer, however specific fields that the object would contain are an implementation detail.

Ideally, we should be able to model the initializer list's methods precisely. Or, at least, it should be possible to explain to the analyzer that the list somehow "takes hold" of the values put into it. Initializer lists can also be copied, which is a separate story that i'm not trying to address here.

The obvious approach to modeling std::initializer_list in a checker would be to construct a SymbolMetadata for the memory region of the initializer list object, which would be of type T* and represent begin(), so we'd trivially model begin() as a function that returns this symbol. The array pointed to by that symbol would be bindLoc()ed to contain the list's contents (probably as a CompoundVal to produce less bindings in the store). Extent of this array would represent size() and would be equal to the length of the list as written.

So this sounds good, however apparently it does nothing to address our false positives: when the list escapes, our RegionStoreManager is not magically guessing that the metadata symbol attached to it, together with its contents, should also escape. In fact, it's impossible to trigger a pointer escape from within the checker.

Approach (1): If only we enabled ProgramState::bindLoc(..., notifyChanges=true) to cause pointer escapes (not only region changes) (which sounds like the right thing to do anyway) such checker would be able to solve the false positives by triggering escapes when binding list elements to the list. However, it'd be as conservative as the current patch's solution. Ideally, we do not want escapes to happen so early. Instead, we'd prefer them to be delayed until the list itself escapes.

So i believe that escaping metadata symbols whenever their base regions escape would be the right thing to do. Currently we didn't think about that because we had neither pointer-type metadatas nor non-pointer escapes.

Approach (2): We could teach the Store to scan itself for bindings to metadata-symbolic-based regions during scanReachableSymbols() whenever a region turns out to be reachable. This requires no work on checker side, but it sounds performance-heavy.

Approach (3): We could let checkers maintain the set of active metadata symbols in the program state (ideally somewhere in the Store, which sounds weird but causes the smallest amount of layering violations), so that the core knew what to escape. This puts a stress on the checkers, but with a smart data map it wouldn't be a problem.

Approach (4): We could allow checkers to trigger pointer escapes in arbitrary moments. If we allow doing this within checkPointerEscape callback itself, we would be able to express facts like "when this region escapes, that metadata symbol attached to it should also escape". This sounds like an ultimate freedom, with maximum stress on the checkers - still not too much stress when we have smart data maps.

Does anybody like/hate any of these solutions or have anything better in mind? I'm personally liking the approach (2) - it should be possible to avoid performance overhead, and clarity seems nice.

Diff Detail

Event Timeline

NoQ created this revision.Jul 10 2017, 11:48 AM
alexander-shaposhnikov added inline comments.
lib/StaticAnalyzer/Core/ExprEngine.cpp
810

explicit ?

NoQ updated this revision to Diff 105897.Jul 10 2017, 11:54 AM
NoQ edited the summary of this revision. (Show Details)

This other diff implements approach (1) through a draft of a checker (that doesn't model much yet). I had to additionally make sure we already have a region to construct metadata for, in spirit of D27202.

For easier comparison, i thought it'd be better to stick it to the same differential revision.

NoQ added inline comments.Jul 10 2017, 11:56 AM
lib/StaticAnalyzer/Core/ExprEngine.cpp
810

This code was moved from below, so i didn't bother changing it. Good point though.

NoQ added a comment.Jul 10 2017, 11:59 AM

Or maybe it wasn't a good idea to make two diffs in one place. Dunno.

test/Analysis/temporaries-callback-order.cpp
28

Apparently, this was caused by the check if the states are equal within processPointerEscapedOnBind, so it was a dry run for checkers.

xazax.hun edited edge metadata.Jul 11 2017, 3:29 AM

At this point, I am a bit wondering about two questions.

  • When should something belong to a checker and when should something belong to the engine? Sometimes we model library aspects in the engine and model language constructs in checkers.
  • What is the checker programming model that we are aiming for? Maximum freedom or more easy checker development?

I think if we aim for maximum freedom, we do not need to worry about the potential stress on checkers, and we can introduce abstractions to mitigate that later on.
If we want to simplify the API, then maybe it makes more sense to move language construct modeling to the engine when the checker API is not sufficient instead of complicating the API.

Right now I have no preference or objections between the alternatives but there are some random thoughts:

  • Maybe it would be great to have a guideline how to evolve the analyzer and follow it, so it can help us to decide in similar situations
  • I do care about performance in this case. The reason is that we have a limited performance budget. And I think we should not expect most of the checker writers to add modeling of language constructs. So, in my opinion, it is ok to have less nice/more verbose API for language modeling if we can have better performance this way, since it only needs to be done once, and is done by the framework developers.
NoQ added a comment.Jul 11 2017, 8:32 AM

These are some great questions, i guess it'd be better to discuss them more openly. As a quick dump of my current mood:

  • To me it seems obvious that we need to aim for a checker API that is both simple and powerful. This can probably by keeping the API as powerful as necessary while providing a layer of simple ready-made solutions on top of it. Probably a few reusable components for assembling checkers. And this layer should ideally be pleasant enough to work with, so that people would prefer to extend it when something is lacking, instead of falling back to the complex omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly similar situation: matchers are not omnipotent, but they're so nice.
  • Separation between core and checkers is usually quite strange. Once we have shared state traits, i generally wouldn't mind having region store or range constraint manager as checkers (though it's probably not worth it to transform them - just a mood). The main thing to avoid here would be the situation when the checker overwrites stuff written by the core because it thinks it has a better idea what's going on, so the core should provide a good default behavior.
  • Yeah, i totally care about performance as well, and if i try to implement approach, i'd make sure it's good.
NoQ added a comment.Jul 18 2017, 4:06 AM

Approach (2): We could teach the Store to scan itself for bindings to metadata-symbolic-based regions during scanReachableSymbols() whenever a region turns out to be reachable. This requires no work on checker side, but it sounds performance-heavy.

Nope, this approach is wrong. Metadata symbols may become out-of-date: when the object changes, metadata symbols attached to it aren't changing (because symbols simply don't change). The same metadata may have different symbols to denote its value in different moments of time, but at most one of them represents the actual metadata value. So we'd be escaping more stuff than necessary.

If only we had "ghost fields" (http://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have been much easier, because the ghost field would only contain the actual metadata, and the Store would always know about it. This example adds to my belief that ghost fields are exactly what we need for most C++ checkers.

dcoughlin edited edge metadata.Jul 18 2017, 10:23 PM

In this case, I would be fine with some sort of AbstractStorageMemoryRegion that meant "here is a memory region and somewhere reachable from here exists another region of type T". Or even multiple regions with different identifiers. This wouldn't specify how the memory is reachable, but it would allow for transfer functions to get at those regions and it would allow for invalidation.

For std::initializer_list this reachable region would the region for the backing array and the transfer functions for begin() and end() yield the beginning and end element regions for it.

In my view this differs from ghost variables in that (1) this storage does actually exist (it is just a library implementation detail where that storage lives) and (2) it is perfectly valid for a pointer into that storage to be returned and for another part of the program to read or write from that storage. (Well, in this case just read since it is allowed to be read-only memory).

What I'm not OK with is modeling abstract analysis state (for example, the count of a NSMutableArray or the typestate of a file handle) as a value stored in some ginned up region in the store. This takes an easy problem that the analyzer does well at (modeling typestate) and turns it into a hard one that the analyzer is bad at (reasoning about the contents of the heap).

I think the key criterion here is: "is the region accessible from outside the library". That is, does the library expose the region as a pointer that can be read to or written from in the client program? If so, then it makes sense for this to be in the store: we are modeling reachable storage as storage. But if we're just modeling arbitrary analysis facts that need to be invalidated when a pointer escapes then we shouldn't try to gin up storage for them just to get invalidation for free.

NoQ added a comment.Jul 19 2017, 5:56 AM

In this case, I would be fine with some sort of AbstractStorageMemoryRegion that meant "here is a memory region and somewhere reachable from here exists another region of type T". Or even multiple regions with different identifiers. This wouldn't specify how the memory is reachable, but it would allow for transfer functions to get at those regions and it would allow for invalidation.

Yeah, this is what we can easily implement now as a symbolic-region-based-on-a-metadata-symbol (though we can make a new region class for that if we eg. want it typed). The problem is that the relation between such storage region and its parent object region is essentially immaterial, similarly to the relation between SymbolRegionValue and its parent region. Region contents are mutable: today the abstract storage is reachable from its parent object, tomorrow it's not, and maybe something else becomes reachable, something that isn't even abstract. So the parent region for the abstract storage is most of the time at best a "nice to know" thing - we cannot rely on it to do any actual work. We'd anyway need to rely on the checker to do the job.

For std::initializer_list this reachable region would the region for the backing array and the transfer functions for begin() and end() yield the beginning and end element regions for it.

So maybe in fact for std::initializer_list it may work fine because you cannot change the data after the object is constructed - so this region's contents are essentially immutable. For the future, i feel as if it is a dead end.

I'd like to consider another funny example. Suppose we're trying to model std::unique_ptr. Consider:

void bar(const std::unique_ptr<int> &x);

void foo(std::unique_ptr<int> &x) {
  int *a = x.get();   // (a, 0, direct): &AbstractStorageRegion
  *a = 1;             // (AbstractStorageRegion, 0, direct): 1 S32b
  int *b = new int;
  *b = 2;             // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b
  x.reset(b);         // Checker map: x -> SymRegion{conj_$0<int *>}
  bar(x);             // 'a' doesn't escape (the pointer was unique), 'b' does.
  clang_analyzer_eval(*a == 1); // Making this true is up to the checker.
  clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker.
}

The checker doesn't totally need to ensure that *a == 1 passes - even though the pointer was unique, it could theoretically have .get()-ed above and the code could of course break the uniqueness invariant (though we'd probably want it). The checker can say that "even if *a did escape, it was not because it was stuffed directly into bar()".

The checker's direct responsibility, however, is to solve the *b == 2 thing (which is in fact the problem we're dealing with in this patch - escaping the storage region of the object).

So we're talking about one more operation over the program state (scanning reachable symbols and regions) that cannot work without checker support.

We can probably add a new callback "checkReachableSymbols" to solve this. This is in fact also related to the dead symbols problem (we're scanning for live symbols in the store and in the checkers separately, but we need to do so simultaneously with a single worklist). Hmm, in fact this sounds like a good idea; we can replace checkLiveSymbols with checkReachableSymbols.

Or we could just have ghost member variables, and no checker support required at all. For ghost member variables, the relation with their parent region (which would be their superregion) is actually useful, the mutability of their contents is expressed naturally, and the store automagically sees reachable symbols, live symbols, escapes, invalidations, whatever.

In my view this differs from ghost variables in that (1) this storage does actually exist (it is just a library implementation detail where that storage lives) and (2) it is perfectly valid for a pointer into that storage to be returned and for another part of the program to read or write from that storage. (Well, in this case just read since it is allowed to be read-only memory).

What I'm not OK with is modeling abstract analysis state (for example, the count of a NSMutableArray or the typestate of a file handle) as a value stored in some ginned up region in the store.This takes an easy problem that the analyzer does well at (modeling typestate) and turns it into a hard one that the analyzer is bad at (reasoning about the contents of the heap).

Yeah, i tend to agree on that. For simple typestates, this is probably an overkill, so let's definitely put aside the idea of "ghost symbolic regions" that i had earlier.

But, to summarize a bit, in our current case, however, the typestate we're looking for is the contents of the heap. And when we try to model such typestates (complex in this specific manner, i.e. heap-like) in any checker, we have a choice between re-doing this modeling in every such checker (which is something analyzer is indeed good at, but at a price of making checkers heavy) or instead relying on the Store to do exactly what it's designed to do.

I think the key criterion here is: "is the region accessible from outside the library". That is, does the library expose the region as a pointer that can be read to or written from in the client program? If so, then it makes sense for this to be in the store: we are modeling reachable storage as storage. But if we're just modeling arbitrary analysis facts that need to be invalidated when a pointer escapes then we shouldn't try to gin up storage for them just to get invalidation for free.

As a metaphor, i'd probably compare it to body farms - the difference between ghost member variables and metadata symbols seems to me like the difference between body farms and evalCall. Both are nice to have, and body farms are very pleasant to work with, even if not omnipotent. I think it's fine for a FunctionDecl's body in a body farm to have a local variable, even if such variable doesn't actually exist, even if it cannot be seen from outside the function call. I'm not seeing immediate practical difference between "it does actually exist" and "it doesn't actually exist, just a handy abstraction". Similarly, i think it's fine if we have a CXXRecordDecl with implementation-defined contents, and try to farm up a member variable as a handy abstraction (we don't even need to know its name or offset, only that it's there somewhere).

NoQ marked an inline comment as done.Aug 15 2017, 6:27 AM

We've discussed it in person with Devin, and he provided more points to think about:

  • If the initializer list consists of non-POD data, constructors of list's objects need to take the sub-region of the list's region as this-region In the current (v2) version of this patch, these objects are constructed elsewhere and then trivial-copied into the list's metadata pointer region, which may be incorrect. This is our overall problem with C++ constructors, which manifests in this case as well. Additionally, objects would need to be constructed in the analyzer's core, which would not be able to predict that it needs to take a checker-specific region as this-region, which makes it harder, though it might be mitigated by sharing the checker state traits.
  • Because "ghost variables" are not material to the user, we need to somehow make super sure that they don't make it into the diagnostic messages.

So, because this needs further digging into overall C++ support and rises too many questions, i'm delaying a better approach to this problem and will fall back to the original trivial patch.

NoQ updated this revision to Diff 111156.Aug 15 2017, 6:28 AM

Actually update the diff.

dcoughlin added inline comments.Aug 17 2017, 2:35 PM
lib/StaticAnalyzer/Core/ExprEngine.cpp
1083

What is special about ObjCBoxedExpr here? Naively I would have expected that we'd want to keep the old behavior for ObjCArrayLiteral and ObjCDictionary as well.

Other than my question above, this looks good to me.

NoQ added inline comments.Aug 29 2017, 8:49 AM
lib/StaticAnalyzer/Core/ExprEngine.cpp
1083

Yeah, i got it all wrong initially.

If a char * is boxed into NSString, it doesn't escape, even though Val would be the string pointer - which is what i wanted to express here; we already had a test for that.

Similarly, if, say, a C struct is boxed, the pointer to the struct doesn't escape. However, contents of the struct do escape! Conveniently, Val here would be the (lazy) compound value, similarly to std::initializer_list, so we don't need to explicitly avoid escaping the struct pointer itself.

If a C integer is boxed, nothing escapes, of course, until we implement the mythical non-pointer escape (eg. file descriptor integers in stream checker should escape when they get boxed).

For ObjC{Array,Dictionary}Expressions, contents of the array/dictionary do actually escape. However, because only NSObjects can be within such boxed literals, and RetainCountChecker ignores escapes, it is largely irrelevant how we behave in this case - i've even no idea how to test that, so i guess it could be fixed later.

Reference: https://clang.llvm.org/docs/ObjectiveCLiterals.html

NoQ updated this revision to Diff 113101.Aug 29 2017, 8:54 AM

Fix a bit of ObjCBoxedExpr behavior.

NoQ marked an inline comment as done.Aug 29 2017, 9:07 AM
rsmith added a subscriber: rsmith.Aug 29 2017, 5:28 PM

The CXXStdInitializerListExpr node has pretty simple evaluation semantics: it takes a glvalue array expression, and constructs a std::initializer_list<T> from it as if by filling in the two members with a pointer to the array and either the size of the array or a pointer to its end. Can we not model those semantics directly here?

NoQ added a comment.Aug 30 2017, 3:21 AM

The CXXStdInitializerListExpr node has pretty simple evaluation semantics: it takes a glvalue array expression, and constructs a std::initializer_list<T> from it as if by filling in the two members with a pointer to the array and either the size of the array or a pointer to its end. Can we not model those semantics directly here?

Yeah, it's definitely an easy case to perform complete modeling. However, in the analyzer we still lack some infrastructure for C++ library modeling, which was discussed quite a lot above, and i feel we've run out of debating time and still need to fix the false positive. It's important to make our checker APIs more powerful before we jump into modeling myriads of STL classes. The main problem is how to represent the private fields in the std::initializer_list<T> object in our symbolic memory model. Here's a quick summary of the above.

We already have the object's fields in the AST, with its AST record layout, however field names and layouts are implementation-dependent, and it is unsafe to try to understand how the specific standard library implementation's layout works and what specific private fields we see in the AST mean. Instead it seems safer to attach metadata to the object, which would contain values of these private fields - this is a common operation for the analyzer (eg. attach string length metadata to a null-terminated C-string).

However, because such metadata should be handled by an analyzer's checker (we cannot possibly model all APIs in the core, moving it into checkers is much more scalable and natural), and therefore such metadata would be private to the checker in one way or another, it makes checkers fully responsible for modeling such metadata. In particular, the very feature lack of which that was causing the false positive (expressing the fact that if the list is passed into an external function (eg. with no visible body available for analysis), objects stored in it are also "escaping" into a function and can be, for example, deallocated here, so other checkers need to be notified to stop tracking them) cannot currently be implemented on the checker side. If we let checkers express this, it'd require an additional checker callback boilerplate that would need to be repeated in every C++ checker, and we already have a lot of boilerplate. Additionally, when constructing the initializer_list's elements, we already need to have the storage for them, which is hard to handle by the core when such storage is checker-specific.

Another solution would be to let the analyzer core help checkers with the simple boilerplate, which is a long-standing project that is to be took up. The only downside of this approach is that it might make checker APIs less omnipotent, because they can no longer maintain their metadata in arbitrary manners, but only in manners that are understandable by the core.

An alternative solution described here was to let the checkers produce "imaginary" fields within our symbolic memory model, with unknown offsets within the object (similar to objective-c instance variables) that would have benefits of no longer being implementation-defined. The difference with maintaining metadata is that most of the boilerplate would be handled by the memory model (RegionStore) rather than the checker, and the core would generally know that the pointers to the array are located within the object. However, this also raises multiple concerns, because such imaginary "ghost" fields may conflict with the real fields in weird manners, they may accidentally leak to user's diagnostics, and what not. And we still have the problem with checker-specific storage.

So these are the excuses :)

NoQ updated this revision to Diff 116512.Sep 25 2017, 4:19 AM

Fix some comments in tests.

Devin: ok to commit? I hope i didn't miss anything in my reasoning about boxed expressions.

Artem: Sorry it too me so long to review this! For CXXStdInitializerListExpr this looks great. However, I don't think we want ObjCBoxedExprs to escape their arguments. It looks to me like these expressions copy their argument values and don't hold on to them.

lib/StaticAnalyzer/Core/ExprEngine.cpp
1085

Note that we do have other ObjC checkers that rely on escaping of ObjC objects, such as the ObjCLoopChecker and ObjCDeallocChecker. I think having the TODO is great, but I'd like you to remove the the bit about "escapes of ObjC objects aren't so important".

1087

In general, I don't think we want things passed to ObjCBoxedExpr to escape. The boxed values are copied and encoded when they are boxed, so the boxed expression doesn't take ownership of them.

1090

Is this 'if' needed?

test/Analysis/objc-boxing.m
66 ↗(On Diff #116512)

In this case the duped string is not owned by val. NSValue doesn't take ownership of the string, so this *will* leak and we should warn about it.

NoQ added inline comments.Oct 2 2017, 8:02 AM
lib/StaticAnalyzer/Core/ExprEngine.cpp
1085

Hmm, should have double-checked. Will fix.

test/Analysis/objc-boxing.m
66 ↗(On Diff #116512)

I mean, the pointer to the raw string is stored inside the NSValue, and can be used or freed from there. The caller can free this string by looking into the val, even though val itself won't release the pointer (i guess i messed up the comment again). From MallocChecker's perspective, this is an escape and no-warning. If we free the string in this function, it'd most likely cause use-after-free in the caller.

I tested that the string is indeed not strduped during boxing:

$ cat test.m

#import <Foundation/Foundation.h>

typedef struct __attribute__((objc_boxable)) {
  const char *str;
} BoxableStruct;

int main() {
  BoxableStruct bs;
  bs.str = strdup("dynamic string");
  NSLog(@"%p\n", bs.str);

  NSValue *val = @(bs);

  BoxableStruct bs2;
  [val getValue:&bs2];
  NSLog(@"%p\n", bs2.str);

  return 0;
}

$ clang test.m -framework Foundation
$ ./a.out

2017-10-02 17:56:00.004 a.out[17933:1083757] 0x7ffd23407380
2017-10-02 17:56:00.004 a.out[17933:1083757] 0x7ffd23407380

So it's possible to retrieve the exact same pointer from the boxed value. So if val is returned to the caller, like in the test, it shouldn't be freed.

If the NSValue itself dies and never escapes, then of course it's a leak, but in order to see that we'd need to model contents of NSValue.

dcoughlin accepted this revision.Oct 2 2017, 10:08 AM
dcoughlin added inline comments.
test/Analysis/objc-boxing.m
66 ↗(On Diff #116512)

You're absolutely right about this. The documentation for NSValue even states: "Important: The NSValue object doesn’t copy the contents of the string, but the pointer itself. If you create an NSValue object with an allocated data item, don’t free the data’s memory while the NSValue object exists."

Sorry! My mistake.

This revision is now accepted and ready to land.Oct 2 2017, 10:08 AM
In D35216#856415, @NoQ wrote:

We already have the object's fields in the AST, with its AST record layout, however field names and layouts are implementation-dependent, and it is unsafe to try to understand how the specific standard library implementation's layout works and what specific private fields we see in the AST mean.

This is precisely how the rest of the compiler handles CXXStdInitializerListExpr, is how it's intended to be used, and is really the only way it can work if you want to evaluate the members of initializer_list<T> rather than hardcoding their meaning. Consumers of CXXStdInitializerListExpr are expected to handle two cases: 1) the class has no base classes and two fields of type const T* and size_t (begin and size), and 2) the class has no base classes and two fields, both of type const T* (begin and end). (Unfortunately, we reject all other cases during CodeGen rather than in Sema, so you may still see them for now.)

NoQ marked 3 inline comments as done.Oct 4 2017, 10:56 AM

This is precisely how the rest of the compiler handles CXXStdInitializerListExpr

Wow. Cool. I'd see what I can do. Yeah, it seems that this is a great case for us to pattern-match the implementations as well (the problems are still there for other STL stuff).

Do you think this patch should be blocked in favor of complete modeling?
Or i can land that and follow up with complete modeling later; it'd be a larger, but not much to discuss indeed.
Or i can try always "inlining" initializer_list constructor and methods during analysis, if all methods are available in the header in all implementations.

NoQ updated this revision to Diff 117699.Oct 4 2017, 10:58 AM

Escape into array and dictionary literals, add relevant tests. Fix the null statement check.

NoQ added inline comments.Oct 4 2017, 10:59 AM
test/Analysis/objc-for.m
328 ↗(On Diff #117699)

I guess these old tests should be replaced with warnIfReached().

NoQ added a comment.EditedOct 4 2017, 11:00 AM

(whoops accidentally pushed the same reply twice, never mind)

lib/StaticAnalyzer/Core/ExprEngine.cpp
1090

Not sure, wanted to be defensive. It seems that CXXStdInitializerList always contains a single child (InitListExpr) so it's irrelevant if the list is empty, while boxed expressions contain no children when empty, and hanging commas are ignored. Replaced with an assertion just in case.

In D35216#888506, @NoQ wrote:

Do you think this patch should be blocked in favor of complete modeling?

Please, let's get this landed. We can add more precise modeling when the need arises.

This revision was automatically updated to reflect the committed changes.