This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Fix bugreporter::getDerefExpr() again.
ClosedPublic

Authored by NoQ on Aug 22 2017, 1:59 PM.

Details

Summary

This patch continues work that was started in D32291.

Our bugreporter::getDerefExpr() API tries to find out what has been dereferenced. For example, if we have an lvalue expression x->y.z which causes a null dereference when dereferenced, the function returns lvalue x->y - the object from which the null pointer must have been loaded. Similarly, unwrapping lvalue x->y would result in x.

I believe i found a more correct way to implement it, namely to see where lvalue-to-rvalue casts are located in the expression. In our example, x->y is surrounded by an lvalue-to-rvalue cast, which indicates that we should not unwrap the expression further. And it is irrelevant whether the member expression is a dot or an arrow, or whether C++ this-> or ObjC self-> is written explicitly or assumed implicitly, or whether the expression or a sub-expression is a pointer or a reference (we used to look at these).

This patch refactors getDerefExpr() with this design in mind. Now the function must be much easier to understand, and also behave correctly.

Unwrapping of binary operators that caused the dereference (eg. *x = 2 -> *x) was removed from getDerefExpr() because it contradicts its purpose and seems to have never actually been used (we should be receiving *x in this function instead in all cases).

Current implementation has the benefit of not crashing on the newly added test case. The crash was caused by the fact that the old getDerefExpr() was thinking that self was dereferenced, even though in fact it wasn't.

I should probably have a look at what else might have changed and add more test cases, because the old code was quite strange.

Diff Detail

Repository
rL LLVM

Event Timeline

NoQ created this revision.Aug 22 2017, 1:59 PM
a.sidorin edited edge metadata.Aug 23 2017, 1:28 AM

Hi Artem,
I have a question after quick look. The original code considered ParenExprs but I cannot find nothing paren-related in the patch. Is case (x->y).z handled as expected?

NoQ added a comment.Aug 23 2017, 1:29 AM

Yeah, line 86.

Sorry, missed that.

xazax.hun accepted this revision.Aug 28 2017, 3:46 AM

Generally, it looks good to me. Though it looks like some of the cases covered in the code do not have corresponding tests (e.g.: the parenexprs).
I think this approach is good in a sense there should be such a cast at the interesting places. But I wonder if there could be some cases where we have such cast earlier than expected.

This revision is now accepted and ready to land.Aug 28 2017, 3:46 AM
NoQ added a comment.Aug 28 2017, 4:03 AM

Thank you for the review!

Though it looks like some of the cases covered in the code do not have corresponding tests (e.g.: the parenexprs).

These are covered by tests in inline-defensive-checks.c:150,156,169,179 (old code had IgnoreParenCasts). This function is actually used a lot (even more so since D32291), and seems fairly well-tested, so i felt relatively safe when refactoring it.

But I wonder if there could be some cases where we have such cast earlier than expected.

I'd love to know :o

In D37023#853941, @NoQ wrote:

Thank you for the review!

Though it looks like some of the cases covered in the code do not have corresponding tests (e.g.: the parenexprs).

These are covered by tests in inline-defensive-checks.c:150,156,169,179 (old code had IgnoreParenCasts). This function is actually used a lot (even more so since D32291), and seems fairly well-tested, so i felt relatively safe when refactoring it.

Oh, indeed, thanks :)

But I wonder if there could be some cases where we have such cast earlier than expected.

I'd love to know :o

I started to check some of the cases I was thinking about, but all of them looked fine :)

dcoughlin edited edge metadata.Aug 29 2017, 5:49 PM

This seems like it is on the right track, but I think the emphasis here on lvalue-to-rvalue casts is a bit misplaced. The logic for "what is the pointer being dereferenced" and "what is the lvalue that holds the pointer being dereferenced" is mixed together in a way that I find confusing.

I think an easier to understand approach is to first find the rvalue of the pointer being dereferenced. Then, second, pattern match on that to find the underlying lvalue the pointer was loaded from (when possible).

But first a couple of points just to make sure we're on the same page (with apologies for the wall of text!). Suppose we have:

struct Y {
   ...
   int z;
};

struct X {
  ...
  Y y;
};

and a local 'x' of type X *.

For an access of the form int l = x->y.z the pointer being dereferenced is the rvalue x. The dereference ultimately results in a load from memory, but the address loaded from may be different from the the pointer being dereferenced. Here, for example, the address loaded from is (the rvalue of) x + "offset of field y into X" + "offset of field z into Y".

Fundamentally, given an expression representing the lvalue that will be loaded from, the way to find the rvalue of the pointer being dereferenced is to strip off the parts of the expression representing addition of offsets and any identity-preserving casts until you get to the expression that represents the rvalue of the base address. (This is why stripping off unary * makes sense -- in an lvalue context it represents adding an offset of 0. This is also why stripping off lvalue-to-rvalue casts doesn't make sense -- they do not preserve identity nor add an offset)

For int l = x->y.z, the expression for the pointer being dereferenced is the lvalue-to-rvalue cast that loads the value stored in local variable x from the lvalue that represents the address of the local variable x. But note that the pointer being dereferenced won't always be an lvalue-to-rvalue cast. For example in int l = returnsPointerToAnX()->y.z the expression for the pointer being dereferenced is the call expression returnsPointerToAnX(). There is no lvalue in sight.

Now, the existing behavior of getDerefExpr() requires that it return the lvalue representing the location containing the dereferenced pointer when possible. This is required by trackNullOrUndefValue() (sigh). So I suggest, for now, first finding the rvalue for the dereferenced value and then adding a fixup at the end of getDerefExpr() that looks to see whether the rvalue for the dereferenced pointer is an lvalue-to-rvalue cast. If so, the fixup will return the sub expression representing the lvalue.

lib/StaticAnalyzer/Core/BugReporterVisitors.cpp
46 ↗(On Diff #112218)

An interesting aspect of this function is that sometimes it returns a sub-expression that represents the pointer rvalue and sometimes it returns a sub-expression that represents the lvalue whose location contains the pointer which is being dereferenced.

I believe the reason we need the lvalue is that trackNullOrUndef() tracks lvalues better than rvalues. (That function loads from the lvalue to get the symbol representing the dereferenced pointer value.)

This behavior is pretty confusing, so we should document it in the comment.

48 ↗(On Diff #112218)

This comment is not right. For x->y.z = 2 the answer should be x and not x->y.

alexfh added a subscriber: alexfh.Sep 18 2017, 7:04 AM

Any news here? I'm wondering mainly because this patch is supposed to fix https://bugs.llvm.org/show_bug.cgi?id=34373.

NoQ updated this revision to Diff 116542.Sep 25 2017, 6:24 AM

@dcoughlin: You're right, my reasoning and understanding was not correct, and your explanation is much more clear. My code still makes sense to me though, so i updated the comments to match. And moved the unusual logic for the lvalue-to-rvalue cast unwrap to the bottom of the function.

Essentially, getDerefExpr() tries its best to explain where the null pointer comes from, on the syntactic level, by unpacking expressions that wrap the expression from which the pointer "originates", where "originates" is understood in a vague manner defined only by what the callers of this utility function expect to see. This leaves us with a few kinds of expressions that we need to unwrap, and we shouldn't touch other expressions, leaving pattern-matching over them to the caller.

There are other facilities that work on non-syntactic level, like your example where we might jump from function expression to return statement within the callee if the callee was inlined, or how getNilReceiver() function unwraps ObjCMessageExpr to the self pointer *iff* it self was nil during symbolic execution (which implies that the whole message expression is nil).

So that's right - lvalue-to-rvalue casts are not "the whole point", but merely an example of an expression kind that we do not have a right to unwrap. In fact, the callers still expect us to unwrap it once, but immediately stop afterwards. This inconsistency should probably be fixed, but getDerefExpr is used by checkers, and current code in checkers seems clean and concise enough.

I had a look at other cast kinds in OperationKinds.def, and all of them seemed as if they're worth unwrapping, apart from, of course, CK_LValueToRValue.

@xazax.hun: So, i guess, if the cast is earlier than we'd expect, then we'd just fail to unwrap something. So we won't find where the pointer comes from, and give up prematurely on our pattern-matching, while looking at roughly the same expression/value, plus-minus offsets. It sounds better than the case when the cast is later than we'd expect, where we may accidentally unwrap wrong stuff and start tracking a completely unrelated value, so i hope it shouldn't be a problem. Thanks for sharing the concern - the AST is huge and very few people possess really deep understanding of it, so any curious findings about it are really nice to share.

dcoughlin accepted this revision.Sep 26 2017, 5:02 PM

LGTM! Thanks Artem.

This revision was automatically updated to reflect the committed changes.

Thank you, Artem!