[analyzer] Improve the modeling of `memset()`.
ClosedPublic

Authored by MTC on Mar 27 2018, 7:36 AM.

Details

Summary

This patch originates from https://reviews.llvm.org/D31868. There are two key points in this patch:

  • Add OverwriteRegion(), this method used to model memset() or something like that.
  • Improve the modeling of memset.

For OverwriteRegion(), is basically invalidate region and bind default. But I think this
method requires more in-depth thinking and more extensive testing.

For evalMemset(), this patch only considers the case where the buffer's offset is zero. And
if the whole region is memseted, bind a default value. According to the value for overwriting,
decide how to update the string length.

For void *memset(void *dest, int ch, size_t count):

  • 1). offset is 0, ch is '\0' and count < dest-buffer's length. Invalidate the buffer and set the string length to 0.
  • 2). offset is 0, ch is '\0' and count == dest-buffer's length. Bind \0 to the buffer with default binding and set the string length to 0.
  • 3). offset is 0, ch is not '\0' and count < dest-buffer's length. Invalidate the buffer and set the string length >= count.
  • 4). offset is 0, ch is not '\0' and count == dest-buffer's length. Bind ch to the buffer and set the string length >= count.

In addition, memset can bind anything to the region, so getBindingForDerivedDefaultValue()'s logic needs some adjustment. The solution in this patch is certainly not correct.

I have tested this patch on sqlite, but there's no difference int the warnings. This patch is very primitive and requires a lot of advice so that it can be perfected.

Thanks in advance for the review!

Diff Detail

Repository
rL LLVM
MTC created this revision.Mar 27 2018, 7:36 AM
MTC edited the summary of this revision. (Show Details)Mar 27 2018, 7:48 AM
NoQ added a comment.Mar 28 2018, 4:00 PM

Why do you need separate code for null and non-null character? The function's semantics doesn't seem to care.

I'd rather consider the case of non-concrete character separately. Because wiping a region with a symbol is not something we currently support; a symbolic default binding over a region means a different thing and it'd be equivalent to invalidation, so for non-concrete character we don't have a better choice than to invalidate. For concrete non-zero character, on the contrary, a default binding would work just fine.

Could you explain why didn't a straightforward bindLoc over a base region wasn't doing the thing you wanted? I.e., why is new Store API function necessary?

NoQ added a comment.Mar 28 2018, 4:02 PM

In addition, memset can bind anything to the region, so getBindingForDerivedDefaultValue()'s logic needs some adjustment. The solution in this patch is certainly not correct.

Yeah, i guess you meant here the thing that i was saying about concrete bindings. If we only limit ourselves to concrete bindings, it should work well.

MTC added a comment.Mar 30 2018, 4:32 AM

Thanks for your review, NoQ!

Why do you need separate code for null and non-null character? The function's semantics doesn't seem to care.

Thank you for your advice, I will remove the duplicate code!

I'd rather consider the case of non-concrete character separately. Because wiping a region with a symbol is not something we currently support; a symbolic default binding over a region means a different thing and it'd be equivalent to invalidation, so for non-concrete character we don't have a better choice than to invalidate. For concrete non-zero character, on the contrary, a default binding would work just fine.

Thank you for your reminding, I overlooked this point. However for non-concrete character, the symbol value, if we just invalidate the region, the constraint information of the non-concrete character will be lost. Do we need to consider this?

Could you explain why didn't a straightforward bindLoc over a base region wasn't doing the thing you wanted? I.e., why is new Store API function necessary?

I am also very resistant to adding new APIs. One is that I am not very familiar with RegionStore. One is that adding a new API is always the last option.

  • The semantics of memset() need default binding, and only bindDefault() can guarantee that. However bindDefault() is just used to initialize the region and can only be called once on the same region.
  • Only when the region is TypedValueRegion and the value type is ArrayType or StructType, bindLoc() can add a default binding. So if we want to continue using default binding, bindLoc() is not the correct choice.

And if I use bindLoc() instead of overwriteRegion(), there will be some crashes occured because bindLoc() broke some assumptions, e.g. bindArray() believes that the value for binding should be CompoundVal, LazyCompoundVal or UnknownVal.

MTC updated this revision to Diff 140405.Mar 30 2018, 4:37 AM

According to @NoQ's suggestion, remove the duplicated code.

MTC updated this revision to Diff 140629.Apr 2 2018, 8:44 AM

Thank you for your reminding, I overlooked this point. However for non-concrete character, the symbol value, if we just invalidate the region, the constraint information of the non-concrete character will be lost. Do we need to consider this?

Sorry for my hasty question, analyzer does not support to bind a symbol with a default binding.

The update of this diff is as follows.

  • If the char value is not concrete, just invalidate the region.
  • Add a test about symbolic char value.
  • A little code refactoring.
MTC added a comment.Apr 2 2018, 8:45 AM

Kindly ping!

Hi Henry. I had a quick look at the patch, here are some remarks.

lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2092 ↗(On Diff #140629)

"So we call"?

2100 ↗(On Diff #140629)

I think we can use the argument type here (which is int).
One more question. Could we use SVal::isZeroConstant() here instead? Do we need the path-sensitivity for the value or we can just check if the value is a constant zero?

2103 ↗(On Diff #140629)

"argument"

2127 ↗(On Diff #140629)

Could you please factor this chunk to a function? It makes the method too big.

lib/StaticAnalyzer/Core/ProgramState.cpp
148 ↗(On Diff #140629)

Do clients of overwriteRegion really need to call checkers?

lib/StaticAnalyzer/Core/RegionStore.cpp
446 ↗(On Diff #140629)

LLVM naming conventions say that function names should start with small letter.

1718 ↗(On Diff #140629)

Could you please explain what is the reason of this change? Do we get new value kinds?

lib/StaticAnalyzer/Core/Store.cpp
72 ↗(On Diff #140629)

Could we make this method pure virtual?

MTC added a comment.May 2 2018, 5:58 AM

Sorry for the long delay, I have just finished my holiday.

Thanks a lot for the review, I will fix the typo in next update.

lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2100 ↗(On Diff #140629)

Yea, it's should be int.

SVal::isZeroConstant() is enough here, thanks!

One question that has nothing to do with the patch, I would like to know if there is a standard to determine whether there is a need for state splitting.

2127 ↗(On Diff #140629)

will do.

lib/StaticAnalyzer/Core/ProgramState.cpp
148 ↗(On Diff #140629)

It is possible that my understanding of the engine is not enough, I think we need to call processRangeChange() for memory change. memset() will overwrite the memory region, so I think there is a need to call this API.

What do you think?

lib/StaticAnalyzer/Core/RegionStore.cpp
1718 ↗(On Diff #140629)

memset() can bind anything to the region with default binding, if there is no such change, it will trigger llvm_unreachable().

I don't think much about it, just put return val; here. Any suggestions?

lib/StaticAnalyzer/Core/Store.cpp
72 ↗(On Diff #140629)

To be honest, I implemented this function according to BindDefault(). And I also think pure virtual is better, thanks.

MTC updated this revision to Diff 145019.May 3 2018, 7:41 AM
  • fix typos
  • code refactoring, add auxiliary method memsetAux()
  • according to a.sidorin's suggestions, remove the useless state splitting.
  • make StoreManager::overwriteRegion() pure virtual
MTC marked 5 inline comments as done.May 3 2018, 7:42 AM
NoQ added a comment.May 4 2018, 4:52 PM

Hmm, ok, it seems that i've just changed the API in D46368, and i should have thought about this use case. Well, at least i have some background understanding of these problems now. Sorry for not keeping eye on this problem.

In D44934#1051002, @NoQ wrote:

Why do you need separate code for null and non-null character? The function's semantics doesn't seem to care.

I guess i can answer myself here:

int32_t x;
memset(&x, 1, sizeof(int32_t));
clang_analyzer_eval(x == 0x1010101); // should be TRUE

I really doubt that we support this case.

So, yeah, zero character is indeed special.

And since zero character is special, i guess we could use the new bindDefaultZero() API, and perform a simple invalidation in the non-zero character case. @MTC, would this be enough for your use case? Or is it really important for you to support the non-zero character case? Cause my example is fairly artificial, and probably i'm worrying too much about it. If nobody really codes that way, i'm fine with supporting the non-zero character case via a simple default binding. In this case we should merge bindDefaultZero() with your overwriteRegion() (i'd prefer your function name, but please keep the empty class check).

lib/StaticAnalyzer/Core/ProgramState.cpp
148 ↗(On Diff #140629)

Well, we need to make sure we don't notify checkers recursively. In bindLoc() we have the notifyChanges parameter, i guess we need to have something similar here as well, if we really need to notify checkers.

Technically, yeah, we need to notify checkers. Like, if you memset() a string to '\0', string checker needs to know that its length has also become 0. Wait, we already are in the string checker, and we're already doing that. I guess it's not that important at the moment, so i'd rather delay it until we need it, and see if we have shared checker states available earlier and if they help.

Also getOwningEngine() is never null.

118 ↗(On Diff #145019)

Note: notifyChanges declared here :)

MTC added a comment.EditedMay 4 2018, 9:30 PM
In D44934#1088771, @NoQ wrote:

Hmm, ok, it seems that i've just changed the API in D46368, and i should have thought about this use case. Well, at least i have some background understanding of these problems now. Sorry for not keeping eye on this problem.

In D44934#1051002, @NoQ wrote:

Why do you need separate code for null and non-null character? The function's semantics doesn't seem to care.

I guess i can answer myself here:

int32_t x;
memset(&x, 1, sizeof(int32_t));
clang_analyzer_eval(x == 0x1010101); // should be TRUE

I really doubt that we support this case.

So, yeah, zero character is indeed special.

Thank you, Artem! I did not consider this common situation. This patch does not really support this situation, in this patch the value of x will be 1, it's not correct!

And since zero character is special, i guess we could use the new bindDefaultZero() API, and perform a simple invalidation in the non-zero character case.

Agree with you. The core problem here is that there is no perfect way to bind default for non-zero value, not the string length stuff. So invalidate the destination buffer but still set string length for non-zero character is OK. Right?

@MTC, would this be enough for your use case? Or is it really important for you to support the non-zero character case? Cause my example is fairly artificial, and probably i'm worrying too much about it. If nobody really codes that way, i'm fine with supporting the non-zero character case via a simple default binding. In this case we should merge bindDefaultZero() with your overwriteRegion() (i'd prefer your function name, but please keep the empty class check).

Based on my limited programming experience, I think memset with non-zero character is common. However given that we can't handle non-zero character binding problems very well, we should now support only zero character. After all, IMHO, correctness of semantic is also important for the analyzer.

I will update this patch to only handle zero character case. In the future, as I am more familiar with the RegionStore, maybe I can solve the problem of non-zero character binding.

MTC updated this revision to Diff 145361.May 5 2018, 3:55 AM
  • Since there is no perfect way to handle the default binding of non-zero character, remove the default binding of non-zero character. Use bindDefaulrZero() instead of overwriteRegion() to bind the zero character.
  • Reuse assume() instead of isZeroConstant() to determine whether it is zero character. The purpose of this is to be able to set the string length when dealing with non-zero symbol character.
NoQ added a comment.May 14 2018, 3:29 PM

Thank you! Looks good overall.

lib/StaticAnalyzer/Checkers/CStringChecker.cpp
1013–1014 ↗(On Diff #145361)

Please say something here (or above) about why do we want our offset to be 0:

We're about to model memset by producing a "default binding" in the Store. Our current implementation - RegionStore - doesn't support default bindings that don't cover the whole base region.

1037 ↗(On Diff #145361)

I think we should use StateNonNullChar (that's computed below) instead of CharVal.isZeroConstant() because the Environment doesn't provide a guarantee that all symbols within it are collapsed to constants where applicable.

1055 ↗(On Diff #145361)

I think this should use IntTy here. Because that's the type of the memset's argument, and that's what assumeZero() expects.

test/Analysis/string.c
1412 ↗(On Diff #145361)

Typo: should :)

MTC added inline comments.May 15 2018, 6:51 AM
lib/StaticAnalyzer/Checkers/CStringChecker.cpp
1037 ↗(On Diff #145361)

Yea, thanks!

1055 ↗(On Diff #145361)

I confirmed again that memset() will convert the value to unsigned char first, see http://en.cppreference.com/w/c/string/byte/memset.

In the next update, I will evalCast(value, UnsignedCharTy, IntTy) first, therefore, I will continue to use UnsignedCharTy.

test/Analysis/string.c
1412 ↗(On Diff #145361)

thanks, will do!

MTC updated this revision to Diff 146816.EditedMay 15 2018, 7:07 AM
  • According to NoQ's suggestion, use assumeZero() instead of isZeroConstant() to determine whether the value is 0.
  • Add test memset26_upper_UCHAR_MAX() and memset27_symbol()
  • Since memset( void *dest, int ch, size_t count) will converts the value ch to unsigned char, we call evalCast() accordingly.
MTC marked 2 inline comments as done.May 15 2018, 8:30 AM
NoQ accepted this revision.May 15 2018, 3:07 PM

Thanks! This looks great now.

lib/StaticAnalyzer/Checkers/CStringChecker.cpp
1055 ↗(On Diff #145361)

Aha, yup, it does convert to unsigned char, but assumeZero() doesn't. The new code looks correct.

This revision is now accepted and ready to land.May 15 2018, 3:07 PM
Closed by commit rL332463: [analyzer] Improve the modeling of memset(). (authored by henrywong, committed by ). · Explain WhyMay 16 2018, 5:41 AM
This revision was automatically updated to reflect the committed changes.