This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/StaticAnalyzer/Checkers/
-
StaticAnalyzer/
-
Checkers/
2
ArrayBoundCheckerV2.cpp
-
test/Analysis/
-
Analysis/
-
out-of-bounds-false-positive.c

Differential D86874

[analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexer
Needs ReviewPublic

Authored by steakhal on Aug 31 2020, 5:18 AM.

Download Raw Diff

Details

Reviewers

NoQ
vsavchenko
Szelethus
martong
balazske
baloghadamsoftware

Summary

Fixes the bug.

typedef unsigned long long size_t;
const char a[] = "aabbcc";
char f1(size_t len) {
  return a[len+1]; // false-positive warning: Out of bound memory access (access exceeds upper limit of memory block)
}

Previousl, the analyzer did these things:

Calculates the RawOffset of the access, resulting in: BaseRegion: a and ByteOffset: len + 1
Checks the lower bound via transforming (simplifying) the question len + 1 < 0 into len < -1. However, the analyzer can not prove nor disprove this, so it assumes that it holds. This assumption will constrain the len to be UINT_MAX since the < operator will promote the -1 into UINT_MAX.
Checks the upper bound via transforming (simplifying) the question len + 1 >= 7 into len < 6. Since the analyzer perfectly constrained len at the previous step, we wrongly diagnose an out-of-bound error.

Proposed solution:
Skip the lower bound check if the simplified root expression (in the current example len) is unsigned and the simplified index holds a negative value.
We know for sure that no out-of-bound error can happen in this part, since how could an unsigned symbolic value be less than a negative constant?
Note that we don't deal with wrapping here.

I hope that this fix gets this checker closer to the stage when it can leave alpha.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

steakhal created this revision.Aug 31 2020, 5:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2020, 5:18 AM

Herald added subscribers: cfe-commits, ASDenysPetrov, Charusso and 9 others. · View Herald Transcript

steakhal requested review of this revision.Aug 31 2020, 5:18 AM

steakhal added a parent revision: D86873: [analyzer][NFC] Refactor ArrayBoundCheckerV2.

Harbormaster completed remote builds in B70087: Diff 288926.Aug 31 2020, 5:59 AM

Note that we don't deal with wrapping here.

Wrapping? Please elaborate.

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
226	I really feel that this check would have a better place in the implementation of `eval`. This seems really counter-intuitive to do this stuff at the Checker's level. Is there any reason why we can't do this in `eval`? `evalBinOpNN` could return with Unknown, and the state should remain unchanged. Am I missing something?

baloghadamsoftware added inline comments.Sep 2 2020, 8:24 AM

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
226	I agree here. Actually, the constraint manager should handle this.

len + 1 < 0

What are types of each term in this expression?

Like, if everything here was promoted to size_t, as it should be according to the language rules, then this comparison is trivially false and our constraint manager is definitely smart enough to figure this out.

into len < -1

And this is an incorrect transformation assuming everything is of type size_t. The original statement was trivially false whereas this statement is true for almost every value - except, well, SIZE_MAX. If it's the checker doing this, it's doing something wrong.

In D86874#inline-803844, @martong wrote:

I really feel that this check would have a better place in the implementation of eval. This seems really counter-intuitive to do this stuff at the Checker's level. Is there any reason why we can't do this in eval?
evalBinOpNN could return with Unknown, and the state should remain unchanged. Am I missing something?

Ah, sort of.
To make everything clear, I have to provide examples, building-blocks, reasoning and stuff, so please don't be mad if it's too long.
I hope you have a warm cup of tee to read through this - seriously - you will need that!

Diving in

It is really important to make a clear distinction between evaluating an expression according to the semantics of the meta-language or of the object-language, because we might get different answers for the same expression.

In fact, evalBinOpNN evaluates expressions according to the semantics of the object-language.

In our context, meta-language is mathematics, and the object-language is the semantics of the abstract machine of the c/c++ language.
In mathematics, there is no such thing as overflow or value ranges, unlike in C++.

Let's see an example:
Assuming that x is in range [1,UINT_MAX].
x + 1 will be in range [2,ULL_MAX+1] in the meta-language.
x + 1 will be in range [0,0]u[2,UINT_MAX] or in [2,UINT_MAX+1] weather the type of x is capable representing the (+1) value and the overflow is well-defined or not.

Another valuable comment is that non-contiguous ranges (range sets) are not really useful.
Knowing that x is in range [0,0]u[2,UINT_MAX] doesn't help much if you want to prove that:
x < 5 holds for all possible interpretations of x.
We can not disproof that either.

So, overflow/underflow can really screw the value ranges, preventing us from evaluating expressions over relational operators.
Which is technically accurate, but not satisfiable in all cases - like in the checker ArrayBoundCheckerV2.

Before describing why is it so problematic in this checker, I should give an overview, how this checker works.

Overview of the logic of the ArrayBoundCheckerV2

The checker checks Location accesses (aka. pointer dereferences).
We construct the RegionRawOffsetV2 object of the access (Which consists of the base region, and the symbolic byte offset expression of the access).

Then we check, whether we access an element preceding the first valid index of the base memory region.
In other words, we check if the byte offset symbolic expression is less then 0:

If YES: Report that we access an out-of-bounds element.
If NO: Infer the range constraints on the symbol and add the constraint to make this array access valid.
If MAYBE: Infer and constrain, just as you would do in the previous case.

Then we check, whether we access an element exceeding the last valid index of the memory region.
In other words, we check if the byte offset symbolic expression is greater then or equal to the extent of the region:

If YES: Report the bug...
If NO: Infer & constrain...
If MAYBE: Infer & constrain...

However, in the checker, we don't check byte offset < 0 directly.
We simplify the left part of the byte offset < 0 inequality by substracting/dividing both sides with the constant C, if the byte offset is a SymintExpr of SymExpr OP C over the plus or multiplication operator (OP).
We do this simplification recursively.
In the end, we will have a symbolic root (call it RootSym) expression and a C' constant over the right-hand side of the original relational operator.
So, after the simplification process we get the RootSym < C' inequality, which we check.

This simplification is the infer operation in the pseudo-code.
And the constrain step is where we assume that the negation of RootSym < C' holds.

SPOILER: This simplification process is only valid using the semantics of the meta-language.

ElementRegions

Pointer values, which describe a particular element of a memory region, can get quite complex.
Even more complex, when we reinterpret cast a pointer to a different type.
In such cases, we might wrap the SubRegion symbol within an ElementRegion with the given target type and offset 0.
Eg:

void foo(int x) {
  int buf[10];
  char *p = (char*)(buf + 2) + 3;
                          ^-- 2*sizeof(int) in bytes which is 8.
  *p = 42; // Not UB.
  int *q = (int*)p; // unaligned pointer!
  *q = 5; // UB.

  char *r = (char*)(buf + x) + 3;
  *r = 66; // Might be safe, if x has a value to make it so.
}

RegionRawOffsetV2

In the previous example p would have the &Element{buf,11 S64b,char} SymbolRegionVal (SVal instance) associated with.
And q: &Element{Element{buf,11 S64b,char},0 S64b,int}.

However, the RegionRawOffsetV2 of both p and q would be the same: {BaseRegion: buf, ByteOffset: 11}
Therefore,RegionRawOffsetV2 gives us a unified view of the memory we are dealing with.
It is also useful dealing with pointer aliasing, just in the example with p and q.

In other cases, where we index with a symbol, the ByteOffset might contain a complex symbolic expression, instead of a ConcreteInt.
Eg: the RegionRawOffsetV2 of buf[x+1] will be {BaseRegion: buf, ByteOffset: x+1}.
It's important to note that now the symbolic expression represented by the ByteOffset, is an expression in the object-language, where we have to deal with overflows, underflows, implicit conversions, promotions another nasty stuff according to the abstract machine.

Note: I don't know what RegionRawOffset is, or what that does.

ElementRegions, again

ElementRegions are might be nested as in this example:

void foo(int x) {
  int buf[10][3];
  int *p = &buf[1][2];
  clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2 S64b,int}
  clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b}

  int *q = &buf[x+1][6];
  clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) + 1,int [3]},6 S64b,int}
  clang_analyzer_DumpRegionRawOffsetV2(q); // raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24}
}

In the example, the subscript expressions are expressions in the object-language.
So, when we touch (evaluate, simplify, reorder, etc.) the (reg_$0<int x>) + 1 we should thouch them just as the abstract machine would.

Calculation of the RegionRawOffsetV2

Let's see how does these subscript expressions used during the calculation of the RegionRawOffsetV2:
For multi-dimension arrays we fold the index expressions of the nested ElementRegions.
We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's index*sizeof(Type).
So, the ByteOffset that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the x+1 in the example).
During this folding operation we use the semantics of the object-language in the code.

The resulting `RegionRawOffsetV2` for `p`, `q` will be:
p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
                                      ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
                                                   |           |
`@a` This is an //object-language// expression. --/            |
                                                              /
`@b` The rest should be //meta-language// expression. -------/

SPOILER: This distinction is really important here.

So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the object-language since only the @a is of that world.
We should model overflow in @a if that expression is of unsigned type (OR signed but -fwrapv...) etc. according to the abstract machine.
BUT modeling the possibility of an overflow of the rest of the expression (@b) would be a flaw.

This reasoning is the same for the other way around.

In general, such mixed expressions make no sense at all.
Trying to evaluate such would lead to esoteric bugs just like the mentioned bug in the summary.

Simplify step, again

Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...
                                              ^^^^^^^^^^
                                                 `@b`
Result:  `reg_$0<int x> < -3 S64b`
          ^^^^^^^^^^^^^   ^^^^^^^
`RootSym` --/                |
                            /
`C'` ----------------------/

#1: This step supposed to fold ONLY the meta-language expression parts (@b) of the previously aquired ByteOffset.
#2: This step requires the expression under simplified to be of meta-language to be able to reorder the constants. (aka. to fold into the right hand side's constant).
#3: This also requires the right-hand side to be of the meta-language.

Cause of the bug

We don't respect the #1 and the #3 requirements of the simplification step.

We treat the complete expression under simplification as an expression of the meta-language.
I'm not changing this, but I probably should though.

Ideally, after simplification we should have this inequality: x+1 < -2
That way we would not fold any subexpression of the object-language, so the #1 requirement would be preserved.
The right-hand side is of an expression of the meta-language.
It makes sense, to evaluate the operator< as the semantics of the meta-language.
But the left-hand side is an expression of the object-language.
We need some sort of logical conversion here.

If x+1 is unsigned then there is no way that expression could result a negative value, so it will be always less then -2.
Therefore, we proved that no out-of-bounds buffer access can happen before the first valid index (0).

Otherwise, we evaluate the inequality according to the object-language.
We call the evalBinOpNN to accomplish this query.
That function, however, should respect the -fwrapv, overflow/underflow, and other nasty things of C++ but I don't care how that is implemented.
The checker's job is done at this point.

Note that, if the inequality is of the form x - C < C' after the ideological-simplification step, we have to deal with underflow similarly.
In this patch, I haven't done it though. Since currently, we just fold C into C'...

If the LHS is signed, ehm, I would have to think it through - but I'm too tired of writing this...

The proposed fix

I know that this doesn't fix the root of the problem, but we have to start somewhere :D

Check if the resulting folded constant (C') is negative or not.
If it is negative and the type of the root expression (RootSym) has unsigned type then I state that this inequality doesn't hold for any (unsigned) interpretation of RootSym.
So just move on and check the upper-bound...

I'm not sure if this fixes all the false-positives, but certainly removes some of them.
I'm not an expert in abstract interpretation and in formal verification.
I hope you enjoyed reading this, now you should take a rest too!

Hi Balázs,

Since reviews.llvm.org is offline, I am sending my comments below, inline.
Thanks for your huge effort in explaining all this!

Overall, I have a feeling that this approach targets only one specific
case, which is fine. But I believe we should think about all the other
possible cases, so we could get rid of other false positives too:

In case of multidimensional arrays, there may be a symbolic value in any

dimension.

What if there are more symbolic values in the dimensions.

Cheers,
Gábor

In D86874#2255990, @martong wrote:

Hi Balázs,

Since reviews.llvm.org is offline, I am sending my comments below, inline.
Thanks for your huge effort in explaining all this!

Overall, I have a feeling that this approach targets only one specific
case, which is fine. But I believe we should think about all the other
possible cases, so we could get rid of other false positives too:

In case of multidimensional arrays, there may be a symbolic value in any

dimension.

Yes, obviously - but it's not a problem. See my next comment.

What if there are more symbolic values in the dimensions.

It stops the simplification process on the very first SymExpr which is not a SymIntExpr. This simplification is done on a best effort basis only.

Another interesting fact is that we don't generate nested ElementRegions too frequently, so don't have to deal with "What if there are more symbolic values in the dimensions." :D
The last two lines of the following example are particularly interesting, I'm curious why we do that.

Let's see some examples:

void foo(int x, int y) {
  int buf[10][3];
  clang_analyzer_dump(&buf[1][2]);   // &Element{Element{buf,1 S64b,int [3]},2 S64b,int}
  clang_analyzer_dump(&buf[1][y]);   // Unknown
  clang_analyzer_dump(&buf[x][2]);   // &Element{Element{buf,reg_$1<int x>,int [3]},2 S64b,int}
  clang_analyzer_dump(&buf[x][y]);   // Unknown
  clang_analyzer_dump(&buf[1][y+1]); // Unknown
  clang_analyzer_dump(&buf[x][y+1]); // Unknown
  clang_analyzer_dump(&buf[x+1][2]); // &Element{Element{buf,(reg_$1<int x>) + 1,int [3]},2 S64b,int}
  clang_analyzer_dump(&buf[1+x][2]); // &Element{Element{buf,(reg_$1<int x>) + 1,int [3]},2 S64b,int}
  clang_analyzer_dump(&buf[x+1][y+2]); // Unknown

  // Another surprise is that if we assign the pointer value to a variable, we get different results...
  int *p = &buf[1][x+1];
  clang_analyzer_dump(p);            // &SymRegion{conj_$2{int *, LC1, S1740, #1}}
  clang_analyzer_dump(&buf[1][x+1]); // Unknown
}

I am coping my comments that I've sent in mail, just in case.

However, in the checker, we don't check byte offset < 0 directly.
We simplify the left part of the byte offset < 0 inequality by substracting/dividing both sides with the constant C, if the byte offset is a SymintExpr of SymExpr OP C over the plus or multiplication operator (OP).
We do this simplification recursively.
In the end, we will have a symbolic root (call it RootSym) expression and a C' constant over the right-hand side of the original relational operator.
So, after the simplification process we get the RootSym < C' inequality, which we check.

Just for the record, this is in getSimplifiedOffsets.

Note: I don't know what RegionRawOffset is, or what that does.

Seems like that is just a pair of a MemRegion plus a concrete int Offset. And this is the return type for ElementRegion::getAsArrayOffset() where we handle only nonloc::ConcreteInts. So, this is similar to RegionRawOffsetV2, but without the possibility of the symbolic index value.

Calculation of the RegionRawOffsetV2

Let's see how does these subscript expressions used during the calculation of the RegionRawOffsetV2:
For multi-dimension arrays we fold the index expressions of the nested ElementRegions.
We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's index*sizeof(Type).
So, the ByteOffset that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the x+1 in the example).

We have these lines in the case of handling the ElementRegion:

if (!index.getAs<NonLoc>())
  return RegionRawOffsetV2();

So, if the index is symbolic we give up, don't we? So, how could this work with x+1? What am I missing here?

The resulting `RegionRawOffsetV2` for `p`, `q` will be:
p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
                                      ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
                                                   |           |
`@a` This is an //object-language// expression. --/            |
                                                              /
`@b` The rest should be //meta-language// expression. -------/
SPOILER: This distinction is really important here.

So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the object-language since only the @a is of that world.
We should model overflow in @a if that expression is of unsigned type (OR signed but -fwrapv...) etc. according to the abstract machine.
BUT modeling the possibility of an overflow of the rest of the expression (@b) would be a flaw.

Why? I'd expect that the value of the whole expression @a@b could overflow.

Simplify step, again
Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...
                                              ^^^^^^^^^^
                                                 `@b`
Result:  `reg_$0<int x> < -3 S64b`
          ^^^^^^^^^^^^^   ^^^^^^^
`RootSym` --/                |
                            /
`C'` ----------------------/
#1: This step supposed to fold ONLY the meta-language expression parts (@b) of the previously aquired ByteOffset.
#2: This step requires the expression under simplified to be of meta-language to be able to reorder the constants. (aka. to fold into the right hand side's constant).
#3: This also requires the right-hand side to be of the meta-language.

Do I understand this correctly, that none of the binary operations can have a symbolic RHS, because that would mean we have a VLA?

We treat the complete expression under simplification as an expression of the meta-language.
I'm not changing this, but I probably should though.

Again, I don't understand why you are sure that the value of whole expression cannot overflow.

Ideally, after simplification we should have this inequality: x+1 < -2
That way we would not fold any subexpression of the object-language, so the #1 requirement would be preserved.
The right-hand side is of an expression of the meta-language.
It makes sense, to evaluate the operator< as the semantics of the meta-language.
But the left-hand side is an expression of the object-language.
We need some sort of logical conversion here.

What if the second index is also symbolic? E.g buf[x+1][y+1]?
This would result in (((reg_$0<int x>) + 1) * 12) + (reg_$1<int y>) + 1) < 0 . And after simplification, the RHS cannot be interpreted as meta.

Check if the resulting folded constant (C') is negative or not.

What happens if the symbolic index is not the most inner index? E.g. buf[6][x+1]. Then C is no longer constant, and no longer meta ...

About the whole raw offset and the related warning. There is a fundamental question: Should we warn at &a[0][10] ?

void foo() {
  int a[10][10];
  int *p0 = &a[9][9];   // OK
  int *p1 = &a[10][10]; // Out-of-bounds
  static_assert(&a[0][10] == &a[1][0]);
  int *p2 = &a[0][10];  // Syntactically (or technically) out-of-bounds, We should warn here !(?)
  int *p3 = &a[1][0];   // Neither syntactically nor semantically out-of-bounds, but it aliases with p2 and p2 is flawed
}

I can only imagine how long it took for you to write all this, because reading it wasn't that short either. Thank you so much! It really gave me a perspective on how you see this problem, as well as what is actually happening (and should happen) in the checker.

In D86874#2254638, @steakhal wrote:

void foo(int x) {
  int buf[10];

  char *r = (char*)(buf + x) + 3;
  *r = 66; // Might be safe, if x has a value to make it so.
}

I suppose you mean that it might be unsafe on the grounds of out-of-bounds addressing, but alignment-wise its okay?

In D86874#2256249, @martong wrote:
The resulting `RegionRawOffsetV2` for `p`, `q` will be:
p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
                                      ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
                                                   |           |
`@a` This is an //object-language// expression. --/            |
                                                              /
`@b` The rest should be //meta-language// expression. -------/
SPOILER: This distinction is really important here.

So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the object-language since only the @a is of that world.
We should model overflow in @a if that expression is of unsigned type (OR signed but -fwrapv...) etc. according to the abstract machine.
BUT modeling the possibility of an overflow of the rest of the expression (@b) would be a flaw.
Why? I'd expect that the value of the whole expression @a@b could overflow.

That wasn't all too clear to me either. It seems like there is extra work to do if we do the calculation according to C++ rules, but not the other way around, so we could just regard everything as if it was under the object-language ruleset.

In D86874#2256249, @martong wrote:
Calculation of the RegionRawOffsetV2

Let's see how does these subscript expressions used during the calculation of the RegionRawOffsetV2:
For multi-dimension arrays we fold the index expressions of the nested ElementRegions.
We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's index*sizeof(Type).
So, the ByteOffset that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the x+1 in the example).

We have these lines in the case of handling the ElementRegion:
if (!index.getAs<NonLoc>())
  return RegionRawOffsetV2();
So, if the index is symbolic we give up, don't we? So, how could this work with x+1? What am I missing here?

In my example, I'm speaking about the case when it's nonloc::SymbolVal wrapping the expression x+1. So not every NonLoc is ConcreteInt.

The resulting `RegionRawOffsetV2` for `p`, `q` will be:
p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
                                      ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
                                                   |           |
`@a` This is an //object-language// expression. --/            |
                                                              /
`@b` The rest should be //meta-language// expression. -------/
SPOILER: This distinction is really important here.

So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the object-language since only the @a is of that world.
We should model overflow in @a if that expression is of unsigned type (OR signed but -fwrapv...) etc. according to the abstract machine.
BUT modeling the possibility of an overflow of the rest of the expression (@b) would be a flaw.
Why? I'd expect that the value of the whole expression @a@b could overflow.

Hm, you might be right about that.
This simplification (over the plus operator) is only valid if no overflow can happen.
We could add the necessary constraints during this constant folding operation such as:
Simplifying the expression x+1 < 0 into x<-1 assuming that x <= numeric_limit_max(typeof(x)) - 1
We could repeat this, while we have a state where all such assumptions are true.
I'm looking forward to implementing it.

Simplify step, again
Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality...
                                              ^^^^^^^^^^
                                                 `@b`
Result:  `reg_$0<int x> < -3 S64b`
          ^^^^^^^^^^^^^   ^^^^^^^
`RootSym` --/                |
                            /
`C'` ----------------------/
#1: This step supposed to fold ONLY the meta-language expression parts (@b) of the previously aquired ByteOffset.
#2: This step requires the expression under simplified to be of meta-language to be able to reorder the constants. (aka. to fold into the right hand side's constant).
#3: This also requires the right-hand side to be of the meta-language.
Do I understand this correctly, that none of the binary operations can have a symbolic RHS, because that would mean we have a VLA?

I think yes, you are right.
However, as I tried to build an example demonstrating this, it turned out that it does not handle VLAs quite well.

void foo(int x, int y) {
  int buf[x][y]; // VLA
  clang_analyzer_dump(&buf[1][2]);
  // &Element{Element{buf,1 S64b,int [y]},2 S64b,int}

  clang_analyzer_DumpRegionRawOffsetV2AndSimplify(&buf[1][2]);
  // RawOffsetV2: {BaseRegion: buf, ByteOffset: 8 S64b}
  // simplification result: {RootSymbol: 8 S64b, FoldedConstant: 0 S64b)
}

The RawOffsetV2 value is wrong. It supposed to have the ByteOffset: (1 * sizeof(int[y])) + 8.

We treat the complete expression under simplification as an expression of the meta-language.
I'm not changing this, but I probably should though.

Again, I don't understand why you are sure that the value of whole expression cannot overflow.

Ideally, after simplification we should have this inequality: x+1 < -2
That way we would not fold any subexpression of the object-language, so the #1 requirement would be preserved.
The right-hand side is of an expression of the meta-language.
It makes sense, to evaluate the operator< as the semantics of the meta-language.
But the left-hand side is an expression of the object-language.
We need some sort of logical conversion here.

What if the second index is also symbolic? E.g buf[x+1][y+1]?
This would result in (((reg_$0<int x>) + 1) * 12) + (reg_$1<int y>) + 1) < 0 . And after simplification, the RHS cannot be interpreted as meta.

As the analyzer does not symbolicate such indexing, the pointer value will be Unknown, so the simplification process does not apply.
At best we can hope to achieve something like this:

BaseRegion: buf, ByteOffset: (reg_$0<int x>) + (reg_$1<int y>) + 16
After simplification: (reg_$0<int x>) + (reg_$1<int y>)   <  -16

But to get this, we need some sort of a canonical form of expressions having multiple symbolic values.
For now, it seems reasonable to me to squash to the left all symbolic values.

Check if the resulting folded constant (C') is negative or not.

What happens if the symbolic index is not the most inner index? E.g. buf[6][x+1]. Then C is no longer constant, and no longer meta ...

In this particular case, I would expect the BytesOffset: (((reg_$0<int x>) + 1) * 4) + 72, where 72 would be the C constant.
This example is interesting from a different standpoint. The canonical form of expressions, again.
I think for our purposes ((reg_$0<int x>) * 4) + 76 form would be a better choice since we need a single multiplication and a single addition.
To get this, we need to transform (Symbol + C) * C' => (Symbol * C') + (C * C').

In D86874#2259105, @Szelethus wrote:

I can only imagine how long it took for you to write all this, because reading it wasn't that short either. Thank you so much! It really gave me a perspective on how you see this problem, as well as what is actually happening (and should happen) in the checker.

Thank you very much. I tied to make it as short as possible while keeping all the necessary details to make my reasoning perfectly clean. Now I think I was on a bad track, but @martong helped me to pinpoint some issues.

In D86874#2254638, @steakhal wrote:
void foo(int x) {
  int buf[10];

  char *r = (char*)(buf + x) + 3;
  *r = 66; // Might be safe, if x has a value to make it so.
}
I suppose you mean that it might be unsafe on the grounds of out-of-bounds addressing, but alignment-wise its okay?

You can safely access any valid memory region via char, unsigned char and std::byte pointers. And as integers are using all bits in their representation, I suspect that this code snippet is safe - given that the pointer points to a valid region.
(aka. no out-of-bound access happens). Alignment does not play any role here.

In D86874#2256249, @martong wrote:
The resulting `RegionRawOffsetV2` for `p`, `q` will be:
p: {BaseRegion: `buf`, ByteOffset: `20 S64b`}
q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`}
                                      ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^
                                                   |           |
`@a` This is an //object-language// expression. --/            |
                                                              /
`@b` The rest should be //meta-language// expression. -------/
SPOILER: This distinction is really important here.

So, we made an expression, we should not evaluate in either of the worlds.
We should not evaluate it using the semantics of the object-language since only the @a is of that world.
We should model overflow in @a if that expression is of unsigned type (OR signed but -fwrapv...) etc. according to the abstract machine.
BUT modeling the possibility of an overflow of the rest of the expression (@b) would be a flaw.
Why? I'd expect that the value of the whole expression @a@b could overflow.
That wasn't all too clear to me either. It seems like there is extra work to do if we do the calculation according to C++ rules, but not the other way around, so we could just regard everything as if it was under the object-language ruleset.

Let's forget it. My statements in this regard are going in a bad direction. Sorry for the dead-end, I thought I was on the right track.

steakhal mentioned this in D88359: [analyzer][RFC] Complete rewrite of ArrayBoundCheckerV2.Sep 26 2020, 7:39 AM

Szelethus mentioned this in D72705: [analyzer] Added new checker 'alpha.unix.ErrorReturn'..Sep 29 2020, 9:59 AM

steakhal mentioned this in D148355: [analyzer] Fix comparison logic in ArrayBoundCheckerV2.Apr 18 2023, 7:17 AM

Revision Contents

Path

Size

clang/

lib/

StaticAnalyzer/

Checkers/

ArrayBoundCheckerV2.cpp

6 lines

test/

Analysis/

out-of-bounds-false-positive.c

7 lines

Diff 288926

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp

Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	if (isa<UnknownSpaceRegion>(RawOffset.getRegion()->getMemorySpace()))
return State;		return State;

ConcreteInt Zero = SVB.makeZeroArrayIndex().castAs<ConcreteInt>();		ConcreteInt Zero = SVB.makeZeroArrayIndex().castAs<ConcreteInt>();
SVal RootNonLoc;		SVal RootNonLoc;
SVal ConstantFoldedRHS;		SVal ConstantFoldedRHS;
std::tie(RootNonLoc, ConstantFoldedRHS) =		std::tie(RootNonLoc, ConstantFoldedRHS) =
simplify(SVB, RawOffset.getByteOffset(), Zero);		simplify(SVB, RawOffset.getByteOffset(), Zero);

		// No unsigned symbolic value can be less then a negative constant.
		if (const auto SymbolicRoot = RootNonLoc.getAs<SymbolVal>())
		martongUnsubmitted Not Done Reply Inline Actions I really feel that this check would have a better place in the implementation of `eval`. This seems really counter-intuitive to do this stuff at the Checker's level. Is there any reason why we can't do this in `eval`? `evalBinOpNN` could return with Unknown, and the state should remain unchanged. Am I missing something? martong: I really feel that this check would have a better place in the implementation of `eval`. This…
		baloghadamsoftwareUnsubmitted Not Done Reply Inline Actions I agree here. Actually, the constraint manager should handle this. baloghadamsoftware: I agree here. Actually, the //constraint manager// should handle this.
		if (SymbolicRoot->getSymbol()->getType()->isUnsignedIntegerType() &&
		ConstantFoldedRHS.castAs<ConcreteInt>().getValue().isNegative())
		return State;

NonLoc LowerBoundCheck =		NonLoc LowerBoundCheck =
SVB.evalBinOpNN(State, BO_LT, RootNonLoc.castAs<NonLoc>(),		SVB.evalBinOpNN(State, BO_LT, RootNonLoc.castAs<NonLoc>(),
ConstantFoldedRHS.castAs<ConcreteInt>(),		ConstantFoldedRHS.castAs<ConcreteInt>(),
SVB.getConditionType())		SVB.getConditionType())
.castAs<NonLoc>();		.castAs<NonLoc>();

ProgramStateRef PrecedesLowerBound, WithinLowerBound;		ProgramStateRef PrecedesLowerBound, WithinLowerBound;
std::tie(PrecedesLowerBound, WithinLowerBound) =		std::tie(PrecedesLowerBound, WithinLowerBound) =
▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines

clang/test/Analysis/out-of-bounds-false-positive.c

	// RUN: %clang_analyze_cc1 -analyzer-checker=core,alpha.security.ArrayBoundV2,debug.ExprInspection \			// RUN: %clang_analyze_cc1 -analyzer-checker=core,alpha.security.ArrayBoundV2,debug.ExprInspection \
	// RUN: -analyzer-config eagerly-assume=false -verify %s			// RUN: -analyzer-config eagerly-assume=false -verify %s

	void clang_analyzer_eval(int);			void clang_analyzer_eval(int);
	void clang_analyzer_printState();			void clang_analyzer_printState();

	typedef unsigned long long size_t;			typedef unsigned long long size_t;
	const char a[] = "abcd"; // extent: 5 bytes			const char a[] = "abcd"; // extent: 5 bytes

	void symbolic_size_t_and_int0(size_t len) {			void symbolic_size_t_and_int0(size_t len) {
	// FIXME: Should not warn for this.			(void)a[len + 1]; // no-warning
	(void)a[len + 1]; // expected-warning {{Out of bound memory access}}
	// We infered that the 'len' must be in a specific range to make the previous indexing valid.			// We infered that the 'len' must be in a specific range to make the previous indexing valid.
	// len: [0,3]			// len: [0,3]
	clang_analyzer_eval(len <= 3); // expected - warning {{TRUE}}			clang_analyzer_eval(len <= 3); // expected-warning {{TRUE}}
	clang_analyzer_eval(len <= 2); // expected - warning {{UNKNOWN}}			clang_analyzer_eval(len <= 2); // expected-warning {{UNKNOWN}}
	}			}

	void symbolic_size_t_and_int1(size_t len) {			void symbolic_size_t_and_int1(size_t len) {
	(void)a[len]; // no-warning			(void)a[len]; // no-warning
	// len: [0,4]			// len: [0,4]
	clang_analyzer_eval(len <= 4); // expected-warning {{TRUE}}			clang_analyzer_eval(len <= 4); // expected-warning {{TRUE}}
	clang_analyzer_eval(len <= 3); // expected-warning {{UNKNOWN}}			clang_analyzer_eval(len <= 3); // expected-warning {{UNKNOWN}}
	}			}
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexerNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Diving in

Overview of the logic of the ArrayBoundCheckerV2

ElementRegions

RegionRawOffsetV2

ElementRegions, again

Calculation of the RegionRawOffsetV2

Simplify step, again

Cause of the bug

The proposed fix

Calculation of the RegionRawOffsetV2

Simplify step, again

Calculation of the RegionRawOffsetV2

Simplify step, again

Revision Contents

Diff 288926

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp

clang/test/Analysis/out-of-bounds-false-positive.c

[analyzer] Fix ArrayBoundCheckerV2 false positive regarding size_t indexer
Needs ReviewPublic