This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Implement cast for ranges of symbolic integers
Needs ReviewPublic

Authored by ASDenysPetrov on May 25 2021, 9:19 AM.

Details

Summary

Support integral cast for ranges of symbolic integers. Previously we only support integral cast for concrete integers.
Reason about the ranges of SymbolCast expressions. Apply truncations, promotions and conversions to get a correct range set using nested types of a SymbolCast.

Fixes: https://github.com/llvm/llvm-project/issues/50380

The solution

Create a map which contains a bitwidth as a key and a range set as a data. Call it CastMap.
CastMap = Map<uint32_t, RangeSet>

NOTE: LLVM-IR has the ability to represent integers with a bitwidth from 1 all the way to 16'777'215. See _ExtInt Feature for details.
NOTE: We don't care about certain signedness of RangeSet stored in CastMap. But the signedness of all stored RangeSet in the map shall be the same.

Create a map which contains a symbol as a key and CastMap as a data. Call it SymCastMap.
SymCastMap = Map<SymbolRef, CastMap>

Store and update SymCastMap for every SymbolCast and every SymExpr which represents an integer.
Use a root symbol of SymbolCast as a key of the map. E.g. for (int16)(uint8)(int32 x) root symbol is (int32 x).
For SymExpr use the symbol itself as a key of the map.

Getting a constraint
  1. Get a key symbol from SymbolCast/SymExpr.
  2. Get a CastMap of constraints from SymCastMap using a key symbol.
  3. Find the smallest type of the given cast symbolic expression.
  4. Find a RangeSet in the CastMap for equal or the first bigger than the bitwidth of the smallest type.
  5. If no RangeSet was found, create a new full RangeSet for the smallest type.
  6. Sequentially cast the RangeSet across the chain of types starting from the most inner one.
Pseudocode
GivenSymbol = (int16)(uint8)(int32 x)

RootSymbol = GetRoot(GivenSymbol) // (int32 x)
CastMap = GetCastMap(RootSymbol) // CastMap for (int32 x)
MinType = FindMinType(GivenSymbol) // uint8
MinBitwidth = BitwidthOf(MinType) // uint8

RangeSet = FindRange(CastMap , MinBitwidth)  // range for bitwidth of 8
if(!RangeSet)
  RangeSet = FindNextRange(CastMap, MinBitwidth) // range for bitwidth of 9+
if(!RangeSet)
  RangeSet = CreateRangeForType(MinType) // full range for uint8

CastChain = GetCastChain(GivenSymbol) // int32 -> uint8 -> int16
ResultRangeSet = RangeThroughCastChain(RangeSet, CastChain)

return ResultRangeSet
Setting a constraint
  1. Get a key symbol from SymbolCast/SymExpr.
  2. Get a map of constraints from SymCastMap using a key symbol.
  3. Find the smallest type of the given cast symbolic expression.
  4. Find and update all RangeSet's in the CastMap for bitwidths which are equal or lower than the bitwidth of the smallest type.
  5. If there is no constraint for the bitwidth of the smallest type in the map, add a new entry with the given RangeSet.
Pseudocode
GivenRangeSet = [N, M]
GivenSymbol = (int16)(uint8)(int32 x)

RootSymbol = GetRoot(GivenSymbol) // (int32 x)
CastMap = GetCastMap(RootSymbol) // CastMap for (int32 x)
MinType = FindMinType(GivenSymbol) // uint8
MinBitwidth = BitwidthOf(MinType) // uint8

Bitwidth = MinBitwidth
while (Bitwidth > 0) // update all constraints which bitwidth is equal or lower then the minimal one
  RangeSet = FindRange(CastMap, Bitwidth)  // range for bitwidth of 8 and lower
  UpdateRange(CastMap, Bitwidth, RangeSet ∩ GivenRangeSet) //  intersect ranges and store the result back to the map
  Bitwidth--

if(!RangeExistsInMap(CastMap, MinBitwidth))
  AddRange(CastMap, MinBitwidth, GivenRangeSet)  // store the given range to the map

See tests in symbol-integral-cast.cpp for examples.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ASDenysPetrov requested review of this revision.May 25 2021, 9:19 AM
NoQ added a comment.May 27 2021, 10:44 AM

It sounds like you indeed solved a lot of problems that prevented us from enabling SymbolCast. But this still requires massive testing, a lot more than a typical constraint solver patch; extraordinary claims require extraordinary evidence. If it works out though, it might be the best thing to happen to the static analyzer in years.

With SymbolCasts in place our equations become much more complicated and therefore the constraint solver becomes much more likely to produce false positives in cases where it previously erred on the side of false negatives.

Another thing to test is our ability to explain bug paths. People are often careless about integral types and it may lead to bugs which your patch helps uncover. But it is worthless to uncover these bugs if the user can't understand them. I'm thinking of scenarios like this:

01  void foo(long x) {
02    if (x == 0)
03      return;
04
05    bar(x, nullptr);
06  }
07
08  void bar(int y, int *p) {
09    if (y == 0)
10      *p = 1; // warning: null dereference
11  }

The user will discard this as false positive because "I checked for zero in foo(), it obviously can't be zero in bar() in this context". There needs to be a note that explains the implicit truncation of an interesting* symbol on line 5. Maybe even mention truncation on line 9 as well (not sure how to word that).

I also wonder if a lot of such reports will also be false positives simply because the presumption of potential overflow is baseless. On paper it looks like "if the user didn't want to pass large values, they'd just use int instead of long". But in practice there may be other reasons to use a larger integer type, such as API requirements (eg., how isascii() accepts an int but only uses 266 values). There's also the usual problem of overflow being impossible specifically on the current path; in this case we have to make sure that an appropriate assert() would actually suppress the warning (i.e., the constraint solver would be able to correctly solve the assert condition as well).

__
*In this case it's interesting as a control flow dependency of the bug location; it sounds like without @Szelethus's control flow dependency tracking this advancement would have been virtually impossible.

@NoQ
This solution only intends to make correct calculations whenever cast occures. We can mark this as alpha or add an argument flag to turn cast reasoning on/off, or we can even disable any part of this patch with argument settings.

But this still requires massive testing, a lot more than a typical constraint solver patch; extraordinary claims require extraordinary evidence.

What kind of tests do you think we need?

If it works out though, it might be the best thing to happen to the static analyzer in years.

Thank you. I appreciate your possitive evaluation.

With SymbolCasts in place our equations become much more complicated and therefore the constraint solver becomes much more likely to produce false positives in cases where it previously erred on the side of false negatives.

Another thing to test is our ability to explain bug paths.

My proposition is to design and describe a paper of:

  • what cases shall be considered as erroneous and be reported;
  • what cases shall be ignored or considered as exceptions (i.e. static_cast);
  • what wordings shall we use in reports;
  • how paths of those reports shall look like;
  • your options;

But in practice there may be other reasons to use a larger integer type, such as API requirements (eg., how isascii() accepts an int but only uses 266 values).

IMO this is great when we tell user that he/she should make sure of the value bounds before passing the arguments to such APIs.

There's also the usual problem of overflow being impossible specifically on the current path; in this case we have to make sure that an appropriate assert() would actually suppress the warning (i.e., the constraint solver would be able to correctly solve the assert condition as well).

For example, this test easily passes.

void test(int x) {
  assert(0 < x && x < 42);
  char c = x;
  clang_analyzer_eval(c <= 0); // expected-warning {{FALSE}}
  clang_analyzer_eval(c >= 42); // expected-warning {{FALSE}}
}

Or you meant some other cases?

__
*In this case it's interesting as a control flow dependency of the bug location; it sounds like without @Szelethus's control flow dependency tracking this advancement would have been virtually impossible.

Can this tracking mechanism be adjusted then?

In the end, should we go further with this patch and make more adjustments in CSA or reject it in view of your concerns?

NoQ added a comment.Jun 3 2021, 11:11 PM

@NoQ
This solution only intends to make correct calculations whenever cast occures. We can mark this as alpha or add an argument flag to turn cast reasoning on/off, or we can even disable any part of this patch with argument settings.

That would be awesome. I think we should land this patch under an -analyzer-config flag. This will allow us to experiment with viability of enabling cast symbols at any time as we improve the constraint solver. Additionally, presence of cast symbols is extremely valuable for z3 runs in which our concerns for increased complexity are eliminated.

But this still requires massive testing, a lot more than a typical constraint solver patch; extraordinary claims require extraordinary evidence.

What kind of tests do you think we need?

Test on a lot of real-world code. Like, seriously, *A LOT* of real-world code. Say, 20x the amount of code we have in docker and csa-testbench, from a large variety of projects. Investigate newly appeared reports carefully to understand the impact. I'll be happy to help with this at some point.

With SymbolCasts in place our equations become much more complicated and therefore the constraint solver becomes much more likely to produce false positives in cases where it previously erred on the side of false negatives.

Another thing to test is our ability to explain bug paths.

My proposition is to design and describe a paper of:

  • what cases shall be considered as erroneous and be reported;
  • what cases shall be ignored or considered as exceptions (i.e. static_cast);
  • what wordings shall we use in reports;
  • how paths of those reports shall look like;
  • your options;

I think we should start from examples. While testing on real-world code, find poorly explained reports and see what piece of information is missing and preventing the user from understanding the warning. Then add that piece of information.

But in practice there may be other reasons to use a larger integer type, such as API requirements (eg., how isascii() accepts an int but only uses 266 values).

IMO this is great when we tell user that he/she should make sure of the value bounds before passing the arguments to such APIs.

I'm thinking about warnings inside the implementations of such APIs.

There's also the usual problem of overflow being impossible specifically on the current path; in this case we have to make sure that an appropriate assert() would actually suppress the warning (i.e., the constraint solver would be able to correctly solve the assert condition as well).

For example, this test easily passes.

void test(int x) {
  assert(0 < x && x < 42);
  char c = x;
  clang_analyzer_eval(c <= 0); // expected-warning {{FALSE}}
  clang_analyzer_eval(c >= 42); // expected-warning {{FALSE}}
}

Or you meant some other cases?

I meant real-world examples. We should see if it works on real-world code where constraints are significantly more complex.

There is so much fancy stuff going on upstream. Awesome to see.
I'm trying to catch up ASAP, I'm finally done with my master's thesis.

Additionally, presence of cast symbols is extremely valuable for z3 runs in which our concerns for increased complexity are eliminated.

You are probably referring to D85528. (Without that patch, Z3 refutation crashes all over the place due to the not modeled widening/narrowing casts.)

But this still requires massive testing, a lot more than a typical constraint solver patch; extraordinary claims require extraordinary evidence.

What kind of tests do you think we need?

Test on a lot of real-world code. Like, seriously, *A LOT* of real-world code. Say, 20x the amount of code we have in docker and csa-testbench, from a large variety of projects. Investigate newly appeared reports carefully to understand the impact. I'll be happy to help with this at some point.

The CSA-testbench is capable of using the Conan package manager.
By cloning the https://github.com/conan-io/conan-center-index you can get a bunch of Conan package recipes, with tests actually using the given library.
Running their tests would ensure that header-only libraries get analyzed as well as normal libraries. However, frequently used packages would be analyzed over and over again, similarly to how headers suffer from this.

We planned to make use of this in the future to make CTU analysis and Z3 refutation more and more robust.
As soon as we have the infrastructure and scale, we plan to enable a small set of contributors of initiating such measurements, but don't expect it in the close future.

Added a boolean option handle-integral-cast-for-ranges under -analyzer-config flag. Disabled the feature by default.

@NoQ, @steakhal
How do you think whether it's neccesory to add any changes in SMTConstraintManager in scope of this patch?

What about this patch?

vsavchenko requested changes to this revision.Jun 16 2021, 5:34 AM

Hey, great work! I think that casts are extremely important, but it looks like you mixed so many things into this patch. Let's make one step at a time a split it into (at least) a couple of patches.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
757–767

That's definitely regresses the interface, so NominalTypeList should be definitely reworked.

1015–1042

This looks like a very static data structure to me, I don't see any reasons why the user should be able to create multiple copies of it.
If it becomes a static data-structure, there will be no need in passing it around.

1067–1103

I think all this extra logic about how we infer ranges for casts is interesting, but should be a separate patch.
For now, you can simply put return Visit(Sym->getOperand());.

First, it will unblock you from depending on that RangeFactory feature.
And also have quite a few questions about this particular implementation, so it will stagger this patch as well.

2464–2523

I need more explanation why we have this function and why we call it where we call it. Additionally, it again looks like it belongs in a separate patch.

This revision now requires changes to proceed.Jun 16 2021, 5:34 AM

Hey, great work! I think that casts are extremely important, but it looks like you mixed so many things into this patch. Let's make one step at a time a split it into (at least) a couple of patches.

Thanks for the tips. I'll adress them in the next update. Actually, I thought about splitting before the first upload and splitted it into D103094 and the current one. This particular patch provides full mechanism implementing feasibility of the test cases. Honestly, I don't know what part could be cut to keep this mechanism holistic and self-sufficient. But I'll see what i can do.

Hey, great work! I think that casts are extremely important, but it looks like you mixed so many things into this patch. Let's make one step at a time a split it into (at least) a couple of patches.

Honestly, I don't know what part could be cut to keep this mechanism holistic and self-sufficient. But I'll see what i can do.

I know: solver part = separate patch.
As I said, introduce a very minimal support in solver (aka VisitSymbolCast in Inferrer) and that's it. All other algorithms, like looking for constraints for the same expression, but casted to larger type, logically belong in a separate where you actually start producing symbolic casts.

vsavchenko added inline comments.Jun 17 2021, 1:23 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2506–2520

What's the point of this when you do reverse operation in Inferrer?

As far as I understood, in VisitSymbolCast, you iterate over larger types and see if the same symbol was casted to any of those, and if yes you truncate the result and use that range.
Here, when we are about to set the constraint for a casted symbol, you iterate over smaller types, truncate this range for a smaller type, construct a cast to that smaller type, and add constraint for that symbol as well.

So, if this is correct, these two pieces of code DO THE SAME WORK and ONLY ONE should remain.

Splitted this revision and moved SValBuilder related changes to separate patch D105340. Added detailed comments. Made NominalTypeList static and as a result removed the forwarding across the functions. Spread handleSymbolCast logic to three methods: modifySymbolAndConstraints, updateExistingConstraints, getProperSymbolAndConstraint.

ASDenysPetrov edited the summary of this revision. (Show Details)Jul 6 2021, 3:48 AM

This is a very complicated patch, I think we'll have to iterate on it quite a lot.
Additionally, we have to be sure that this doesn't crash our performance.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1015

Comments on:

  • why do we need it?
  • why does it have four types?
  • why do we not care about signed/unsigned types?
2581–2583

OK, but I still don't understand one thing.
Here you go over all "smaller" types and artificially create constraints for them, and at the same time in VisitSymbolCast you do the opposite operation? Why? Shouldn't the map have constraints for smaller types already because of this action? Why do we need to do both?

2585–2586

This looks like a pattern and we should probably make into a method of SymbolCast

I found some issues. Working on improvement.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2581–2583

I've been preparing an answer for you, but suddenly you inspired me on some impovements. Thanks.

2585–2586

I did it :) but refused. It will just turn into:

if (isa<SymbolCast>(Sym))
  Sym = cast<SymbolCast>(Sym)->getRootOperand();

It looks pretty the same and brings no benefit IMO, does it?
Every time I used getRootOperand I also needed some additional traverse through the types te get some another information, so I couldn't avoid the while loop there. So I decided not to introduce a new method in SymbolCast.

vsavchenko added inline comments.Jul 6 2021, 9:07 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2585–2586

Aha, I see your point. I guess we can take it into SymExpr and call it not getRootOperand, which won't tell much to a person reading the name, but something like ignoreCasts. It will fit well with Expr::IgnoreCasts, Expr::IgnoreParens, etc.

ASDenysPetrov added inline comments.Jul 6 2021, 10:11 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2585–2586

Nice idea! True, getRootOperand would only tell enough to user in scope of SymbolCast. I'll try to implement this in the next update.

Added SymExpr::ignoreCasts method. Added descriptive comments.

Added more descriptive comments. Fixed RangeConstraintManager::updateExistingConstraints function.

Can you please explain why you do the same thing in two different ways?

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2581–2583

I've fixed RangeConstraintManager::updateExistingConstraints. There was a mistake when I update smaller types from the root symbol, but correct symbol is the given symbol which is before calling ignoreCast().
May be now it would be more clear for you.

That's not the question I'm asking. Why do you need to set constraints for other symbolic expressions, when SymbolicInferrer can look them up on its own? Which cases will fail if we remove that part altogether?

That's not the question I'm asking. Why do you need to set constraints for other symbolic expressions, when SymbolicInferrer can look them up on its own? Which cases will fail if we remove that part altogether?

I see. Here is what fails in case if we don't update other constraints:

void test(int x) {
  if ((char)x > -10 && (char)x < 10) {
    if ((short)x == 8) {
      // If you remove updateExistingConstraints,
      // then `c` won't be 8. It would be [-10, 10] instead.
      char c = x;
      if (c != 8)
        clang_analyzer_warnIfReached(); // should no-warning, but fail
    }
  }
}

That's not the question I'm asking. Why do you need to set constraints for other symbolic expressions, when SymbolicInferrer can look them up on its own? Which cases will fail if we remove that part altogether?

I see. Here is what fails in case if we don't update other constraints:

void test(int x) {
  if ((char)x > -10 && (char)x < 10) {
    if ((short)x == 8) {
      // If you remove updateExistingConstraints,
      // then `c` won't be 8. It would be [-10, 10] instead.
      char c = x;
      if (c != 8)
        clang_analyzer_warnIfReached(); // should no-warning, but fail
    }
  }
}

OK, it's something! Good!
I still want to hear a good explanation why is it done this way. Here c is mapped to (char)x, and we have [-10, 10] directly associated with it, but we also have (short)x associated with [8, 8]. Why can't VisitSymbolCast look up constraints for (short)x it already looks up for constraints for different casts already.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

Why do you use VisitSymExpr here? You want to interrupt all `Visits or... I'm not sure I fully understand.

1075

Can we get a test for that?

1076

Same goes here.

1101

Why do you get associated constraint directly without consulting with what SymbolRangeInferrer can tell you about it?

ASDenysPetrov added a comment.EditedJul 9 2021, 7:55 AM

Generally, with this patch we kinda have several constraints for each cast of a single symbol. And we shall care for all of that constraints and timely update them (if possible).
For instance, we have int x and meet casts of this symbol in code:

int x;
(char)x; // we can reason about the 1st byte
(short)x; // we can reason about the 2 lowest bytes
(ushort)x; // we can reason about the 2 lowest bytes (in this case we may not store for unsigned separately, as we already stored 2 bytes for signed)

That's like we have a knowledge of a lower part of the integer. And every time we have a new constraints, for example, for (short)x; (aka 2 bytes) then we have to update all the constraints that have two bytes or lower ((char)xin this case) as well to make them consistent.

@vsavchenko

I still want to hear a good explanation why is it done this way. Here c is mapped to (char)x, and we have [-10, 10] directly associated with it, but we also have (short)x associated with [8, 8]. Why can't VisitSymbolCast look up constraints for (short)x it already looks up for constraints for different casts already.

Hm, you've confused me. I'll make some debugging and report.

Generally, with this patch we kinda have several constraints for each cast of a single symbol. And we shall care for all of that constraints and timely update them (if possible).
For instance, we have int x and meet casts of this symbol in code:

int x;
(char)x; // we can reason about the 1st byte
(short)x; // we can reason about the 2 lowest bytes
(ushort)x; // we can reason about the 2 lowest bytes (in this case we may not store for unsigned separately, as we already stored 2 bytes for signed)

That's like we have a knowledge of a lower part of the integer. And every time we have a new constraints, for example, for (short)x; (aka 2 bytes) then we have to update all the constraints that have two bytes or lower ((char)xin this case) as well to make them consistent.

What we do in Inferrer is that we try to look at many sources of information and intersect their ranges. And I repeat my question again in a bit different form, why can't it look up constraints for (char)x and for (short)x and intersect them?
You should admit you never really address this question. Why can't VisitSymolCast do everything?

@vsavchenko

I still want to hear a good explanation why is it done this way. Here c is mapped to (char)x, and we have [-10, 10] directly associated with it, but we also have (short)x associated with [8, 8]. Why can't VisitSymbolCast look up constraints for (short)x it already looks up for constraints for different casts already.

Hm, you've confused me. I'll make some debugging and report.

It should not be about debugging, it's your code! Why did you write it this way!?

@vsavchenko

Why did you write it this way!?

I want the map contains only valid constraints at any time, so we can easely get them without traversing with all variants intersecting with each other. I'm gonna move updateExistingConstraints logic to VisitSymbolCast. I think your suggestion can even improve the feature and cover some more cases. I'll add more tests in the next update. Thanks!

ASDenysPetrov added inline comments.Jul 9 2021, 10:50 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1101

What do you mean? I didn't get. Could you give an example?

@vsavchenko

Why did you write it this way!?

I want the map contains only valid constraints at any time, so we can easely get them without traversing with all variants intersecting with each other. I'm gonna move updateExistingConstraints logic to VisitSymbolCast. I think your suggestion can even improve the feature and cover some more cases. I'll add more tests in the next update. Thanks!

[-10, 10] is also valid, right? You can't keep things at their best all the time. And if you want all constraints directly in the map then what's all this logic in VisitSymbolCast? That's why I keep asking why do you need both parts of this solution and didn't get any answer so far.
I'm hands down for the incremental approach and adding small-to-medium size improvements on top of each other. That makes my life as a reviewer easier :) That's said, I don't want to commit to a big solution, where the author doesn't want to explain why there are two parts of the solution instead of one.

I want you to tell me why the code that's in VisitSymbolCast does what it does. And the same about updateExistingConstraints. Also I want to hear a solid reason why it's split this way and why we need both of them.

You should understand that I'm not peaking on you personally. The review process takes a lot of my time too. I want to make it easier for both of us. When the reviewer understand what you are going for, it is much easier for them to help you in refining your solution. This patch is very big, but the summary doesn't cover the main part: the approach. And you leave me here dragging it out of you.

ASDenysPetrov edited the summary of this revision. (Show Details)Jul 12 2021, 9:47 AM
ASDenysPetrov added a comment.EditedJul 12 2021, 9:50 AM

@vsavchenko I've updated the summary. I hope, I addressed your question. Thanks.

ASDenysPetrov edited the summary of this revision. (Show Details)Jul 12 2021, 10:16 AM
ASDenysPetrov added inline comments.Jul 13 2021, 5:52 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

Here we want to delegate the reasoning to another handler as we don't support non-integral cast yet.

1075

I'll add some.

ASDenysPetrov added inline comments.Jul 13 2021, 6:23 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

Do you think the recursive call is better than the loop? But, I guess, I see your point. You option could be safer if we had another implementation of the virtual method. Or you think such alike cast symbol is possible in the future? Well, for now ignoreCasts doesn't make sense to any other Expr successors.

I'll allocate some time to get into your summary, but for now here are my concerns about SymbolRangeInferrer and VisitSymbolCast.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

You are not delegating it here. Visit includes a runtime dispatch that calls a correct VisitTYPE method. Here you call VisitSymExpr directly, which is one of the VisitTYPE methods. No dispatch will happen, and since we use VisitSymExpr as the last resort (it's the most base class, if we got there, we don't actually support the expression), you interrupt the Visit and go directly to "the last resort".

See the problem now?

1101

getConstraint returns whatever constraint we have stored directly in the constraint map. That's the main source of information for ranges, but not the only one.

Here is the of things that you skip, when you do getConstraint here:

  • we can understand that something is equality/disequality check and find the corresponding info in Equivalence Classes data structure
  • we can see that the expression has the form A - B and we can find constraint for B - A
  • we can see that the expression is comparison A op B and check what other comparison info we have on A and B (your own change)
  • we can see that the expression is of form A op B and check if we know something about A and B, and produce a reasonable constraint out of this information

In order to use the right information, you should use infer that will actually do all other things as well. That's how SymbolRangeInferrer is designed, to be recursive.

Speaking of recursiveness. All these loops and manually checking for types of the cast's operand is against this pattern. Recursive visitors should call Visit for children nodes (like RecursiveASTVisitor). In other words, if f(x) is a visit function, it should be defined like this:

f(x) = g(f(x->operand_1), f(x->operand_2), ... , f(x->operand_N))

or if we talk about your case specifically:

f(x: SymbolCast) = h(f(x->Operand))

and the h function should transform the range set returned by f(x->Operand) into a range set appropriate for x.

NOTE: h can also intersect different ranges
vsavchenko added inline comments.Jul 13 2021, 6:58 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

Oh, wait, why is it even virtual? I don't think that it should be virtual.
Are similar functions in Expr virtual?
And I think that this implementation should live in SymExpr directly. Then it would look like:

if (const SymbolCast *ThisAsCast = dyn_cast<SymbolCast>(this)) {
  return ThisAsCast->ignoreCasts();
}
return this;
ASDenysPetrov added inline comments.Jul 13 2021, 7:37 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

Yes, SymExpr is an abstract class. And because of limitations and dependency of inheritance we are not able to know the implementaion of SymbolCast. Unfortunately, this is not a CRTP.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

OK. I reject this idea before. If we call Visit inside VisitSymbolCast, we will go into recursive loop, because it will return us back to VisitSymbolCast as we have passed Sym as is. (This is theoretically, I didn't check in practice.) Or I'm missing smth?
I choosed VisitSymExpr here because all kinds of SymbolCast were previously handled here. So I decided to pass all unsupproted forms of casts there.

1101

Thank you for useful notes! I'll take them into account.

vsavchenko added inline comments.Jul 13 2021, 8:01 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

Did I suggest to Visit(Sym)? Of course it is going to end up in a loop!
Why isn't it Visit(Sym->getOperand()) here? Before we started producing casts, casts were transparent. This logic would fit perfectly with that.

1076

And here, since we couldn't really reason about it, we usually return infer(T).

OK, thanks for putting a summary. I now got a good idea why you need both.
At the same time, take a look at D105692. I'm about to land it and I think it's going to be useful for you.

ASDenysPetrov added inline comments.Jul 13 2021, 11:32 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

were transparent.

Not exactly. There are still some cases when symbols are not equal to there roots(aka Operands). Such cases were handled by VisitSymExpr which uses infer(Sym->getType()); instead of getOperand`. So this needs a sort of think twice. Also I see a problem with EquivalenceClass'es. Consider next:

int x, y;
if(x == y)
  if ((char)x == 2)
    if(y == 259)
      // Here we shall update `(char)x` and find this branch infeasible.

Also such cases like:

if(x == (short)y)
  // What we should do(store) with(in) `EquivalenceClass`es.

Currently, I have an obscure vision of the solution.

// 1. `VisitSymbolCast`.
// Get a range for main `reg_$0<int x>` - [-2147483648, 2147483647]
// Cast main range to `short` - [-2147483648, 2147483647] -> [-32768, 32767].
// Now we get a valid range for further bifurcation - [-32768, 32767].

That's a great example, thanks for putting it together. I can see your point now!

Please, rebase your change and make use of ConstraintAssignor, and rework VisitSymbolCast.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

Re-read my comment, please.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1063

There are still some cases when symbols are not equal to there roots(aka Operands)

Right now we don't have casts, this is what we do currently. However faulty it is, it is the existing solution and we should respect that.

Also I see a problem with EquivalenceClass'es.

Because of the current situation with casts (or more precisely with their lack), EquivalenceClasses do not get merged for symbols with different types. It is as simple as that.
You can find similar tests in equality_tracking.c.

ASDenysPetrov added inline comments.Jul 14 2021, 10:26 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

Oh, wait, why is it even virtual?

ignoreCasts is a virtual function because I haven't found any other way to implement it better.

I don't think that it should be virtual.

Unfortunately, this is not a CRTP to avoid dynamic linking.

Are similar functions in Expr virtual?

SymExpr is an abstract class. I'm not sure about similarity but SymExpr has such virtual methods:

  • computeComplexity
  • getType
  • getOriginRegion

And I think that this implementation should live in SymExpr directly.

It's impossible due to SymExpr implementation design. SymExpr knows nothing about implementation details of SymbolCast to invoke ignoreCasts().

vsavchenko added inline comments.Jul 14 2021, 10:38 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

a) Expr is also an abstract class
b) I put the implementation right there in the comment above. I don't see any reasons not to use it.
c) I don't buy it about "impossible" and "implementation design" because you can always declare function in one place and define it in the other.

ASDenysPetrov updated this revision to Diff 358685.EditedJul 14 2021, 11:58 AM

Made ignoreCast non-virtual.
P.S. IMO, this change is not something that can be taken as a pattern, though.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

I think I achieved of what you've been concerned.

Made ignoreCast non-virtual.
P.S. IMO, this change is not something that can be taken as a pattern, though.

It is already a pattern in other type hierarchies.
Virtual functions are only good, when they can have multiple implementations. ignoreCasts by its name can have only one implementation and couldn't be virtual. That's it! It is more useable now, and less confusing for its users. The fact that its definition lives in some other cpp file doesn't change it.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

This function should be removed then.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2539–2541

No, please, remove duplication by putting it inside of the constraint assignor. It is designed specifically so we don't duplicate code around assumeSymXX functions.

@vsavchenko

It is already a pattern in other type hierarchies.

I just rarely met them. And it was hard to me reading the code searching for implementation all over the places.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

NP.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2539–2541

+1. That's what I've recently thought about. :)

ASDenysPetrov edited the summary of this revision. (Show Details)

Improved ignoreCasts implementation. Adapted to ConstraintAssignor.

vsavchenko added inline comments.Jul 15 2021, 9:15 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1644–1648

That's not using ConstraintAssignor, you simply put your implementation in here. That won't do!

ASDenysPetrov added inline comments.Jul 15 2021, 9:31 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1644–1648

OK, please tell me how to use it correctly in my case.

vsavchenko added inline comments.Jul 15 2021, 9:36 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1644–1648

Can you read the comments first and then ask me if you have any specific questions?

Adapted solution to ConstraintAssignor API. Added tests.

ASDenysPetrov added inline comments.Jul 16 2021, 6:38 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1644–1648

I think I did it. Could you please review the changes?

ASDenysPetrov retitled this revision from [analyzer] Implement cast for ranges of symbolic integers. to [analyzer] Implement cast for ranges of symbolic integers.Nov 17 2021, 1:59 AM

Ping. If there is somebody interested in this? :)

Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2022, 10:13 AM

First of all, thanks Denys for working on this, nice work! Here are my concerns and remarks.

I think this fixed set of types in NominalTypeList is too rigid. I don't like the fact that we have to iterate over all four types whenever we set a new constraint or when we try to infer. Also, I am thinking about downstream hardware architectures, where there might be integers with different bit-widths (@vabridgers). Also, at some point people will pursue us to support integers with arbitrary bitwidth (see _ExtInt)

Thus, I am proposing an alternative approach. We should have a SymExpr -> Set of SymExpr mapping in the State that represents the relation of symbols that are connected via some cast operations (see REGISTER_MAP_WITH_PROGRAMSTATE). Let's call this mapping as CastMap. The key should be the root symbol, i.e the symbol that is being declared first before all cast operations.

E.g. Let's have

int16 a = 128;

then we have a constraint [128,128] stored for $a. Then

if ((int8)a < 0)

creates a new symbol $a2 (SymbolCast) that has a new constraint [-128,-128] assigned to it. And we also keep track in the State, that $a and $a2 refers the same root symbol (a). We now have in the CastMap $a -> [$a2].

Now, let's say we have

if ((_ExtInt(7))a > 64)

then we can dig up the existing contraints from CastMap to check for the State's validity and we can update all the constraints of $a and $a2 as needed. Also, CastMap is updated: $a -> [$a2, $a3].

clang/lib/StaticAnalyzer/Checkers/ExprInspectionChecker.cpp
421–426

Does it really matter? I mean, why do we need this change?

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1842

Could you please rebase? The "simplification" code part had been merged already to llvm/main and it is not part of this change.

1857–1881

I think this hunk should remain in assignSymExprToConst. Why did you move it?

1910–1920

I think, we should definitely store the constraints as they appear in the analyzed code. This way we mix the infer logic into the constraint setting, which is bad.
I mean, we should simply store the constraint directly for the symbol as it is. And then only in VisitSymbolCast should we infer the proper value from the stored constraint (if we can).

(Of course, if we have related symbols (casts of the original symbol) then their constraints must be updated.)

1911
1914
1915
1916
1952
1976–1980

Instead of a noop we should be more conservative in this case. We should invalidate (remove) the constraints of all the symbols that have more bits than the currently set symbol. However, we might be clever in cases of special values (e.g 0 or in case of the true rangeSet {[MIN, -1], [1, MAX]}).

ASDenysPetrov marked 10 inline comments as done.EditedApr 22 2022, 12:40 PM

Thank you for the review @martong! Your work is not less hard than mine. I'll rework and update the revision ASAP.

clang/lib/StaticAnalyzer/Checkers/ExprInspectionChecker.cpp
421–426

I investigated. This changes is not obligatory now. I'll remove it.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1857–1881

I'll remove. It's unrelated one.

1910–1920

I see what you mean. I thought about this. Here what I've faced with.

  1. Let's say you meet (wchar_t)x > 0, which you store like a pair {(wchar_t)x, [1,32767]}.
  2. Then you meet (short)x < 0, where you have to answer whether it's true or false.
  3. So would be your next step? Brute forcing all the constraint pairs to find any x-related symbol? Obviously, O(n) complexity for every single integral symbol is inefficient.

What I propose is to "canonize" arbitrary types to a general form where this form could be a part of key along with x and we could get a constraint with a classic map complexity. So that:

  1. You meet (wchar_t)x > 0, which you convert wchar_t to int16 and store like a pair {(int16)x, [1,32767]}.
  2. Then you meet (short)x < 0, where you convert short to int16 and get a constraint.

That's why I've introduced NominalTypeList.
But now I admited your concern about arbitrary size of integers and will redesign my solution.

1976–1980

No, it's incorrect. Consider next:

int x;
if(x > 1000000 || x < 100000) 
  return;
// x (100'000, 1000'000) 
if((int8)x != 42) 
  return;
// x (100'000, 1000'000) && (int8)x (42, 42)

We can't just remove or invalidate x (100'000, 1000'000) because this range will still stay true.
Strictly speaking x range should be updated with values 100394, 102442, 108586, ...,, 960554 and any other value within the range which has its lowest byte equals to 42.
We can't just update the RangeSet with such a big amount of values due to performance issues. So we just assume it as less accurate.

martong added inline comments.Apr 25 2022, 3:07 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1910–1920

So would be your next step? Brute forcing all the constraint pairs to find any x-related symbol? Obviously, O(n) complexity for every single integral symbol is inefficient.

I don't think we need a brute force search among the constraints if we have the additional CastMap (that I have previously proposed).
So, the next step would be: Lookup the root symbol of (short)x that is (wchar_t)x (supposing we have seen a declaration wchar_t x; first during the symbolic execution, but having a different root might work out as well).
Then from the CastMap we can query in O(1) the set of the related cast symbols of the root symbol. From this set it takes to query the constraints for each of the members from the existing equivalneClass->constraint mapping. That's O(1) times the number of the members of the cast set (which is assumed to be very few in the usual case).

1976–1980

Okay, this makes perfect sense, thanks for the example!

ASDenysPetrov updated this revision to Diff 425248.EditedApr 26 2022, 9:36 AM
ASDenysPetrov marked 4 inline comments as done.
ASDenysPetrov edited the summary of this revision. (Show Details)

@martong thank you for the idea. I've tried to implement it. Could you look at the patch once again, please? I've also described a new solution in the Summary.

ASDenysPetrov marked 2 inline comments as done.Apr 26 2022, 9:37 AM

Thanks Denys for the update! This is getting really good.

I have some concerns though about the CastMap = Map<uint32_t, RangeSet>. I think we should have CastMap = Map<uint32_t, EquivalenceClass> instead, and we could get the RangeSet from the existing ConstraintRange mapping. By storing directly the RangeSet, the State might get out-of-sync when we introduce a constraint to another member in an equivalence class. (Besides that, our mapping of constraints is happening always by using the EquivalenceClasses as keys.)
I think this could solve the problematic code you posted earlier

int x, y;
if(x == y)
  if ((char)x == 2)
    if(y == 259)
      // Here we shall update `(char)x` and find this branch infeasible.

Here we have EqClass1: [x, y] , EqClass2: [(char)x] and they are not the same class, thus when you iterate over the CastMap, you can get the updated RangeSet for both classes, and the infeasibility can be discovered.

About this:

if(x == (short)y)
  // What we should do(store) with(in) `EquivalenceClass`es.

In this case, we have one EqClass with two members, the SymbolRef x and the SymbolCast (short)y. They both must have the same RangeSet associated to them. And this is already implemented. By referreing to the EqClass in the CastMap, we simply can reuse this information.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1123–1124

By using llvm::Expected<T> we would be more aligned with the llvm error handling practices. Besides, the bool in the tuple and the Success variable in the function below would not be needed.

NoQ added inline comments.Apr 28 2022, 1:54 PM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
707–708

These maps will need to be cleaned up when symbols become dead (as in RangeConstraintManager::removeDeadBindings()).

Giving it some more thought, the SymCastMap = Map<SymbolRef, CastMap> should be keyed as well with an equivalence class : SymCastMap = Map<EquivalenceClass, CastMap>. This is the only way to use the equivalence info correctly when we process the casts.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
707–708

Yes, the same way as we clean up e.g. the DisequalityMap.

martong added inline comments.Apr 29 2022, 3:04 AM
clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1183–1185

There might be a problem here because the iteration of the map is non-deterministic. We should probably have a copy that is sorted, or the container should be sorted (sorted immutable list maybe?).

Your tests below passed probably because the cast chains are too small. Could you please have a test, where the chain is really long (20 maybe) and shuffled.
(My thanks for @steakhal for this additional comment.)

1906–1909

Same here.

clang/test/Analysis/symbol-integral-cast.cpp
32
ASDenysPetrov added a comment.EditedMay 13 2022, 8:38 AM

Ping

Thank you, folk, for taking you time. I'll surely make corresponding changes according to your suggestions and notify you then. Sorry, @martong, for the late response. I'm pretty loaded recent times.

ASDenysPetrov edited the summary of this revision. (Show Details)May 19 2022, 10:26 AM

Denys, I've created a very simple patch that makes the SValBuilder to be able to look up and use a constraint for an operand of a SymbolCast. That change passes 2 of your test cases, thus I made that a parent patch.

clang/test/Analysis/symbol-integral-cast.cpp
13–37

These two tests are redundant because they are handled by the Parent patch I've just created. https://reviews.llvm.org/D126481

ASDenysPetrov marked 4 inline comments as done.EditedJun 27 2022, 11:15 AM

@martong Just FYI. I've been working on reworking this solution to using EquivalenceClasses for several weeks. It turned out that this is an extremely hard task to acomplish. There're a lot of cast cases like: (int8)x==y, (uint16)a==(int64)b, (uint8)y == b, Merging and inferring all of this without going beyond the complexity O(n) is really tricky.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1183–1185

I've checked. ImmutableSet gave me a sorted order. But I agree that it could be just a coincidence. I'll try to add more tests.

@martong Just FYI. I've been working on reworking this solution to using EquivalenceClasses for several weeks. It turned out that this is an extremely hard task to acomplish. There're a lot of cast cases like: (int8)x==y, (uint16)a==(int64)b, (uint8)y == b, Merging and inferring all of this without going beyond the complexity O(n) is really tricky.

Please elaborate. I don't see how is it different than merging and inferring without the handling of casts. My understanding is that, we have more symbols (additional SymbolCasts) indeed. But, the merging algorithm should not change at all, that should be agnostic to the actual symbol kind (whether that is a SymbolCast or a SymbolData or a SymSymEpxr).
The infer algorithm might be different though, but there I think the original algorithm you delved for SymExprs should work as well for EquivalenceClasses.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1183–1185

Yes, your are right, and I was wrong. ImmutableSet is based on an AVL tree, which is a balanced binary tree and the begin() gives such an iterator that conducts an inorder - thus sorted - traversal. And the key is an integer. (I don't know now how, but we were mislead, @steakhal too. I guess, we thought that the key of CastMap is a pointer.)

1906–1909

Please disregard the comment above.

ASDenysPetrov marked 10 inline comments as done.EditedJul 11 2022, 9:06 AM

@martong Thank you for your patience. I started moving to a bit another direction that we can improving it iteratively.
Just spoiling, in my latest solution a single symbol will be associated to many classes. Here are also tricky cases:


Consider equalities

a32 == (int8)b32
b32 == c32

Class-Symbol map is:

class1 { a32 , (int8)b32 }
class2 { b32 , c32 }

Symbol-Class map is:

a32 { 32 : class1 }
b32 {  8 : class1, 32 : class2  }
c32 { 32 : class2 }

If we know:

(int8)c32 == -1

then what is:

(int8)a32 - ?

Should we traverse like a -> 32 -> class1 -> (int8)b32 -> b32 -> class2 -> c32 -> (int8)c32 ?


The x8 == y32 we can treat as a range of int8 ( {-128, 127} or {0, 255} ).


For (int8)x32 == (int16)x32 we can eliminate one of the symbols in the class a s a redundant one.


If x32 == 0 then we can simplify next classes (int16)x32 == y and (int8)x32 == z merging them into a single class {x32, y, z}.

I believe there are more cases.

Thanks Denys for your continued work on this. These are very good questions that must be answered, we need exactly such thinking to implement this properly. I believe we can advance gradually.

@martong Thank you for your patience. I started moving to a bit another direction that we can improving it iteratively.
Just spoiling, in my latest solution a single symbol will be associated to many classes. Here are also tricky cases:


Consider equalities

a32 == (int8)b32
b32 == c32

Class-Symbol map is:

class1 { a32 , (int8)b32 }
class2 { b32 , c32 }

Symbol-Class map is:

a32 { 32 : class1 }
b32 {  8 : class1, 32 : class2  }
c32 { 32 : class2 }

If we know:

(int8)c32 == -1

then what is:

(int8)a32 - ?

Should we traverse like a -> 32 -> class1 -> (int8)b32 -> b32 -> class2 -> c32 -> (int8)c32 ?

I think, we should have only a -> 32 -> class1 -> (int8)b32. The (int8)b32 -> b32 step would be incorrect according to the modulo logic.
With other words, we should check the equivalence class of the root symbol (and no other eq classes should be considered), in this case this is only class1. (More precisely we should check the SymCastMap of class1.)


The x8 == y32 we can treat as a range of int8 ( {-128, 127} or {0, 255} ).

I am not sure what you mean here.
You mean, when we bifurcate on x8 == y32 ? Then we have two branches, the false case x8 == y32: [0, 0] and x8 == y32: [[INT_MIN, -1], [1,INT_MAX]]


For (int8)x32 == (int16)x32 we can eliminate one of the symbols in the class a s a redundant one.

Yes, but this is more like an optimization step. I'd handle this with low priority and with a FIXME comment.


If x32 == 0 then we can simplify next classes (int16)x32 == y and (int8)x32 == z merging them into a single class {x32, y, z}.

Good point. This is an optimization for precision. I'd like to have this, but in a subsequent patch. Let's try to have the absolute simplest working version in this patch.
Also, this can extend to any concrete value that is meaningful in the smaller types. E.g. any single value of x32 in [0,127] could simplify (int8)x32.

I believe there are more cases.

Yes. Consider liveness for example. We should remove the class from SymCastMap if the class itself becomes dead. This should be part of this patch.

Completely reworked solution.