[analyzer] Implement cast for ranges of symbolic integersNeeds ReviewPublicActions

Authored by ASDenysPetrov on May 25 2021, 9:19 AM.

Details

Reviewers
 NoQ vsavchenko steakhal martong dcoughlin baloghadamsoftware
Summary

Support integral cast for ranges of symbolic integers. Previously we only support integral cast for concrete integers.
Reason about the ranges of SymbolCast expressions. Apply truncations, promotions and conversions to get a correct range set using nested types of a SymbolCast.

The solution

Create a map which contains a bitwidth as a key and a range set as a data. Call it CastMap.
CastMap = Map<uint32_t, RangeSet>

NOTE: LLVM-IR has the ability to represent integers with a bitwidth from 1 all the way to 16'777'215. See _ExtInt Feature for details.
NOTE: We don't care about certain signedness of RangeSet stored in CastMap. But the signedness of all stored RangeSet in the map shall be the same.

Create a map which contains a symbol as a key and CastMap as a data. Call it SymCastMap.
SymCastMap = Map<SymbolRef, CastMap>

Store and update SymCastMap for every SymbolCast and every SymExpr which represents an integer.
Use a root symbol of SymbolCast as a key of the map. E.g. for (int16)(uint8)(int32 x) root symbol is (int32 x).
For SymExpr use the symbol itself as a key of the map.

Getting a constraint
1. Get a key symbol from SymbolCast/SymExpr.
2. Get a CastMap of constraints from SymCastMap using a key symbol.
3. Find the smallest type of the given cast symbolic expression.
4. Find a RangeSet in the CastMap for equal or the first bigger than the bitwidth of the smallest type.
5. If no RangeSet was found, create a new full RangeSet for the smallest type.
6. Sequentially cast the RangeSet across the chain of types starting from the most inner one.
Pseudocode
```GivenSymbol = (int16)(uint8)(int32 x)

RootSymbol = GetRoot(GivenSymbol) // (int32 x)
CastMap = GetCastMap(RootSymbol) // CastMap for (int32 x)
MinType = FindMinType(GivenSymbol) // uint8
MinBitwidth = BitwidthOf(MinType) // uint8

RangeSet = FindRange(CastMap , MinBitwidth)  // range for bitwidth of 8
if(!RangeSet)
RangeSet = FindNextRange(CastMap, MinBitwidth) // range for bitwidth of 9+
if(!RangeSet)
RangeSet = CreateRangeForType(MinType) // full range for uint8

CastChain = GetCastChain(GivenSymbol) // int32 -> uint8 -> int16
ResultRangeSet = RangeThroughCastChain(RangeSet, CastChain)

return ResultRangeSet```
Setting a constraint
1. Get a key symbol from SymbolCast/SymExpr.
2. Get a map of constraints from SymCastMap using a key symbol.
3. Find the smallest type of the given cast symbolic expression.
4. Find and update all RangeSet's in the CastMap for bitwidths which are equal or lower than the bitwidth of the smallest type.
5. If there is no constraint for the bitwidth of the smallest type in the map, add a new entry with the given RangeSet.
Pseudocode
```GivenRangeSet = [N, M]
GivenSymbol = (int16)(uint8)(int32 x)

RootSymbol = GetRoot(GivenSymbol) // (int32 x)
CastMap = GetCastMap(RootSymbol) // CastMap for (int32 x)
MinType = FindMinType(GivenSymbol) // uint8
MinBitwidth = BitwidthOf(MinType) // uint8

Bitwidth = MinBitwidth
while (Bitwidth > 0) // update all constraints which bitwidth is equal or lower then the minimal one
RangeSet = FindRange(CastMap, Bitwidth)  // range for bitwidth of 8 and lower
UpdateRange(CastMap, Bitwidth, RangeSet ∩ GivenRangeSet) //  intersect ranges and store the result back to the map
Bitwidth--

if(!RangeExistsInMap(CastMap, MinBitwidth))
AddRange(CastMap, MinBitwidth, GivenRangeSet)  // store the given range to the map```

See tests in symbol-integral-cast.cpp for examples.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

OK, thanks for putting a summary. I now got a good idea why you need both.
At the same time, take a look at D105692. I'm about to land it and I think it's going to be useful for you.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1222

were transparent.

Not exactly. There are still some cases when symbols are not equal to there roots(aka Operands). Such cases were handled by VisitSymExpr which uses infer(Sym->getType()); instead of getOperand`. So this needs a sort of think twice. Also I see a problem with EquivalenceClass'es. Consider next:

```int x, y;
if(x == y)
if ((char)x == 2)
if(y == 259)
// Here we shall update `(char)x` and find this branch infeasible.```

Also such cases like:

```if(x == (short)y)
// What we should do(store) with(in) `EquivalenceClass`es.```

Currently, I have an obscure vision of the solution.

```// 1. `VisitSymbolCast`.
// Get a range for main `reg_\$0<int x>` - [-2147483648, 2147483647]
// Cast main range to `short` - [-2147483648, 2147483647] -> [-32768, 32767].
// Now we get a valid range for further bifurcation - [-32768, 32767].```

That's a great example, thanks for putting it together. I can see your point now!

Please, rebase your change and make use of ConstraintAssignor, and rework VisitSymbolCast.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1222

There are still some cases when symbols are not equal to there roots(aka Operands)

Right now we don't have casts, this is what we do currently. However faulty it is, it is the existing solution and we should respect that.

Also I see a problem with EquivalenceClass'es.

Because of the current situation with casts (or more precisely with their lack), EquivalenceClasses do not get merged for symbols with different types. It is as simple as that.
You can find similar tests in equality_tracking.c.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

Oh, wait, why is it even virtual?

ignoreCasts is a virtual function because I haven't found any other way to implement it better.

I don't think that it should be virtual.

Unfortunately, this is not a CRTP to avoid dynamic linking.

Are similar functions in Expr virtual?

SymExpr is an abstract class. I'm not sure about similarity but SymExpr has such virtual methods:

• computeComplexity
• getType
• getOriginRegion

And I think that this implementation should live in SymExpr directly.

It's impossible due to SymExpr implementation design. SymExpr knows nothing about implementation details of SymbolCast to invoke ignoreCasts().

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

a) Expr is also an abstract class
b) I put the implementation right there in the comment above. I don't see any reasons not to use it.
c) I don't buy it about "impossible" and "implementation design" because you can always declare function in one place and define it in the other.

ASDenysPetrov updated this revision to Diff 358685.EditedJul 14 2021, 11:58 AM

P.S. IMO, this change is not something that can be taken as a pattern, though.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

I think I achieved of what you've been concerned.

P.S. IMO, this change is not something that can be taken as a pattern, though.

It is already a pattern in other type hierarchies.
Virtual functions are only good, when they can have multiple implementations. ignoreCasts by its name can have only one implementation and couldn't be virtual. That's it! It is more useable now, and less confusing for its users. The fact that its definition lives in some other cpp file doesn't change it.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

This function should be removed then.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
3140–3142

No, please, remove duplication by putting it inside of the constraint assignor. It is designed specifically so we don't duplicate code around assumeSymXX functions.

@vsavchenko

It is already a pattern in other type hierarchies.

I just rarely met them. And it was hard to me reading the code searching for implementation all over the places.

clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
293–296 ↗(On Diff #357463)

NP.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
3140–3142

+1. That's what I've recently thought about. :)

ASDenysPetrov edited the summary of this revision. (Show Details)

Improved ignoreCasts implementation. Adapted to ConstraintAssignor.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2097–2101

That's not using ConstraintAssignor, you simply put your implementation in here. That won't do!

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2097–2101

OK, please tell me how to use it correctly in my case.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2097–2101

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2097–2101

I think I did it. Could you please review the changes?

ASDenysPetrov retitled this revision from [analyzer] Implement cast for ranges of symbolic integers. to [analyzer] Implement cast for ranges of symbolic integers.Nov 17 2021, 1:59 AM

Ping. If there is somebody interested in this? :)

Herald added a project: Restricted Project. Mar 23 2022, 10:13 AM

First of all, thanks Denys for working on this, nice work! Here are my concerns and remarks.

I think this fixed set of types in NominalTypeList is too rigid. I don't like the fact that we have to iterate over all four types whenever we set a new constraint or when we try to infer. Also, I am thinking about downstream hardware architectures, where there might be integers with different bit-widths (@vabridgers). Also, at some point people will pursue us to support integers with arbitrary bitwidth (see _ExtInt)

Thus, I am proposing an alternative approach. We should have a SymExpr -> Set of SymExpr mapping in the State that represents the relation of symbols that are connected via some cast operations (see REGISTER_MAP_WITH_PROGRAMSTATE). Let's call this mapping as CastMap. The key should be the root symbol, i.e the symbol that is being declared first before all cast operations.

E.g. Let's have

`int16 a = 128;`

then we have a constraint [128,128] stored for \$a. Then

`if ((int8)a < 0)`

creates a new symbol \$a2 (SymbolCast) that has a new constraint [-128,-128] assigned to it. And we also keep track in the State, that \$a and \$a2 refers the same root symbol (a). We now have in the CastMap \$a -> [\$a2].

Now, let's say we have

`if ((_ExtInt(7))a > 64)`

then we can dig up the existing contraints from CastMap to check for the State's validity and we can update all the constraints of \$a and \$a2 as needed. Also, CastMap is updated: \$a -> [\$a2, \$a3].

clang/lib/StaticAnalyzer/Checkers/ExprInspectionChecker.cpp
421–426 ↗(On Diff #388552)

Does it really matter? I mean, why do we need this change?

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2182

Could you please rebase? The "simplification" code part had been merged already to llvm/main and it is not part of this change.

2197–2221

I think this hunk should remain in assignSymExprToConst. Why did you move it?

2250–2260

I think, we should definitely store the constraints as they appear in the analyzed code. This way we mix the infer logic into the constraint setting, which is bad.
I mean, we should simply store the constraint directly for the symbol as it is. And then only in VisitSymbolCast should we infer the proper value from the stored constraint (if we can).

(Of course, if we have related symbols (casts of the original symbol) then their constraints must be updated.)

2251
2254
2255
2256
2292
2316–2320

Instead of a noop we should be more conservative in this case. We should invalidate (remove) the constraints of all the symbols that have more bits than the currently set symbol. However, we might be clever in cases of special values (e.g 0 or in case of the true rangeSet {[MIN, -1], [1, MAX]}).

ASDenysPetrov marked 10 inline comments as done.EditedApr 22 2022, 12:40 PM

Thank you for the review @martong! Your work is not less hard than mine. I'll rework and update the revision ASAP.

clang/lib/StaticAnalyzer/Checkers/ExprInspectionChecker.cpp
421–426 ↗(On Diff #388552)

I investigated. This changes is not obligatory now. I'll remove it.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2197–2221

I'll remove. It's unrelated one.

2250–2260

1. Let's say you meet (wchar_t)x > 0, which you store like a pair {(wchar_t)x, [1,32767]}.
2. Then you meet (short)x < 0, where you have to answer whether it's true or false.
3. So would be your next step? Brute forcing all the constraint pairs to find any x-related symbol? Obviously, O(n) complexity for every single integral symbol is inefficient.

What I propose is to "canonize" arbitrary types to a general form where this form could be a part of key along with x and we could get a constraint with a classic map complexity. So that:

1. You meet (wchar_t)x > 0, which you convert wchar_t to int16 and store like a pair {(int16)x, [1,32767]}.
2. Then you meet (short)x < 0, where you convert short to int16 and get a constraint.

That's why I've introduced NominalTypeList.

2316–2320

No, it's incorrect. Consider next:

```int x;
if(x > 1000000 || x < 100000)
return;
// x (100'000, 1000'000)
if((int8)x != 42)
return;
// x (100'000, 1000'000) && (int8)x (42, 42)```

We can't just remove or invalidate x (100'000, 1000'000) because this range will still stay true.
Strictly speaking x range should be updated with values 100394, 102442, 108586, ...,, 960554 and any other value within the range which has its lowest byte equals to 42.
We can't just update the RangeSet with such a big amount of values due to performance issues. So we just assume it as less accurate.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
2250–2260

So would be your next step? Brute forcing all the constraint pairs to find any x-related symbol? Obviously, O(n) complexity for every single integral symbol is inefficient.

I don't think we need a brute force search among the constraints if we have the additional CastMap (that I have previously proposed).
So, the next step would be: Lookup the root symbol of (short)x that is (wchar_t)x (supposing we have seen a declaration wchar_t x; first during the symbolic execution, but having a different root might work out as well).
Then from the CastMap we can query in O(1) the set of the related cast symbols of the root symbol. From this set it takes to query the constraints for each of the members from the existing equivalneClass->constraint mapping. That's O(1) times the number of the members of the cast set (which is assumed to be very few in the usual case).

2316–2320

Okay, this makes perfect sense, thanks for the example!

ASDenysPetrov updated this revision to Diff 425248.EditedApr 26 2022, 9:36 AM
ASDenysPetrov marked 4 inline comments as done.
ASDenysPetrov edited the summary of this revision. (Show Details)

@martong thank you for the idea. I've tried to implement it. Could you look at the patch once again, please? I've also described a new solution in the Summary.

ASDenysPetrov marked 2 inline comments as done.Apr 26 2022, 9:37 AM

Thanks Denys for the update! This is getting really good.

I have some concerns though about the CastMap = Map<uint32_t, RangeSet>. I think we should have CastMap = Map<uint32_t, EquivalenceClass> instead, and we could get the RangeSet from the existing ConstraintRange mapping. By storing directly the RangeSet, the State might get out-of-sync when we introduce a constraint to another member in an equivalence class. (Besides that, our mapping of constraints is happening always by using the EquivalenceClasses as keys.)
I think this could solve the problematic code you posted earlier

```int x, y;
if(x == y)
if ((char)x == 2)
if(y == 259)
// Here we shall update `(char)x` and find this branch infeasible.```

Here we have EqClass1: [x, y] , EqClass2: [(char)x] and they are not the same class, thus when you iterate over the CastMap, you can get the updated RangeSet for both classes, and the infeasibility can be discovered.

```if(x == (short)y)
// What we should do(store) with(in) `EquivalenceClass`es.```

In this case, we have one EqClass with two members, the SymbolRef x and the SymbolCast (short)y. They both must have the same RangeSet associated to them. And this is already implemented. By referreing to the EqClass in the CastMap, we simply can reuse this information.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1229–1230

By using llvm::Expected<T> we would be more aligned with the llvm error handling practices. Besides, the bool in the tuple and the Success variable in the function below would not be needed.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
889–890

These maps will need to be cleaned up when symbols become dead (as in RangeConstraintManager::removeDeadBindings()).

Giving it some more thought, the SymCastMap = Map<SymbolRef, CastMap> should be keyed as well with an equivalence class : SymCastMap = Map<EquivalenceClass, CastMap>. This is the only way to use the equivalence info correctly when we process the casts.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
889–890

Yes, the same way as we clean up e.g. the DisequalityMap.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1289–1291

There might be a problem here because the iteration of the map is non-deterministic. We should probably have a copy that is sorted, or the container should be sorted (sorted immutable list maybe?).

Your tests below passed probably because the cast chains are too small. Could you please have a test, where the chain is really long (20 maybe) and shuffled.
(My thanks for @steakhal for this additional comment.)

2300–2303

Same here.

clang/test/Analysis/symbol-integral-cast.cpp
31
ASDenysPetrov added a comment.EditedMay 13 2022, 8:38 AM

Ping

Thank you, folk, for taking you time. I'll surely make corresponding changes according to your suggestions and notify you then. Sorry, @martong, for the late response. I'm pretty loaded recent times.

ASDenysPetrov edited the summary of this revision. (Show Details)May 19 2022, 10:26 AM

Denys, I've created a very simple patch that makes the SValBuilder to be able to look up and use a constraint for an operand of a SymbolCast. That change passes 2 of your test cases, thus I made that a parent patch.

clang/test/Analysis/symbol-integral-cast.cpp
12–36

These two tests are redundant because they are handled by the Parent patch I've just created. https://reviews.llvm.org/D126481

ASDenysPetrov marked 4 inline comments as done.EditedJun 27 2022, 11:15 AM

@martong Just FYI. I've been working on reworking this solution to using EquivalenceClasses for several weeks. It turned out that this is an extremely hard task to acomplish. There're a lot of cast cases like: (int8)x==y, (uint16)a==(int64)b, (uint8)y == b, Merging and inferring all of this without going beyond the complexity O(n) is really tricky.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1289–1291

I've checked. ImmutableSet gave me a sorted order. But I agree that it could be just a coincidence. I'll try to add more tests.

@martong Just FYI. I've been working on reworking this solution to using EquivalenceClasses for several weeks. It turned out that this is an extremely hard task to acomplish. There're a lot of cast cases like: (int8)x==y, (uint16)a==(int64)b, (uint8)y == b, Merging and inferring all of this without going beyond the complexity O(n) is really tricky.

Please elaborate. I don't see how is it different than merging and inferring without the handling of casts. My understanding is that, we have more symbols (additional SymbolCasts) indeed. But, the merging algorithm should not change at all, that should be agnostic to the actual symbol kind (whether that is a SymbolCast or a SymbolData or a SymSymEpxr).
The infer algorithm might be different though, but there I think the original algorithm you delved for SymExprs should work as well for EquivalenceClasses.

clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp
1289–1291

Yes, your are right, and I was wrong. ImmutableSet is based on an AVL tree, which is a balanced binary tree and the begin() gives such an iterator that conducts an inorder - thus sorted - traversal. And the key is an integer. (I don't know now how, but we were mislead, @steakhal too. I guess, we thought that the key of CastMap is a pointer.)

2300–2303

ASDenysPetrov marked 10 inline comments as done.EditedJul 11 2022, 9:06 AM

@martong Thank you for your patience. I started moving to a bit another direction that we can improving it iteratively.
Just spoiling, in my latest solution a single symbol will be associated to many classes. Here are also tricky cases:

Consider equalities

```a32 == (int8)b32
b32 == c32```

Class-Symbol map is:

```class1 { a32 , (int8)b32 }
class2 { b32 , c32 }```

Symbol-Class map is:

```a32 { 32 : class1 }
b32 {  8 : class1, 32 : class2  }
c32 { 32 : class2 }```

If we know:

`(int8)c32 == -1`

then what is:

`(int8)a32 - ?`

Should we traverse like a -> 32 -> class1 -> (int8)b32 -> b32 -> class2 -> c32 -> (int8)c32 ?

The x8 == y32 we can treat as a range of int8 ( {-128, 127} or {0, 255} ).

For (int8)x32 == (int16)x32 we can eliminate one of the symbols in the class a s a redundant one.

If x32 == 0 then we can simplify next classes (int16)x32 == y and (int8)x32 == z merging them into a single class {x32, y, z}.

I believe there are more cases.

Thanks Denys for your continued work on this. These are very good questions that must be answered, we need exactly such thinking to implement this properly. I believe we can advance gradually.

@martong Thank you for your patience. I started moving to a bit another direction that we can improving it iteratively.
Just spoiling, in my latest solution a single symbol will be associated to many classes. Here are also tricky cases:

Consider equalities

```a32 == (int8)b32
b32 == c32```

Class-Symbol map is:

```class1 { a32 , (int8)b32 }
class2 { b32 , c32 }```

Symbol-Class map is:

```a32 { 32 : class1 }
b32 {  8 : class1, 32 : class2  }
c32 { 32 : class2 }```

If we know:

`(int8)c32 == -1`

then what is:

`(int8)a32 - ?`

Should we traverse like a -> 32 -> class1 -> (int8)b32 -> b32 -> class2 -> c32 -> (int8)c32 ?

I think, we should have only a -> 32 -> class1 -> (int8)b32. The (int8)b32 -> b32 step would be incorrect according to the modulo logic.
With other words, we should check the equivalence class of the root symbol (and no other eq classes should be considered), in this case this is only class1. (More precisely we should check the SymCastMap of class1.)

The x8 == y32 we can treat as a range of int8 ( {-128, 127} or {0, 255} ).

I am not sure what you mean here.
You mean, when we bifurcate on x8 == y32 ? Then we have two branches, the false case x8 == y32: [0, 0] and x8 == y32: [[INT_MIN, -1], [1,INT_MAX]]

For (int8)x32 == (int16)x32 we can eliminate one of the symbols in the class a s a redundant one.

Yes, but this is more like an optimization step. I'd handle this with low priority and with a FIXME comment.

If x32 == 0 then we can simplify next classes (int16)x32 == y and (int8)x32 == z merging them into a single class {x32, y, z}.

Good point. This is an optimization for precision. I'd like to have this, but in a subsequent patch. Let's try to have the absolute simplest working version in this patch.
Also, this can extend to any concrete value that is meaningful in the smaller types. E.g. any single value of x32 in [0,127] could simplify (int8)x32.

I believe there are more cases.

Yes. Consider liveness for example. We should remove the class from SymCastMap if the class itself becomes dead. This should be part of this patch.