This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Handle false positives caused by funny addressing
AbandonedPublic

Authored by isuckatcs on Jul 25 2022, 8:31 AM.

Details

Summary

According to a FIXME in RegionStore.cpp the addressing below is referred to as
funny addressing.

int x = ...;
int *p = &x;
char *q = (char*) p;
char c = *q;  // returns the first byte of 'x'.

This addressing is a source of false positives at the moment. Look at the following
snippet:

struct S
{
    uint8_t x = 0;
    uint8_t arr[7];
};

int main()
{
    S P[1];

    memset(P->arr, 0, sizeof(P->arr));
    const uint8_t *p = (const uint8_t *)(P);
    auto v = p[2];                                               <-- uninitialized assign reported!

    return 0;
}

In this example p[2] points to the same place as P[0].arr[1], so it's value is 0,
however the analyzer cannot read that properly and reads it as Undefined instead.

This patch attempts do detect some cases of overlapping memory access, and fall back
to Unknown values in such cases even if we lose information, to reduce the amount of
false positives.

I also wonder if we want to create a checker to handle this instead.

Diff Detail

Event Timeline

isuckatcs created this revision.Jul 25 2022, 8:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 8:31 AM
isuckatcs requested review of this revision.Jul 25 2022, 8:31 AM
isuckatcs edited the summary of this revision. (Show Details)
isuckatcs updated this revision to Diff 447386.Jul 25 2022, 9:57 AM

Added support for record types

Added more detail to one of the testcases.

clang/test/Analysis/array-punned-region.c
23

This is also a false positive, look at https://godbolt.org/z/Go778dhza.

This is indeed a problem. Strict aliasing violations are not well handled.
The 'assignment' should have never happened in the first place.

That being said, I believe we should focus on fixing the root cause, instead of suppressing the symptom.
In our case, the two pointers should refer to the same memory region, thus referring to the same bindings in the store.
@ASDenysPetrov had an early prototype (D114718) for detecting strict-aliasing issues, which seemed pretty good.

Strangely enough, the &q[2] == &s.arr[1] is true in the following example, yet we cannot recover the bindings.
https://godbolt.org/z/KqWK81P8K
This is the bug you are about I think.

That being said, I believe we should focus on fixing the root cause, instead of suppressing the symptom.

I agree with fixing the root cause, though I think that means coming up with a new memory model, and a new store implementation.
The problem here is that we can't map one memory region to another. Let's stick to the original example with some modification:

struct S
{
    uint8_t x = 0;
    uint8_t arr[7];
};

int main()
{
    S P[1];

    memset(P->arr, 0, sizeof(P->arr));
    
    const uint8_t *p = (const uint8_t *)(P);

    auto x = p[0];
    auto y = p[1];
    auto z = p[2];

    return 0;
}

After memset the store looks like this:

{ "cluster": "P", "items": [
  { "kind": "Default", "offset": 8, "value": "conj_$1{uint8_t, LC1, S889556, #1}" },
  { "kind": "Direct", "offset": 0, "value": "0 U8b" }
]}

And the variables are:

{ "cluster": "x", "items": [
  { "kind": "Direct", "offset": 0, "value": "0 U8b" }
]},
{ "cluster": "y", "items": [
  { "kind": "Direct", "offset": 0, "value": "conj_$1{uint8_t, LC1, S889556, #1}" }
]}
{ "cluster": "z", "items": [
  { "kind": "Direct", "offset": 0, "value": "Undefined" }
]}

In case of p we basically treat the memory as if it was raw memory. We can't tell if p[2] is part of the array region or not, so we
also can't tell if it's value is the Default binding, or Undefined. Since we don't know where the end of the region is, there's no way
to solve this problem at the moment. For example, how can we come up with the conclusion that p[7] is still initialized, but p[8] is not?

If we read the values using array access, we still face the problem above, but at least we can tell the analyzer to use the Default binding,
instead of Undefined.

auto y = P[0].arr[0];
auto z = P[0].arr[1];

So the resulting values are:

{ "cluster": "y", "items": [
  { "kind": "Direct", "offset": 0, "value": "derived_$2{conj_$1{uint8_t, LC1, S889556, #1},Element{Element{P,0 S64b,struct S}.arr,0 S64b,unsigned char}}" }
]},
{ "cluster": "z", "items": [
  { "kind": "Direct", "offset": 0, "value": "derived_$3{conj_$1{uint8_t, LC1, S889556, #1},Element{Element{P,0 S64b,struct S}.arr,1 S64b,unsigned char}}" }
]}

(I think these values aren't necessarily correct either.)

@ASDenysPetrov had an early prototype (D114718) for detecting strict-aliasing issues, which seemed pretty good.

I can think of something similar instead of this patch. It would detect if a C-style cast changes the type, and set the result to Unknown in such cases.
Though that would lead to more information loss as we have know. For example p would be Unknown and we couldn't read p[0].

That being said, I believe we should focus on fixing the root cause, instead of suppressing the symptom.

I agree with fixing the root cause, though I think that means coming up with a new memory model, and a new store implementation.
The problem here is that we can't map one memory region to another. Let's stick to the original example with some modification:

struct S
{
    uint8_t x = 0;
    uint8_t arr[7];
};

int main()
{
    S P[1];

    memset(P->arr, 0, sizeof(P->arr));
    
    const uint8_t *p = (const uint8_t *)(P);

    auto x = p[0];
    auto y = p[1];
    auto z = p[2];

    return 0;
}

After memset the store looks like this:

{ "cluster": "P", "items": [
  { "kind": "Default", "offset": 8, "value": "conj_$1{uint8_t, LC1, S889556, #1}" },
  { "kind": "Direct", "offset": 0, "value": "0 U8b" }
]}

And the variables are:

{ "cluster": "x", "items": [
  { "kind": "Direct", "offset": 0, "value": "0 U8b" }
]},
{ "cluster": "y", "items": [
  { "kind": "Direct", "offset": 0, "value": "conj_$1{uint8_t, LC1, S889556, #1}" }
]}
{ "cluster": "z", "items": [
  { "kind": "Direct", "offset": 0, "value": "Undefined" }
]}

x is correct, y is not. It should have been a derived region of conj1. That being said, the type of conj1 is uint8_t, which is also bad. It should have been utin8_t[7] I think.
Consequently, z should have been a derived symbol, just like for y, but with a different offset.

In case of p we basically treat the memory as if it was raw memory. We can't tell if p[2] is part of the array region or not, so we
also can't tell if it's value is the Default binding, or Undefined. Since we don't know where the end of the region is, there's no way
to solve this problem at the moment.

We track the origin of the memory regions as you do; hence there are no limitations there.
p is &Element{P,0 S64b,unsigned char}, which basically encodes that we are looking at the memregion of P, but as an uint8_t type.

For example, how can we come up with the conclusion that p[7] is still initialized, but p[8] is not?

p + 7 is represented as &Element{P,7 S64b,unsigned char}, thus p[7] would access the 7th byte of the memregion P - which is aliasing with the P[0].arr[6].
At this point we should look at the store and look for bindings.

{ "cluster": "P", "pointer": "0x557b6037eb40", "items": [
  { "kind": "Default", "offset": 8, "value": "conj_$0{unsigned char, LC1, S1266, #1}" },
  { "kind": "Direct", "offset": 0, "value": "0 U8b" }
]}

Given that we associate the right type with the default binding there, we could infer that p[7] is affected by that default binding, and p[8] is not.
I agree that this approach might not look the best, since it needs to look at the value associated with the default binding to determine the length of the covered bytes.
I don't expect this to be much of an issue on the implementations side, as such conjured symbols regularly appear due to invalidation, where we have the region which gets invalidated (hence their sizes).

Alternatively, we could embed this length metadata into the default binding - without any indirection.

If we read the values using array access, we still face the problem above, but at least we can tell the analyzer to use the Default binding,
instead of Undefined.

auto y = P[0].arr[0];
auto z = P[0].arr[1];

So the resulting values are:

{ "cluster": "y", "items": [
  { "kind": "Direct", "offset": 0, "value": "derived_$2{conj_$1{uint8_t, LC1, S889556, #1},Element{Element{P,0 S64b,struct S}.arr,0 S64b,unsigned char}}" }
]},
{ "cluster": "z", "items": [
  { "kind": "Direct", "offset": 0, "value": "derived_$3{conj_$1{uint8_t, LC1, S889556, #1},Element{Element{P,0 S64b,struct S}.arr,1 S64b,unsigned char}}" }
]}

(I think these values aren't necessarily correct either.)

They seem to be correct at first glance. At least they have the 'wished' derived symbols I was referring to.

@ASDenysPetrov had an early prototype (D114718) for detecting strict-aliasing issues, which seemed pretty good.

I can think of something similar instead of this patch. It would detect if a C-style cast changes the type, and set the result to Unknown in such cases.
Though that would lead to more information loss as we have know. For example p would be Unknown and we couldn't read p[0].

We should avoid introducing unknown as much as we can.

I am inclined to produce Unknown values or even sink nodes when the strict aliasing rules are violated. This is perfectly aligned how we handle other kind of undefined behavior. Why should we complicate our memory model to be able to handle already flawed C++ code? I think we should push https://reviews.llvm.org/D114718 as much as possible.

isuckatcs abandoned this revision.Jul 27 2022, 12:56 PM

Abandoning the revision, because D114718 is already addressing the same problem and more.