Page MenuHomePhabricator

[analyzer] Retrieve a character from StringLiteral as an initializer for constant arrays.
ClosedPublic

Authored by ASDenysPetrov on Aug 3 2021, 4:07 AM.

Details

Summary

Assuming that values of constant arrays never change, we can retrieve values for specific position(index) right from the initializer, if presented. Retrieve a character code by index from StringLiteral which is an initializer of constant arrays in global scope.

This patch has a known issue of getting access to characters past the end of the literal. The declaration, in which the literal is used, is an implicit cast of kind array-to-pointer. The offset should be in literal length's bounds. This should be distinguished from the states in the Standard C++20 (dcl.init.string) 9.4.2.3.
Example:

const char arr[42] = "123";
char c = arr[41]; // OK
const char * const str = "123";
char c = str[41]; // NOK

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ASDenysPetrov edited the summary of this revision. (Show Details)

Adjusted the patch according to changes in the parent revision D104285.

Looks great.

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1640–1644

This seems like a huge hack. Do we really need this? I think we should account for this case at the initialization of the mentioned array...
Leave it as-is right now, but eh, ugly.

1645

You could use the uint64_t type here, and spare the subsequent explicit cast. This operation would be safe since Idx must be non-negative here.

1807–1808

So, your patch is NFC except for this part, which applies the very same logic for global initializers.
Am'I right?

clang/test/Analysis/initialization.cpp
138

Ah, it's somewhat confusing.
At first, when I looked at it, I assumed that this array has 6 elements as its name suggests.
But it has actually 5 elements.

162–163

You could inline the -42 without changing any expected behavior.
It would make the test terser IMO. The same applies to the other case.

166–172

I'm not sure if I follow. The TODO implies to me that this case is about multidimensional arrays, but it's actually not.
glob_arr6 is of type const char[5]
Could you clarify this?
BTW, at first glance, the gist of this case is the same as the glob_invalid_index7.
Why does this behave differently? I'm puzzled.

@ASDenysPetrov This looks promising! Please address the nits which @steakhal found, than I think it is good to land.

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1645

+1 for using uint64_t if possible

ASDenysPetrov added inline comments.Oct 4 2021, 6:38 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1645

SL->getCodeUnit returns uint32_t, so I decided to keep the result value consistent if any changes required in the future.

ASDenysPetrov added inline comments.Oct 4 2021, 6:47 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1645

Oh. discard my previous reply. I mistakenly considered a wrong line.
The correct reply:
As you can see, preceding condition (Idx < 0) guarantee Idx to be non-negative. But if you suggest this to be explicitly visible, then OK, I don't mind. I've just kept it int64_t for consistency with return value as well.

ASDenysPetrov added inline comments.Oct 4 2021, 7:19 AM
clang/test/Analysis/initialization.cpp
138

6 in glob_arr6 is like a serial number. Means that there are glob_arr5 and glob_arr4 and so for above.

162–163

glob_arr6[-42] this provokes AST parser to emit a warning before reaching to CSA checks. Separate variable allows to avoid AST parser checks to let CSA be engaged.

166–172

Correct. Thanks! The consequence of copy-paste. I'll fix.

Updated according to suggestions.

I am close to accepting this. However, D111542 should be its parent patch. I think similar issues could arise here as well, so, redecl chain tests would be really beneficial.

I am close to accepting this. However, D111542 should be its parent patch. I think similar issues could arise here as well, so, redecl chain tests would be really beneficial.

I apologize for the delay, I'm still working on changes. I'll update it soon.

Why does glob_invalid_index7() and glob_invalid_index8() differ in behavior?
I would expect that the analyzer produces the same Loc symbolic value for both cases thus, the array access should result in the same behavior regardless if glob_arr6 is used, or acquired a pointer and using that in a subsequent operation.
Could you elaborate on this?

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1646

Instead of this comment, you can put an assert(Idx.isStrictlyPositive()) here.

Rebased. Improved behavior to make glob_invalid_index8 case passed and some other cases. Added more tests. Added tests for universal characters.

Why does glob_invalid_index7() and glob_invalid_index8() differ in behavior?
I would expect that the analyzer produces the same Loc symbolic value for both cases thus, the array access should result in the same behavior regardless if glob_arr6 is used, or acquired a pointer and using that in a subsequent operation.
Could you elaborate on this?

You're right about the same Loc. There were just limitations of the previous implementation. I've fixed it.

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1646

I reworked this case and we don't need it anymore.

@martong

I am close to accepting this. However, D111542 should be its parent patch. I think similar issues could arise here as well, so, redecl chain tests would be really beneficial.

I think I'm done. You can proceed the review.

martong added inline comments.Oct 21 2021, 9:30 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1711–1716

I am wondering why this hunk is needed? getSValFromInitListExpr is handling this case at L1725, isn't it?

ASDenysPetrov added inline comments.Oct 21 2021, 5:23 PM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1711–1716

L1710 handles const char x[] = "abc";
L1725 handles const char x[] = {"abc"};
I think I should give an example in comments.

martong added inline comments.Oct 22 2021, 12:25 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1711–1716

Okay, thanks for the explanation.

I think I should give an example in comments.

Yes, that's a good idea!

Fixed test unpassing in array-struct-region.c. Added examples in comments according to suggestions.

ASDenysPetrov edited the summary of this revision. (Show Details)Oct 22 2021, 8:31 AM
martong added inline comments.Oct 25 2021, 2:57 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

Thanks for updating the patch! However, this FIXME makes me worried. Do you think you could pass the Decl itself to getSValFromInitListExpr in order to be able to check whether the type is a pointer or an array?

ASDenysPetrov added inline comments.Oct 25 2021, 6:57 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

This worried me as well. Currently I can't find a way to get the Decl for SL.

@martong (inline)

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

We can load this as is for now with mentioning the known issue.

martong added inline comments.Oct 27 2021, 6:27 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

This might cause some itchy false positives. Perhaps, we could address this in a follow-up patch and then commit them together?

It looks good to me. I don't mind that FIXME. That being said, I'm not even sure it's a FIXME.
Check my comment inline about this.

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

Currently I can't find a way to get the Decl for SL.

Why do you need a Decl? The SVal's gotta be an Element{"123",41 S64b,char} for the example in the comment (*).

(*) With a minor modification: https://godbolt.org/z/7zhGMnf7P

template <class T> void clang_analyzer_dump(T);
const char * const str = "123";
const char * str2 = "123";
void top() {
  clang_analyzer_dump(&str[41]);  // &Element{"123",41 S64b,char}
  clang_analyzer_dump(&str2[41]); // &Element{SymRegion{reg_$0<const char * str2>},41 S64b,char}
}
1803–1805

I think it would be better to have the check before we calculate the Offset. At least that way I find it easier to follow.

I'm gonna add docs to getSValFromStringLiteral and update the patch. And of course will fix your nits.

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

Because only Decl can tell us whether it is a const char str[42] or const char * const str. We can't say anything just looking at SVal.

1758–1761

This might cause some itchy false positives. Perhaps, we could address this in a follow-up patch and then commit them together?

This will produce not more FP as before, even less. I didn't change the behavior specifically here. I just found the issue point to it.

1803–1805

Yes, it's just my omission.

Added doc comment to getSValFromStringLiteral function.
Fixed minor code defects.

ASDenysPetrov marked an inline comment as done.Oct 27 2021, 4:20 PM

I'm still worried about regressions.
Please split the patch into two by separating the tests into an NFC patch, on which you would apply the behavioral change.
That way it would be clear what and why changed. It would also help us to see what previously had defects you plan to preserve and annotate with FIXMEs.

clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

I probably still miss the point. https://godbolt.org/z/EMhbq3745
Since the case you mention is actually represented by a NonParamVarRegion which holds a pointer to its decl.

const char str3[43] = "123";
void top() {
  clang_analyzer_dump(&str3[41]); // &Element{str3,41 S64b,char}
  //                    NonParamVarRegion  ---^^^^ (it can print the name of the *decl*)
}

What I wanted to highlight is, that it's a non-issue. In your example you had a non-const global variable, thus we could not infer any meaningful initial value for it, and that is actually the expected behavior. As soon as I marked the str pointer const (along with its pointee), suddenly the analyzer can infer its initial value.

martong added inline comments.Oct 28 2021, 1:40 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

But @steakhal, we have this case in the test file with a FIXME that is not aligned with your observation.

char const *const glob_ptr8 = "123";
void glob_ptr_index4() {
  clang_analyzer_eval(glob_ptr8[0] == '1');  // expected-warning{{TRUE}}
  clang_analyzer_eval(glob_ptr8[1] == '2');  // expected-warning{{TRUE}}
  clang_analyzer_eval(glob_ptr8[2] == '3');  // expected-warning{{TRUE}}
  clang_analyzer_eval(glob_ptr8[3] == '\0'); // expected-warning{{TRUE}}
  // FIXME: Should be UNDEFINED.
  // We should take into account a declaration in which the literal is used.
  clang_analyzer_eval(glob_ptr8[4] == '\0'); // expected-warning{{TRUE}}
}
1758–1761

This might cause some itchy false positives. Perhaps, we could address this in a follow-up patch and then commit them together?

This will produce not more FP as before, even less. I didn't change the behavior specifically here. I just found the issue point to it.

Would it be a FP introduced by this patch? Or we would report the "bug" in the baseline too?

char const *const glob_ptr8 = "123";
int glob_ptr_index4(int x) {
  return x/glob_ptr8[4]; // Div by zero 
}

Actually, we should report "accessing garbage or undefined" instead of div zero.

steakhal added inline comments.Oct 28 2021, 1:43 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

Actually, separating the tests into a parent patch would highlight these differences.

ASDenysPetrov added inline comments.Oct 28 2021, 4:04 AM
clang/lib/StaticAnalyzer/Core/RegionStore.cpp
1758–1761

In your example you had a non-const global variable

Yes, because my example is about offsets and const-ness has no meaning in it. It was omitted for lower verbosity.

What I wanted to highlight is, that it's a non-issue.

@martong is right. Please, pay attention to the NOTE. It says that there is no problem with the array declration, but it is with the pointer one. We can't get a Decl from StringRegion: https://godbolt.org/z/hhvasM8dh

const char * const str = "123";
void top() {
  clang_analyzer_dump(&str[41]);  // &Element{"123",41 S64b,char}
                                  // StringRegion^
}

Would it be a FP introduced by this patch? Or we would report the "bug" in the baseline too?
Actually, we should report "accessing garbage or undefined" instead of div zero.

Yes, It should warn "accessing garbage or undefined". The problem exists right now in the baseline: https://godbolt.org/z/6Exejd149
As I said I didn't change the behavior here. The root of the problem is the same as I described above. FP will still be presented after the patch. I just found and pointed to the particular bug.

Actually, separating the tests into a parent patch would highlight these differences.

I don't see strong reasons for splitting the patch after you get the point. The solution is quite coherent, so I prefer not to do so.

Lastly, I update the comment in the NOTE to make it clearer.

steakhal accepted this revision.Oct 28 2021, 5:20 AM

There is no need to do anything with this. You can commit as-is.
I've done it myself, and it seems like your patch only turns TRUE for the cases where it was UNKNOWN or FALSE previously.
So these test cases were changed by your test, and the rest of them were preserved in your patch:
glob_array_index5(), glob_ptr_index3(), glob_array_index6(), glob_ptr_index5(), glob_ptr_index6(), glob_ptr_index7()

Next time I would rather have two patches, one of which adds tests to document the existing behavior and a follow-up changing the behavior.

This revision is now accepted and ready to land.Oct 28 2021, 5:20 AM
ASDenysPetrov edited the summary of this revision. (Show Details)

Updated comments briefly explaining the root of the existing issue.

Thanks, @steakhal and @martong. I appreciate your efforts!

Next time I would rather have two patches, one of which adds tests to document the existing behavior and a follow-up changing the behavior.

It makes sense. I will take this into account with the following changes.

martong accepted this revision.Oct 29 2021, 5:31 AM

Nice work! Thanks!