This is an archive of the discontinued LLVM Phabricator instance.

[analyzer][RFC] Simplify MetadataSymbol representation and resolve a CStringChecker FIXME
Needs ReviewPublic

Authored by steakhal on Aug 24 2020, 4:15 AM.

Details

Summary

The SymbolMetadata handing was slightly flawed.
Given the following example:

void strcat_models_dst_length_even_if_src_dies(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
}

A metadata symbol was assigned to the cstring length map, representing the new length of the dst:
'dst -> meta{src} + 4' where meta{src} represents the cstring length of the region of src.
Unfortunately, this information was immediately removed from the model, since the src become dead after the evaluation of calling strcat.
This results in the removal of all symbolic expressions (eg metadata symbols) that refers to the dead src memory region.

As of now, MetadataSymbols are used only by the CStringChecker, so this fix is tightly coupled to fixing the resulting issues of that checker.
There was already a FIXME in the test suite for this - which is now resolved.


There was a lengthy discussion on the mailing list how to resolve this issue.
In that mailing, @NoQ proposed these steps, which I have implemented in this patch:

  1. Remove statement, location context, block count from SymbolMetadata's identity. Let it solely depend on the region.
  2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.
  3. When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map. We don't need to change the state because the state didn't in fact change. On subsequent strlen(R) calls we'll simply return the same symbol (because the map will remain empty), until the string is changed.
  4. If the string is mutated, record its length as usual. But if the string is invalidated (or the new length is completely unknown), do not remove the length from the state; instead, set it to SymbolConjured. Only remove it from the map when the region dies.
  5. Keep string length symbols alive as long as they're in the map.

Diff Detail

Event Timeline

steakhal created this revision.Aug 24 2020, 4:15 AM
steakhal requested review of this revision.Aug 24 2020, 4:15 AM
martong added inline comments.Aug 24 2020, 6:31 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
198

I think the comments should be updated as well: remove "in use" and refer to the aliveness of the memregion.

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2417

Seems like checkDeadSymbols could have a generic implementation. Perhaps in the following form:

class CStringChecker : public Checker<
                                         check::DeadSymbols<CStringLength>

But this should be done in a separate patch.

Maybe it would make sense to have a generic default check::LiveSymbols<GDM0, GDM1, ...> implementation as well. In that implementation we could iterate over the GDM entries and could mark the dependent (sub)symbols live.

(Note, this is a reminder from our verbal discussion with @steakhal.)

clang/test/Analysis/string.c
1544

It would make sense to split this into two. Only one of them should be in the FIXMEs section with the {{UNKNOWN}}. TRUE expectations could be moved out from the FIXMEs section.

steakhal added inline comments.Aug 24 2020, 6:33 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
441

Why do we even need the tag?
Why is it defaulted to nullptr if it asserts later that the tag must not be null.
I'm confused, somebody help me out :D

516–522

Is it true for modeling checkers too?

524–532

I think it should be removed, but for now I left it to draw attention in the review process.

steakhal added inline comments.Aug 24 2020, 6:39 AM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
198

Definitely, thanks.

clang/test/Analysis/string.c
1544

This patch does not interfere with the test case below.

However, it would be nice to say that calling strlen on non-null-terminated strings is UB.
Thus expecting the clang_analyzer_eval(strlen(str) >= 5); to be TRUE does not make sense much IMO.

NoQ added inline comments.Aug 24 2020, 11:23 PM
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
441

The tag is necessary so that different checkers could attach different metadata to the same region (possibly even of the same type). A null value indeed makes no sense here.

516–522

This comment is much older than the concept of a "modeling checker" and it has never been true. Checkers need to mark things live. It's part of life! (no pun intended) Like, i agree that non-modeling checkers probably don't need to use this method, but that wasn't what the author was trying to say :) And also there's nothing special about SymbolMetadata; any data that the checker helps keep track of may need to be marked live by the checker, so markInUse() is insufficient.

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2417

Nope, there can't be a generic implementation. It depends a lot on the data structure tracked by the checker (the checker's technically allowed to track "a map from regions to maps from lists of expressions to sets of SVals", good luck coming up with a generic checkLiveSymbols() for that) and it most likely isn't uniquely determined by the data structure either.

But what we can do is provide half-baked implementations for the common situations such as "a map from regions to symbols" (eg. this checker, smart pointer checker) or "a map from symbols to an enum of states the object associated with the symbol is in" (malloc checker, stream checker). And even better, we could provide half-baked implementations of entire state traits, not just checkDeadSymbols
/checkLiveSymbols().

2429

Ok, this doesn't look correct (looks like it never was). It's liveness of the region that's important to us, not liveness of the symbol. Because it's the liveness of the region that causes the program to be able to access the map entry.

steakhal marked an inline comment as done.Aug 26 2020, 1:28 AM
steakhal added inline comments.
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
441

Thanks.

516–522

Ok, I will rephrase the comment expressing that this should be used only in modeling checkers.

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2417

I think we can postprone the experiments with the generic checkDeadSymbols/checkLiveSymbols implementations.
This patch does not aim to solve that.

2429

Let's say we have this:

void foo() {
  char *p = malloc(12);
  // strlen(p); // no-leak if strlen called, leak warning otherwise...
} // expected-warning {{leak}}

If we were marking the region live here, we would miss the leak warning. That warning is only triggered if the region of the p becomes dead. Which will never become dead if we have a cstring length symbol associated to that region.
I came to this conclusion after implementing your suggested edit above (checking regions instead of symbols).

But if the string is invalidated (or the new length is completely unknown), do not remove the length from the state; instead, set it to SymbolConjured.

Where exactly do you implement the above?

When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map.

And this?

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2379

It's just a tiny nit. But, perhaps we could come up with a more meaningful name instead of Entry and Entries. Maybe Length ?

2384–2385

Umm, these two lines are quite disturbing after each other. Seems like first we remove MR then we add MR again, so, this must not be a noop, right? But then what is happening here exactly? Some comments around here in the code would help.

2429

Is it possible to have two string length symbols that are associated to the same region? Could that cause any problem?
E.g. what will happen below? Will we remove MR twice?

char *p = "asdf"
char *q = p + 1;
strlen(p); strlen(q);
martong added inline comments.Sep 8 2020, 5:39 AM
clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2384–2385

Ugh, giving it more thought, we just reset the Value associated to MR. Still, would be nice to comment this there.

But if the string is invalidated (or the new length is completely unknown), do not remove the length from the state; instead, set it to SymbolConjured.

Where exactly do you implement the above?

When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map.

And this?

Hm yes. I missed them. I will extend the current implementation, rethinking the invalidation cases, when to store and what etc.
It will bloat this patch, but indeed - necessary. Thank you!

However, as I planned to clean the CStringChecker a little bit up, it seems reasonable to me to land those patches before I start working on this to save me from some rebasing :|

What do you think about these patches:

  • D84316 [analyzer][NFC] Split CStringChecker to modeling and reporting
  • D84979 [analyzer][NFC] Refine CStringLength modeling API

If you like that direction, I can rebase this on top of that. That would help me a lot, to know when & how a given metadata symbol would be used, and check if we indeed handle the invalidation, etc. in the right way.

clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2379

We will see. Thanks.

2384–2385

Exactly. I will do that, ok.

NoQ added inline comments.Sep 8 2020, 1:04 PM
clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2429

Which will never become dead if we have a cstring length symbol associated to that region.

That's not how it's supposed to work. Symbols don't keep their associated regions alive, only regions keep symbols associated with them alive.

Don't manipulate region liveness manually at all here. Regardless of strlen(), the heap region dies with p which is the last variable to refer to it. Rely on it to garbage-collect the symbol for strlen(p) but don't actively mutate it.

NoQ added inline comments.Sep 8 2020, 1:06 PM
clang/lib/StaticAnalyzer/Checkers/CStringChecker.cpp
2429

E.g. what will happen below?

These are different regions. Your p is Element{0, "asdf", char}, whereas your q is Element{1, "asdf", char}.