This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] CERTStrChecker: Model gets()
AbandonedPublic

Authored by Charusso on Nov 4 2019, 11:10 AM.

Details

Reviewers
NoQ
Summary

This checker warn on gets() based on the following rules:
https://wiki.sei.cmu.edu/confluence/display/c/STR31-C.+Guarantee+that+storage+for+strings+has+sufficient+space+for+character+data+and+the+null+terminator

It also tries to rewrite the bad code with the help of FixItHints.

Diff Detail

Event Timeline

Charusso created this revision.Nov 4 2019, 11:10 AM
NoQ added inline comments.Nov 4 2019, 11:43 AM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

Again, you will have to highlight the allocation site with a note. Therefore you will have to write a bug visitor that traverses the size expression at some point (or, equivalently, a note tag when the size expression is evaluated). Therefore you don't need to store the expression in the program state.

clang/test/Analysis/cert/str31-alloc.cpp
43

The fix is not correct. It should be sizeof(buf3) - 1, otherwise you still have a buffer overflow.

NoQ added inline comments.Nov 4 2019, 11:53 AM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
185

Also, which is probably more important, you will never be able to provide a fixit for the malloced memory case, because there may be multiple execution paths that reach the current point with different size expressions (in fact, not necessarily all of them are malloced).

Eg.:

char *x = 0;
char y[10];

if (coin()) {
  x = malloc(20);
} else {
  x = y;
}

gets(x);

If you suggest replacing gets(x) with gets_s(x, 20), you'll still have a buffer overflow on the else-branch on which x points to an array of 10 bytes.

Charusso added inline comments.Nov 4 2019, 12:32 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

Yes, you have pointed out the necessary visitor, but it needs more thinking.

I have a memory region which could be any kind of "memory block region" therefore I have no idea where is the size expression. We are supporting ~20 different allocations, which is nothing compared to the wild with the not so uncommon 5+ parameter allocators. Therefore I still do not want to reverse engineer a small MallocChecker + ExprEngine + BuiltinFunctionChecker inside my checker. They provide the necessary DynamicSizeInfo easily, which could be used in at least 4 checkers at the moment (which I have pointed out earlier in D68725).

If I have the size expression in the dynamic size map, and I can clearly point out the destination buffer, it is a lot more simplified to traverse the graph where the buffer and its size comes from.

185

This checker going to evolve a lot, and all of the checked function calls have issues like that. I do not even think what else issues they have. I would like to cover the false alarm suppression when we are about to alarm. Is it would be okay? I really would like to see alarms first.

For example, I have seen stuff in the wild so that I can state out 8-param allocators and we need to rely on the checkers provide information about allocation.

clang/test/Analysis/cert/str31-alloc.cpp
43

Good catch, thanks! I was really into the pretty-printing, we should not fix-it.

Charusso added inline comments.Nov 4 2019, 2:21 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

Well, you really do not want to store SizeExpr of malloc(SizeExpr) and you are right I will have to traverse from it to see whether the SizeExpr is ambiguous or not, where it comes from.

I want to rely on the trackExpressionValue() as the SizeExpr is available by getDynamicSizeExpr(), so it is one or two lines of code.

Would you create your own switch-case to see where is the size expression goes in the allocation and use trackExpressionValue() on it? So that you do not store information in the global state which results in better run-time / less memory.

At first I really wanted to model malloc() and realloc() and stuff, then I realized the MallocChecker provides every information I need. Would it be a better idea to create my own tiny MallocChecker inside my checker which does nothing but marks the size expression being interesting with NoteTags?

Also I am thinking of a switch-case on the DefinedOrUnknownSVal Size which somewhere has an expression inside it which I could trackExpressionValue() on.

Basically we are missing the rules what to use and I have picked the easiest solution. Could you share please which would be the right direction for such a simple task?

NoQ added inline comments.Nov 4 2019, 2:34 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

I want to rely on the trackExpressionValue() as the SizeExpr is available by getDynamicSizeExpr(), so it is one or two lines of code.

This won't work. trackExpressionValue() can only track an active expression (that has, or at least should have, a value in the bug-node's environment). You'll have to make a visitor or a note tag.

You can either make your own visitor (which will detect the node in which the extent information becomes present), or convert MallocChecker to use note tags and then inter-operate with those tags (though the interestingness map - "i mark the symbol as interesting so i'm interested in highlighting the allocation site" - or a similar mechanism). The second approach is more work because no such interoperation has ever been implemented yet, but it should be pretty rewarding for the future.

NoQ added inline comments.Nov 4 2019, 2:41 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
185

*summons @Szelethus*

Apart from the obviously syntactic cases, you might actually be able to implement fixits for the situation when the reaching-definitions analysis displays exactly one definition for x, which additionally coincides with the allocation site. If that definition is a simple assignment, you'll be able to re-run the reaching definitions analysis for the RHS of that assignment. If that definition comes from a function call, you might be able to re-run the reaching definitions analysis on the return statement(s) of that function (note that this function must have been inlined during path-sensitive analysis, otherwise no definition in it would coincide with the allocation site). And so on.

This problem sheds some light on how much do we want to make the reaching definitions analysis inter-procedural. My current guess is that we probably don't need to; we'd rather have this guided by re-running the reaching-definitions analysis based on the path-sensitive report data, than have the reaching-definitions analysis be inter-procedural on our own.

Charusso added a comment.EditedNov 4 2019, 3:10 PM
This comment has been deleted.
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

This won't work. trackExpressionValue() can only track an active expression (that has, or at least should have, a value in the bug-node's environment). You'll have to make a visitor or a note tag.

So because most likely after the malloc() the size symbol dies, the trackExpressionValue() cannot track dead symbols? Because we could make the size dying base on the buffer, we have some dependency logic for that. It also represents the truth, the size is part of that memory block's region. After that we could track the expression of the size?

NoQ added inline comments.Nov 4 2019, 3:15 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

So because most likely after the malloc() the size symbol dies...?

After the malloc() is consumed, the size expression dies and gets cleaned up from the Environment. The symbol will only die if the value wasn't put into the Store in the process of modeling the statement that consumes the malloc() expression (such as an assignment). But trackExpressionValue() can only track live (active) expressions.

Charusso updated this revision to Diff 227867.Nov 5 2019, 7:04 AM
Charusso marked 6 inline comments as done.
  • Use existing visitors.
  • Make the MallocBugVisitor available for every checker.
  • Fix duplication of fix-its on the warning path piece when we emit a note as well.
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
125

I see. Now I have tried out what we have. The trackExpressionValue() has a lookup to see where is the expression available:

/// Find the ExplodedNode where the lvalue (the value of 'Ex')                   
/// was computed.                                                                
static const ExplodedNode* findNodeForExpression(const ExplodedNode *N,          
                                                 const Expr *Inner) {            
  while (N) {                                                                    
    if (N->getStmtForDiagnostics() == Inner)                                     
      return N;                                                                  
    N = N->getFirstPred();                                                       
  }                                                                              
  return N;                                                                      
}

from that point the expression was alive, and tracking is fine.


The InnerPointerChecker has introduced a place: AllocationState.h to communicate with the MallocBugVisitor. I believe this is the simplest way to communicate.

185

That is a cool idea! I hope @Szelethus has time for his project.

Charusso updated this revision to Diff 227888.Nov 5 2019, 8:34 AM
Charusso marked 2 inline comments as done.
Charusso retitled this revision from [analyzer][WIP] CERTStrChecker: Model gets() to [analyzer] CERTStrChecker: Model gets().
  • Do not try to fix-it an array with offsets.
Charusso updated this revision to Diff 227892.Nov 5 2019, 8:46 AM
  • Support alloca().

Hmm, so this checker is rather a collection of CERT rule checkers, right? Shouldn't the checker name contain the actual rule name (STR31-C)? User interfacewise, I would much prefer smaller, leaner checkers than a big one with a lot of options, which are barely supported for end-users. I would expect a cert package to contain subpackages like cert.str, and checker names cert.str.31StringSize, or similar. Also, shouldn't we move related checkers from security.insecureAPI to cert? Or just mention the rule name in the description, and continue not having a cert package?

clang/include/clang/StaticAnalyzer/Checkers/Checkers.td
70

Hmm. We have a variety of checkers that check for a CERT rule. Maybe we should put the rest here as well, if would better follow the clang-tidy interface.

I'll ask around in the office.

clang/lib/StaticAnalyzer/Checkers/AllocationState.h
29

I would prefer if this header file didn't exist, or was thought out better, because its messy that we hide MallocChecker, but expose its guts like this.

The change is fine, this is just a critique of the checker.

clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
185

This sounds very cool! Once we're at the bug report construction phase, we can make reaching definitions analysis "interprocedural enough" for most cases, I believe.

Hmm, so this checker is rather a collection of CERT rule checkers, right? Shouldn't the checker name contain the actual rule name (STR31-C)? User interfacewise, I would much prefer smaller, leaner checkers than a big one with a lot of options, which are barely supported for end-users. I would expect a cert package to contain subpackages like cert.str, and checker names cert.str.31StringSize, or similar.

It is the STR rules of CERT, nothing else. Most of the rules are tied together, and that is why the checker needs to be designed as one checker at first. I am not sure which part of the STR I will cover, so may when the checker evolves and some functions does not need the same helper methods we need to create new checkers. STR31 and STR32 are my projects which is like one single project. Also I did not except the users to specify the rule number, but this checker could be something like cert.str.Termination. There is two floating-point CERT checkers inside the insecureAPI that is why I have introduced the cert package, which will have three members, one is that new checker. I think a new package is only necessary if it contains at least two checkers.

Also, shouldn't we move related checkers from security.insecureAPI to cert? Or just mention the rule name in the description, and continue not having a cert package?

We should not, they does not fit into CERT rules, but it has two CERT floating-point checkers. The cert package should be well described with CERT rules. I want to move the CERT checkers from it into that cert package, and leave the rest.

Hmm, so this checker is rather a collection of CERT rule checkers, right? Shouldn't the checker name contain the actual rule name (STR31-C)? User interfacewise, I would much prefer smaller, leaner checkers than a big one with a lot of options, which are barely supported for end-users. I would expect a cert package to contain subpackages like cert.str, and checker names cert.str.31StringSize, or similar.

It is the STR rules of CERT, nothing else. Most of the rules are tied together, and that is why the checker needs to be designed as one checker at first. I am not sure which part of the STR I will cover, so may when the checker evolves and some functions does not need the same helper methods we need to create new checkers. STR31 and STR32 are my projects which is like one single project. Also I did not except the users to specify the rule number, but this checker could be something like cert.str.Termination. There is two floating-point CERT checkers inside the insecureAPI that is why I have introduced the cert package, which will have three members, one is that new checker. I think a new package is only necessary if it contains at least two checkers.

Implementationwise that sounds wonderful, these rules are fairly similar to have a single checker responsible for them. The user interface however (enabling different CERT rules) they should probably be split up into separate checkers per rule, rather than options, wouldn't you agree? @NoQ? Also, cert.str.Termination sounds like a wonderful name, I don't insist on the rule number being present in it, but at the very least, it should be in the description.

Also, shouldn't we move related checkers from security.insecureAPI to cert? Or just mention the rule name in the description, and continue not having a cert package?

We should not, they does not fit into CERT rules, but it has two CERT floating-point checkers. The cert package should be well described with CERT rules. I want to move the CERT checkers from it into that cert package, and leave the rest.

I meant related ones only, though I didn't go through the checkers individually, it might just be the case that insecureApi doesn't implement any specific CERT rules :)

Hmm, so this checker is rather a collection of CERT rule checkers, right? Shouldn't the checker name contain the actual rule name (STR31-C)? User interfacewise, I would much

Implementationwise that sounds wonderful, these rules are fairly similar to have a single checker responsible for them. The user interface however (enabling different CERT rules) they should probably be split up into separate checkers per rule, rather than options, wouldn't you agree? @NoQ?

I'm not @NoQ, but I do agree that there should be a separate check per rule in terms of the UI presented to the user. The name should follow the rule ID like they do in clang-tidy, for some consistency there.

Also, cert.str.Termination sounds like a wonderful name, I don't insist on the rule number being present in it, but at the very least, it should be in the description.

I think that the rule number should be in the name. I'd probably go with cert.STR31-C or cert.str31-c (so it's clear which CERT standard the rule came from).

I'm not @NoQ, but I do agree that there should be a separate check per rule in terms of the UI presented to the user. The name should follow the rule ID like they do in clang-tidy, for some consistency there.
I think that the rule number should be in the name. I'd probably go with cert.STR31-C or cert.str31-c (so it's clear which CERT standard the rule came from).

We warmly welcome not (@NoQ)s! I think Artem really wanted to start this direction to make the two tool work together, but I have seen his project is unbelievably difficult so that it is a little-bit far away, sadly. Even we are far away to have multiple CERT rules in this package, if the Tidy users like the code-names, I cannot say no to start the collaboration with Tidy. I would pick cert.str.31-c, as @Szelethus pointed out we use lower-case words for package names and then we can run every cert.str checker at once.

Thanks for the ideas @Szelethus, @aaron.ballman!

I'm not @NoQ, but I do agree that there should be a separate check per rule in terms of the UI presented to the user. The name should follow the rule ID like they do in clang-tidy, for some consistency there.
I think that the rule number should be in the name. I'd probably go with cert.STR31-C or cert.str31-c (so it's clear which CERT standard the rule came from).

We warmly welcome not (@NoQ)s! I think Artem really wanted to start this direction to make the two tool work together, but I have seen his project is unbelievably difficult so that it is a little-bit far away, sadly. Even we are far away to have multiple CERT rules in this package, if the Tidy users like the code-names, I cannot say no to start the collaboration with Tidy. I would pick cert.str.31-c, as @Szelethus pointed out we use lower-case words for package names and then we can run every cert.str checker at once.

Would it make sense to use cert.str.31.c to remove the random dash? Would this also help the user to do something like cert.str.*.cpp? if they want just the CERT C++ STR rules checked? Or can they do that already even with the -?

Would it make sense to use cert.str.31.c to remove the random dash? Would this also help the user to do something like cert.str.*.cpp? if they want just the CERT C++ STR rules checked? Or can they do that already even with the -?

Well, we could introduce package cert.str.c and cert.str.cpp and then the rule-number follows: cert.str.c.31 where the 31 is the name of the checker in this case, which sounds very strange. @Szelethus is the code owner of our frontend so I would wait how he imagine the layout. As I know to enable every C checker of the package cert.str we need to create a c package because we do not have such a logic to put * in the package name before the checker's name and the package c clarify what the user wants to do. Now I have checked your cert.str.cpp page [1] and I think the cert.str.cpp invocation could invoke the cert.str.c as a dependency, because the c rules apply to cpp as you have written.

On the other hand we are trying to avoid larger scope changes than the necessary because we do not know when cert.str.c would contain at least two checkers. That is why I was so minimal and only introduced the package cert because we already have two FLP checkers inside the insecureAPI base-checker.

[1] https://wiki.sei.cmu.edu/confluence/pages/viewpage.action?pageId=88046330

Would it make sense to use cert.str.31.c to remove the random dash? Would this also help the user to do something like cert.str.*.cpp? if they want just the CERT C++ STR rules checked? Or can they do that already even with the -?

Well, we could introduce package cert.str.c and cert.str.cpp and then the rule-number follows: cert.str.c.31 where the 31 is the name of the checker in this case, which sounds very strange. @Szelethus is the code owner of our frontend so I would wait how he imagine the layout.

I wouldn't want to go with that approach because it confuses the names from the coding standard it's meant to check. I think a good policy is to try to keep the check names similar to the coding standard names whenever possible (regardless of the coding standard).

As I know to enable every C checker of the package cert.str we need to create a c package because we do not have such a logic to put * in the package name before the checker's name and the package c clarify what the user wants to do. Now I have checked your cert.str.cpp page [1] and I think the cert.str.cpp invocation could invoke the cert.str.c as a dependency, because the c rules apply to cpp as you have written.

Yes, the C++ rules incorporate some of the C rules, but not all of them, which kind of complicates things. The STR section happens to take everything from the C STR section.

On the other hand we are trying to avoid larger scope changes than the necessary because we do not know when cert.str.c would contain at least two checkers. That is why I was so minimal and only introduced the package cert because we already have two FLP checkers inside the insecureAPI base-checker.

Understandable.

[1] https://wiki.sei.cmu.edu/confluence/pages/viewpage.action?pageId=88046330

Charusso updated this revision to Diff 228675.Nov 11 2019, 5:14 AM
  • The packaging have not been addressed yet.
  • Inject the "zombie" size expression to the new function call (fgets) if none of the size expression's regions have been modified.

The idea is that: When we set up a variable size = 13; it modifies the region, but the size expression is not stored yet, so we do not invalidate anything. We store the malloc(size + 1)'s size, after that the dead-symbol-purging kick in and it either invalidate the region or makes it keep alive.

  • If the region of size is alive after the purge point we cannot inject the "zombie" size + 1 as an expression, we need to obtain its concrete value: 14. (When the redefinition happen I wanted to create a NoteTag, but I have not seen a simple way to do so.)
  • If the region of size has been purged out, it is safe to copy-and-paste the "zombie" size + 1 as an expression.
NoQ added inline comments.Nov 13 2019, 11:13 AM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

All right, so basically what you're saying is that literally every invocation of gets() deserves a warning. This means that for all practical purposes your checker is an AST-based checker, just implemented with path-sensitive callbacks. A path-sensitive checker emits warnings based on multiple events that happen sequentially along the path (use-after-free: "memory deallocated - that same memory used", division by zero: "value constrained to zero - something is being divided by that same value", etc.) but your checker emits the warning by looking at only one statement: "gets() is invoked".

Do i understand correctly that your plan is to use path-sensitive analysis for fixits only? But you can't emit fixits for any truly path-sensitive warning anyway. Fixits must work correctly on all execution paths, so you cannot emit a correct fixit by looking at only one execution path. In order to emit fixits, you need to either resort to a pure AST check anyway ("this expression refers to an array of fixed size"), or maybe implement auxiliary data-flow analysis for a certain must-problem (eg., "the buffer argument may have exactly one possible value across all paths that reach gets()"). But in both cases the path-sensitive engine does literally nothing to help you; all the data that you'll need for your fixit will be available from the AST.

NoQ added inline comments.Nov 13 2019, 11:27 AM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

Like, i think this was an interesting investigation and i was genuinely curious about how this turns out to be, but for now it seems that the problem you're trying to solve cannot be solved this way. Path sensitive analysis is fundamentally applicable to only 50% of the problems (to "may-problems" but not to "must-problems"), and the problem you're trying to solve is in the latter category. I believe you'll have to fall back to the relatively boring task of adding fixits to security.insecureAPI.gets; but then, again, if you manage to employ use-def chains for this problem, that might be quite an inspiring start.

Charusso added inline comments.Nov 13 2019, 11:43 AM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

Most of the time the given allocation to hold the arbitrary string happens in a local scope. After that we see fscanf(dst), gets(dst), memcpy(dst), strncpy(dst), stuff... which pushes new data into that memory block, and then the cool developers write that down: dst[42] = '\0' which means all the reports should be thrown away in a path-sensitive manner on dst. Reallocations, re-bindings, non-AST stuff could handled very easily with the non-AST checker, like that one.

Sometimes we are work with destination-array like memcpy(Foo[Bar.Baz]->Qux, ...) which could not really handled with just a simple AST-based checker. I could not say at the moment we could handle it with symbols, but we have a much larger scope of information by symbols.

Most of the time because of the Analyzer is much smarter than the Tidy we could emit fix-its with the help of flow-sensitiveness very easily. I would create huge white-lists what we want to fix-it, and what we could not, but at some point if we model the symbols better, we can.

Other than that easy false-positive suppression and tiny flow-sensitive rebinding stuff, we could be sure what is going on by each string-manipulation. The gets() is a toy example where at most a grep -rn 'gets(' could do better analysis than us.

The real world looks like that:

1	encryptedpasswordlen = ((strlen(passwd) + RADIUS_VECTOR_LENGTH - 1)
 	/ RADIUS_VECTOR_LENGTH) * RADIUS_VECTOR_LENGTH;
2	cryptvector = palloc(strlen(secret) + RADIUS_VECTOR_LENGTH);
3	memcpy(cryptvector, secret, strlen(secret));
...
4	for (i = 0; i < encryptedpasswordlen; i += RADIUS_VECTOR_LENGTH) {
5	  memcpy(cryptvector + strlen(secret), md5trailer, RADIUS_VECTOR_LENGTH);
...
  • from postgresql/src/backend/libpq/auth.c

At 3 we would emit a warning, because the null-termination left by the wrong size of the string, but at 5 we see that, it left, because at that offset the string continues, and dunno, on 6 when we model every flow-sensitive information, the string left non-terminated.

Of course each of that stuff is local and AST-based (with huge overhead of rebindings and impossible false-positive suppression), but when you have two of it, that is when the fun begins.

Charusso added inline comments.Nov 13 2019, 12:21 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

if you manage to employ use-def chains for this problem, that might be quite an inspiring start.

We have regions so we do not need to rely on such chains in the AST-world, if I get your idea right by the "Use-define chain" wiki [1]. Btw. it is not that difficult problem in the AST-world, you need to create a recursive AST-matcher on the DeclRefExpr with std::function.

Basically, I want to implement all the STR rules in a logical order, here is one of the examples from STR32-C [2] which is my last planned project at the moment:

void lessen_memory_usage(void) {
  wchar_t *temp;
  size_t temp_size;
 
  /* ... */
 
  if (cur_msg != NULL) {
    temp_size = cur_msg_size / 2 + 1;
    temp = realloc(cur_msg, temp_size * sizeof(wchar_t));
    /* temp &and cur_msg may no longer be null-terminated */
    if (temp == NULL) {
      /* Handle error */
    }
 
    cur_msg = temp;
    cur_msg_size = temp_size;
    cur_msg_len = wcslen(cur_msg);
  }
}

They really want to represent the wild, and please think of that problem in terms of an AST-checker versus in terms of getAsRegion and getDynamicElementCount to compare the size of the allocated memory block and inject that: cur_msg[temp_size - 1] = L'\0'; because the array would overflow. How cool is the Analyzer and how smart to do so.

It would took at most 10 minutes to implement if the evalBinOp would work or the main 10 years old implementation of obtaining the element-count would work. I am on the way to fixing the latter, but it will be more path-sensitive info, than you could imagine, like reusing the "zombie" size-expression turned out to be a hard problem. And it will be a lot easier to solve such problems, I believe.

The local scope is the key, and that checker at the moment only tries to rewrite destination-arrays which are local. I think we could see if we emit multiple reports on a given call so due to ambiguity we would drop such fix-its. With an AST checker I could not imagine how difficult it would be.

It is rather a research at the moment, because I have encountered dozens of silly stuff, beginning with the getExtent(), so I cannot say this direction is the 100% future, but I have picked the Analyzer over Tidy, for that reason, it is smarter.

[1] https://en.wikipedia.org/wiki/Use-define_chain
[2] https://wiki.sei.cmu.edu/confluence/display/c/STR32-C.+Do+not+pass+a+non-null-terminated+character+sequence+to+a+library+function+that+expects+a+string

NoQ added inline comments.Nov 13 2019, 3:27 PM
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

It would took at most 10 minutes to implement if the evalBinOp would work or the main 10 years old implementation of obtaining the element-count would work.

You have multiple different checkers that you want to implement and for each of them there are two parts of the problem:
(1) Emit the warning,
(2) Emit a fixit for that warning.

Path-sensitive analysis is perfect for (1), at least for some checkers (not for the one in this patch).

But if you try to rely on path-sensitive analysis for (2), your result will simply be incorrect for the reason stated above: there may be other execution paths that you haven't taken into account. And when you say

I would create huge white-lists what we want to fix-it

, this literally means repeating a lot of the work you did in D45050, as you have to write down matchers (or a CFG-based data flow analysis) for the cases that you really can fix. At this point you will not only be aware of the allocation site from which you can extract the size-expression, but also of all other possible allocation sites.

So i honestly believe that you should drop the idea of using path sensitive analysis to help you with fixits, and instead focus on the two more-or-less-independent tasks of (a) developing path-sensitive checkers for the CERT problems you're interested in and (b) developing fixits for such checkers in a syntactic manner without relying on the information obtained via the path-sensitive analysis.

Charusso marked 14 inline comments as done.Nov 13 2019, 4:37 PM

Given that we are having two different projects at first let us create the path-sensitive error-catching + false positive suppression + design of the CERT rules, and when we are fine, we get back to the impossible-to-solve problem: to adjust fix-its (path-sensitively).

I have not seen I am creating two different projects, because in my mind the Analyzer would be useless if it would be an expensive grep, so umm, sorry.

clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

It would took at most 10 minutes to implement if the evalBinOp would work or the main 10 years old implementation of obtaining the element-count would work.

You have multiple different checkers that you want to implement and for each of them there are two parts of the problem:
(1) Emit the warning,
(2) Emit a fixit for that warning.

This string-handling misuse of the language C consists of thousands of entries across random open source projects, that is why I would like to suggest fix-its. With the help of path-sensitive stuff I would like prove the fix-its are being fine.

Path-sensitive analysis is perfect for (1), at least for some checkers (not for the one in this patch).

But if you try to rely on path-sensitive analysis for (2), your result will simply be incorrect for the reason stated above: there may be other execution paths that you haven't taken into account.

I know that the path-sensitive analysis means that: "There is a path". But I also know that there is one single function body where the (allocation, string manipulation, return data) triplet takes place. Here we should take every of the execution paths because we are in a local scope. We could be 100% sure when the obtained size-expression's region is not safe to reuse. I do not think about function-boundaries because the peoples write code like that triplet. Sometimes the size-expression is coming from a function-call, but that should be fine to obtain.

And when you say

I would create huge white-lists what we want to fix-it

, this literally means repeating a lot of the work you did in D45050, as you have to write down matchers (or a CFG-based data flow analysis) for the cases that you really can fix. At this point you will not only be aware of the allocation site from which you can extract the size-expression, but also of all other possible allocation sites.

There is one allocation site 99,99% of the time, where the 0,01% is your counter example in this review. Of course there must be such case in the wild, somewhere, deep in the woods, and of course the most error-prone case is the 0,01%, but the other 99,99% has the same issue, and we could provide fix-its easily if we do not focus mostly on the 0,01%. I want to make my stuff non-alpha, even none of the non-alpha checkers or not the Tidy can provide 100% accuracy, that stuff with fix-its should be 100% accurate. So in case of the 0,01% we detect it and we do not fix-it, that is it. And we are hoping someone creates summary-based analysis, so that we can rewrite the 0,01% as well.

So i honestly believe that you should drop the idea of using path sensitive analysis to help you with fixits, and instead focus on the two more-or-less-independent tasks of (a) developing path-sensitive checkers for the CERT problems you're interested in and (b) developing fixits for such checkers in a syntactic manner without relying on the information obtained via the path-sensitive analysis.

Plot twist: nor the AST-checkers and nor the path-sensitive-checkers could solve this issue. You cannot state out that my approach is bad, so I should change it. I cannot state out my approach is good, so we should go for it.

I believe in that when I dropped my Tidy-career then I have picked the right tool which I can improve to create 100% accuracy and model the impossible-to-model stuffs, like that STR rules. Of course, in a path-sensitive manner, to drop every heuristic, to model everything which necessary with offsets and allocations, and for false positive suppression. The key here, to modify the Analyzer from that statement "There is a path" to "There is only a path", according to the use-case of the string manipulation being a single path.

NoQ added a subscriber: gribozavr.Nov 13 2019, 6:33 PM
NoQ added inline comments.
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

There is one allocation site 99,99% of the time, where the 0,01% is your counter example in this review.

It might be 0,01% for this checker (and if so, a simple AST-based solution will work equally well), but for other checkers there will be a lot more problems: like, how do you null-terminate a string if the size of the string depends on the execution path?

The key here, to modify the Analyzer from that statement "There is a path" to "There is only a path", according to the use-case of the string manipulation being a single path.

This is a possible approach to one of our open projects, "Implement a dataflow flamework", which would allow us to solve must-problems as well as may-problems. Like, it sounds easy: if we simply were able to figure out whether the analyzer has explored all paths or not in the current analysis, we will be able to solve may-problems in all cases in which the analyzer has actually explored all paths, by post-processing the ExplodedGraph. I've been advocating for this approach for some time in the past, but this approach is clearly a dead end because in most of the interesting cases (eg., in presence of any sort of loops) it'll reply "i don't know" ("we were clearly unable to explore all paths").

The more principled solution is to make an actual data flow framework in the usual sense (i.e., an API that would allow us to write "must-problem" checks in the same way as we currently write path-sensitive checks but with the extra join operation). I heard that @gribozavr has some plans in this area.

Charusso marked an inline comment as done.Nov 13 2019, 7:46 PM
Charusso added inline comments.
clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp
201–207

how do you null-terminate a string if the size of the string depends on the execution path?

Here is your example a little-bit modified:

char *x = 0;
char y[10];

if (coin()) {
  x = malloc(20);
} else {
  x = y;
}

char src[] = "Foo";
strcpy(x, src);

Because the source string is only available after the allocation we cannot make sure the allocation could hold the entire source object which appropriate allocation would be: malloc(strlen(src) + 1) or something similar. Also the free() is another hard problem immediately.

It is that 0,01% case which you do not encounter in the wild. The usual approach: having a memory block - fill it - return it. But the Static Analyzer made for that purpose, so it can detect such path-sensitive information and prevent to emit a fix-it.


"Implement a dataflow flamework"

That is a cool idea, but for the first time I want to think about local scopes. Within a single function body, that is the scope I can safely measure, I believe. If it turns out to be a good idea, we could improve on it step-by-step. I really wanted to measure my stuff, but every time when I have finished I had to rewrite my stuff near from scratch, so I cannot say this checker is somewhat special to rewrite half of the Analyzer. May it will be.


I have learnt nothing from Tidy, but that: prove that the fix-it is valid, if it is, emit, otherwise throw away. The Tidy cannot measure such information, and that is where the flow-sensitive stuff kicks in.

If our engine would be perfect, with dataflow-ctu-summary-dunnowhat analysis collaborating together may we would consider to rewrite such crazy difficult to rewrite code. I think we are that far away of that level, we have to avoid to think of that kind of fix-its. But the 99,99% totally worth and easy to measure with the help of dynamic size information, like I have implemented mostly of the STR31-C and STR32-C examples based on that idea with fix-its.

I cannot emphasize enough well how simple is the way the string is manipulated and how easy it should be modeled with memory block regions or the dynamic "used size" of the memory block, and stuff like that. It is just a sequence of data-flow, from one given allocation to one given return, with the craziest offsets on the destination-array and with the craziest custom allocators.

If you would like to cover more, we are not interested in multiple-buffers, but rather one single buffer with a custom allocator where the size expression should be easy to obtain. I have taken that heuristic and it worked out quite well. Like the Git's allocator injects the null terminator, they say, in a comment, safely - when they pass a wrong size to it: strlen(src). So it would find possible false positives, but the commenting is not really the way we null-terminate.


This is one of the most needed checkers, so I understand why you want to make it as precise as possible with the largest set of possibilities, but incremental development, eh? Peoples are lame and cannot count so they write memcpy(dest, "Foobar", 6), where the next statements are telling us if the programmer was lame, or not (and on the previous line the memcpy() has an appropriate size expression). Please think about that kind of vulnerability when there is a Windows update every week. That was my base case when I started this project 2 years ago.

Charusso abandoned this revision.Nov 18 2019, 3:23 PM

This patch moved to D70411.