This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] MemoryBlockRegion: Generalize AllocaRegion
AbandonedPublic

Authored by Charusso on Oct 9 2019, 1:20 PM.

Details

Reviewers
NoQ
xazax.hun
Summary

This patch generalizes the AllocaRegion to store metadata about the
expression of the allocation of a memory block. This information could be
used to apply fix-its to the size expression of the allocation when the
buffer would overflow.

Diff Detail

Event Timeline

Charusso created this revision.Oct 9 2019, 1:20 PM
NoQ added a comment.Oct 25 2019, 7:18 PM

This patch generalizes the AllocaRegion to store metadata about the expression of the allocation of a memory block.

Mmm, but you can easily obtain that expression from the existing AllocaRegion. I.e., you can obtain the expression on which the memory was allocated, which is an invocation of alloca(). You can always inspect this CallExpr to extract the size argument or the callee declaration, there's no need to store them separately.

Generally, there's no need to add more information to MemRegion and SymExpr objects unless you want to discriminate between different regions/symbols that are currently treated as equal regions/symbols (cf. D21978). If the information is not part of region's/symbol's identity, it shouldn't be their field. You can attach this information to them in a state trait if necessary.

Extent is definitely an example of such information. It's currently implemented via range constraints over SymbolExtent, but i'd much rather have extents implemented as a REGISTER_MAP_WITH_PROGRAMSTATE(Extent, const MemRegion *, SVal) and scrap the virtual methods.

This information could be used to apply fix-its to the size expression of the allocation when the buffer would overflow.

Even if AllocaRegion didn't store the expression, any path-sensitive checker for which such region is "interesting" would have to implement a bug visitor to display the allocation site. Such visitor automatically knows the location of the alloca() expression and can emit the necessary fixits.

clang/test/Analysis/pr22954.c
628

See the other part of the FIXME: But not on the previous line! The symbol is not yet dead on line 628; i specifically left this comment around because this test often gets "fixed" incorrectly.

Thanks for the notes!

In D68725#1722136, @NoQ wrote:

Generally, there's no need to add more information to MemRegion and SymExpr objects [...]

Well, that was my first idea, then I saw we allow helper-methods inside regions. getExtent() has been deprecated on my side.

Even if AllocaRegion didn't store the expression, any path-sensitive checker for which such region is "interesting" would have to implement a bug visitor to display the allocation site. Such visitor automatically knows the location of the alloca() expression and can emit the necessary fixits.

I would create two type of MemoryBlockRegions, one for expressions: malloc(), one for declarations: std::string str. The rational behind the latter to make CStringChecker as simply StringChecker, with modelling the concept: str.size() is the dynamic size behind VarRegion(str). My idea here to make it similar to getDynamicTypeInfo so I can obtain the allocated memory block's capacity, element count, used size easily in every checker. This information could simplify MallocChecker a lot, because the allocated size of memory could serve the allocation state. The "used size" is heuristically modeled in CStringChecker, so it could be simplified, and my new checker would be interested in the element-count only. Also I am thinking of the InnerBufferChecker could connect to a MemoryBlockRegion as all the views abstracts a block of memory.

This new model is basically a new middle-layer in the hierarchy, for example: std::string str -> VarRegion(str) has a memory block behind the scenes, which belongs to the generic heap. This means, I cannot introduce a new type of MemSpaceRegion, as a memory block is a subset of memory space. Both the Decl and Expr based regions produce dynamically changing stuff, that is why I would use the concept of AllocaRegion with an Expr or a Decl of the allocation as a known identifier.

I cannot write that down: str.size() == getDynamicSize(str) or char *cstr; strlen(cstr) == getDynamicSize(cstr), where the VarRegion being a pointer to the data has a size of a pointer, but I am interested in the block of memory it points to. So, even I could remove the entire AllocaRegion, I think it is great to distinguish between a pointer and a block of memory. Also this entire dynamic-size business would be based on MemoryBlockRegions, which I believe a good way to isolate a new concept with creating new types of regions, and not to stress the current model.

I have mixed feelings how would we model a string, and connect it with the current model. I think that would be neat to express like that:
malloc(str.size() + 1) == MemoryBlockExprRegion(getDynamicSize(MemoryBlockDeclRegion(VarRegion(str))) + 1), I believe.

Here is an example why I am interested in strings. Here is a function from LevelDB which does not inject the null-terminator:
(https://github.com/google/leveldb/blob/master/db/c.cc)

static char* CopyString(const std::string& str) {
  char* result = reinterpret_cast<char*>(malloc(sizeof(char) * str.size()));
  memcpy(result, str.data(), sizeof(char) * str.size());
  return result;
}

The proper way to copy a string is:

static char* CopyString(const std::string& str) {
  char* result = reinterpret_cast<char*>(malloc(str.size() + 1));
  strcpy(result, str.data());
  return result;
}

Also, this would be the first step to handle everything in the string.c test file, which is not the goal. The goal is the size of a memory block.

clang/test/Analysis/pr22954.c
628

Well, this is not on the previous line, but I will mark this as unresolved then. In my mind with that new concept we see that we have an eternal block of memory, which we needs to explicitly free().

NoQ added a comment.Oct 29 2019, 4:56 PM
In D68725#1722136, @NoQ wrote:

Even if AllocaRegion didn't store the expression, any path-sensitive checker for which such region is "interesting" would have to implement a bug visitor to display the allocation site. Such visitor automatically knows the location of the alloca() expression and can emit the necessary fixits.

The rational behind the latter to make CStringChecker as simply StringChecker, with modelling the concept: str.size() is the dynamic size behind VarRegion(str).

We already have SymbolMetadata for that purpose. If you want, you can model the inner buffer of a std::string as a SymbolMetadata of type char * associated with the region of the std::string; in this case the string's capacity is the extend of the symbolic region associated with SymbolMetadata and the string's length is CStringChecker's existing SymbolMetadata of type size_t associated with the same symbolic region.

I don't particularly love SymbolMetadata because i believe it should be associated with symbols rather than with regions-and-points (eg., introduce a symbolic identity for the std::string that gets preserved on copies and associate SymbolMetadata with that identity).

But i believe that we already have enough concepts and tools here, there's no need to introduce more of them.

Charusso abandoned this revision.Oct 29 2019, 5:01 PM
In D68725#1726332, @NoQ wrote:

But i believe that we already have enough concepts and tools here, there's no need to introduce more of them.

In D68725#1726332, @NoQ wrote:

I don't particularly love SymbolMetadata because i believe it should be associated with symbols [...]

That is why I secretly wanted to get rid of that concept, but I am fine with using a generic size map.