This is an archive of the discontinued LLVM Phabricator instance.

[analysis][analyzer] Introduce the skeleton of a reaching definitions calculator
Needs ReviewPublic

Authored by Szelethus on Mar 17 2020, 8:13 AM.

Details

Summary

The following revision adds the basic infrastructure for a reaching definitions algorithm for C++.

Short description & motivation

This is a dataflow algorithm designed to find a set of definitions (for example assignments) to variables in a given CFGBlock. To demonstrate, in the following code snippet:

int flag;

void foo();

void f() {
  int *x = nullptr;
  flag = 1;

  foo();
  if (flag)
    x = new int;

  foo();
  if (flag)
    *x = 5;
}

The CFG would look like this:


                   -> [B3] ->    -> [B1] ->
                  /          \  /          \
[B5 (ENTRY)] -> [B4] ------> [B2] ---> [B0 (EXIT)]

Should foo() change flag's value on the first function call to false, then true on the second, a null pointer dereference error would occur. Using a reaching definitions calculator, we can retrieve that the set of ingoing definitions to B1 is {(flag, [B2]), (x, [B3]), (x, [B4])}. The set hints that x has a reaching definitions that would not have caused a nullpointer dereference.

The algorithm

A reaching definition for a given instruction is an earlier instruction whose target variable can reach (be assigned to) the given one without an intervening assignment. The similarly named reaching definitions is a data-flow analysis which statically determines which definitions may reach a given point in the code [1].

The set of ingoing and outgoing reaching definitions are calculated for each basic block with set operations on GEN and KILL sets. This could be further fine-grained so the RD sets would be a property of a statement, rather then an entire CFGBlock, but I didn't want to delay the publishing of this patch any further, and I believe that such a change wouldn't hurt the infrastructure much.

Implementation

As the formal definition would suggest, this was conceived for instructions, which is why even such terms as "variable" or "definition" can be hard to define for C++. For this reason, the patch spares a lot of LOC for documentation and reasoning. While the algorithm itself is simple (and is implemented quite literally from [1]), the problematic part of this is the generation of of GEN and KILL sets. I tried to introduce an infrastructure that can tolerate a lot of things I have inevitable forgotten (or left for followup patches) with the use of easy to add AST matchers.

Immediate questions to address

We already have 2 dataflow algorithms implemented in the Analysis library: UninitializedObject and LiveVariables. Reaching definitions doesn't sound all too dissimilar. Why are we adding hundreds of LOC here? Can't we reuse some of it?

UninitializedObject and LiveVariables are practically the same algorithm with minimal differences. Despite this, they take up ~750 and ~1050 LOC respectively. Both of those algorithms can be expressed with the use of GEN and KILL sets, yet none of them are, and they duplicate a lot of logic. It isn't terribly obvious however how their logic could be (or really, should be) merged.

Shockingly, we have no GEN and KILL set implementations in Clang. I think that is the main addition of this patch, even if it unfortunately duplicates some logic. The previously mentioned two analyses however could serve as an inspiration.

UninitializedObject and LiveVariables uses ASTVisitors rather than ASTMatchers. The latter is more expensive. Why did we go with them?

Matchers are more expressive. For instance, how would you find s.a.x.z with visitors? Its doable, but requires to keep track of a lot of state and would make the implementation ugly. I don't have complete confidence in my decision here, so I welcome alternative suggestions or counterarguments.

What are the main things to get done?

In order:

  • Finalize the infrastructure for GEN set generation.
  • Make it possible to calculate RD sets for statements, not only blocks.
  • Improve the interface of ReachingDefinitionsCalculator. What we like to query? Most probably the set of ingoing RDs for a variable at a given statement.
  • Performance: use immutable data structures and a better CFG traversal strategy.

Further reading

My GSoC project: https://szelethus.github.io/gsoc2019/ (this has a lot of pointers to related discussions in the 'Reaching Definitions Analysis' section)

[cfe-dev] Dataflow analyses in Clang -- why do we not have GEN and KILL sets? http://lists.llvm.org/pipermail/cfe-dev/2020-March/064893.html

[analyzer][WIP] Implement a primitive reaching definitions analysis D64991

References

My main source was wikipedia:
[1] https://en.wikipedia.org/wiki/Reaching_definition

I read the following articles, but they didn't give me the information I needed:

Tonella, Paolo, et al. "Variable precision reaching definitions analysis for software maintenance." Proceedings. First Euromicro Conference on Software Maintenance and Reengineering. IEEE, 1997.

Collard, Jean-François, and Jens Knoop. "A comparative study of reaching-definitions analyses." (1998).

Diff Detail

Event Timeline

Szelethus created this revision.Mar 17 2020, 8:13 AM
whisperity added inline comments.Mar 17 2020, 10:30 AM
clang/include/clang/Analysis/Analyses/ReachingDefinitions.h
79

Shouldn't this be called DefinitionLess if this is the "natural" comparator for Definition? Also, is this the convention in LLVM, instead of providing an explicit specialisation for std::less?

131

I do not feel this is visually separative enough, the fact that there is a free-floating comment (seemingly bogusly not attached to any declaration) breaks reading of the code.

How about this:

class X {
  // Fields that should not change after construction
  private:
    Foo bar;
    Bla blah;

  // Fields that change at every step
  private:
     Blah bla;

  public:
    Yadda blebba;
};
140

You do not seem to be using this variable but in one place. Are you sure it is worth the saving as a field? Also, is it valid in the first place to have this->Context not to be the same as this->D->getASTContext()?

176–179

Is this the right approach? A public method makes the user itch to call it. If GenSetMatcherCallback is a superclass of every possible implementation, I think adding that class as a friend here works, and you could make the methods private, or protected.

236–242

The immutability might not(?), but the performance aspects of the RBT behind STL map could? K is a pointer, why not use DenseMap? How large do you expect these containers to be when they peak out?

clang/lib/Analysis/ReachingDefinitions.cpp
96

I know this is the first patch, but it might be worth mentioning here that this thing does not match user-defined operators.

162

You mean explicit destructor calls here, right?

275

Instead of simply blockID, could you harmonise this output with the CFG dump and say Bx instead of just X?

clang/test/Analysis/dump-definitions.cpp
17–18

What is an element? How do they get their numbers? What does the 3 mean here? I get that basic block 1 (the body of the function) writes ptr... but I don't understand this further from looking at the expected output.

Added some more reviewers who might be interested.

I think it is very crucial to make the intentions clear, how do you define definition and variable?
Other than assignments we might include pre/postfix increment/decrement, non-const by-ref function arguments and so on as definitions.
Also, it would be great to have some proposal upfront how indirect writes should behave.

Basically, this algorithm is not very useful on its own, but it can be used as a building block for other kind of analyses. So it would make sense to try to look for potential users and use cases and see if they have the same requirements or do you need to introduce configurations?
Trying to rewrite some existing algorithms in terms of these sets and see if the tests pass might be a good experiment if it is not too much work. (In case we can validate the performance we could even replace the original implementations if this makes things easier.)

I think having a reaching definition set per basic block makes perfect sense, as it should be relatively easy to calculate the reaching definition of an instruction given the set and the basic block. So maybe it is more efficient to calculate the reaching definition sets on demand for instruction rather than computing it for every single instruction.

Regarding ASTMatchers, I think they might be great for now for prototyping but I think once you want to make this more robust covering every corner case in the matcher expression will be at least as hard as writing a visitor if not harder. But we will see :)

clang/include/clang/Analysis/Analyses/ReachingDefinitions.h
40

I wonder if Variable will be the right notion long term. Do we want to be able to represent heap locations or do we exclude them on purpose? Reasoning about the heap is quite challenging so both answers might be reasonable. But in case we try to tackle the more general problem, we basically need a more general term like MemoryLocation.

60

Is the inheritance justified here? Is a definition a variable? Maybe having a variable member better express the relationship.

98

As far as I understand this is the analysis state for you, while GenSet is used for the transfer functions (I might be mistaken). I think it might be better to origanize the code the following way:

/// Analysis state
everything state related

/// Transfer functions
everything transfer function related

/// Iteration/Traversal
the main loop of the algorithm
222

Hmm, I am not really familiar with the specifics and maybe this is optimized away but always wondered if we only need the address of a static variable why don't we chose the smallest one? Like a char?

clang/lib/Analysis/ReachingDefinitions.cpp
41

In the future this will be more complicated.

For example if I assign to a struct all of its members needs to be killed. As a result, you will not only need to represent memory regions but also the relationships between them. I wonder if the analyzer's memregion classes are any good or are they too specific to the analyzer.

137

Note that, memberExpr is not supported at this point. It is not derived from declRefExpr.

Szelethus marked 9 inline comments as done.Mar 18 2020, 8:54 AM

I think it is very crucial to make the intentions clear, how do you define definition and variable?
Other than assignments we might include pre/postfix increment/decrement, non-const by-ref function arguments and so on as definitions.
Also, it would be great to have some proposal upfront how indirect writes should behave.

You're totally right. The thing that I wanted to achieve, partly, is to make the implementation flexible for things I may have forgotten, or will need to add for future standards/languages. As you can see from D64991, I did make a lot more progress than what is shown here, so I'll explain some of my long-term plans based on the experience I gathered:

I think it is very crucial to make the intentions clear, how do you define [...] variable?

Variable could be practically anything that can be written, but due to the nature of what we can work this, we have to make severe restrictions.

  • We don't really have a good points-to-analysis. This discards pointees of any kind pretty much immediately. I don't have long term plans to add support for them.
  • Arrays are strongly related, and while there are studies on indexed variables (the last paper in my summary talks about such an approach), I think the way C++ complicates things on an uncomfortable level. I do have confidence however adding support for it in the future for the cases where the index is known compile time (possibly with the help of the RD algorithm itself) should be doable by modifying the Variable class.

So this leaves the most obvious: VarDecls, and for record objects a (VarDecl, FieldChain) pair (for s.x.a, this would be (s, [x,a])). Setting the VarDecl part of the pair to nullptr could refer to the implicit this during the analysis of a C++ methods.

The plan was to represent each field with a separate Variable object, so for this code

struct A {
  struct B {
    int x, y;
  };
  B b;
};

void foo() { A a };

The following list of Variable objects would be created:

a
a.b
a.b.x
a.b.y

The reason behind this seemingly wasteful storage is that the eventual set operations would be difficult if not impossible to implement, if a single Variable object were to hold the information about its fields. Mind that each of them could have a totally different definition associated with it. I hope that by emloying immutable data structures these wouldn't be terribly expensive memory-wise.

I think it is very crucial to make the intentions clear, how do you define definition[...]?
Other than assignments we might include pre/postfix increment/decrement, non-const by-ref function arguments and so on as definitions.

Definition is a statement, more specifically a CFGStmt (an element of a CFGBlock), that either writes, or could potentially write a variable. The proposed definition finding stage should be very flexible for future additions.

Also, it would be great to have some proposal upfront how indirect writes should behave.

I wrote a mail about this during my GSoC project: http://lists.llvm.org/pipermail/cfe-dev/2019-July/062975.html.

Basically, this algorithm is not very useful on its own, but it can be used as a building block for other kind of analyses. So it would make sense to try to look for potential users and use cases and see if they have the same requirements or do you need to introduce configurations?

The Static Analyzer would be the number one user of the RD algorithm, as described in the summary, so I guess that you referred to the users of GEN/KILL sets? What do you mean under configurations?

Trying to rewrite some existing algorithms in terms of these sets and see if the tests pass might be a good experiment if it is not too much work. (In case we can validate the performance we could even replace the original implementations if this makes things easier.)

I'm not too sure how much work it would take (I suspect an unreasonable quantity, but I might be wrong), so I'll take a look. This is a great idea to gain a lot more knowledge about this topic, even if it eventually fails.

I think having a reaching definition set per basic block makes perfect sense, as it should be relatively easy to calculate the reaching definition of an instruction given the set and the basic block. So maybe it is more efficient to calculate the reaching definition sets on demand for instruction rather than computing it for every single instruction.

Regarding ASTMatchers, I think they might be great for now for prototyping but I think once you want to make this more robust covering every corner case in the matcher expression will be at least as hard as writing a visitor if not harder. But we will see :)

Thank you for the detailed response! I won't update this patch for a while to leave time for others to respond, but I will try to work on the other algorithms a bit.

clang/include/clang/Analysis/Analyses/ReachingDefinitions.h
40

I don't have long term plans to reason about pointees in general. Heap in particular is probably off limits for the foreseeable future.

60

Yup, you're right. This was a quick hack while developing.

176–179

Moving this to GenSetMatcherCallback would indeed be a great idea :)

236–242

Measurements on real-life codebases can never come soon enough, but I fear it'll be a while before I get them.

clang/lib/Analysis/ReachingDefinitions.cpp
41

As described in my comment, record variables would have numerous Variable classes, so this function wouldn't get much more complicated in the future (as seen in D64991).

137

Indeed, that is one beast of a matcher :) You can take a sneak peak in D64991.

162

Nope, implicit. I wouldn't worry much about explicit calls, I would handle them the same as I would any CallExpr. Implicit destructor calls could modify the global state, and are not visibly present in the code.

275

What do you mean?

clang/test/Analysis/dump-definitions.cpp
17–18

A CFGBlock is a block that contains all statements that are always executed sequentially. The statements are enumerated according to the execution order. 3 here means that the definition is the 3rd element in the 1st CFGBlock.

Variable could be practically anything that can be written, but due to the nature of what we can work this, we have to make severe restrictions.

  • We don't really have a good points-to-analysis. This discards pointees of any kind pretty much immediately. I don't have long term plans to add support for them.

I am a bit concerned about this. So if we have something like *x = 2, this definition will not appear in any of the reaching definition sets? An alternative would be to include it in all the sets that are enabled by the strict-aliasing rules. The reason why I am a bit concerned about this because I think we should be clear what the user can expect from this algorithm? Will we calculate a superset of reaching definitions (over-approximating)? Will we calculate a subset (under-approximating)? Different users might have different requirements. Usually, dataflow analyses tend to over-approximate, but if we omit some elements from the reaching definitions, we will both over- and under-approximate at the same time. While this might make sense in certain use-cases, this might be quite surprising for potential users of the algorithm. This is why I think it is important to set a clear high-level goal. One such goal could be that we always want to over-approximate modulo bugs.

  • Arrays are strongly related, and while there are studies on indexed variables (the last paper in my summary talks about such an approach), I think the way C++ complicates things on an uncomfortable level. I do have confidence however adding support for it in the future for the cases where the index is known compile time (possibly with the help of the RD algorithm itself) should be doable by modifying the Variable class.

I think initially handling arrays as if it contained only one element should be fine.

So this leaves the most obvious: VarDecls, and for record objects a (VarDecl, FieldChain) pair (for s.x.a, this would be (s, [x,a])). Setting the VarDecl part of the pair to nullptr could refer to the implicit this during the analysis of a C++ methods.

Note that, this is very similar to what the lifetime analysis is using. See the LifetimeContractVariable in https://reviews.llvm.org/D72810. But we have a more refined memory location that can also include temporaries that we use during the analysis.

The plan was to represent each field with a separate Variable object, so for this code

struct A {
  struct B {
    int x, y;
  };
  B b;
};

void foo() { A a };

The following list of Variable objects would be created:

a
a.b
a.b.x
a.b.y

The reason behind this seemingly wasteful storage is that the eventual set operations would be difficult if not impossible to implement, if a single Variable object were to hold the information about its fields. Mind that each of them could have a totally different definition associated with it. I hope that by emloying immutable data structures these wouldn't be terribly expensive memory-wise.

Not sure what do you mean here. Basically, I think the question is, how hierarchical do you want your representation to be. I.e.: do you want to have pointers between sub/super objects or do you prefer doing lookups when you want to invalidate subobjects?

Also, it would be great to have some proposal upfront how indirect writes should behave.

I wrote a mail about this during my GSoC project: http://lists.llvm.org/pipermail/cfe-dev/2019-July/062975.html.

I think saying that *x = 2 will just be ignored is not sufficient. See my argument above about over- and under-approximation. I think it would be great to know what are the consequences of this decision. E.g. you could check how the existing algorithms like uninitialized variables behave. For example, one possible mitigation of the problem with pointers would be, to consider &var a definition.
My point is, we have several ways to deal with pointers without precise points-to analysis and ignoring them is just one of the options. Having more options considered with pros and cons enumerated would greatly increase our confidence in your chosen solution.

The Static Analyzer would be the number one user of the RD algorithm, as described in the summary, so I guess that you referred to the users of GEN/KILL sets? What do you mean under configurations?

I mean both. As you mentioned GEN/KILL could be reused for other analyses. But reaching definitions can be a building block for other high-level checks like finding raw pointers that are owners. Consider the following example:

void foo(int *p) {
  ...
  delete p;
}

How do you know that the pointer p is an owner of the pointee? You know that because it is deleted. So if you want to classify some raw pointers as owners one possible way to do it to check the reaching definitions for delete statements. Clang-tidy could try to convert some of those pointers to unique_ptrs. This is how this algorithm can be more generally useful as a building block for other problems unrelated to the static analyzer.

martong added inline comments.Apr 30 2020, 7:50 AM
clang/lib/Analysis/ReachingDefinitions.cpp
269

I understand that this is the worklist algorithm uplifted from Wikipedia. But how do we transmogrify the original algorithm [1] to that one? What's particularly interesting for me is that we continue with the successors from here instead of examining all blocks over again.

[1] Dragon book, 2007

whisperity resigned from this revision.Jun 3 2021, 6:49 AM
whisperity added a subscriber: whisperity.

What about this patch? I'm removing my reviewer bit just so it doesn't appear in my list anymore, but if there are any updates, I'll keep myself as a subscriber. 🙂