The following revision adds the basic infrastructure for a reaching definitions algorithm for C++.
Short description & motivation
This is a dataflow algorithm designed to find a set of definitions (for example assignments) to variables in a given CFGBlock. To demonstrate, in the following code snippet:
int flag; void foo(); void f() { int *x = nullptr; flag = 1; foo(); if (flag) x = new int; foo(); if (flag) *x = 5; }
The CFG would look like this:
-> [B3] -> -> [B1] -> / \ / \ [B5 (ENTRY)] -> [B4] ------> [B2] ---> [B0 (EXIT)]
Should foo() change flag's value on the first function call to false, then true on the second, a null pointer dereference error would occur. Using a reaching definitions calculator, we can retrieve that the set of ingoing definitions to B1 is {(flag, [B2]), (x, [B3]), (x, [B4])}. The set hints that x has a reaching definitions that would not have caused a nullpointer dereference.
The algorithm
A reaching definition for a given instruction is an earlier instruction whose target variable can reach (be assigned to) the given one without an intervening assignment. The similarly named reaching definitions is a data-flow analysis which statically determines which definitions may reach a given point in the code [1].
The set of ingoing and outgoing reaching definitions are calculated for each basic block with set operations on GEN and KILL sets. This could be further fine-grained so the RD sets would be a property of a statement, rather then an entire CFGBlock, but I didn't want to delay the publishing of this patch any further, and I believe that such a change wouldn't hurt the infrastructure much.
Implementation
As the formal definition would suggest, this was conceived for instructions, which is why even such terms as "variable" or "definition" can be hard to define for C++. For this reason, the patch spares a lot of LOC for documentation and reasoning. While the algorithm itself is simple (and is implemented quite literally from [1]), the problematic part of this is the generation of of GEN and KILL sets. I tried to introduce an infrastructure that can tolerate a lot of things I have inevitable forgotten (or left for followup patches) with the use of easy to add AST matchers.
Immediate questions to address
We already have 2 dataflow algorithms implemented in the Analysis library: UninitializedObject and LiveVariables. Reaching definitions doesn't sound all too dissimilar. Why are we adding hundreds of LOC here? Can't we reuse some of it?
UninitializedObject and LiveVariables are practically the same algorithm with minimal differences. Despite this, they take up ~750 and ~1050 LOC respectively. Both of those algorithms can be expressed with the use of GEN and KILL sets, yet none of them are, and they duplicate a lot of logic. It isn't terribly obvious however how their logic could be (or really, should be) merged.
Shockingly, we have no GEN and KILL set implementations in Clang. I think that is the main addition of this patch, even if it unfortunately duplicates some logic. The previously mentioned two analyses however could serve as an inspiration.
UninitializedObject and LiveVariables uses ASTVisitors rather than ASTMatchers. The latter is more expensive. Why did we go with them?
Matchers are more expressive. For instance, how would you find s.a.x.z with visitors? Its doable, but requires to keep track of a lot of state and would make the implementation ugly. I don't have complete confidence in my decision here, so I welcome alternative suggestions or counterarguments.
What are the main things to get done?
In order:
- Finalize the infrastructure for GEN set generation.
- Make it possible to calculate RD sets for statements, not only blocks.
- Improve the interface of ReachingDefinitionsCalculator. What we like to query? Most probably the set of ingoing RDs for a variable at a given statement.
- Performance: use immutable data structures and a better CFG traversal strategy.
Further reading
My GSoC project: https://szelethus.github.io/gsoc2019/ (this has a lot of pointers to related discussions in the 'Reaching Definitions Analysis' section)
[cfe-dev] Dataflow analyses in Clang -- why do we not have GEN and KILL sets? http://lists.llvm.org/pipermail/cfe-dev/2020-March/064893.html
[analyzer][WIP] Implement a primitive reaching definitions analysis D64991
References
My main source was wikipedia:
[1] https://en.wikipedia.org/wiki/Reaching_definition
I read the following articles, but they didn't give me the information I needed:
Tonella, Paolo, et al. "Variable precision reaching definitions analysis for software maintenance." Proceedings. First Euromicro Conference on Software Maintenance and Reengineering. IEEE, 1997.
Collard, Jean-François, and Jens Knoop. "A comparative study of reaching-definitions analyses." (1998).
I wonder if Variable will be the right notion long term. Do we want to be able to represent heap locations or do we exclude them on purpose? Reasoning about the heap is quite challenging so both answers might be reasonable. But in case we try to tackle the more general problem, we basically need a more general term like MemoryLocation.