diff --git a/clang/docs/DataFlowAnalysisIntro.md b/clang/docs/DataFlowAnalysisIntro.md new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntro.md @@ -0,0 +1,1000 @@ +# Data flow analysis: an informal introduction + +## Abstract + +This document introduces data flow analysis in an informal way. The goal is to +give the reader an intuitive understanding of how it works, and show how it +applies to a range of refactoring and bug finding problems. + +Data flow analysis is a well-established technique; it is described in many +papers, books, and videos. If you would like a more formal, or a more thorough +explanation of the concepts mentioned in this document, please refer to the +following resources: + +* [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)). +* Videos on the PacketPrep YouTube channel that introduce lattices and the + necessary background information: + [#20](https://www.youtube.com/watch?v=73j_FXBXGm8), + [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8), + [#22](https://www.youtube.com/watch?v=saOG7Uooeho), + [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g), + [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc), + [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw). +* [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE) +* [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf). +* [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf). +* [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/). +* [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf) + (a paper that successfully applies symbolic execution to real-world + software). + +## Data flow analysis + +### The purpose of data flow analysis + +Data flow analysis is a static analysis technique that proves facts about a +program or its fragment. It can make conclusions about all paths through the +program, while taking control flow into account and scaling to large programs. +The basic idea is propagating facts about the program through the edges of the +control flow graph (CFG) until a fixpoint is reached. + +### Sample problem and an ad-hoc solution + +We would like to explain data flow analysis while discussing an example. Let's +imagine that we want to track possible values of an integer variable in our +program. Here is how a human could annotate the code: + +```c++ +void Example(int n) { + int x = 0; + // x is {0} + if (n > 0) { + x = 5; + // x is {5} + } else { + x = 42; + // x is {42} + } + // x is {5; 42} + print(x); +} +``` + +We use sets of integers to represent possible values of `x`. Local variables +have unambiguous values between statements, so we annotate program points +between statements with sets of possible values. + +Here is how we arrived at these annotations. Assigning a constant to `x` allows +us to make a conclusion that `x` can only have one value. When control flow from +the "then" and "else" branches joins, `x` can have either value. + +Abstract algebra provides a nice formalism that models this kind of structure, +namely, a lattice. A join-semilattice is a partially ordered set, in which every +two elements have a least upper bound (called a *join*). + +``` +join(a, b) ⩾ a and join(a, b) ⩾ b and join(x, x) = x +``` + +For this problem we will use the lattice of subsets of integers, with set +inclusion relation as ordering and set union as a join. + +Lattices are often represented visually as Hasse diagrams. Here is a Hasse +diagram for our lattice that tracks subsets of integers: + +![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg) + +Computing the join in the lattice corresponds to finding the lowest common +ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of +literature on efficiently implementing LCA queries for a DAG, however Efficient +Implementation of Lattice Operations (1989) +([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911), +[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that +particularly well-suited for programmatic implementation. + +### Too much information and "top" values + +Let's try to find the possible sets of values of `x` in a function that modifies +`x` in a loop: + +```c++ +void ExampleOfInfiniteSets() { + int x = 0; // x is {0} + while (condition()) { + x += 1; // x is {0; 1; 2; …} + } + print(x); // x is {0; 1; 2; …} +} +``` + +We have an issue: `x` can have any value greater than zero; that's an infinite +set of values, if the program operated on mathematical integers. In C++ `int` is +limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is +still really big. + +To make our analysis practical to compute, we have to limit the amount of +information that we track. In this case, we can, for example, arbitrarily limit +the size of sets to 3 elements. If at a certain program point `x` has more than +3 possible values, we stop tracking specific values at that program point. +Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top" +according to a convention in abstract algebra). + +```c++ +void ExampleOfTopWithALoop() { + int x = 0; // x is {0} + while (condition()) { + x += 1; // x is ⊤ + } + print(x); // x is ⊤ +} +``` + +The statement "at this program point, `x`'s possible values are `⊤`" is +understood as "at this program point `x` can have any value because we have too +much information, or the information is conflicting". + +Note that we can get more than 3 possible values even without a loop: + +```c++ +void ExampleOfTopWithoutLoops(int n) { + int x = 0; // x is {0} + switch(n) { + case 0: x = 1; break; // x is {1} + case 1: x = 9; break; // x is {9} + case 2: x = 7; break; // x is {7} + default: x = 3; break; // x is {3} + } + // x is ⊤ +} +``` + +### Uninitialized variables and "bottom" values + +When `x` is declared but not initialized, it has no possible values. We +represent this fact symbolically as `⊥` (pronounced "bottom"). + +```c++ +void ExampleOfBottom() { + int x; // x is ⊥ + x = 42; // x is {42} + print(x); +} +``` + +Note that using values read from uninitialized variables is undefined behaviour +in C++. Generally, compilers and static analysis tools can assume undefined +behavior does not happen. We must model uninitialized variables only when we are +implementing a checker that specifically is trying to find uninitialized reads. +In this example we show how to model uninitialized variables only to demonstrate +the concept of "bottom", and how it applies to possible value analysis. We +describe an analysis that finds uninitialized reads in a section below. + +### A practical lattice that tracks sets of concrete values + +Taking into account all corner cases covered above, we can put together a +lattice that we can use in practice to track possible values of integer +variables. This lattice represents sets of integers with 1, 2, or 3 elements, as +well as top and bottom. Here is a Hasse diagram for it: + +![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg) + +### Formalization + +Let's consider a slightly more complex example, and think about how we can +compute the sets of possible values algorithmically. + +```c++ +void Example(int n) { + int x; // x is ⊥ + if (n > 0) { + if (n == 42) { + x = 44; // x is {44} + } else { + x = 5; // x is {5} + } + print(x); // x is {44; 5} + } else { + x = n; // x is ⊤ + } + print(x); // x is ⊤ +} +``` + +As humans, we understand the control flow from the program text. We used our +understanding of control flow to find program points where two flows join. +Formally, control flow is represented by a CFG (control flow graph): + +![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg) + +We can compute sets of possible values by propagating them through the CFG of +the function: + +* When `x` is declared but not initialized, its possible values are `{}`. The + empty set plays the role of `⊥` in this lattice. + +* When `x` is assigned a concrete value, its possible set of values contains + just that specific value. + +* When `x` is assigned some unknown value, it can have any value. We represent + this fact as `⊤`. + +* When two control flow paths join, we compute the set union of incoming + values (limiting the number of elements to 3, representig larger sets as + `⊤`). + +The sets of possible values are influenced by: + +* Statements, for example, assignments. + +* Joins in control flow, for example, ones that appear at the end of "if" + statements. + +**Effects of statements** are modeled by what is formally known as a transfer +function. A transfer function takes two arguments: the statement, and the state +of `x` at the previous program point. It produces the state of `x` at the next +program point. For example, the transfer function for assignment ignores the +state at the previous program point: + +```c++ +// GIVEN: x is {42; 44} +x = 0; +// CONCLUSION: x is {0} +``` + +The transfer function for `+` performs arithmetic on every set member: + +```c++ +// GIVEN: x is {42, 44} +x = x + 100; +// CONCLUSION: x is {142, 144} +``` + +**Effects of control flow** are modeled by joining the knowledge from all +possible previous program points. + +```c++ +if (...) { + ... + // GIVEN: x is {42} +} else { + ... + // GIVEN: x is {44} +} +// CONCLUSION: x is {42; 44} +``` + +```c++ +// GIVEN: x is {42} +while (...) { + ... + // GIVEN: x is {44} +} +// CONCLUSION: {42; 44} +``` + +The predicate that we marked "given" is usually called a precondition, and the +conclusion is called a postcondition. + +In terms of the CFG, we join the information from all predecessor basic blocks. + +![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg) + +Putting it all together, to model the effects of a basic block we compute: + +``` +out = transfer(basic_block, join(in_1, in_2, ..., in_n)) +``` + +(Note that there are other ways to write this equation that produce higher +precision analysis results. The trick is to keep exploring the execution paths +separately and delay joining until later. Hoowever, we won't discuss those +variations here.) + +To make a conclusion about all paths through the program, we repeat this +computation on all basic blocks until we reach a fixpoint. In other words, we +keep propagating information through the CFG until the computed sets of values +stop changing. + +If the lattice has a finite height and transfer functions are monotonic the +algorithm is guaranteed to terminate. Each iteration of the algorithm can +change computed values only to larger values from the lattice. In the worst +case, all computed values become `⊤`, which is not very useful, but at least the +analysis terminates at that point, because it can't change any of the values. + +Fixpoint iteration can be optimised by only reprocessing basic blocks which had +one of their inputs changed on the previous iteration. This is typically +implemented using a worklist queue. With this optimisation the time complexity +becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and +`|L|` is the size of lattice used by the analysis. + +## Symbolic execution: a very short informal introduction + +### Symbolic values + +In the previous example where we tried to figure out what values a variable can +have, the analysis had to be seeded with a concrete value. What if there are no +assignments of concrete values in the program? We can still deduce some +interesting information by representing unknown input values symbolically, and +computing results as symbolic expressions: + +```c++ +void PrintAbs(int x) { + int result; + if (x >= 0) { + result = x; // result is {x} + } else { + result = -x; // result is {-x} + } + print(result); // result is {x; -x} +} +``` + +We can't say what specific value gets printed, but we know that it is either `x` +or `-x`. + +Dataflow analysis is an istance of abstract interpretation, and does not dictate +how exactly the lattice and transfer functions should be designed, beyond the +necessary conditions for the analysis to converge. Nevertheless, we can use +symbolic execution ideas to guide our design of the lattice and transfer +functions: lattice values can be symbolic expressions, and transfer functions +can construct more complex symbolic expressions from symbolic expressions that +represent arguments. See [this StackOverflow +discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation) +for a further comparison of abstract interpretation and symbolic execution. + +### Flow condition + +A human can say about the previous example that the function returns `x` when +`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by +tracking a flow condition. A flow condition is a predicate written in terms of +the program state that is true at a specific program point regardless of the +execution path that led to this statement. For example, the flow condition for +the program point right before evaluating `result = x` is `x >= 0`. + +If we enhance the lattice to be a set of pairs of values and predicates, the +dataflow analysis computes the following values: + +```c++ +void PrintAbs(int x) { + int result; + if (x >= 0) { + // Flow condition: x >= 0. + result = x; // result is {x if x >= 0} + } else { + // Flow condition: x < 0. + result = -x; // result is {-x if x < 0} + } + print(result); // result is {x if x >= 0; -x if x < 0} +} +``` + +Of course, in a program with loops, symbolic expressions for flow conditions can +grow unbounded. A practical static analysis system must control this growth to +keep the symbolic representations manageable and ensure that the data flow +analysis terminates. For example, it can use a constraint solver to prune +impossible flow conditions, and/or it can abstract them, losing precision, after +their symbolic representations grow beyond some threshold. This is similar to +how we had to limit the sizes of computed sets of possible values to 3 elements. + +### Symbolic pointers + +This approach proves to be particularly useful for modeling pointer values, +since we don't care about specific addresses but just want to give a unique +identifier to a memory location. + +```c++ +void ExampleOfSymbolicPointers(bool b) { + int x = 0; // x is {0} + int* ptr = &x; // x is {0} ptr is {&x} + if (b) { + *ptr = 42; // x is {42} ptr is {&x} + } + print(x); // x is {0; 42} ptr is {&x} +} +``` + +## Example: finding output parameters + +Let's explore how data flow analysis can help with a problem that is hard to +solve with other tools in Clang. + +### Problem description + +Output parameters are function parameters of pointer or reference type whose +pointee is completely overwritten by the function, and not read before it is +overwritten. They are common in pre-C++11 code due to the absence of move +semantics. In modern C++ output parameters are non-idiomatic, and return values +are used instead. + +Imagine that we would like to refactor output parameters to return values to +modernize old code. The first step is to identify refactoring candidates through +static analysis. + +For example, in the following code snippet the pointer `c` is an output +parameter: + +```c++ +struct Customer { + int account_id; + std::string name; +} + +void GetCustomer(Customer *c) { + c->account_id = ...; + if (...) { + c->name = ...; + } else { + c->name = ...; + } +} +``` + +We would like to refactor this code into: + +```c++ +Customer GetCustomer() { + Customer c; + c.account_id = ...; + if (...) { + c.name = ...; + } else { + c.name = ...; + } + return c; +} +``` + +However, in the function below the parameter `c` is not an output parameter +because its field `name` is not overwritten on every path through the function. + +```c++ +void GetCustomer(Customer *c) { + c->account_id = ...; + if (...) { + c->name = ...; + } +} +``` + +The code also cannot read the value of the parameter before overwriting it: + +```c++ +void GetCustomer(Customer *c) { + use(c->account_id); + c->name = ...; + c->account_id = ...; +} +``` + +Functions that escape the pointer also block the refactoring: + +```c++ +Customer* kGlobalCustomer; + +void GetCustomer(Customer *c) { + c->name = ...; + c->account_id = ...; + kGlobalCustomer = c; +} +``` + +To identify a candidate function for refactoring, we need to do the following: + +* Find a function with a non-const pointer or reference parameter. + +* Find the definition of that function. + +* Prove that the function completely overwrites the pointee on all paths + before returning. + +* Prove that the function reads the pointee only after overwriting it. + +* Prove that the function does not persist the pointer in a data structure + that is live after the function returns. + +There are also requirements that all usage sites of the candidate function must +satisfy, for example, that function arguments do not alias, that users are not +taking the address of the function, and so on. Let's consider verifying usage +site conditions to be a separate static analysis problem. + +### Lattice design + +To analyze the function body we can use a lattice which consists of normal +states and failure states. A normal state describes program points where we are +sure that no behaviors that block the refactoring have occurred. Normal states +keep track of all parameter's member fields that are known to be overwritten on +every path from function entry to the corresponding program point. Failure +states accumulate observed violations (unsafe reads and pointer escapes) that +block the refactoring. + +In the partial order of the lattice failure states compare greater than normal +states, which guarantees that they "win" when joined with normal states. Order +between failure states is determined by inclusion relation on the set of +accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order +between normal states is determined by reversed inclusion relation on the set of +overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of +overwritten fields). + +![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg) + +To determine whether a statement reads or writes a field we can implement +symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer +dereference operator and `MemberExpr`s. + +### Using data flow results to identify output parameters + +Let's take a look at how we use data flow analysis to identify an output +parameter. The refactoring can be safely done when the data flow algorithm +computes a normal state with all of the fields proven to be overwritten in the +exit basic block of the function. + +```c++ +struct Customer { + int account_id; + std::string name; +}; + +void GetCustomer(Customer* c) { + // Overwritten: {} + c->account_id = ...; // Overwritten: {c->account_id} + if (...) { + c->name = ...; // Overwritten: {c->account_id, c->name} + } else { + c->name = ...; // Overwritten: {c->account_id, c->name} + } + // Overwritten: {c->account_id, c->name} +} +``` + +When the data flow algorithm computes a normal state, but not all fields are +proven to be overwritten we can't perform the refactoring. + +```c++ +void target(bool b, Customer* c) { + // Overwritten: {} + if (b) { + c->account_id = 42; // Overwritten: {c->account_id} + } else { + c->name = "Konrad"; // Overwritten: {c->name} + } + // Overwritten: {} +} +``` + +Similarly, when the data flow algorithm computes a failure state, we also can't +perform the refactoring. + +```c++ +Customer* kGlobalCustomer; + +void GetCustomer(Customer* c) { + // Overwritten: {} + c->account_id = ...; // Overwritten: {c->account_id} + if (...) { + print(c->name); // Unsafe read + } else { + kGlobalCustomer = c; // Pointer escape + } + // Unsafe read, Pointer escape +} +``` + +## Example: finding dead stores + +Let's say we want to find redundant stores, because they indicate potential +bugs. + +```c++ +x = GetX(); +x = GetY(); +``` + +The first store to `x` is never read, probably there is a bug. + +The implementation of dead store analysis is very similar to output parameter +analysis: we need to track stores and loads, and find stores that were never +read. + +[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a +generalization of this idea, which is often used to answer many related +questions, for example: + +* finding dead stores, +* finding uninitialized variables, +* finding a good point to deallocate memory, +* finding out if it would be safe to move an object. + +## Example: definitive initialization + +Definitive initialization proves that variables are known to be initialized when +read. If we find a variable which is read when not initialized then we generate +a warning. + +```c++ +void Init() { + int x; // x is uninitialized + if (cond()) { + x = 10; // x is initialized + } else { + x = 20; // x is initialized + } + print(x); // x is initialized +} +``` + +```c++ +void Uninit() { + int x; // x is uninitialized + if (cond()) { + x = 10; // x is initialized + } + print(x); // x is maybe uninitialized, x is being read, report a bug. +} +``` + +For this purpose we can use lattice in a form of a mapping from variable +declarations to initialization states; each initialization state is represented +by the followingn lattice: + +![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg) + +A lattice element could also capture the source locations of the branches that +lead us to the corresponding program point. Diagnostics would use this +information to show a sample buggy code path to the user. + +## Example: refactoring raw pointers to `unique_ptr` + +Modern idiomatic C++ uses smart pointers to express memory ownership, however in +pre-C++11 code one can often find raw pointers that own heap memory blocks. + +Imagine that we would like to refactor raw pointers that own memory to +`unique_ptr`. There are multiple ways to design a data flow analysis for this +problem; let's look at one way to do it. + +For example, we would like to refactor the following code that uses raw +pointers: + +```c++ +void UniqueOwnership1() { + int *pi = new int; + if (...) { + Borrow(pi); + delete pi; + } else { + TakeOwnership(pi); + } +} +``` + +into code that uses `unique_ptr`: + +```c++ +void UniqueOwnership1() { + auto pi = std::make_unique(); + if (...) { + Borrow(pi.get()); + } else { + TakeOwnership(pi.release()); + } +} +``` + +This problem can be solved with a lattice in form of map from value declarations +to pointer states: + +![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg) + +We can perform the refactoring if at the exit of a function `pi` is +`Compatible`. + +```c++ +void UniqueOwnership1() { + int *pi; // pi is Compatible + pi = new int; // pi is Defined + if (...) { + Borrow(pi); // pi is Defined + delete pi; // pi is Compatible + } else { + TakeOwnership(pi); // pi is Compatible + } + // pi is Compatible +} +``` + +Let's look at an example where the raw pointer owns two different memory blocks: + +```c++ +void UniqueOwnership2() { + int *pi = new int; // pi is Defined + Borrow(pi); + delete pi; // pi is Compatible + if (smth) { + pi = new int; // pi is Defined + Borrow(pi); + delete pi; // pi is Compatible + } + // pi is Compatible +} +``` + +It can be refactored to use `unique_ptr` like this: + +```c++ +void UniqueOwnership2() { + auto pi = make_unique(); + Borrow(pi); + if (smth) { + pi = make_unique(); + Borrow(pi); + } +} +``` + +In the following example, the raw pointer is used to access the heap object +after the ownership has been transferred. + +```c++ +void UniqueOwnership3() { + int *pi = new int; // pi is Defined + if (...) { + Borrow(pi); + delete pi; // pi is Compatible + } else { + vector> v = {std::unique_ptr(pi)}; // pi is Compatible + print(*pi); + use(v); + } + // pi is Compatible +} +``` + +We can refactor this code to use `unique_ptr`, however we would have to +introduce a non-owning pointer variable, since we can't use the moved-from +`unique_ptr` to access the object: + +```c++ +void UniqueOwnership3() { + std::unique_ptr pi = std::make_unique(); + if (...) { + Borrow(pi); + } else { + int *pi_non_owning = pi.get(); + vector> v = {std::move(pi)}; + print(*pi_non_owning); + use(v); + } +} +``` + +If the original code didn't call `delete` at the very end of the function, then +our refactoring may change the point at which we run the destructor and release +memory. Specifically, if there is some user code after `delete`, then extending +the lifetime of the object until the end of the function may hold locks for +longer than necessary, introduce memory overhead etc. + +One solution is to always replace `delete` with a call to `reset()`, and then +perform another analysis that removes unnecessary `reset()` calls. + +```c++ +void AddedMemoryOverhead() { + HugeObject *ho = new HugeObject(); + use(ho); + delete ho; // Release the large amount of memory quickly. + LongRunningFunction(); +} +``` + +This analysis will refuse to refactor code that mixes borrowed pointer values +and unique ownership. In the following code, `GetPtr()` returns a borrowed +pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned +pointer. We don't distinguish between these two assignments, and we want each +assignment to be paired with a corresponding sink; otherwise, we transition the +pointer to a `Conflicting` state, like in this example. + +```c++ +void ConflictingOwnership() { + int *pi; // pi is Compatible + pi = GetPtr(); // pi is Defined + Borrow(pi); // pi is Defined + + pi = new int; // pi is Conflicting + Borrow(pi); + delete pi; + // pi is Conflicting +} +``` + +We could still handle this case by finding a maximal range in the code where +`pi` could be in the Compatible state, and only refactoring that part. + +```c++ +void ConflictingOwnership() { + int *pi; + pi = GetPtr(); + Borrow(pi); + + std::unique_ptr pi_unique = std::make_unique(); + Borrow(pi_unique.get()); +} +``` + +## Example: finding redundant branch conditions + +In the code below `b1` should not be checked in both the outer and inner "if" +statements. It is likely there is a bug in this code. + +```c++ +int F(bool b1, bool b2) { + if (b1) { + f(); + if (b1 && b2) { // Check `b1` again -- unnecessary! + g(); + } + } +} +``` + +A checker that finds this pattern syntactically is already implemented in +ClangTidy using AST matchers (`bugprone-redundant-branch-condition`). + +To implement it using the data flow analysis framework, we can produce a warning +if any part of the branch condition is implied by the flow condition. + +```c++ +int F(bool b1, bool b2) { + // Flow condition: true. + if (b1) { + // Flow condition: b1. + f(); + if (b1 && b2) { // `b1` is implied by the flow condition. + g(); + } + } +} +``` + +One way to check this implication is to use a SAT solver. Without a SAT solver, +we could keep the flow condition in the CNF form and then it would be easy to +check the implication. + +## Example: finding unchecked `std::optional` unwraps + +Calling `optional::value()` is only valid if `optional::has_value()` is true. We +want to show that when `x.value()` is executed, the flow condition implies +`x.has_value()`. + +In the example below `x.value()` is accessed safely because it is guarded by the +`x.has_value()` check. + +```c++ +void Example(std::optional &x) { + if (x.has_value()) { + use(x.value()); + } +} +``` + +While entering the if branch we deduce that `x.has_value()` is implied by the +flow condition. + +```c++ +void Example(std::optional x) { + // Flow condition: true. + if (x.has_value()) { + // Flow condition: x.has_value() == true. + use(x.value()); + } + // Flow condition: true. +} +``` + +We also need to prove that `x` is not modified between check and value access. +The modification of `x` may be very subtle: + +```c++ +void F(std::optional &x); + +void Example(std::optional &x) { + if (x.has_value()) { + // Flow condition: x.has_value() == true. + unknown_function(x); // may change x. + // Flow condition: true. + use(x.value()); + } +} +``` + +## Example: finding dead code behind A/B experiment flags + +Finding dead code is a classic application of data flow analysis. + +Unused flags for A/B experiment hide dead code. However, this flavor of dead +code is invisible to the compiler because the flag can be turned on at any +moment. + +We could make a tool that deletes experiment flags. The user tells us which flag +they want to delete, and we assume that the it's value is a given constant. + +For example, the user could use the tool to remove `example_flag` from this +code: + +```c++ +DEFINE_FLAG(std::string, example_flag, "", "A sample flag."); + +void Example() { + bool x = GetFlag(FLAGS_example_flag).empty(); + f(); + if (x) { + g(); + } else { + h(); + } +} +``` + +The tool would simplify the code to: + +```c++ +void Example() { + f(); + g(); +} +``` + +We can solve this problem with a classic constant propagation lattice combined +with symbolic evaluation. + +## Example: finding inefficient usages of associative containers + +Real-world code often accidentally performs repeated lookups in associative +containers: + +```c++ +map xs; +xs[42]->name = "..."; +xs[42]->title = "..."; +``` + +To find the above inefficiency we can use the available expressions analysis to +understand that `m[42]` is evaluated twice. + +```c++ +map xs; +Employee &e = xs[42]; +e->name = "..."; +e->title = "..."; +``` + +We can also track the `m.contains()` check in the flow condition to find +redundant checks, like in the example below. + +```c++ +std::map xs; +if (!xs.contains(42)) { + xs.insert({42, someEmployee}); +} +``` + +## Example: refactoring types that implicitly convert to each other + +Refactoring one strong type to another is difficult, but the compiler can help: +once you refactor one reference to the type, the compiler will flag other places +where this information flows with type mismatch errors. Unfortunately this +strategy does not work when you are refactoring types that implicitly convert to +each other, for example, replacing `int32_t` with `int64_t`. + +Imagine that we want to change user IDs from 32 to 64-bit integers. In other +words, we need to find all integers tainted with user IDs. We can use data flow +analysis to implement taint analysis. + +```c++ +void UseUser(int32_t user_id) { + int32_t id = user_id; + // Variable `id` is tainted with a user ID. + ... +} +``` + +Taint analysis is very well suited to this problem because the program rarely +branches on user IDs, and almost certainly does not perform any computation +(like arithmetic). diff --git a/clang/docs/DataFlowAnalysisIntroImages/CFGExample.svg b/clang/docs/DataFlowAnalysisIntroImages/CFGExample.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/CFGExample.svg @@ -0,0 +1,520 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Entry + + // Pre: x is ⊥int x;if (n > 0)// Post: x is ⊥ + + // Pre: x is ⊥x = n;// Post: x is ⊤ + + // Pre: x is ⊥if (n == 42)// Post: x is ⊥ + + // Pre: x is ⊥x = 5;// Post: x is {5} + + // Pre: x is ⊥x = 44;// Post: x is {44} + + // Pre: x is {5; 44}print(x);// Post: x is {5; 44} + + // Pre: x is ⊤print(x);// Post: x is ⊤ + + + + + + + + + + + Exit + + True + True + False + False + + diff --git a/clang/docs/DataFlowAnalysisIntroImages/CFGJoinRule.svg b/clang/docs/DataFlowAnalysisIntroImages/CFGJoinRule.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/CFGJoinRule.svg @@ -0,0 +1,222 @@ + + + + + + + + + + + + (Given) + (Conclusion) + in1 + in2 + inn + join(in1, in2, …, inn) + out = transfer(basic_block, join(in1, in2, …, inn)) + + Basic block + + + + + + + diff --git a/clang/docs/DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg b/clang/docs/DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg @@ -0,0 +1,114 @@ + + + + + + + + + + + + Maybe uninitialized + Initialized + Uninitialized + + + + + + + diff --git a/clang/docs/DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg b/clang/docs/DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg @@ -0,0 +1,403 @@ + + + + + ⊥ = {} + + + {−9} + {−5} + + {−3} + {−2} + {−1} + {0} + {1} + {2} + {3} + + + + + + + + + + + + + {−9, −5} + {−3, −1} + + {1, 2} + + + + + + + + + {1, 2, 3} + + + + ⊤ = ℤ + + + + + + + diff --git a/clang/docs/DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg b/clang/docs/DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg @@ -0,0 +1,403 @@ + + + + + ⊥ = {} + + + {−9} + {−5} + + {−3} + {−2} + {−1} + {0} + {1} + {2} + {3} + + + + + + + + + + + + + {−9, −5} + {−3, −1} + + {1, 2} + + + + + + + + + {1, 2, 3} + + + + (goes up infinitely) + + + + + + + diff --git a/clang/docs/DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg b/clang/docs/DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg @@ -0,0 +1,340 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Overwritten: {p->x, p->y, p->z} + Overwritten: {p->x, p->z} + Overwritten: {p->x, p->z} + Overwritten: {p->q} + Overwritten: {} + + + + + Normal states + + + Failure states + {Unsafe read at line 3} + {Pointer escape at line 5} + {Unsafe read at line 7} + {Unsafe read at line 3, Pointer escape at line 5} + + + {Unsafe read at line 3, Pointer escape at line 5, Unsafe read at line 7} + + + + + diff --git a/clang/docs/DataFlowAnalysisIntroImages/UniquePtrLattice.svg b/clang/docs/DataFlowAnalysisIntroImages/UniquePtrLattice.svg new file mode 100644 --- /dev/null +++ b/clang/docs/DataFlowAnalysisIntroImages/UniquePtrLattice.svg @@ -0,0 +1,114 @@ + + + + + + + + + + + + Conflicting + Defined + Compatible + + + + + + + diff --git a/clang/docs/conf.py b/clang/docs/conf.py --- a/clang/docs/conf.py +++ b/clang/docs/conf.py @@ -33,7 +33,24 @@ templates_path = ['_templates'] # The suffix of source filenames. -source_suffix = '.rst' +source_suffix = { + '.rst': 'restructuredtext', +} + +try: + import recommonmark +except ImportError: + # manpages do not use any .md sources + if not tags.has('builder-man'): + raise +else: + import sphinx + if sphinx.version_info >= (3, 0): + # This requires 0.5 or later. + extensions.append('recommonmark') + else: + source_parsers = {'.md': 'recommonmark.parser.CommonMarkParser'} + source_suffix['.md'] = 'markdown' # The encoding of source files. #source_encoding = 'utf-8-sig' diff --git a/clang/docs/index.rst b/clang/docs/index.rst --- a/clang/docs/index.rst +++ b/clang/docs/index.rst @@ -25,6 +25,7 @@ CrossCompilation ClangStaticAnalyzer ThreadSafetyAnalysis + DataFlowAnalysisIntro AddressSanitizer ThreadSanitizer MemorySanitizer