This patch finally makes CFLAnders properly build function summaries.
The approach I took was more complicated than I expected. Initially, I thought that the information we need to build the summary is all included in the ReachabilitySet, and I could just iterate over it once to populate the entire function summary. However, it turns out that we might have cases like this:
define i32* @f(i32*** %arg1) { %d = load i32**, i32*** arg1 %d2 = load i32*, i32** %d ret i32* %d2 }
Here node (%d2, 0) does not value-alias with any other nodes in ReachabilitySet [1][2]. All we can deduce from ReachSet is that (%d, 1) is assigned to (%d2, 0), and (%arg, 1) is assigned to (%d, 0). The value %d here acts as an "intermediate value" that connects assignments from/to its different DerefLevel. Such a connection is not something we could obvious obtain from ReachabilitySet, which only holds assignments that are on the same level. If we don't check for this case, we will incorrectly conclude that the return value of @f has no relation with the parameter %arg1 of @f.
There could be many ways to resolve the above issue. The proposed solution in this patch is to build a read&write set for each value V in the function, holding param/return values that are read into or write from V, respectively. If both the read set and the write set of V are non-empty, then we know that V has been used as an "intermediate value", and we should add an cross-level assignment summary edge from everything in the write set to everything in the read set. An alternative design is to bake those read/write sets into ReachabilitySet itself so we don't have to build additional structures at summary construction time. But I decided to go with the current approach since it is relatively cleaner and easy to think about.
[1] Here I use the pair notation (x, y) to denote an InstantiatedValue with value x and dereference-level y.
[2] In theory, (%d2, 0) should alias (%arg1, 2), but our CFL matching algorithm doesn't add the node (%arg1, 2) into ReachabilitySet. The algorithm doesn't do it because (%arg1, 2) never gets explicitly referenced in the function body. If we were to add (%arg1, 2), we need to also consider questions like "why don't we add (%arg1, 3)", "at which DerefLevel should we stop adding nodes", and things may quickly get unnecessarily complicated that way.
Is there a reason these aren't static?