Page MenuHomePhabricator

[analyzer][WIP] Implement a primitive reaching definitions analysis
Needs ReviewPublic

Authored by Szelethus on Jul 19 2019, 6:30 AM.

Details

Summary

When I say WIP, I really mean it. This has no comments, no nothing, but follows this algorithm pretty literally.

This analysis isn't conservative enough yet, I really should include function calls with non-const references into the gen set, but hey, at this point I'm optimistic that it can be done.

Diff Detail

Event Timeline

Szelethus created this revision.Jul 19 2019, 6:30 AM
Szelethus updated this revision to Diff 211119.EditedJul 22 2019, 9:04 AM

I plan to split this patch up eventually, but here's where I'm standing right now:

  • Correct the algorithm that by accident did this: GEN[B] = GEN[B] union (IN[B] - KILL[B]) instead of this: OUT[B] = GEN[B] union (IN[B] - KILL[B])
  • Realize that llvm::SmallSet uses both operator== and operator<, and since for a reaching definition set I'd like to sort by VarDecl, but for the kill set that AND the CFGElementRef, give up on it and use std::set instead with two different comparators
  • Collect all non-local variables and regard all function calls as definitions to them
  • Collect parameter declarations as well, add them to GEN[entry block]
  • Plenty more test cases

Still left to do:

  • Document, cite the algorithm
  • If a pointee is written, invalidate all variables with the same type:
int i;
int *ptr;
// ...
*ptr = 0; // might point to i, regard it as i's definition
  • Tests for unreachable successors
  • Evaluate on a large codebase
  • Come up with something smart to handle FieldDecls (or just ignore them :^) )
NoQ added a comment.Jul 22 2019, 1:16 PM

CFGElementRef

Wait, is it a thing already?? Did i miss anything???

This analysis isn't conservative enough yet, I really should include function calls with non-const references into the gen set, but hey, at this point I'm optimistic that it can be done.

"Conservative" is not a property of the algorithm; it depends on how is the algorithm used. For some purposes you'll want to add an extra definition when you're not sure if it'll be a definition, for other purposes you'll want to only have definitions in which you are 100% certain in the set.

clang/lib/Analysis/ReachingDefinitions.cpp
69

I don't think you really need this. Every assignment would anyway appear in the CFG as a separate element. And also forEachDescendant() may accidentally scan items in different CFG blocks (i.e., if the whole statement is a ?: or a logical op).

In D64991#1596292, @NoQ wrote:

CFGElementRef

Wait, is it a thing already?? Did i miss anything???

Oh, yea, I have it locally, still need to iron out a couple things on that before daring to upload it :) I figured it wouldn't cause much confusion so I shared this one kinda hanging in the air.

This analysis isn't conservative enough yet, I really should include function calls with non-const references into the gen set, but hey, at this point I'm optimistic that it can be done.

"Conservative" is not a property of the algorithm; it depends on how is the algorithm used. For some purposes you'll want to add an extra definition when you're not sure if it'll be a definition, for other purposes you'll want to only have definitions in which you are 100% certain in the set.

Aye I'll try to specify on that. Thanks!

Szelethus added a subscriber: steakhal.

@steakhal you some great experience with (strict) aliasing, could you chip in on this maybe? :)

Szelethus updated this revision to Diff 211457.Jul 24 2019, 3:33 AM

Rebase on top of D65196.

Szelethus updated this revision to Diff 211509.Jul 24 2019, 7:58 AM

Use the correct sorting for the GEN, KILL, IN and OUT sets.

xazax.hun added inline comments.Jul 29 2019, 9:17 PM
clang/include/clang/Analysis/Analyses/ReachingDefinitions.h
31

In the future we might also want to reason about FieldDecls.

51

Is it safe to omit the Kind from the comparisons?

clang/lib/Analysis/ReachingDefinitions.cpp
62

Will this work if the LHS is not directly a declrefexpr?

E.g.:

(a, b) = 5;

Or more fun cases:

(cond ? a : b) = 5;
103

What about the non-const arguments of calls, and indirect writes through pointers? Even if you plan not to handle some of these, I prefer to have a TODO for these the appropriate places as soon as possible.

105

There are non-assignment operators too, like operator++. You might want to consider them as definitions too.

Hmm, we could make a redundant assignments checker: if a variable has multiple reaching definitions, but those all assign the same value, emit a warning. We could even use fixits with that.

void t(int a) {
  if (coin())
    a = 2; // note: reaching def
  else
    a = 2; // note: reaching def
  use(a); // warn: a is always 2 here
}
NoQ added a comment.Sep 9 2019, 5:24 PM

Hmm, we could make a redundant assignments checker: if a variable has multiple reaching definitions, but those all assign the same value, emit a warning. We could even use fixits with that.

void t(int a) {
  if (coin())
    a = 2; // note: reaching def
  else
    a = 2; // note: reaching def
  use(a); // warn: a is always 2 here
}

Sounds like a useful compiler warning to me.

Also what specific fixit do you have in mind and why do you think it'll be easily obtainable from the results of the reaching definition analysis?

The following words are echoing in my ears as I'm knowingly going completely against them:

In D45532#1083670, @NoQ wrote:

I'm asking for this because it's, like, actually difficult and painful to understand formal information in large chunks. One-solution-at-a-time would have been so much easier. It is really a bad idea to start by presenting a working solution; the great idea is to present a small non-working solution with a reasonable idea behind it, and then extend it "in place" as much as it seems necessary. It is very comfortable when parts of the patch with different "status" (main ideas, corner-case fixes, heuristics, refactoring) stay in separate patches.

Essentially every part of the solution that you implement, if you think it might deserve a separate discussion, also deserves a separate patch. It is reasonable to split the work into logical chunks before you start it. It's very unlikely that you have a single idea that takes 500 non-trivial lines of code to express. Splitting up ideas so that they were easy to grasp is an invaluable effort.

I'll keep this revision as more of a bag of ideas, and create a series of patches when I got the core of this figured out. This algorithm is very hard to implement, and its easier for me to not bother with splitting up my entire work when I'm still shooting at a moving target. Several of my ideas required me to make huge refactorings, and as the code grows, my overall desire for the finished algorithm changes as well.

With that said.

I made the following changes:

  • Added support for record, but not much in terms of C++ classes.
    • Fields of a record are identified with a (variable, chain of fields) pair, e.g. Var.a.x.b would be represented with (Var, FieldDecls of [a, x, b]).
    • A Definition of a record object doesn't contain Definitions of its fields -- instead, whenever a Definition of a record object is added to a GEN set, a helper function creates adds Definitions for its fields. This is convenient, else we'd need to flatten Definitions later on. For the example mentioned earlier, we really need an entry separately for Var, Var.a, Var.a.x, Var.a.x.b in order for the analyzer to conveniently ask for reaching definitions later down the line.
  • Added/corrected plenty of test cases.
  • Haven't really touched variable invalidation.
  • Decorated the code with a ton of TODOs, but not much documentation just yet.

There are a couple things I'm worrying about, especially those that @xazax.hun mentioned in the inlines. I'm still confident that actual writes to a variable may only be assignments and variable definitions, but they can be esoteric enough, e.g. (a, b) = 5, retrieveWhateverObject() = {9, 5, "remember the vasa"}. For now, performance doesn't really concern me, but eventually I'll need to move the matchers to a builder class. Invalidation propagation (as discussed on mailing list) hide in them a variety of tricky edge cases. All of these combined create a desire for a thought-out, easily extensible way to create GEN sets.

So, here are my ideas.

  • Overall code structure
    • Create a builder object, stored in the calculator itself, which would store the expensive-to-create matchers.
    • Make assignment an event to which a variety of rules may be assigned to, so special cases like (a, b) = 5, retrieveWhateverObject() = {9, 5, "remember the vasa"}, assignment to this, and whatever pitfalls Objective-.* languages have may be more easily handled.
    • GEN sets store the outgoing, or in other words, last definition to each variable, but this information is not enough. The way I described how I plan to make this algorithm semi-intraprocedural with the use of visitors in a nutshell works by the analyzer trying to prove that an invalidation really does write a variable, or can be discarded. However, in the same block, for the same variable, multiple invalidations of the same variable may occur. So, instead of storing the last definition, we should store all invalidations up to the first write. Later, the analyzer could remove these definitions directly, and ask the calculator to regenerate the KILL sets and recalculate the reaching definitions sets.
  • Invalidation
    • Research type based alias analysis, until then, ignore aliases, and just invalidate objects of the same type.
    • The example below highlights how the function call on line 2 is a definition of i, because ptr might point to it, but may not be a definition of j. Buuuuuuuuuuut, with the use of goto, it is possible. This implies that every variable in a non-nested scope has to be invalidated if a pointer with an aliasing type escapes. This may be super tricky to implement precisely, so I'll probably just take all variables, not only non-nested ones into account.
1 void f(int i, int *ptr) {
2   foo(ptr);
3   int j;
4 }
  • Invalidation has to be very customizable. The static analyzer doesn't generate code, so some relaxation of the invalidation ruleset maybe desirable, such as assuming that offsetof isn't used to change the rest of the record if a field of it escapes.
Szelethus updated this revision to Diff 222514.Sep 30 2019, 4:23 PM

I'm starting to be really happy with the current direction! I think I'll start splitting this up soon, I'm confident that the current interface (after some polishing) is general enough to develop incrementally.

  • Introduce Variable, a base class of Definition, which is a (VarDecl, FieldChain) pair.
  • Introduce GenSetBuilder, a class that owns all MatchFinders and associated callbacks used for finding potential writes and invalidations to Variables, use exclusively it to build GEN sets.
    • Use a very flexible interface for creating and registering matchers for either
      • Creating the set of Variables accessible to a function
      • Finding statements that may write any of those
      • Teasing an Expr apart to find Variables in it
    • In an init() pass, collect all variables used, or potentially invalidated in the function. This allows for invalidations and writes to be done in a single pass on a CFGStmt.
    • Added a very primitive solution to invalidate variables passed to functions as a non-const reference.