This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add a generic data-flow analysis framework
ClosedPublic

Authored by Mogball on May 31 2022, 6:53 PM.

Details

Summary

This patch introduces a generic data-flow analysis framework to MLIR. The framework implements a fixed-point iteration algorithm and a dependency graph between lattice states and analysis. Lattice states and points are fully extensible to support highly-customizable analyses.

Diff Detail

Event Timeline

Mogball created this revision.May 31 2022, 6:53 PM
Mogball requested review of this revision.May 31 2022, 6:53 PM
Mogball planned changes to this revision.May 31 2022, 6:54 PM
mehdi_amini added inline comments.Jun 1 2022, 8:41 AM
mlir/include/mlir/Analysis/DataFlowFramework.h
34

Is "lattice point" a well known term in the literature? Otherwise it seems to me that "program point" may be more descriptive here

169

#if LLVM_ENABLE_ABI_BREAKING_CHECKS

183

Why not PointT *point ?

mlir/lib/Analysis/DataFlowFramework.cpp
37

Nit: Spurious braces multiple times, here and elsewhere.

Mogball updated this revision to Diff 433491.Jun 1 2022, 12:16 PM

rename LatticePoint to ProgramPoint and LatticeState to AnalysisState
to avoid confusion.

also fix formatting issues

Mogball updated this revision to Diff 433494.Jun 1 2022, 12:33 PM
Mogball marked 4 inline comments as done.

Mehdi's review comments

Mogball planned changes to this revision.Jun 1 2022, 12:34 PM
phisiart requested changes to this revision.Jun 2 2022, 4:51 PM
phisiart added a subscriber: phisiart.
phisiart added inline comments.Jun 2 2022, 4:52 PM
mlir/include/mlir/Analysis/DataFlowFramework.h
68

When reading the code, I find it confusing that sometimes it's in ProgramPoint::states and sometimes it's in DataFlowSolver::analysisStates.

Can we unify to DataFlowSolver::analysisStates?

81

I checked your other code, and BaseT seems to always be ProgramPoint. Is it possible for BaseT to be something else? If not, can we let ProgramPointBase always inherit ProgramPoint?

144

I think the naming is a bit confusing here: Point and ProgramPoint have pretty much the same meaning.

Maybe ProgramPoint and AbstractCustomProgramPoint?

264–269

Can we make the ownership clearer:

DenseMap<std::pair<Point, TypeID>, std::unique_ptr<AnalysisState>> analysisStates;

I understand that you might not want ProgramPoint::states to own these AnalysisStates, so like what the other comment on ProgramPoint::states suggests, I would recommend unifying to DataFlowSolver::AnalysisState.

Mogball added inline comments.Jun 2 2022, 8:52 PM
mlir/include/mlir/Analysis/DataFlowFramework.h
68

I did this for performance reasons. Let me check again what the difference is.

Mogball added inline comments.Jun 3 2022, 12:17 PM
mlir/include/mlir/Analysis/DataFlowFramework.h
68

I'll revert this back to what it was before (as you suggested). I made some recent changes to the sparse analysis that makes this optimization redundant. I checked that, as things are now, it's no longer needed.

81

This is a convention for storage uniquer types, but I'll make the change as I can't imagine scenarios in which the base type would be anything else.

Mogball updated this revision to Diff 434216.Jun 3 2022, 5:32 PM
Mogball marked 4 inline comments as done.

review comments.

Also adds an onUpdate function to analysis states

Now we need an ODM :)

I think River's up next :)

But sure!

mlir/include/mlir/Analysis/DataFlowFramework.h
144

I've named them ProgramPoint and GenericProgramPoint.

I think River's up next :)

But sure!

(No need to block on me, I can't make next week, and I'll be OOO for a few weeks after)

Took a quick scan, will look more in detail tomorrow. Really appreciate the detailed documentation.

mlir/include/mlir/Analysis/DataFlowFramework.h
10–13
51

Can this constructor be protected? (You can't create an instance of this class anyways).

130

Drop the llvm:: here, same applies to various other things in this file (take a look at what's provided by LLVM.h)

197–201

The use of draw here is kind of confusing, is this adding a dependency?

243–244

Can we make this protected?

286

Is the use of child necessary here? I would assume that we can have some analyses that are "parents" in some situations, and "children" in another.

292–293
295
303

Are these defined in this commit? I can't find what these are supposed to be referencing.

315

Could we just call this initialize?

323

? To avoid "update .. update"

mlir/lib/Analysis/DataFlowFramework.cpp
14

Why? Please prefer using LLVM_DEBUG directly instead.

20

Please prefer using namespace mlir; instead.

122–127
145

Can you just inline these into the header?

Mogball added inline comments.Jun 6 2022, 8:35 AM
mlir/lib/Analysis/DataFlowFramework.cpp
14

It was a little annoying wrapping it with LLVM_ENABLE_ABI_BREAKING_CHECKS everywhere...

Mogball updated this revision to Diff 434505.Jun 6 2022, 9:13 AM
Mogball marked 15 inline comments as done.

review comments

mlir/include/mlir/Analysis/DataFlowFramework.h
10–13

https://jakubmarian.com/comma-before-that-and-which/

Odd to be digging into a grammar debate here... but I've seen this mistake a few times in the code base's comments :P

303

soon^TM

rriddle added inline comments.Jun 6 2022, 11:10 PM
mlir/include/mlir/Analysis/DataFlowFramework.h
10–13

Not sure that applies here given that you could wrap that portion within parentheses and it would make more sense. The sentence as written has a run-on feel of "and .. and" which apply to different parts of the sentence. Can you reword the sentence then? Right now it's confusing to read.

rriddle added inline comments.Jun 6 2022, 11:11 PM
mlir/lib/Analysis/DataFlowFramework.cpp
14

Not sure what you mean, this is a .cpp you don't need to use LLVM_ENABLE_ABI_BREAKING_CHECKS anywhere...

rriddle added inline comments.Jun 6 2022, 11:15 PM
mlir/lib/Analysis/DataFlowFramework.cpp
14

At least, I'm not sure what situations we have that set NDEBUG but not LLVM_ENABLE_ABI_BREAKING_CHECKS.

Mogball marked 3 inline comments as done.Jun 7 2022, 10:09 AM
Mogball added inline comments.
mlir/lib/Analysis/DataFlowFramework.cpp
14

Internally at google, LLVM_ENABLE_ABI_BREAKING_CHECKS is always turned off, even when NDEBUG is not set (the two debugName fields are not compiled when LLVM_ENABLE_ABI_BREAKING_CHECKS is not set).

Alternatively, I can make the debugName fields always compile.

Mogball updated this revision to Diff 434865.Jun 7 2022, 10:19 AM
Mogball marked an inline comment as done.

clang-format

Sorry for the delay, I'll leave some more comments tomorrow.

mlir/include/mlir/Analysis/DataFlowFramework.h
119

Throughout the class this is called key so maybe getKey()?

122

"key and contents" doesn't seem to match the code.

Mogball added inline comments.Jun 8 2022, 8:59 AM
mlir/include/mlir/Analysis/DataFlowFramework.h
122

It's a content key. KeyTy is a required typedef of StorageUniquer::BaseStorage. I'll rename some of the stuff to make this more obvious.

Mogball updated this revision to Diff 435376.Jun 8 2022, 4:03 PM

change "key" to "value" in GenericProgramPointBase

Mogball marked 2 inline comments as done.Jun 8 2022, 4:03 PM
phisiart added inline comments.Jun 9 2022, 12:58 AM
mlir/include/mlir/Analysis/DataFlowFramework.h
293

I've been thinking about how to get rid of the need for propagateIfChanged(). One way to do it is to reintroduce the separation of "LatticeValue" and "LatticeElement", and here let's call them "AnalysisState" and "AnalysisStateElement" (we should try to find a better name than "Element"...).

// User should override this.
class AnalysisState {
public:
  virtual ~AnalysisState();
  virtual bool isUninitialized() const = 0;
  virtual ChangeResult defaultInitialize() = 0;
  virtual void print(raw_ostream &os) const = 0;

protected:
  AnalysisState(ProgramPoint point) : point(point) {}
  StringRef debugName;
};

// User should not override this.
// AnalysisStateElement (instead of AnalysisState) is what's stored in DataFlowSolver::analysisStates.
template <typename State>
class AnalysisStateElement {
 public:
  // Read-only.
  const State &get() const {
    return state;
  }

  // User must return a ChangeResult, and the API does the
  // "propagateIfChanged".
  //
  // Usage:
  // update([&](State *state, Set<WorkItem> *added_dependees) {
  //   ...
  // });
  void update(function_ref<ChangeResult(State *, Set<WorkItem> *)> update_func) {
    SetVector<WorkItem> added_dependees;
    auto changed = update_func(&state, &added_dependees);
    dependees.insert(added_dependees.begin(), added_dependees.end());

    if (changed) {
      for (auto &dependee : dependees) {
        solver.enqueue(dependee);
      }
    }
  }

 private:
  ProgramPoint point;
  State state;
  SetVector<WorkItem> dependees;
  DataFlowSolver &solver;
};

This would address the following issues at the same time:

  1. Currently AnalysisState is a somewhat awkward abstraction: it contains logic on dependency management. I think that AnalysisState should be a standalone class that only represents the stored state itself (i.e. the lattice value); all dependency management logic is considered part of the analysis and should appear in DataFlowAnalysis which the user overrides.
  1. By introducing AnalysisStateElement we force all state updates to go through a single entry point (update() in the example code I wrote). This way we don't need to expose propagateIfChanged and the user will never forget to call it.
  1. From a reader's perspective, currently AnalysisState is heavily coupled with DataFlowSolver, which is a big class. In particular, the reader must understand what they are expected to do with DataFlowSolver in onUpdate. By creating a separate AnalysisStateElement class, AnalysisState becomes more readable.

AnalysisStateElement is not to be overridden by the user. In particular, this does mean that we can't have the custom onUpdate() anymore. In my opinion, such logic belongs to visit(): we shouldn't spread analysis logic in both DataFlowAnalysis::visit() and onUpdate().

Mogball added inline comments.Jun 9 2022, 12:02 PM
mlir/include/mlir/Analysis/DataFlowFramework.h
293

(Note: "dependee" is a misnomer, it should be "dependent"...)

I do appreciate that the separation of analysis state and element makes the API clearer, and I am aware that AnalysisState, DataFlowSolver, and DataFlowAnalysis are all tightly coupled, but there is a major problem with this approach.

update_func is not going to know what the dependents are. The dependents could be analyses and states of which the writer of update_func is not aware. This will make any analyses written in this style not be naturally composable. For a sparse analysis, this is straightforward:

getAnalysisElement<ConstantValue>(value).update(
  [&](ConstantValue *state, SetVector<WorkItem &> dependents) {
    do_update(state);
    for (Operation *user : value.getUsers()) dependents.insert({this, user});
});

But what happens if I write an AnalysisB that wants to depend on this state? The above analysis will need to know about that. This is why dependency management is the responsibility of the solver (and the states combined). Individual analyses can ask the solver, "hey, when this state gets update, invoke me again" or, "hey, when you get updated, invoke me on the users of the value".

There are also performance implications. The dependents can be stored on the solver, but I moved them to the analysis state class to skip a map lookup when propagateIfChanged is called (which also improves cache locality). And onUpdate exists so that states can implement dependency logic outside of the big hash map, e.g. by using use-def chains. This results in significant performance boosts to constant propagation so that it matches the performance (within 5%, despite the analysis being split in two) of the current implementation. Separating the state itself and the dependency logic into separate classes would mean that both would have to be overwritten...

I would agree that the API surface isn't perfect. addDependency is a function in DataFlowSolver but it just accesses private members of` AnalysisState`. That's a minor change and I should probably do that.

(The reason I didn't have a backreference to DataFlowSolver in each AnalysisState instead of making the classes friends with each other is because I didn't want to increase the size of AnalysisState, but I haven't checked the memory usage on large programs yet so this point is a little moot).

Mogball marked an inline comment as done.Jun 9 2022, 12:45 PM
rriddle accepted this revision.Jun 12 2022, 11:52 PM

I think this is in a good enough state that we can land and iterate on in-tree.

mlir/lib/Analysis/DataFlowFramework.cpp
14

Setting aside how horribly broken of a behavior that is (that isn't your fault), I support your original patch of having a wrapper DEBUG macro. Having to worry about differences between NDEBUG/LLVM_ENABLE_ABI_BREAKING_CHECKS is really hacky....

54

nit: Can you put comment blocks between the different class implementations? Makes the file a little easier to read.

Mogball updated this revision to Diff 436579.Jun 13 2022, 2:58 PM
Mogball marked 2 inline comments as done.

review comments

phisiart accepted this revision.Jun 13 2022, 3:03 PM
This revision is now accepted and ready to land.Jun 13 2022, 3:03 PM
Mogball updated this revision to Diff 436655.Jun 13 2022, 10:29 PM

fix windows

Mogball updated this revision to Diff 436665.Jun 13 2022, 11:42 PM

make AnalysisState and DataFlowAnalysis constructors public to avoid confusion
with mscv inheritance rules

This revision was landed with ongoing or failed builds.Jun 14 2022, 9:54 AM
This revision was automatically updated to reflect the committed changes.