This is an archive of the discontinued LLVM Phabricator instance.

Implement a cost model to drive the lowering of scf.parallel.
Needs ReviewPublic

Authored by bakhtiyarneyman on Nov 23 2021, 7:43 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

Depends On D110680

Diff Detail

Event Timeline

bakhtiyarneyman requested review of this revision.Nov 23 2021, 7:43 PM

This cost model component is pretty interesting! It looks like something that really can be a generic pass that would drive a traversal and query opinterfaces?

mlir/lib/Dialect/Async/Transforms/CostModel.cpp
1

(license header)

89–91
95

Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an op interface instead, shouldn't it?

mlir/lib/Dialect/Async/Transforms/CostModel.h
1

You're missing the license header, and the guard does not match the usual format I believe.

6

Please avoid using namespace directive in a header.

14

Can you make it a class and increase the doc?
Right now the description says " Approximate cost of executing an op on a modern CPU" but the first member is a Builder which is confusing to me.

Edit: reading the implementation, I get a better feel for what is going on, but the comment stands :)

nicolasvasilache requested changes to this revision.Nov 24 2021, 3:07 AM

This seems important and cross-cutting enough to warrant an RFC on discourse, I am surprised @mehdi_amini has not asked for this already :)
Putting a blocker until a discussion has happened.

This revision now requires changes to proceed.Nov 24 2021, 3:07 AM
bakhtiyarneyman marked 4 inline comments as done.

Address comments.

mlir/lib/Dialect/Async/Transforms/CostModel.cpp
95

Yes, the code is written with that potential development in mind. But it is not clear that something as intrusive is necessary now. The concept needs to be validated first. The biggest question is whether the estimation precision of what we have now is sufficient. If it turns to be insufficient, we might need to do better analysis. For instance, it's conceivable that the whole approach would need to be scrapped, it might be imperative to do the memory component of the analysis earlier and/or incrementally, before the tiling and fusion have been lowered to ops which make it difficult to guess the cache performance. Furthermore, it might be beneficial to make a first-class cost-model which would help drive optimizations in other parts of the pipeline, most importantly the codegen.

Furthermore, I suspect that once we are trying to use op interfaces, the cost model needs to be reimagined as this first-class entity, which might mean that we would need to design it in a way, that is useful for other architectures, e.g. GPUs. That in itself will take a quarter. So my preference is to do small and local to the AsyncParallelFor pass.

mlir/lib/Dialect/Async/Transforms/CostModel.h
1

It seems to be matching the PassDetail.h from the same directory, no?

It would be great to have the cost model to be a part of core MLIR, defined with op interface, but for it to be usable for all the potential targets it will require a lot of discussions (e.g. how cost model can capture GPU memory coalescing, should it be the concern of the cost model at all, or it should just track compute?). Let's start an RFC to figure out how it should look like. In the meantime we can hide the current crude implementation behind the AsyncBlockSizeComputationFunction API, and keep implementation inside the Tensorflow compiler. We can tune it for the compilation pipeline we have right now, without need to worry about the general case.

mlir/include/mlir/Dialect/Async/Passes.h
25

How about giving more control to the caller (similar to the TileSizeComputationFunction in linalg).

Something like this:

struct AsyncBlockSize {
  Value targetBlockSize; // must be a Value of index type
   ...                   // some other values?
}

using AsyncBlockSizeComputationFunction =
    std::function<AsyncBlockSize(OpBuilder &, scf::ParallelOp parallelLoop)>;

createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads, 
                                                AsyncBlockSizeComputationFunction cost);

Constant int32_t targetBlockSize will be just a trivial cost compute function. Also Async dialect can provide the default cost function based on the ops properties.

mlir/include/mlir/Dialect/Async/Passes.td
34

and this will become a property of the caller pipeline