This is an archive of the discontinued LLVM Phabricator instance.

Make AsyncParallelForRewrite parameterizable with a cost model which drives deciding the parallelization granularity.
ClosedPublic

Authored by bakhtiyarneyman on Dec 8 2021, 8:54 PM.

Diff Detail

Event Timeline

bakhtiyarneyman created this revision.Dec 8 2021, 8:54 PM
bakhtiyarneyman requested review of this revision.Dec 8 2021, 8:54 PM

The revision title mentions "AsyncParallelFor pass" but this isn't touching the pass itself right?

mlir/include/mlir/Dialect/Async/Transforms.h
35

Can you improve the doc?

Improve docs.

bakhtiyarneyman marked an inline comment as done.Dec 9 2021, 2:49 PM

The revision title mentions "AsyncParallelFor pass" but this isn't touching the pass itself right?

Correct.

bakhtiyarneyman retitled this revision from Make AsyncParallelFor pass parameterizable with a cost model which drives deciding the parallelization granularity. to Make AsyncParallelForRewrite parameterizable with a cost model which drives deciding the parallelization granularity..Dec 9 2021, 2:50 PM
ezhulenev accepted this revision.Dec 9 2021, 3:19 PM
This revision is now accepted and ready to land.Dec 9 2021, 3:19 PM
mehdi_amini accepted this revision.Dec 9 2021, 4:47 PM
mehdi_amini added inline comments.
mlir/include/mlir/Dialect/Async/Transforms.h
23

Change the comment.

bakhtiyarneyman marked an inline comment as done.Dec 9 2021, 6:00 PM

Rebase. Note that this made some of the loop unrolling logic dynamic which seeped all the way into loop building.

ezhulenev added inline comments.Dec 14 2021, 4:28 AM
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
825

Computing numUnrollableLoops as Value (compute function argument) + runtime scf.if in the loop nest prevents LLVM from loop unrolling and vectorization, and it leads to large regressions in: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/tfrt/benchmarks/compute_function_benchmark.cc

name                                 old cpu/op  new cpu/op  delta
BM_cpurt_Fresh2/4/process_time       12.3µs ± 9%  12.3µs ±10%      ~     (p=0.961 n=17+18)
BM_cpurt_Fresh2/8/process_time       12.3µs ± 8%  12.5µs ± 7%      ~     (p=0.245 n=18+17)
BM_cpurt_Factorized0/0/process_time  21.1µs ± 4%  21.2µs ± 4%      ~     (p=0.499 n=19+18)
BM_cpurt_Factorized0/4/process_time  21.5µs ± 7%  35.9µs ± 6%   +66.93%  (p=0.000 n=19+20)
BM_cpurt_Factorized0/8/process_time  21.0µs ± 3%  35.7µs ± 5%   +69.52%  (p=0.000 n=17+20)
BM_cpurt_Factorized1/0/process_time  9.35µs ± 6%  9.31µs ± 3%      ~     (p=0.732 n=18+17)
BM_cpurt_Factorized1/4/process_time  35.9µs ± 4%  46.4µs ± 5%   +29.04%  (p=0.000 n=18+19)
BM_cpurt_Factorized1/8/process_time  36.2µs ± 3%  46.3µs ± 6%   +28.17%  (p=0.000 n=18+19)
BM_cpurt_Factorized2/0/process_time  86.7µs ± 8%  86.3µs ± 8%      ~     (p=0.832 n=18+17)
BM_cpurt_Factorized2/4/process_time   124µs ± 7%   442µs ± 3%  +255.67%  (p=0.000 n=18+18)
BM_cpurt_Factorized2/8/process_time   159µs ± 6%   459µs ± 6%  +187.73%  (p=0.000 n=17+18)

numUnrollableLoops should be known at compiled time, with dynamic minTaskSize the structure should look like this:

ParallelComputeFunction unrollableParallelComputeFunction =
        createParallelComputeFunction(op, staticBounds, numUnrollableLoops,
                                      rewriter);

ParallelComputeFunction defaultParallelComputeFunction =
        createParallelComputeFunction(op, staticBounds, numUnrollableLoops,
                                      rewriter);

b.create<scf::IfOp>(..) {
  dispatch unrollableParallelComputeFunction
} else {
  dispatch defaultParallelComputeFunction
}

There is no real need of benefit of computing numUnrollableLoop as Value, because unrolling/vectorization can only happen in trip counts (loop bounds) are know at compile time.

Address comments: lift the scf.if out of the loops.

bakhtiyarneyman marked an inline comment as done.Dec 14 2021, 5:51 PM
bakhtiyarneyman added inline comments.
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
825

Done, PTAL.

mehdi_amini added inline comments.Dec 14 2021, 6:07 PM
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
833

You're creating a value dynamicShouldUnroll here which isn't used in the else branch below, can you sink this in the then branch?

bakhtiyarneyman marked an inline comment as done.

Elide dynamic check if possible.

bakhtiyarneyman marked an inline comment as done.Dec 14 2021, 6:23 PM
ezhulenev accepted this revision.Dec 15 2021, 2:05 AM
ezhulenev added inline comments.Dec 15 2021, 2:14 AM
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
847

nit: I'd move this lambda close to the dispatchNotUnrollable to reduce the nesting, and put similar things together, although it's used only inside one branch of the if. I think it's ok to put createParallelComputeFunction inside lamdba, so you don't create aligned compute function if you'll not need it.

Fix minor C++ compilation issue.

Same thing for another lambda.

Move minTaskSize definition earlier to avoid contaminating the analysis results with the partially executed rewrite logic.