Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
The revision title mentions "AsyncParallelFor pass" but this isn't touching the pass itself right?
mlir/include/mlir/Dialect/Async/Transforms.h | ||
---|---|---|
35 | Can you improve the doc? |
mlir/include/mlir/Dialect/Async/Transforms.h | ||
---|---|---|
23 |
Rebase. Note that this made some of the loop unrolling logic dynamic which seeped all the way into loop building.
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp | ||
---|---|---|
867 | Computing numUnrollableLoops as Value (compute function argument) + runtime scf.if in the loop nest prevents LLVM from loop unrolling and vectorization, and it leads to large regressions in: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/tfrt/benchmarks/compute_function_benchmark.cc name old cpu/op new cpu/op delta BM_cpurt_Fresh2/4/process_time 12.3µs ± 9% 12.3µs ±10% ~ (p=0.961 n=17+18) BM_cpurt_Fresh2/8/process_time 12.3µs ± 8% 12.5µs ± 7% ~ (p=0.245 n=18+17) BM_cpurt_Factorized0/0/process_time 21.1µs ± 4% 21.2µs ± 4% ~ (p=0.499 n=19+18) BM_cpurt_Factorized0/4/process_time 21.5µs ± 7% 35.9µs ± 6% +66.93% (p=0.000 n=19+20) BM_cpurt_Factorized0/8/process_time 21.0µs ± 3% 35.7µs ± 5% +69.52% (p=0.000 n=17+20) BM_cpurt_Factorized1/0/process_time 9.35µs ± 6% 9.31µs ± 3% ~ (p=0.732 n=18+17) BM_cpurt_Factorized1/4/process_time 35.9µs ± 4% 46.4µs ± 5% +29.04% (p=0.000 n=18+19) BM_cpurt_Factorized1/8/process_time 36.2µs ± 3% 46.3µs ± 6% +28.17% (p=0.000 n=18+19) BM_cpurt_Factorized2/0/process_time 86.7µs ± 8% 86.3µs ± 8% ~ (p=0.832 n=18+17) BM_cpurt_Factorized2/4/process_time 124µs ± 7% 442µs ± 3% +255.67% (p=0.000 n=18+18) BM_cpurt_Factorized2/8/process_time 159µs ± 6% 459µs ± 6% +187.73% (p=0.000 n=17+18) numUnrollableLoops should be known at compiled time, with dynamic minTaskSize the structure should look like this: ParallelComputeFunction unrollableParallelComputeFunction = createParallelComputeFunction(op, staticBounds, numUnrollableLoops, rewriter); ParallelComputeFunction defaultParallelComputeFunction = createParallelComputeFunction(op, staticBounds, numUnrollableLoops, rewriter); b.create<scf::IfOp>(..) { dispatch unrollableParallelComputeFunction } else { dispatch defaultParallelComputeFunction } There is no real need of benefit of computing numUnrollableLoop as Value, because unrolling/vectorization can only happen in trip counts (loop bounds) are know at compile time. |
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp | ||
---|---|---|
867 | Done, PTAL. |
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp | ||
---|---|---|
882 | You're creating a value dynamicShouldUnroll here which isn't used in the else branch below, can you sink this in the then branch? |
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp | ||
---|---|---|
896 | nit: I'd move this lambda close to the dispatchNotUnrollable to reduce the nesting, and put similar things together, although it's used only inside one branch of the if. I think it's ok to put createParallelComputeFunction inside lamdba, so you don't create aligned compute function if you'll not need it. |
Move minTaskSize definition earlier to avoid contaminating the analysis results with the partially executed rewrite logic.