This is an archive of the discontinued LLVM Phabricator instance.

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)
ClosedPublic

Authored by ezhulenev on Jun 24 2021, 5:27 AM.

Details

Summary

Depends On D104780

Recursive work splitting instead of sequential async tasks submission gives ~20%-30% speedup in microbenchmarks.

Algorithm outline:

  1. Collapse scf.parallel dimensions into a single dimension
  2. Compute the block size for the parallel operations from the 1d problem size
  3. Launch parallel tasks
  4. Each parallel task reconstructs its own bounds in the original multi-dimensional iteration space
  5. Each parallel task computes the original parallel operation body using scf.for loop nest

Diff Detail

Event Timeline

ezhulenev created this revision.Jun 24 2021, 5:27 AM
ezhulenev requested review of this revision.Jun 24 2021, 5:27 AM
ezhulenev edited the summary of this revision. (Show Details)Jun 24 2021, 5:31 AM
ezhulenev added reviewers: herhut, mehdi_amini.

Very neat! Just some nits.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
137

Longer term this could become an op, as this functionality is needed frequently. Not here, though.

209

Why is this needed?

230

I see the convenience of this but I'd prefer if there was no invisible side-effect on offset.

242

b.create<ConstantIndexOp>(0)

246

I wonder whether ArrayRef would be the better abstraction here.

267

how about max(blockFirstIndex + blockSize, tripCount)?

278

nit: double compute?

282

nit: contains, multiple

322

nit: or

387

Is this needed?

415

nit: 'ConstantIndexOp'

446

nit: the second half

453

nit: use start?

475

nit: use start?

507

ConstantIndexOp

544

ConstantIndexOp

682

Some more ConstantIndexOp opportunities

690

Maybe use ceil_div instead of divup?

697

Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be positive here and in most of the rest of this file. Wouldn't the unsigned one suffice here?

714

nit replaced.

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26

This test is super sparse. Why even capture S and E here, as they are not used? Unless you want to ensure that the right block numbers are passed, but that needs a less sparse test.

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6

Is this broken by this change? Is that because we now pass groups through functions?

ezhulenev updated this revision to Diff 354406.Jun 24 2021, 5:47 PM
ezhulenev marked 22 inline comments as done.

Resolve comments

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
209

Removed. Leftovers that I forgot to cleanup.

230

I added a not that it updates the offset, but don't see any other way to keep uses oneliners and make it more explicit

242

Cool, didn't know it exists. Updated all constants.

246

Updated all callsites.

697

Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just for "consistency" :)

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26

I added a bit more details, but still decided not to add checks for all of the generated IR, just checked the main structure. It's just too many checks, and I find it easier to rely on the mlir integration tests that run the code.

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6

There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<->dropRef and destroys the group too early. Will send a fix in one of the next PRs.

ezhulenev updated this revision to Diff 354408.Jun 24 2021, 5:58 PM

Revert accidental change

herhut accepted this revision.Jun 25 2021, 5:19 AM

Thanks!

This revision is now accepted and ready to land.Jun 25 2021, 5:19 AM