Page MenuHomePhabricator

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)
ClosedPublic

Authored by ezhulenev on Jun 24 2021, 5:27 AM.

Details

Summary

Depends On D104780

Recursive work splitting instead of sequential async tasks submission gives ~20%-30% speedup in microbenchmarks.

Algorithm outline:

  1. Collapse scf.parallel dimensions into a single dimension
  2. Compute the block size for the parallel operations from the 1d problem size
  3. Launch parallel tasks
  4. Each parallel task reconstructs its own bounds in the original multi-dimensional iteration space
  5. Each parallel task computes the original parallel operation body using scf.for loop nest

Diff Detail

Event Timeline

ezhulenev created this revision.Jun 24 2021, 5:27 AM
ezhulenev requested review of this revision.Jun 24 2021, 5:27 AM
ezhulenev edited the summary of this revision. (Show Details)Jun 24 2021, 5:31 AM
ezhulenev added reviewers: herhut, mehdi_amini.

Very neat! Just some nits.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
129

Longer term this could become an op, as this functionality is needed frequently. Not here, though.

187

Why is this needed?

208

I see the convenience of this but I'd prefer if there was no invisible side-effect on offset.

220

b.create<ConstantIndexOp>(0)

224

I wonder whether ArrayRef would be the better abstraction here.

245

how about max(blockFirstIndex + blockSize, tripCount)?

256

nit: double compute?

260

nit: contains, multiple

315

nit: or

383

Is this needed?

411

nit: 'ConstantIndexOp'

442

nit: the second half

449

nit: use start?

469

nit: use start?

501

ConstantIndexOp

538

ConstantIndexOp

624

Some more ConstantIndexOp opportunities

632

Maybe use ceil_div instead of divup?

639

Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be positive here and in most of the rest of this file. Wouldn't the unsigned one suffice here?

656

nit replaced.

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26

This test is super sparse. Why even capture S and E here, as they are not used? Unless you want to ensure that the right block numbers are passed, but that needs a less sparse test.

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6

Is this broken by this change? Is that because we now pass groups through functions?

ezhulenev updated this revision to Diff 354406.Jun 24 2021, 5:47 PM
ezhulenev marked 22 inline comments as done.

Resolve comments

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
187

Removed. Leftovers that I forgot to cleanup.

208

I added a not that it updates the offset, but don't see any other way to keep uses oneliners and make it more explicit

220

Cool, didn't know it exists. Updated all constants.

224

Updated all callsites.

639

Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just for "consistency" :)

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26

I added a bit more details, but still decided not to add checks for all of the generated IR, just checked the main structure. It's just too many checks, and I find it easier to rely on the mlir integration tests that run the code.

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6

There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<->dropRef and destroys the group too early. Will send a fix in one of the next PRs.

ezhulenev updated this revision to Diff 354408.Jun 24 2021, 5:58 PM

Revert accidental change

herhut accepted this revision.Jun 25 2021, 5:19 AM

Thanks!

This revision is now accepted and ready to land.Jun 25 2021, 5:19 AM