This is an archive of the discontinued LLVM Phabricator instance.

[mlir][SCF] Add parallel abstraction on tensors.
ClosedPublic

Authored by nicolasvasilache on May 27 2022, 9:22 AM.

Details

Summary

This revision adds scf.foreach_thread and other supporting abstractions
that allow connecting parallel abstractions and tensors.

Discussion is available here.

Diff Detail

Event Timeline

nicolasvasilache requested review of this revision.May 27 2022, 9:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 9:22 AM

Fix invalid test.

Fix invalid test.

Drop dead code, clarify the notion of "virtual threads of execution".

ftynse requested changes to this revision.May 30 2022, 1:48 AM
ftynse added inline comments.
mlir/include/mlir/Dialect/SCF/SCFOps.td
329

Nit: "function" is confusing here since the body is not isolated from above as a function would be and only has one block while the function would have a CFG.

342

It isn't clear how exactly the values are formed given that parallel_insert_slice doesn't actually return anything and neither does the terminator. Maybe put a reference to the terminator documentation here.

343
451

Can't this be something like llvm::iterator_range<Block::iterator>?

455–462

We can just mark PerformConcurrentlyOp as not having a terminator at all, similarly to how ModuleOp does not. This will remove the need for this operation, and likely simplify some other logic.

468

Nit: let's not commit commented-out code.

485
516–521

Let's add the namespace prefix systematically

mlir/lib/Dialect/SCF/SCF.cpp
1059

Please add a test for all user-visible error messages.

1060

"index type index" sounds tautological.

1072

Nit: type mismatch between the result << i << "(" << ... << " of the terminator would avoid the awkward 1th result and 2th result.

1114–1119

Can't we just put this part into a custom directive and use declarative assembly for the rest?

1139

Please create a block with builder, that's why it is provided....

1273–1286

Can't this just return getBody()->without_terminator() ?

This revision now requires changes to proceed.May 30 2022, 1:48 AM
nicolasvasilache marked 13 inline comments as done.

Address review comments.

nicolasvasilache marked an inline comment as done.May 30 2022, 8:38 AM
nicolasvasilache added inline comments.
mlir/lib/Dialect/SCF/SCF.cpp
1059

not actually produceable, dropped it.

1114–1119

Could not find a way to handle the "induction variable"-like bbargs in the declarative assembly format.
In the absence of that, this would result in moving almost all of the code to the custom hook without much benefit.

Added a TODO as it would indeed be a nice improvement to use the declarative format.

nicolasvasilache marked an inline comment as done.May 30 2022, 8:38 AM
ftynse accepted this revision.May 30 2022, 8:41 AM
This revision is now accepted and ready to land.May 30 2022, 8:41 AM
This revision was automatically updated to reflect the committed changes.
christopherbate added inline comments.
mlir/lib/Dialect/SCF/SCF.cpp
1272

@nicolasvasilache Currently this operation assumes that there will be a yielded value for each "parallel_insert_slice" operation. Would it make sense to change this to have a single return type for each unique "dest" value for all the parallel insert slice operations? For example, a distributed "copy" operation could have each thread copying non-contiguous portions of the source to a single destination value. Currently, that would yield a result variable for each portion.

mlir/lib/Dialect/SCF/SCF.cpp
1272

This is an interesting take and could make sense, thanks for proposing!
I can certainly see the case of multiple piecewise concurrent updates to the same result.
Could you elaborate a bit on your use cases?

mlir/lib/Dialect/SCF/SCF.cpp
1272

Currently I'm working on an operation that represents a 2D view into the source tensor where each row is contiguous but adjacent rows are not contiguous. A subset of the view is copied row-by-row into a destination tensor. I lower the copy to scf.foreach_thread, where the number of threads is smaller than the number of rows by some factor. At this time the number of threads and the size of the copy are known statically, so each thread should copy N rows