This is an archive of the discontinued LLVM Phabricator instance.

[mlir][vector] Add operations used for Vector distribution
ClosedPublic

Authored by ThomasRaoux on Apr 13 2022, 12:16 PM.

Details

Summary

Add vector op warp_execute_on_lane_0 that will be used to do incremental
vector distribution in order to target warp level vector programming for
architectures with GPU-like SIMT programming model.
The idea behind the op is discussed further on discourse:
https://discourse.llvm.org/t/vector-vector-distribution-large-vector-to-small-vector/1983/23

Diff Detail

Event Timeline

ThomasRaoux created this revision.Apr 13 2022, 12:16 PM
Herald added a project: Restricted Project. · View Herald Transcript
ThomasRaoux requested review of this revision.Apr 13 2022, 12:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2022, 12:16 PM
antiagainst requested changes to this revision.Apr 14 2022, 4:29 PM
antiagainst added inline comments.
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td
2545

This summary tells no much additional information. What about "terminates and yields values from vector regions"?

2547
2564

Also here. What about "Executes operations in the associated region on lane #0 of a GPU SIMT warp".

2565

Make clear that this op is for GPU SIMT like execution model at the beginning?

2574

The description here is a bit hard for me to parse through. Can we reword this to be clearer? What is the mental model for vector ops before/after vector.warp_execute_on_lane_0? (I assume a single lane in the SIMT warp?) What's the mental model for the op arguments/results? (still data range accessed by a single lane in the SIMT warp?) What's the mental model for the op region arguments/yields? (I think full data range for all threads in the SIMT warp?) ... It would be nice if we can say that. (Maybe the current wording is trying to express that, but I find it hard to parse and grok.)

2591

Hmm, this is also hard for me to parse.. You mean "for captured values _it's_ only available to lane 0 .. ?"

2598

Bikeshed: I'd think something like vector.warp_execute_on_lane_0 @ 32 [%laneid] reads better: we are execute on lane 0 out of 32, and the current lane's index is using an indexing operator.

mlir/lib/Dialect/Vector/IR/VectorOps.cpp
4716

Hmm, terms here can be improved IMO. What about calling it "warp data vector" and "lane data vector"?

... for lane/warp data vector operands ...

4762

Any chance to add some logic to verify captured values? Like only allow capturing constants (or whatever needed) for now. Better to be conservative at the beginning to avoid surprises and we can relax later. Hard in the reverse way.

mlir/test/Dialect/Vector/ops.mlir
774

Add tests for negative cases?

This revision now requires changes to proceed.Apr 14 2022, 4:29 PM

address review comments

ThomasRaoux marked 2 inline comments as done.Apr 14 2022, 7:49 PM
ThomasRaoux added inline comments.
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td
2547

good point, rewrote it following this example.

2565

Added quite a few details to the description, hopefully it is more clear.

2574

I tried to add more details but it is not easy to explain clearly which is why I was hoping the examples of lowering make the semantic of the op easier to understand.

2591

kind of, basically the op is meant to have the semantic of

if(laneid == 0) {
...
}
2598

because it is not really meant to reprensent indexing I'd prefer keeping the current representation. I would like it to look more like a if(laneid == 0) kind of op.

mlir/lib/Dialect/Vector/IR/VectorOps.cpp
4716

I was thinking about it but I think this gives the wrong impression. When using this op there isn't warp level data and thread level data, everything is already in SIMT mode so all the data are thread level. What the operands expose is that the data from all the lanes may be transfered to the lane0 so that everything ca be computed on one thread.

4762

I'm not sure there is an advantage as it would need to be relaxed in the next commit. This is needed to support any kind of basic transformations.

mlir/test/Dialect/Vector/ops.mlir
774

oops forgot to upload those. Thanks for catching this

antiagainst accepted this revision.Apr 14 2022, 8:11 PM

I know that this is already reviewed by Nicolas previously. So LGTM.

mlir/include/mlir/Dialect/Vector/IR/VectorOps.td
2545

Nit: Terminates...

This revision is now accepted and ready to land.Apr 14 2022, 8:11 PM
This revision was landed with ongoing or failed builds.Apr 14 2022, 8:53 PM
This revision was automatically updated to reflect the committed changes.
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td
2573

Interestingly, this is also true for more general SPMD: you could want to write any distributed program on vectors this way:

if (MPI_Rank == 0) {
  ...
}

Can we make the doc a little less GPU-specific?
No need to go into full MPI for now but this is more general than "just GPU".

2580

I would probably write all doc as threadId / numThreads and specialize to laneid/warpSize in the particular (important) case of GPUs.

2587

In the future, the properties of the distribution may be described by extra attributes (e.g. affine map) and operands ?

2598

Give an example of block cyclic here ?
E.g. vector<64xf32> on 32 threads maps vector[0], vector[1] to thread 0.

2608

Like I had mentioned in my comment in the past: https://github.com/google/iree-llvm-sandbox/pull/147, I think readability would still benefit from having a clear documentation of what piece of code runs in parallel and what runs sequentially.

2626

Here, clearly spelling out distributed vs sequential regions and arguments would be helpful.

It feels a bit counter-intuitive to say that the op receives a vector<4xi32> as an operand and internally has access to a vector<128xi32>.

2635

This example is probably the first time the reader really understands what is going on.
With an extra preliminary example that just explains what runs in parallel and in sequential in what part of the code (without any detail on packing/unpacking of vector values), I think this would be easier to read for newcomers.

mlir/lib/Dialect/Vector/IR/VectorOps.cpp
4624

hmm parseRegionArgument? I would have thought a simple parseOperand of IndexType would be in order here?

4708

I am wondering whether we should go into such details of verification or if we should just take 1 type, generate the 2nd type and compare for equality?
I expect this will quickly grow in complexity in the future while not giving significantly more actionable information than expected V1 sequential type to match V2 distributed type (but got V3).

mlir/test/Dialect/Vector/invalid.mlir
1534

If we go for the simple generate + equality approach, this would be just 1 extra test for all cases.

Thanks for the review @nicolasvasilache and sorry for the delay as I was in vacation. I'll send another patch to address your comments.

ThomasRaoux added inline comments.May 9 2022, 6:12 AM
mlir/lib/Dialect/Vector/IR/VectorOps.cpp
4708

I agree with that, however because the distribution is currently implicit we would need to first compute the implicit distribution to infer the distributed type. Computing the implicit distribution would require the same set of checks so it wouldn't simplify the code right now. Once we make distribution explicit we can change this code into verifying distribution map and compute directly the infer distributed type.