Add vector op warp_execute_on_lane_0 that will be used to do incremental
vector distribution in order to target warp level vector programming for
architectures with GPU-like SIMT programming model.
The idea behind the op is discussed further on discourse:
https://discourse.llvm.org/t/vector-vector-distribution-large-vector-to-small-vector/1983/23
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td | ||
---|---|---|
2545 | This summary tells no much additional information. What about "terminates and yields values from vector regions"? | |
2547 | Can we improve the doc here? Something like scf.yield? https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/SCF/SCFOps.td#L695 | |
2564 | Also here. What about "Executes operations in the associated region on lane #0 of a GPU SIMT warp". | |
2565 | Make clear that this op is for GPU SIMT like execution model at the beginning? | |
2574 | The description here is a bit hard for me to parse through. Can we reword this to be clearer? What is the mental model for vector ops before/after vector.warp_execute_on_lane_0? (I assume a single lane in the SIMT warp?) What's the mental model for the op arguments/results? (still data range accessed by a single lane in the SIMT warp?) What's the mental model for the op region arguments/yields? (I think full data range for all threads in the SIMT warp?) ... It would be nice if we can say that. (Maybe the current wording is trying to express that, but I find it hard to parse and grok.) | |
2591 | Hmm, this is also hard for me to parse.. You mean "for captured values _it's_ only available to lane 0 .. ?" | |
2598 | Bikeshed: I'd think something like vector.warp_execute_on_lane_0 @ 32 [%laneid] reads better: we are execute on lane 0 out of 32, and the current lane's index is using an indexing operator. | |
mlir/lib/Dialect/Vector/IR/VectorOps.cpp | ||
4716 | Hmm, terms here can be improved IMO. What about calling it "warp data vector" and "lane data vector"? ... for lane/warp data vector operands ... | |
4762 | Any chance to add some logic to verify captured values? Like only allow capturing constants (or whatever needed) for now. Better to be conservative at the beginning to avoid surprises and we can relax later. Hard in the reverse way. | |
mlir/test/Dialect/Vector/ops.mlir | ||
774 | Add tests for negative cases? |
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td | ||
---|---|---|
2547 | good point, rewrote it following this example. | |
2565 | Added quite a few details to the description, hopefully it is more clear. | |
2574 | I tried to add more details but it is not easy to explain clearly which is why I was hoping the examples of lowering make the semantic of the op easier to understand. | |
2591 | kind of, basically the op is meant to have the semantic of if(laneid == 0) { ... } | |
2598 | because it is not really meant to reprensent indexing I'd prefer keeping the current representation. I would like it to look more like a if(laneid == 0) kind of op. | |
mlir/lib/Dialect/Vector/IR/VectorOps.cpp | ||
4716 | I was thinking about it but I think this gives the wrong impression. When using this op there isn't warp level data and thread level data, everything is already in SIMT mode so all the data are thread level. What the operands expose is that the data from all the lanes may be transfered to the lane0 so that everything ca be computed on one thread. | |
4762 | I'm not sure there is an advantage as it would need to be relaxed in the next commit. This is needed to support any kind of basic transformations. | |
mlir/test/Dialect/Vector/ops.mlir | ||
774 | oops forgot to upload those. Thanks for catching this |
I know that this is already reviewed by Nicolas previously. So LGTM.
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td | ||
---|---|---|
2545 | Nit: Terminates... |
mlir/include/mlir/Dialect/Vector/IR/VectorOps.td | ||
---|---|---|
2573 | Interestingly, this is also true for more general SPMD: you could want to write any distributed program on vectors this way: if (MPI_Rank == 0) { ... } Can we make the doc a little less GPU-specific? | |
2580 | I would probably write all doc as threadId / numThreads and specialize to laneid/warpSize in the particular (important) case of GPUs. | |
2587 | In the future, the properties of the distribution may be described by extra attributes (e.g. affine map) and operands ? | |
2598 | Give an example of block cyclic here ? | |
2608 | Like I had mentioned in my comment in the past: https://github.com/google/iree-llvm-sandbox/pull/147, I think readability would still benefit from having a clear documentation of what piece of code runs in parallel and what runs sequentially. | |
2626 | Here, clearly spelling out distributed vs sequential regions and arguments would be helpful. It feels a bit counter-intuitive to say that the op receives a vector<4xi32> as an operand and internally has access to a vector<128xi32>. | |
2635 | This example is probably the first time the reader really understands what is going on. | |
mlir/lib/Dialect/Vector/IR/VectorOps.cpp | ||
4624 | hmm parseRegionArgument? I would have thought a simple parseOperand of IndexType would be in order here? | |
4708 | I am wondering whether we should go into such details of verification or if we should just take 1 type, generate the 2nd type and compare for equality? | |
mlir/test/Dialect/Vector/invalid.mlir | ||
1534 | If we go for the simple generate + equality approach, this would be just 1 extra test for all cases. |
Thanks for the review @nicolasvasilache and sorry for the delay as I was in vacation. I'll send another patch to address your comments.
mlir/lib/Dialect/Vector/IR/VectorOps.cpp | ||
---|---|---|
4708 | I agree with that, however because the distribution is currently implicit we would need to first compute the implicit distribution to infer the distributed type. Computing the implicit distribution would require the same set of checks so it wouldn't simplify the code right now. Once we make distribution explicit we can change this code into verifying distribution map and compute directly the infer distributed type. |
This summary tells no much additional information. What about "terminates and yields values from vector regions"?