This is the first of several step to support distributing large vectors. This adds instructions extract_map and insert_map that allow us to do incremental lowering. Right now the transformation only apply to simple pointwise operation with a vector size matching the dimension of the IDs used to distribute the vector.
This can be used to distribute large vectors to loops or SPMD.
Details
Diff Detail
Event Timeline
In general the semantics of this op isn't clear to me. It seems that we "hide" the SIMT aspect by "faking" a scalar computation.
Have you looked into structuring this in a region instead?
func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> { %0 = addf %A, %B : vector<32xf32> return %0: vector<32xf32> }
->
func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> { %C = simt.execute %A to %Aelt : vector<1xf32> affine_map<()[s0] -> (s0)> %B to %Belt : vector<1xf32> affine_map<()[s0] -> (s0)> { %0 = addf %A, %B : vector<1xf32> yield %0 } return %C: vector<32xf32> }
I haven't tried using a region. The downside of using region is that it makes the canonicalization harder. This is just the first step and those operations are meant to be transient ops that will eventually get combined with memory accesses. What do you think?
Using SIMT regions would be like creating parallel loop around the code. Here I'm trying distribute the vector on a specific ID.
That being said, @nicolasvasilache might have a better answer since he had the original idea of having such operations to allow incremental lowering for vector distribution.
Thanks for starting this @ThomasRaoux.
Parallel semantics are orthogonal to the mechanisms underlying the composition of patterns that propagate distribution of SSA values to address computations.
In particular, the patterns will also be useful to break down big vectors into smaller ones going through a loop + memory.
These will be run and tested on single-thread scalar CPU code.
Users of this pattern will include transformations that create ops with regions with parallel semantics.
I envision this is the point when the question that @mehdi_amini raises will come into play.
I would imagine reusing the abstraction that makes sense at the GPU-level for this (and which should be more general and e.g. support control-flow, divergence, synchronizations etc)?
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
462 | I would rename dimension to multiplicity. | |
467 | vector size / multiplicity | |
468 | I'd rephrase to a Value such as a loop induction variable or an SPMD id. | |
471 | s/merged with/folded into ? You may also want to have something along the lines of Similarly to vector.tuple_get, this operation is used for progressive lowering and should be folded away before converting to LLVM. | |
477 | How about a syntax like: vector.extract_map optional_map %v[%id:32] So that in the future we could have: Or even, when it makes sense: | |
680 | vector size * multiplicity | |
681 | Same as above re value, induction variables and SPMD id. | |
683 | Same as above. | |
mlir/include/mlir/Dialect/Vector/VectorTransforms.h | ||
175 | Why not just take a Value here ? If it is just for testing purposes, we can write tests + passes that expect a certain structure and takes advantage of that. | |
201 | This will create many patterns when instantiated on many possible pointwise vector ops. Not necessary to do it in this PR but let's do it in the immediate next PR. | |
201 | I was thinking of structuring this differently:
Triggering the transformation can be done in a few ways:
The advantage is we only need to implement 2 canonicalization patterns (one for pull extract up through an op, the other for pushing insert down through an op) instead of 3 (insert + extract + rewrite op as "extract-op-insert"). | |
mlir/lib/Dialect/Vector/VectorTransforms.cpp | ||
2424 | Can we turn this into a canonicalization pattern on ExtractMapOp? | |
2439 | I don't think this is necessary, see my general comment above. | |
2456 | You shouldn't have this called by the rewrite pattern. | |
mlir/test/lib/Transforms/TestVectorTransforms.cpp | ||
137 | I would restructure this test pass to just take the first index of the enclosing function and use that as the value to distribute on. I would use a hardcoded custom op "test_distribution_value" that just takes an index. In the future, we can plug that to a loop induction variable and write an integration test. |
Maybe remove SPMD from the title and commit message as this is intended to also be useful for sequential vector code?
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
460 | This name is confusing cause some of the terms are overloaded. My first instinct when seeing map was to expect an affine_map, but that is not the case. And there is no "extraction" happening here. How about Vector_DistributeSliceOp | |
672 | Similar to above here, but this is more complicated. Each distributed ID is inserting a slice of the vector. Are the semantics of the operation that this is a "insert and broadcast", i.e. all distributed IDs have access to the resulting inserted value? I feel like this has some implicit synchronization behavior. Are all the threads "inserting" here expected to be synchronized at this point? |
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
468 | I added examples and added more details based on Nicolas comments. Let me know if you think this is still hard to understand. | |
477 | I changed the syntax as suggested. I'm planning to add the optional map in following patches. | |
mlir/include/mlir/Dialect/Vector/VectorTransforms.h | ||
175 | I was thinking we will need to create Value on the fly at some point which is why I had it that way. But I'm not sure this will be needed right now so I changed it to just pass a Value for now. | |
201 | I went for solution 1. Let me know if this is different than what you had in mind. | |
mlir/lib/Dialect/Vector/VectorTransforms.cpp | ||
2439 | Based on your general comment, the pattern is gone and I only left the transformation that can be called directly. | |
2456 | Removed this pattern based on other comments. | |
mlir/test/lib/Transforms/TestVectorTransforms.cpp | ||
137 | Ok, changed it to do that. |
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
460 | The goal is to have an affine_map indeed. This is the just the first step and I will add an affine_map operand as well. This doesn't really distribute so I'm not sure about calling it distributeSliceOp. Maybe should be called getSliceOp or something similar? @nicolasvasilache, any thoughts on the best name for this op? | |
672 | This doesn't imply anything on synchronization. I think I made it confusing by mentioning SPMD but this is not directly related. Whether or not the different elements are synchronized depends on the semantic of the program. |
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
672 | This makes sense. Thanks for the clarification. | |
mlir/lib/Dialect/Vector/VectorTransforms.cpp | ||
2453 | For such methods it is better to just return extract and newVec and let the caller deal with the operations. Similarly it is better to pass in the OpBuilder as an argument. If used within the DialectConversion pass the ConversionPatternRewriter needs to track operations added/deleted so that it can undo on failure. | |
mlir/test/lib/Transforms/TestVectorTransforms.cpp | ||
138 | I think if there were an Operation interface for all relevant ops, then this could be done as a pattern which would be highly preferable. |
mlir/lib/Dialect/Vector/VectorTransforms.cpp | ||
---|---|---|
2453 | Sure, I can change that. I'll wait for your second comment to be resolved first as this will work differently based on whether I use a pattern or do direct transformation. | |
mlir/test/lib/Transforms/TestVectorTransforms.cpp | ||
138 | The reason I'm not using a Pattern is not because there no Operation interface but because the rewrite pattern would run into infinite loop since this transformation leaves the original instruction unchanged and just adds extract_map/insert_map that then get propagated by the canonicalization operations. |
Thanks for the feedback, I think I addressed everything. Please take another look.
mlir/lib/Dialect/Vector/VectorTransforms.cpp | ||
---|---|---|
2453 | I changed the signature to return the extract/insert and do the replace outside the function. |
Looks fine to me. Please wait for Aart or Nicolas to approve.
mlir/include/mlir/Dialect/Vector/VectorTransforms.h | ||
---|---|---|
186 | Last comment here (sorry for the multiple rounds); return by reference is weird. In Linalg this was done by creating a struct like struct Foo { ExtractMapOp extract; InsertMapOp insert; } and return type of the method being Optional<Foo> . I am not aware of the convention here, but I find this better. Return by reference typically make sense for containers like vector, set, map, etc. to avoid a copy (though C++ semantics now is to do a move on return of such objects rather than a copy). |
Remove return by reference.
mlir/include/mlir/Dialect/Vector/VectorTransforms.h | ||
---|---|---|
186 | Changed it. |
LGTM, if a better name pops to mind please do propose.
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
460 | I agree the name is not good but I can't think of a better one. So in the current form, this overlaps with insert/extract_element. In the absence of a better proposal I'd stay with this name for now but I'd very much like a better name.. | |
mlir/include/mlir/Dialect/Vector/VectorTransforms.h | ||
181 | s/going from 0 to dimension/taking *all* values in [0 .. multiplicity - 1] (e.g. loop induction variable or SPMD id) ? Please us multiplicity here and below for consistency. | |
mlir/test/lib/Transforms/TestVectorTransforms.cpp | ||
138 | The pattern is the one that propagates the insert/extract up and down and will benefit from the Interfaces. |
I find these op semantics really hard to understand right now, I think the doc needs a significant amount of improvement.
I am very unconvinced by this, but this can be just a lack of understanding, can you please bring this up to Discourse and complete a design discussion on this topic?
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
503 | This IR and this transformation seems deeply broken to me at the moment. |
This op's custom print is missing a 32 somewhere. Isn't %C of type vector 32xf32? Not having that goes against IR printing guidelines.
mlir/include/mlir/Dialect/Vector/VectorOps.td | ||
---|---|---|
461 | Why are these ops called extract_map and insert_map?! They aren't extracting or inserting maps nor are they extracting *and* mapping. The phrase "maps a given multiplicity of the vector ..." isn't really meaningful to me! Why is id used? Did you mean position or index? Both the naming and the doc description appear to be completely unclear and perhaps disconnected from the op's semantics. |
Sorry for the delay. I'm starting a Discourse discussion and can update the doc based on it.
This name is confusing cause some of the terms are overloaded. My first instinct when seeing map was to expect an affine_map, but that is not the case. And there is no "extraction" happening here. How about Vector_DistributeSliceOp