This is an archive of the discontinued LLVM Phabricator instance.

[mlir][vector] First step of vector distribution transformation
ClosedPublic

Authored by ThomasRaoux on Sep 25 2020, 2:49 PM.

Details

Summary

This is the first of several step to support distributing large vectors. This adds instructions extract_map and insert_map that allow us to do incremental lowering. Right now the transformation only apply to simple pointwise operation with a vector size matching the dimension of the IDs used to distribute the vector.
This can be used to distribute large vectors to loops or SPMD.

Diff Detail

Event Timeline

ThomasRaoux created this revision.Sep 25 2020, 2:49 PM
Herald added a project: Restricted Project. · View Herald Transcript
ThomasRaoux requested review of this revision.Sep 25 2020, 2:49 PM

In general the semantics of this op isn't clear to me. It seems that we "hide" the SIMT aspect by "faking" a scalar computation.
Have you looked into structuring this in a region instead?

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %0 = addf %A, %B : vector<32xf32>
  return %0: vector<32xf32>
}

->

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %C = simt.execute %A to %Aelt : vector<1xf32> affine_map<()[s0] -> (s0)> 
                                   %B to %Belt : vector<1xf32> affine_map<()[s0] -> (s0)> {
    %0 = addf %A, %B : vector<1xf32>
    yield %0
  }
  return %C: vector<32xf32>
}

In general the semantics of this op isn't clear to me. It seems that we "hide" the SIMT aspect by "faking" a scalar computation.
Have you looked into structuring this in a region instead?

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %0 = addf %A, %B : vector<32xf32>
  return %0: vector<32xf32>
}

->

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %C = simt.execute %A to %Aelt : vector<1xf32> affine_map<()[s0] -> (s0)> 
                                   %B to %Belt : vector<1xf32> affine_map<()[s0] -> (s0)> {
    %0 = addf %A, %B : vector<1xf32>
    yield %0
  }
  return %C: vector<32xf32>
}

I haven't tried using a region. The downside of using region is that it makes the canonicalization harder. This is just the first step and those operations are meant to be transient ops that will eventually get combined with memory accesses. What do you think?

Using SIMT regions would be like creating parallel loop around the code. Here I'm trying distribute the vector on a specific ID.

That being said, @nicolasvasilache might have a better answer since he had the original idea of having such operations to allow incremental lowering for vector distribution.

nicolasvasilache requested changes to this revision.Sep 28 2020, 1:35 AM

Thanks for starting this @ThomasRaoux.

In general the semantics of this op isn't clear to me. It seems that we "hide" the SIMT aspect by "faking" a scalar computation.
Have you looked into structuring this in a region instead?

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %0 = addf %A, %B : vector<32xf32>
  return %0: vector<32xf32>
}

->

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %C = simt.execute %A to %Aelt : vector<1xf32> affine_map<()[s0] -> (s0)> 
                                   %B to %Belt : vector<1xf32> affine_map<()[s0] -> (s0)> {
    %0 = addf %A, %B : vector<1xf32>
    yield %0
  }
  return %C: vector<32xf32>
}

Parallel semantics are orthogonal to the mechanisms underlying the composition of patterns that propagate distribution of SSA values to address computations.
In particular, the patterns will also be useful to break down big vectors into smaller ones going through a loop + memory.
These will be run and tested on single-thread scalar CPU code.

Users of this pattern will include transformations that create ops with regions with parallel semantics.
I envision this is the point when the question that @mehdi_amini raises will come into play.
I would imagine reusing the abstraction that makes sense at the GPU-level for this (and which should be more general and e.g. support control-flow, divergence, synchronizations etc)?

mlir/include/mlir/Dialect/Vector/VectorOps.td
462

I would rename dimension to multiplicity.
In future revisions, I expect this to become variadic in either id, dimension or both.

467

vector size / multiplicity

468

I'd rephrase to a Value such as a loop induction variable or an SPMD id.

471

s/merged with/folded into ?

You may also want to have something along the lines of Similarly to vector.tuple_get, this operation is used for progressive lowering and should be folded away before converting to LLVM.

477

How about a syntax like: vector.extract_map optional_map %v[%id:32]

So that in the future we could have:
vector.extract_map optional_map %v[%id0:32][%id1:16]

Or even, when it makes sense:
vector.extract_map optional_map %v[%id0:%sz0][%id1:%sz1]

708

vector size * multiplicity

709

Same as above re value, induction variables and SPMD id.

711

Same as above.

mlir/include/mlir/Dialect/Vector/VectorTransforms.h
175

Why not just take a Value here ?
Do you foresee actually creating the Value on the fly in the intended long-term usage?
In the case that the Value is a loop induction variable this would look fishy.

If it is just for testing purposes, we can write tests + passes that expect a certain structure and takes advantage of that.

201

This will create many patterns when instantiated on many possible pointwise vector ops.
How about creating a vector::PointwiseInterface and adding it to proper operations ?

Not necessary to do it in this PR but let's do it in the immediate next PR.

201

I was thinking of structuring this differently:

  1. DistributeVectorPattern<Op> only adds a pair of extract_map, insert_map after the op.
  2. then canonicalization patterns quick-in greedily: extract_map "go up" until they fold into a transfer_read; insert_map "go down" until they fold into a transfer_write.

Triggering the transformation can be done in a few ways:

  1. either have a transform / pass explicitly call distribution to insert the pair of ops and later apply the canonicalization patterns (in that case the match failure condition is when a vector op flows into an extract_map op).
  2. add additional filtering constraints so that the patterns applies only in certain conditions: e.g. based on type (vector op of size "n", presence of an attribute, keep a set of "seen" ops in the pattern).
  3. traverse the use-def chains until we see a use of the appropriate %id, in which case bail. However, this is likely to become expensive and attribute/seen op from 2. is a memoization of this.

The advantage is we only need to implement 2 canonicalization patterns (one for pull extract up through an op, the other for pushing insert down through an op) instead of 3 (insert + extract + rewrite op as "extract-op-insert").
This will be quite simpler once we have to start dealing with permutation / indexing maps.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
2424

Can we turn this into a canonicalization pattern on ExtractMapOp?

2439

I don't think this is necessary, see my general comment above.

2456

You shouldn't have this called by the rewrite pattern.
Code should be restructured to avoid this.

mlir/test/lib/Transforms/TestVectorTransforms.cpp
137

I would restructure this test pass to just take the first index of the enclosing function and use that as the value to distribute on.

I would use a hardcoded custom op "test_distribution_value" that just takes an index.
In the test IR you can then just have a func argument %arg0 : index and do test_distribution_value(%arg0).

In the future, we can plug that to a loop induction variable and write an integration test.

This revision now requires changes to proceed.Sep 28 2020, 1:35 AM

Maybe remove SPMD from the title and commit message as this is intended to also be useful for sequential vector code?

aartbik added inline comments.Sep 28 2020, 11:08 AM
mlir/include/mlir/Dialect/Vector/VectorOps.td
466

1-D to be consistent with other spelling in this file

468

this is an extremely concise description, can you provide a bit more details
an example would be helpful too

704

this is a copy-and-paste left over

706

1-D

mravishankar requested changes to this revision.Sep 28 2020, 11:33 AM
mravishankar added a subscriber: mravishankar.
mravishankar added inline comments.
mlir/include/mlir/Dialect/Vector/VectorOps.td
460

This name is confusing cause some of the terms are overloaded. My first instinct when seeing map was to expect an affine_map, but that is not the case. And there is no "extraction" happening here. How about Vector_DistributeSliceOp

700

Similar to above here, but this is more complicated. Each distributed ID is inserting a slice of the vector. Are the semantics of the operation that this is a "insert and broadcast", i.e. all distributed IDs have access to the resulting inserted value?

I feel like this has some implicit synchronization behavior. Are all the threads "inserting" here expected to be synchronized at this point?

Address review comments.

ThomasRaoux marked 11 inline comments as done.Sep 28 2020, 1:00 PM
ThomasRaoux added inline comments.
mlir/include/mlir/Dialect/Vector/VectorOps.td
468

I added examples and added more details based on Nicolas comments. Let me know if you think this is still hard to understand.

477

I changed the syntax as suggested. I'm planning to add the optional map in following patches.

mlir/include/mlir/Dialect/Vector/VectorTransforms.h
175

I was thinking we will need to create Value on the fly at some point which is why I had it that way. But I'm not sure this will be needed right now so I changed it to just pass a Value for now.

201

I went for solution 1. Let me know if this is different than what you had in mind.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
2439

Based on your general comment, the pattern is gone and I only left the transformation that can be called directly.

2456

Removed this pattern based on other comments.

mlir/test/lib/Transforms/TestVectorTransforms.cpp
137

Ok, changed it to do that.

ThomasRaoux retitled this revision from [mlir][vector] First step of vector SPMD distribution transformation to [mlir][vector] First step of vector distribution transformation.Sep 28 2020, 1:17 PM
ThomasRaoux edited the summary of this revision. (Show Details)
ThomasRaoux added inline comments.Sep 28 2020, 1:22 PM
mlir/include/mlir/Dialect/Vector/VectorOps.td
460

The goal is to have an affine_map indeed. This is the just the first step and I will add an affine_map operand as well. This doesn't really distribute so I'm not sure about calling it distributeSliceOp. Maybe should be called getSliceOp or something similar? @nicolasvasilache, any thoughts on the best name for this op?

700

This doesn't imply anything on synchronization. I think I made it confusing by mentioning SPMD but this is not directly related. Whether or not the different elements are synchronized depends on the semantic of the program.
For instance this could be used to distribute elements over a serialized loop, the lowering patterns should do the right thing to not break the semantic if cross lane operations are needed.
Do you think this is something that needs to be added to the description or the latest updates are enough to clarify that this doesn't imply SPMD?

mravishankar requested changes to this revision.Sep 28 2020, 3:00 PM
mravishankar added inline comments.
mlir/include/mlir/Dialect/Vector/VectorOps.td
700

This makes sense. Thanks for the clarification.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
2453

For such methods it is better to just return extract and newVec and let the caller deal with the operations. Similarly it is better to pass in the OpBuilder as an argument. If used within the DialectConversion pass the ConversionPatternRewriter needs to track operations added/deleted so that it can undo on failure.

mlir/test/lib/Transforms/TestVectorTransforms.cpp
138

I think if there were an Operation interface for all relevant ops, then this could be done as a pattern which would be highly preferable.

This revision now requires changes to proceed.Sep 28 2020, 3:00 PM

Forgot to add context.

ThomasRaoux added inline comments.Sep 28 2020, 3:11 PM
mlir/lib/Dialect/Vector/VectorTransforms.cpp
2453

Sure, I can change that. I'll wait for your second comment to be resolved first as this will work differently based on whether I use a pattern or do direct transformation.

mlir/test/lib/Transforms/TestVectorTransforms.cpp
138

The reason I'm not using a Pattern is not because there no Operation interface but because the rewrite pattern would run into infinite loop since this transformation leaves the original instruction unchanged and just adds extract_map/insert_map that then get propagated by the canonicalization operations.
(See discussion with Nicolas about that: https://reviews.llvm.org/D88341#inline-819945)
Here I went for the solution 1. suggested by Nicolas. Using solution 2. would allow making it a pattern however I'm not sure if it is really cleaner. What do you think?

Change signature of distributPointwiseVectorOp based on comments.

Thanks for the feedback, I think I addressed everything. Please take another look.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
2453

I changed the signature to return the extract/insert and do the replace outside the function.

mravishankar accepted this revision.Sep 29 2020, 1:24 PM

Looks fine to me. Please wait for Aart or Nicolas to approve.

mlir/include/mlir/Dialect/Vector/VectorTransforms.h
186

Last comment here (sorry for the multiple rounds); return by reference is weird. In Linalg this was done by creating a struct like

struct Foo {
  ExtractMapOp extract;
  InsertMapOp insert;
}

and return type of the method being Optional<Foo> . I am not aware of the convention here, but I find this better. Return by reference typically make sense for containers like vector, set, map, etc. to avoid a copy (though C++ semantics now is to do a move on return of such objects rather than a copy).

Remove return by reference.

mlir/include/mlir/Dialect/Vector/VectorTransforms.h
186

Changed it.

nicolasvasilache accepted this revision.Sep 30 2020, 3:21 AM

LGTM, if a better name pops to mind please do propose.

mlir/include/mlir/Dialect/Vector/VectorOps.td
460

I agree the name is not good but I can't think of a better one.
We actually need 2 names not just one.
I originally went for insert/extract + map to mirror the existing insert/extract which behave similarly in LLVM.
There is an insertion/extraction of value happening: only 1 value is mapped.

So in the current form, this overlaps with insert/extract_element.

In the absence of a better proposal I'd stay with this name for now but I'd very much like a better name..

mlir/include/mlir/Dialect/Vector/VectorTransforms.h
181

s/going from 0 to dimension/taking *all* values in [0 .. multiplicity - 1] (e.g. loop induction variable or SPMD id) ?
Not sure if this phrasing would be clear enough but I'd like to emphasize that this is a special type of value that must take all values in the range.

Please us multiplicity here and below for consistency.

mlir/test/lib/Transforms/TestVectorTransforms.cpp
138

The pattern is the one that propagates the insert/extract up and down and will benefit from the Interfaces.
It seems to me the addition of insert/extract pairs is the responsibility of the pass/transformation which seems appropriate for the current state of the test pass.
I'd say we can revisit this later when we have enough patterns and we start looking at profitability.

This revision is now accepted and ready to land.Sep 30 2020, 3:21 AM

Re-phrase comment and replace dimension with multiplicity.

ThomasRaoux marked an inline comment as done.Sep 30 2020, 12:47 PM
ThomasRaoux added inline comments.
mlir/include/mlir/Dialect/Vector/VectorOps.td
460

I'll leave it like that for now then and we can think about a better name.

mlir/test/lib/Transforms/TestVectorTransforms.cpp
138

Leaving it like that for now.

I find these op semantics really hard to understand right now, I think the doc needs a significant amount of improvement.

I am very unconvinced by this, but this can be just a lack of understanding, can you please bring this up to Discourse and complete a design discussion on this topic?

mlir/include/mlir/Dialect/Vector/VectorOps.td
506

This IR and this transformation seems deeply broken to me at the moment.

->

func @distribute_vector_add(%A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {

%C = simt.execute %A to %Aelt : vector<1xf32> affine_map<()[s0] -> (s0)> 
                                 %B to %Belt : vector<1xf32> affine_map<()[s0] -> (s0)> {
  %0 = addf %A, %B : vector<1xf32>
  yield %0
}

This op's custom print is missing a 32 somewhere. Isn't %C of type vector 32xf32? Not having that goes against IR printing guidelines.

bondhugula added inline comments.Oct 9 2020, 12:01 AM
mlir/include/mlir/Dialect/Vector/VectorOps.td
464

Why are these ops called extract_map and insert_map?! They aren't extracting or inserting maps nor are they extracting *and* mapping. The phrase "maps a given multiplicity of the vector ..." isn't really meaningful to me! Why is id used? Did you mean position or index? Both the naming and the doc description appear to be completely unclear and perhaps disconnected from the op's semantics.

I find these op semantics really hard to understand right now, I think the doc needs a significant amount of improvement.

I am very unconvinced by this, but this can be just a lack of understanding, can you please bring this up to Discourse and complete a design discussion on this topic?

Ping here? (without acknowledgement I can't know if this has been missed)

I find these op semantics really hard to understand right now, I think the doc needs a significant amount of improvement.

I am very unconvinced by this, but this can be just a lack of understanding, can you please bring this up to Discourse and complete a design discussion on this topic?

Ping here? (without acknowledgement I can't know if this has been missed)

Sorry for the delay. I'm starting a Discourse discussion and can update the doc based on it.

I find these op semantics really hard to understand right now, I think the doc needs a significant amount of improvement.

I am very unconvinced by this, but this can be just a lack of understanding, can you please bring this up to Discourse and complete a design discussion on this topic?

Ping here? (without acknowledgement I can't know if this has been missed)

Sorry for the delay. I'm starting a Discourse discussion and can update the doc based on it.

https://llvm.discourse.group/t/vector-vector-distribution-large-vector-to-small-vector/1983