This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse] introduce vectorization pass for sparse loops
ClosedPublic

Authored by aartbik on Nov 17 2022, 1:50 PM.

Details

Summary

This brings back previous SIMD functionality, but in a separate pass.
The idea is to improve this new pass incrementally, going beyond for-loops
to while-loops for co-iteration as welll (masking), while introducing new
abstractions to make the lowering more progressive. The separation of
sparsification and vectorization is a very good first step on this journey.

Also brings back ArmSVE support

Still to be fine-tuned:

+ use of "index" in SIMD loop (viz. a[i] = i)
+ check that all ops really have SIMD support
+ check all forms of reductions
+ chain reduction SIMD values

Diff Detail

Event Timeline

aartbik created this revision.Nov 17 2022, 1:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 17 2022, 1:50 PM
aartbik requested review of this revision.Nov 17 2022, 1:50 PM
Herald added a project: Restricted Project. · View Herald Transcript
Peiming added inline comments.Nov 17 2022, 2:20 PM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297

You can simply do a clone(def) here and update the operand to vx in place

336

nit: static?

345

nit: successful

377

So, this assume that the reducValue is the first value, which is, of course, valid.

But will it be too fragile against changes? How about adding static function to query the index? and add some assertion there to make sure these the index is always valid.

e.g., in sparsification, we can have something like
assert (codegen.redVal == block.getArgument(getReductionValueArgumentIndex())

433–434

Any reason why you resue the function for analysis/codegen instead of having two function? wouldn't that be more clear?

Peiming added a comment.EditedNov 17 2022, 2:45 PM

A general question (from a newbie in vectorization):

Is it possible to always start vectorization from the inner most for loop? if so, you can chain the for loop vectorization as all the inner loop will return a vector value if it has been vectorized?

It seems to me that the current implementation only vectorize the loops without dependence between reduction values?

It seems to me that the current implementation only vectorize the loops without dependence between reduction values?

Yes, that is in the TBD (see: chain reduction SIMD values).
There is even a test for this that I plan to bring back in the follow ups:

CHECK: %[[VAL_65:.*]] = vector.insertelement %[[VAL_66:.*]]#2, %[[VAL_3]]{{\[}}%[[VAL_6]] : index] : vector<8xf64>
CHECK: %[[VAL_67:.*]] = scf.for %[[VAL_68:.*]] = %[[VAL_66]]#0 to %[[VAL_22]] step %[[VAL_4]] iter_args(%[[VAL_69:.*]] = %[[VAL_65]]) -> (vector<8xf64>) {
CHECK: %[[VAL_70:.*]] = affine.min #map(%[[VAL_22]], %[[VAL_68]])
CHECK: %[[VAL_71:.*]] = vector.create_mask %[[VAL_70]] : vector<8xi1>
CHECK: %[[VAL_72:.*]] = vector.maskedload %[[VAL_11]]{{\[}}%[[VAL_68]]], %[[VAL_71]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_73:.*]] = arith.addf %[[VAL_69]], %[[VAL_72]] : vector<8xf64>
CHECK: %[[VAL_74:.*]] = arith.select %[[VAL_71]], %[[VAL_73]], %[[VAL_69]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_74]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_75:.*]] = scf.for %[[VAL_76:.*]] = %[[VAL_66]]#1 to %[[VAL_25]] step %[[VAL_4]] iter_args(%[[VAL_77:.*]] = %[[VAL_78:.*]]) -> (vector<8xf64>) {
CHECK: %[[VAL_79:.*]] = affine.min #map(%[[VAL_25]], %[[VAL_76]])
CHECK: %[[VAL_80:.*]] = vector.create_mask %[[VAL_79]] : vector<8xi1>
CHECK: %[[VAL_81:.*]] = vector.maskedload %[[VAL_14]]{{\[}}%[[VAL_76]]], %[[VAL_80]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_82:.*]] = arith.addf %[[VAL_77]], %[[VA CHECK: %[[VAL_65:.*]] = vector.insertelement %[[VAL_66:.*]]#2, %[[VAL_3]]{{\[}}%[[VAL_6]] : index] : vector<8xf64>
CHECK: %[[VAL_67:.*]] = scf.for %[[VAL_68:.*]] = %[[VAL_66]]#0 to %[[VAL_22]] step %[[VAL_4]] iter_args(%[[VAL_69:.*]] = %[[VAL_65]]) -> (vector<8xf64>) {
CHECK: %[[VAL_70:.*]] = affine.min #map(%[[VAL_22]], %[[VAL_68]])
CHECK: %[[VAL_71:.*]] = vector.create_mask %[[VAL_70]] : vector<8xi1>
CHECK: %[[VAL_72:.*]] = vector.maskedload %[[VAL_11]]{{\[}}%[[VAL_68]]], %[[VAL_71]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_73:.*]] = arith.addf %[[VAL_69]], %[[VAL_72]] : vector<8xf64>
CHECK: %[[VAL_74:.*]] = arith.select %[[VAL_71]], %[[VAL_73]], %[[VAL_69]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_74]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_75:.*]] = scf.for %[[VAL_76:.*]] = %[[VAL_66]]#1 to %[[VAL_25]] step %[[VAL_4]] iter_args(%[[VAL_77:.*]] = %[[VAL_78:.*]]) -> (vector<8xf64>) {
CHECK: %[[VAL_79:.*]] = affine.min #map(%[[VAL_25]], %[[VAL_76]])
CHECK: %[[VAL_80:.*]] = vector.create_mask %[[VAL_79]] : vector<8xi1>
CHECK: %[[VAL_81:.*]] = vector.maskedload %[[VAL_14]]{{\[}}%[[VAL_76]]], %[[VAL_80]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_82:.*]] = arith.addf %[[VAL_77]], %[[VAL_81]] : vector<8xf64>
CHECK: %[[VAL_83:.*]] = arith.select %[[VAL_80]], %[[VAL_82]], %[[VAL_77]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_83]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_84:.*]] = vector.reduction <add>, %[[VAL_85:.*]] : vector<8xf64> into f64
CHECK: scf.yield %[[VAL_84]] : f64L_81]] : vector<8xf64>
CHECK: %[[VAL_83:.*]] = arith.select %[[VAL_80]], %[[VAL_82]], %[[VAL_77]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_83]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_84:.*]] = vector.reduction <add>, %[[VAL_85:.*]] : vector<8xf64> into f64
CHECK: scf.yield %[[VAL_84]] : f64

Hi Aart,

Only high level comments, I didn't have time to dig into the meat of the pass yet (and by no means my review is blocking :P)

Cheers,
-Quentin

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175

Maybe add a comment saying that the additional arguments are described as options in the .td file.

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
273

What should we set as vector length for something that depends on the actual type?

E.g., if the vector unit is 128-bit wide, we would have 2x for i64, 4x for i32, 8x for i16.

We don't need to address that in this patch and in particular, maybe the goal here is to vectorize as much as possible and let the actual backend split things up later.

mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp
73

I see that this moves in the vectorization pass.

Couple of questions:

  • Could this be a problem for users of the SparsificationPass?
  • Should we run this as part of the vectorization pass at all?

Essentially where I'm going is, should we let the users decide when and where the canonicalization should happen instead of appending it to vectorization pass (while it was after the sparsification pass before)?

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
42

You may be able to use getConstantIntValue instead.

aartbik marked 6 inline comments as done.Nov 17 2022, 6:05 PM
aartbik added inline comments.
mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175

I don't mind adding such a comment, but if you look in this file, then all passes are defined in the TD and have options defined as parameter here. So if I add something we need to document them all. WDYT?

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
273

That is a good question, and very fundamental to the vector dialect. Note that by having the vl=xxx flag, we define the mechanism (doing it) but not the policy (picking it).

In general, you want the VL to match the SIMD length and a bit more. E.g. picking vl=8 on a 4-way simd will result in a vector loop with two iterations inside "unrolled" into registers.

See this very old thread I started with experimental docs that show the generated AVX512 code

https://discourse.llvm.org/t/case-study-docs-on-vector-dialect-cpu-codegen/1674

mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp
73

No problem, we never see vector code out of this pass. It was a left over from when we removed vectorization from the sparsification pass. The V2V should have been removed at at time too.

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
42

Ha! Let me try. Because this was an index, some of our other utilities did not work but your suggestion is nice and short

297

Yeah, I was looking for a more "reflective" way.
But will clone() work? Does that make a shallow or deep copy?

377

Yeah, good suggestion. Made it less brittle.

433–434

Because so much of the tests that analyze are repeated in codegen, this was it is a lot less code ;-)

Matt added a subscriber: Matt.Nov 17 2022, 8:27 PM
Peiming added inline comments.Nov 17 2022, 9:49 PM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
433–434

okay, it seems that what you really need is a “dataflow visitor”, but seems that MLIR does not provide that

Peiming added inline comments.Nov 17 2022, 9:51 PM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297

I believe that it makes a deep copy.

Peiming added inline comments.Nov 17 2022, 9:56 PM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297

Actually it might be a little bit ambitious here to say deep copy.

It clones the operation itself, plus all the regions in it. But the operands will remains the same.

You can the call updateRootInPlace to set the operand to other SSA values

That was quick! Thanks, Aart! Just some minor comments. Other than that, LGTM!

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
277

Is this mostly for gathers indices?

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
140

We may have some utilities in the Linalg vectorizer that could be useful here to detect and get the the combining operation.

430

Would this bail out if the loop is an outer loop?

Peiming added inline comments.Nov 18 2022, 9:05 AM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297

a little bit *ambiguous*

aartbik marked 4 inline comments as done.Nov 18 2022, 9:18 AM
aartbik added inline comments.
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297

But then it is way too much redundant work right? I would rather that just this MACRO logic to just create what we need. Ideally I would like to have some reflection flavor like

vexp = rewriter.createUnOp(op->kind, loc, vx);

do we have something like that?

Thanks for working on this, Aart!

This makes sense. I've left a few inline comments, but these are just nits, typos or basic questions from an aspiring "vectorization" ninja :) Definitely non-blocking!

One thing that's not clear is how enable-simd-index32=true affects the generated code. I'm comparing output for CHECK-VEC2 and CHECK-VEC3 and they look identical (func @add_dense) 🤔 . Also, would it make sense to test other values of vl (currently it's always vl=16)?

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
34

[nit] Would it make sense to document what the unit of vectorLength is? (i.e. bits, bytes, number of elements?)

96

What's lo and hi in this context?

143

For those of us who have no access to this book, would you be able to add a few more comments? In particular, could you explain the difference between r and rd? Thank you :)

223
334–335

Suggesting this as it's not immediately obvious what "first pass" means when looking at this method in isolation (it becomes obvious when it's being called).

348
360

[nit] Wouldn't forOpNew be more descriptive? forOp2 might suggest there are 2 "for loops" to begin with.

It would also be good to set up the expectations for this vectorizer somewhere in the documentation to make sure nobody misses that this is a transition step towards Linalg.

qcolombet added inline comments.Nov 18 2022, 10:38 AM
mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175

If the canonical way is to have the documentation in the td file, what works for me :).

aartbik marked 11 inline comments as done.Nov 18 2022, 11:43 AM
aartbik added inline comments.
mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175

Thanks!

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
273

I added a comment in the doc. See if that makes sense.

277

yes, added that specifically in the option description

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
34

Added more doc to struct

42

it works!

96

Documented

140

can you point me to any? None jumped out (this small)

143

We all should have access to the book! But documented

297

I played around with some cloning, but nothing looks as concise as the current macro magic (which is also the most efficient way to get to the new IR). SO I am leaving this as a follow up refinement if you are okay.

360

Good suggestion. My naming is always too short ;-)

430

It does since it will fail on the inner forOp. But is there a quicker way to test this?

aartbik updated this revision to Diff 476574.Nov 18 2022, 12:52 PM
aartbik marked 8 inline comments as done.

addressed comments, rebased with main

aartbik updated this revision to Diff 476594.Nov 18 2022, 1:47 PM

dotted the i's for ARMSVE

aartbik edited the summary of this revision. (Show Details)Nov 18 2022, 1:52 PM
aartbik updated this revision to Diff 476616.Nov 18 2022, 3:19 PM

clang format

dotted the i's for ARMSVE

Great to see this! Also, many thanks for all the additional comments :)

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
285
397
407–408

Shouldn't the pass-through value be equal to whatever is the neutral value for the combining operation? (i.e. 1 for multiplication and 0 for addition)

mlir/test/Dialect/SparseTensor/sparse_vector.mlir
2–9

You could try more descriptive prefixes:

  • CHECK-VEC1 --> CHECK-SCALAR
  • CHECK-VEC2 --> CHECK-VEC16
  • CHECK-VEC3 --> CHECK-VEC16-32IDX
  • CHECK-VEC4 --> CHECK-VEC4-SCALABLE

You could also just drop CHECK to make it a bit concise.

This is clear either way, so feel free to ignore.

aartbik marked 2 inline comments as done.Nov 21 2022, 8:53 AM
aartbik added inline comments.
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
407–408

Ah, that is possible too, but by using the inits shown, you can safe one op in the end.
E.g. a sum reduction could start with

v = [0,...0]
simd loop
t = sum_reduce v
r = r + t

or you can do

v = [0,...r]
simd loop
r = sum_reduce v

mlir/test/Dialect/SparseTensor/sparse_vector.mlir
2–9

This is a good suggestion, done.

aartbik updated this revision to Diff 476928.Nov 21 2022, 9:23 AM
aartbik marked 2 inline comments as done.

addressed comments, rebased with main

dcaballe accepted this revision.Nov 21 2022, 3:07 PM

Thanks, Aart! LGTM

This revision is now accepted and ready to land.Nov 21 2022, 3:07 PM
This revision was landed with ongoing or failed builds.Nov 21 2022, 4:12 PM
This revision was automatically updated to reflect the committed changes.
qcolombet added inline comments.Nov 21 2022, 4:34 PM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
76

You could use getConstantIntValue here as well.

91

Nit: Took me some time to parse what the min was doing because the order of the expressions and the ValueRange don't match.
I.e., the expression has min(symbol, dim0 - dim1) (i.e., symbol, dim0, dim1) and the ValueRange has dim0, dim1, symbol.

Maybe that's just me not being used to dealing with affine builds (where we know that dimensions appears first) :).

Feel free to ignore.

113

I feel I'm missing something obvious, which may indicate a need for a comment :).
How do we know that only at most the last index will be a VectorType?

Couldn't we technically end up with a VectorType for several of the indices?

Going by what happens in vectorizeSubscripts, I don't see anything that would prevent generating more than one vload.

aartbik marked 3 inline comments as done.Nov 21 2022, 6:29 PM
aartbik added inline comments.
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
76

Good point.

113

Indeed, if you have to ask, it needs a comment. Coming up!

awarzynski added inline comments.Nov 22 2022, 8:01 AM
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
407–408

Ah, I see! I got confused because genVectorInvariantValue creates:

%25 = "vector.broadcast"(%arg4) : (f32) -> vector<[4]xf32>

so v = [r, ..., r] rather than v = [0, ..., r].

But then vpass is effectively replaced with %6 = vector.insertelement %3, %cst[%c0 : index] : vector<[4]xf32>, so the code is correct.