This is an archive of the discontinued LLVM Phabricator instance.

[mlir] [sparse] start of sparse tensor compiler support
ClosedPublic

Authored by aartbik on Nov 6 2020, 5:59 PM.

Details

Summary

As discussed in https://llvm.discourse.group/t/mlir-support-for-sparse-tensors/2020
this CL is the start of sparse tensor compiler support in MLIR. Starting with a
"dense" kernel expressed in the Linalg dialect together with per-dimension
sparsity annotations on the tensors, the compiler automatically lowers the
kernel to sparse code using the methods described in Fredrik Kjolstad's thesis.

Many details are still TBD. For example, the sparse "bufferization" is purely
done locally since we don't have a global solution for propagating sparsity
yet. Furthermore, code to input and output the sparse tensors is missing.
Nevertheless, with some hand modifications, the generated MLIR can be
easily converted into runnable code already.

Diff Detail

Event Timeline

aartbik created this revision.Nov 6 2020, 5:59 PM
aartbik requested review of this revision.Nov 6 2020, 5:59 PM
aartbik updated this revision to Diff 303610.Nov 6 2020, 8:07 PM

fixed lint issue

sri added a subscriber: sri.Nov 8 2020, 7:14 PM
silvas added inline comments.Nov 9 2020, 8:39 AM
mlir/test/Dialect/Linalg/sparse_1d.mlir
19

Haven't done a full review of the patch yet, but VAL_0 here doesn't seem to feed into any of the memrefs in the converted program.

aartbik added inline comments.Nov 9 2020, 9:24 AM
mlir/test/Dialect/Linalg/sparse_1d.mlir
19

The current sparse compiler does a simple "local" bufferization, where the arguments are used to get the shape, but are otherwise dropped. The alloc() that replace them are not even filled with values. You will see this in all the examples.

This is of course not a runnable (and final) solution, but it makes hand modifying it to something runnable easier.

In the long run, the sparse bufferization should replace the parameters and prototype of the method, just like true bufferization does currently.

Thanks for pushing on this Aart!

Some high-level comments before digging deeper: I get that this is essentially reimplementing TACO whose ideas have been shown to work in the past but I am wondering if we could significantly simplify things by taking advantage of the existing MLIR infra?
In particular, this is a one-shot implement all at once that I fear will be hard to break down and interoperate with.
For instance, I have difficulty seeing how tiling, fusion or vectorization will work from this form.

As discussed offline, I am wondering if we would be better off by immediately starting from:

  1. making topological ordering part of a verification pattern, the transformation part can likely be already be applied with linalg interchange, the optimization part is separate.
  2. given the existing linalg to loops lowering, adapting it to work with the sparse annotation, in particular with a level of indirection in the load/store under appropriate sparsity conditions.
  3. inserting extra control flow (additional scf.for loops + scf.if to emulate coiteration)
  4. starting to invest in scf.for + scf.if hoisting and in particular -> scf.while canonicalizations

The following comment reminds me of the first broadcast lowering which had similar recursions:

/// Recursively generates code while computing iteration lattices in order
/// to manage the complexity of implementing co-iteration over unions
/// and intersections of sparse iterations spaces.

I somewhat view this as a sign of missing abstractions and lowering too fast, which makes sense given the similarity with TACO.
I suspect a lot of the lattice part of the code could start popping out quite naturally as canonicalizations, but I have not tried so I may be completely off here.
It may also be that we don't want to fully expand the while along all paths?

I expect the discussion above to work well with reasonable interfaces that could easily be switched to operations on future sparse types when they become available.

I realize this may sound like a lot of things that have to come together to get to a similar functionality but written in a more progressive and composable fashion.

Is this an opportunity to start breaking down the problem into small composable abstractions and start distributing some of the work ?

mlir/test/Dialect/Linalg/sparse_3d.mlir
1193
%[[VAL_25:.*]] = load %[[VAL_9]][%[[VAL_24]]]

should just work.
Cases where I've seen a need for the escaping sequence is when the % was also captured.

Thanks for pushing on this Aart!
I realize this may sound like a lot of things that have to come together to get to a similar functionality but written in a more progressive and composable fashion.

Thanks for your feedback Nicolas, and of course I am pondering over all those questions myself as well already. But I guess the approach I am taking with this CL is a bit "learning to walk before you can run", i.e. start with a working solution (so people actually understand fully the direction that will be taken) and then progressively improve on it by adding useful abstractions and refactoring the working parts accordingly, similar to what is still being done for Linalg, bufferization, and even Vector dialect still. I don't see a very clear path adding such abstractions purely in isolation and then somehow hope they indeed come together well.

Taking this CL as starting point, the very first steps would of course be some testing of the generated code, some performance analysis, and addition of SIMD support, just to convince others this is a feasible approach. Then proper next steps would indeed be (1) move annotations to Linalg interface, so that we can verify early and other phases have to start thinking of how to propagate that information, (2) find a better approach to the bufferization issue (3) progressively try to find better primitives to express sparse code, including I/O of the tensors.

mlir/test/Dialect/Linalg/sparse_3d.mlir
1193

Note that all CHECKs are auto-generated (generate-test-checks.py), so the layout and constructs are no my doing. Perhaps we can improve the script to avoid escapes where not needed (also, I had to make one hand modification on a parameter match that did not work out of the box).

Thanks Aart!

I did one pass, have a bunch of nits to make this look more MLIResque in style. I agree with Nicolas that we should think about how to connect this better to the infrastructure and avoid a monolithic codegen that will grow huge over time (the same rationale as for affine code generation could apply). My knowledge of the TACO paper is residual at this point, so it's not always easy to track what is happening in the code the high level. Maybe having a relatively short comment blob that explains the algorithm without referring to the paper could be a good way for us to crystallize the understanding on how this can be fitted into MLIR. That being said, I wouldn't oppose landing a version that is correct + tests, and then iterating on refactoring and generalization while keeping the tests green.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
56–57

Nit: these could use some documentation. In particular, what do these things contain for different kinds.

79

Nit: /// for function docs as well

97

Nit: emplace_back() ?

238

Nit: we use LogicalResult for such cases

243

Nit: getAttrOfType

272

We have some other implementations of topological sort, iterative IIRC, would it be possible to refactor+template those instead?

331

Nit: I'd rather return Optional<unsigned>

338–340

I was a bit lots here before realizing that kTensor refers to any values that is indexed in the GenericOp and kScalar refers to any value that is not, regardless of their actual types. kTensor may or may not be a tensor, kScalar may also be a tensor if the ops inside just consume it as a whole... Would be nice to have this documented.

342–352

How do you expect this to scale to more ops?

Generally, MLIR doesn't seem to like data structures parallel to the IR, but it may be justified if there are significant restrictions or analysis costs to process the IR every time.

371

Nit: map.getResult(i).isFunctionOfDim(idx) should work and be more future-proof

428–431

Where are these deallocated?

436

Where does this inference take place?

445

Shouldn't this be op.getInitTensor(t - numInputs)?

462

Where is this deallocated?

472

Nit: this seems to be a recurrent pattern, can we factor it out into the op class or interface ?

524

Could we have this as a function template with lambdas instead?

529

This actually closes the conditional rather than the loop, looks a strange to me. Correct, but strange.

630–631

This is not compatible with PatternRewriter, use rewriter.createBlock instead.

638

This is not compatible with PatternRewriter either. createBlock takes an arrayref of argument types for this purpose.

aartbik marked 15 inline comments as done.Nov 12 2020, 8:10 PM
aartbik added inline comments.
mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
238

By merging with the new verifier of the annotations, this is no longer needed.

243

By merging with the other CL that makes annotations part of generic op, this is no longer needed.

272

I looked at some, but for the ones I found, I felt expressing this with code reuse would be more obscure than adding these few lines. Is there one simpler one that I overlooked?

331

Yeah, I guess I am old school C sometimes with my extra parameters and magic returns.
Changed to the more elegant Optional, which also has the advantage of stopping the build as soon as it fails.

338–340

Well, anything indexed would be a tensor for the kernel, but I see your point on the scalar issue. Would kInvariant be a better name (besides adding more doc, which I did too).

342–352

I too, am reluctant to have parallel data structures, but the lattice point manipulations really need some IR to represent arbitrary subexpressions (viz a + b + c has a + b + c, a + b. a + c, b + c, a, b etc.). I don't expect this too scale to very large IR ever...

428–431

They are not. Currently the genBuffers() is a "local" bufferization method that just adds some stub allocations that I can simply hand modify into something running. Most of this code is throw-away since it will be replaced by something more global in bufferization (or at least something changed here by modifying the parameters of the method rather than adding allocs).

436

L444.

445

Any tensor (t < numInput) is an input parameter, except potentially the output tensor (t == numInput), which has a single InitTensor. Again, I realize all code should be clear and solid, but this genBuffers will be rewritten heavily very soon. This just gets some reasonable allocs before the rewritten kernel.

462

See above. TBD ;-)

472

I see a lot of this outside this file too, let's do a cleanup later for all

529

Yeah, this was actually a result of not using { on a single statement for-loop (body is single if).
I had two {{ and }} but was not sure if style would apply to macro as well ;-)

630–631

Ah yes, you are right. Thanks for this very sharp observation, I had completely overlooked that!

aartbik marked 10 inline comments as done.Nov 12 2020, 8:26 PM

Thanks Aart!

I did one pass, have a bunch of nits to make this look more MLIResque in style. I agree with Nicolas that we should think about how to connect this better to the infrastructure and avoid a monolithic codegen that will grow huge over time (the same rationale as for affine code generation could apply). My knowledge of the TACO paper is residual at this point, so it's not always easy to track what is happening in the code the high level. Maybe having a relatively short comment blob that explains the algorithm without referring to the paper could be a good way for us to crystallize the understanding on how this can be fitted into MLIR. That being said, I wouldn't oppose landing a version that is correct + tests, and then iterating on refactoring and generalization while keeping the tests green.

Thanks for the very detailed comments, Alex, to the point as always! The endgoal is definitely not a monolithic codegen, but for now to simply bootstrap the idea, and iterate to nicer abstractions (not unlike the approach we took for many other modules). I really look forward getting this of the ground!

aartbik updated this revision to Diff 305016.Nov 12 2020, 8:34 PM

addressed Alex' comments

aartbik updated this revision to Diff 305021.Nov 12 2020, 8:50 PM

fixed typo

aartbik updated this revision to Diff 305562.Nov 16 2020, 10:57 AM
aartbik marked 2 inline comments as done.

rebased with new built-in annotations

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
77

If this is meant to subsist in the mid/long-term, it would be good to move it to utils and have a special test for it with simple MLIR custom ops and prints, completely independent of the codegen.

530

I find this function very hard to follow in the current spaghetti form.
It doesn't seem like it would be very hard to restructured to a more structured form with helper functions that have semantically meaningful names.

Could we please try that ?

554

braces plz, trivial statement is defined as 1-liner these days :)

aartbik marked 2 inline comments as done.Nov 16 2020, 8:42 PM
aartbik added inline comments.
mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
77

If we indeed keep this representation and data structure in the long term, I will be happy to pull it out so it can be unit tested in isolation. But let's leave that for a follow-up restructuring.

530

Agreed. When I started this function, it was actually very readable (it more or less follows the recursive variant given in the Taco paper), but I must admit that after filling in on the blanks with actual implementation, it has become a bit massive.

I broke it up into smaller, individual methods. PTAL if this reads better.

aartbik updated this revision to Diff 305651.Nov 16 2020, 8:54 PM
aartbik marked an inline comment as done.

broke up large genStmt() method into many genXXX() methods

nicolasvasilache accepted this revision.Nov 17 2020, 6:30 AM

Thanks @aartbik for refactoring and making this more readable.

This makes a few things clear IMO that we should address before landing.
The remaining issues are mainly concentrated around goto-style programming induced by manual insertion point manipulations across function boundaries.

You should only ever need to build IR by using:

  1. OpBuilder::InsertionGuard g(b) which resets the IP upon RAII release
  2. structured builders that take lambdas and make it very clear the nesting that you emit.

Granted, there is no such builder for WhileOp, this revision should still adopt best practices and we can turn this into the structured WhileOp builder in a subsequent revision.

Approving conditioned on that.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
577

no goto please

620

no goto please

731

no goto please

762

tmp does not escape and is overwritten a bunch of times, please declare inside the for loop, it makes its transient nature clearer.

768

please factor l767-777 into a meaningfully named helper function

783

Please split in half and refactor further:

// Single index
if (indices.count() == 1) {
  genForLoop(args,  [&](OpBuilder &b, Location, Value iv, ValueRange) {
    genLocals(merger, codegen, b, ...);
  };); // where lambda is a scoped body builder to be passed to the scf::ForOp builder
  return;
}

genWhileLoop(
  args, 
  [&](OpBuilder &b, Location, Value iv, ValueRange){
    genLocals(merger, codegen, b, ...);
        for (unsigned lj : merger.set(lts)) {
      if (li == lj || merger.latGT(li, lj)) {
        LatPoint latj = merger.lat(lj);
        tmp = latj.bits;
        tmp ^= lati.bits;
        if (merger.hasAnyOf(tmp, false))
          continue; // dense exhausted within if/else
        // Recurse into body of each branch.
        genIf(merger, codegen, rewriter, op, idx, latj.bits, ifOp, [](){
          ifBodyBuilder
        });
        genStmt(merger, codegen, rewriter, op, topSort, latj.exp, at + 1);
      }
    }
  }, /* regionBuilder1 */
  [&](OpBuilder &b, Location, Value iv, ValueRange){}, /* regionBuilder2 */
  );

In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with lambdas.
Then all insertion point manipulations can be isolated in that builder.

788

This is confusing due to the goto-style insertion point manipulation across function boundaries.
Please refactor to use the structured scf::IfOp builder:

OpBuilderDAG<(ins "Value":$cond,
  CArg<"function_ref<void(OpBuilder &, Location)>",
       "buildTerminatedBody">:$thenBuilder,
  CArg<"function_ref<void(OpBuilder &, Location)>",
       "nullptr">:$elseBuilder)>

with proper helper functions / lambda to make it read more naturally.

Additionally, since this is only used in the while case, I expect it will compose nicely with the above suggestion to show the real structure that you emit in C++:

/// dense iterator
emitForOp(..., [](){
  body
  no conditional
})
return;

/// coiteration
emitWhileOp(..., [](...){}, [](...){
  body 
  emitIfOp(..., [](..){
    body
  });
});
804

Let's please get rid of goto-style insertion point manipulation across function boundaries.

genWhileInduction should be lifted as a lambda passed to genWhileLoop.

This revision is now accepted and ready to land.Nov 17 2020, 6:30 AM
aartbik marked 5 inline comments as done.Nov 17 2020, 11:04 AM

I addressed your remaining comments, except that I am pushing back (for now) on the insertion point vs. builder issue. Most obvious, since we don't have such a builder for while yet. But also, I am not convinced it will necessarily make for clearer code. Right now, every method leaves the insertion "cursor" at the right place for the next, even if that cursor can be something different (inside for, while, or if). I am not sure builders will improve on that, but could be convinced otherwise by trying once we actually have those abstractions. But let's not keep stalling this CL by trying to obtain a perfection that is also not met in other parts yet ;-)

aartbik updated this revision to Diff 305858.Nov 17 2020, 11:06 AM

addressed remaining comments (except builders)
also rebased with the new Affine utils that make for shorter code

ftynse added inline comments.Nov 17 2020, 11:41 AM
mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
20–23

A bit more detail on what is the iteration graph and lattice would be welcome for future readers :)

428–431

Maybe use an alloca instead? If it is rewritten anyway, we shouldn't care if it's stack or heap, and we wouldn't be creating code that leaks memory.

783

In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with lambdas.

The only reason why I did not add such a builder was the absence of uses, and we don't want untested code.

aartbik marked 3 inline comments as done.Nov 17 2020, 12:06 PM
aartbik added inline comments.
mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
20–23

Added a bit more detail.

783

I already am very grateful you added the while loops in the first place, which saved me a lot of time. I am happy to develop more infrastructure as needed.

aartbik updated this revision to Diff 305878.Nov 17 2020, 12:18 PM
aartbik marked 2 inline comments as done.

alloc -> alloca, added a few more comments in top description

ftynse accepted this revision.Nov 17 2020, 12:53 PM
This revision was automatically updated to reflect the committed changes.
jjolly added a subscriber: jjolly.Nov 18 2020, 8:29 AM