Page MenuHomePhabricator

[mlir][sparse] add parallelization strategies to sparse compiler

Authored by aartbik on Nov 23 2020, 10:01 AM.



This CL adds the ability to request different parallelization strategies
for the generate code. Every "parallel" loop is a candidate, and converted
to a parallel op if it is an actual for-loop (not a while) and the strategy
allows dense/sparse outer/inner parallelization.

This will connect directly with the work of @ezhulenev on parallel loops.

Still TBD: vectorization strategy

Diff Detail

Event Timeline

aartbik created this revision.Nov 23 2020, 10:01 AM
aartbik requested review of this revision.Nov 23 2020, 10:01 AM
aartbik edited the summary of this revision. (Show Details)Nov 23 2020, 11:15 AM

For ease of reviewing, I split out the invariant generalization in its own review:

penpornk added inline comments.Nov 24 2020, 1:21 PM

Typo: s/stategy/strategy/


Maybe explain the types of loop first just so it's clear for people who are not familiar with TACO?

  • For loops that only iterates one tensor. (Can be called dense/sparse loop based on the compression of that tensor dimension.)
  • For loops that co-iterates tensors (at most one tensor can have sparse storage for that dimension).
  • While loops co-iterating tensors (more than one tensor have sparse storage for that dimension -- parallelization not supported).

Nit: Comma after e.g.


Nit: Missing a close parenthesis.


Typo: s/it it/it is/


Typo: s/loop loop/loop/


Please bear with my noob questions:
It looks like highs and sizes store the inferred dimension sizes. What exactly is lo (and pidxs and loops)? If it's the starting point of a dimension, isn't it 0? Why are we updating pidxs and loops through parOp.getInductionVars()?

Also, why do we need highs to be per tensor for sparse but sizes can be just per dimension? (If multiple tensors co-iterate on a dimension, the size should match anyway.)

It'd be great if you could help add more explanations about them (or examples) to struct CodeGen. (Or I can do it later once I'm more familiar with the code.)

1133 ↗(On Diff #307112)

Does alloca fill the allocated memory with 0s? If not, X has not been initialized to 0 before the loop, and the loop only set its values where A is nonzero.


If in the SCALE case (e.g., trait_ss) the inner loop can be parallelized, then I think in this case it could be parallelized too, because the amount of work solely depends on A.

aartbik marked 8 inline comments as done.Nov 24 2020, 4:11 PM
aartbik added inline comments.

Added some explanation at the top level.


Yes, I can add some more comments to the code, since it has become less self-explanatory then when I started this. In a nutshell, yes some of these are fixed before hand (like sizes), others are updated as we go (so that loops after loops iterate from where the previous loop left the induction).

1133 ↗(On Diff #307112)

None of these buffers are properly initialized. This is the temporary "local bufferization" solution we talked about earlier. Leaving these stubs here allows me to quickly hand covert this into something working. But eventually this will be replaced by proper initialization code (reading from file for example).


Well in this case, the inner loop is a "reduction". But you are right this could be parallelized too with a parallel tree reduction. And in fact, the "scf.parallel" operation supports the "scf.reduce" we may start using in the future....

aartbik updated this revision to Diff 307474.Nov 24 2020, 4:12 PM
aartbik marked 3 inline comments as done.

rebased + addressed comments

penpornk accepted this revision.Nov 24 2020, 4:41 PM
penpornk added inline comments.

Thank you very much!


Oh, right. I forgot that it is doing reduction so it's out of scope for now. Thank you for the clarification! :)

This revision is now accepted and ready to land.Nov 24 2020, 4:41 PM

Thanks for your review, Penporn! Keep those useful comments coming!