This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse] Use incremental topological sort to compute loop order from iteration graph
Needs ReviewPublic

Authored by streichgeorg on May 7 2023, 3:23 AM.

Details

Summary

Refactor iteration graph code a bit so that the loop order can be computed incrementally and constraints that would create cycles can be skipped. This makes it possible to apply constraints in a more fine-grained way. The current implementation increases the asymptotic complexity of the procedure, so it may need some optimization. Some other improvements I have in mind are

  • Add loop type constraints (filter < parallel < reduction) via edges as well, instead of implicitly.
  • Relax constraints for dense tensors so that we only require the innermost dimension to be last. As far as I can tell, there is no need to have a specific ordering between dimensions otherwise.
  • Add constraints for a single dense tensor as a group (e.g. only add them if all can be satisfied)

At the moment when running check-mlir one test fails. This is because now some of the Undef constraints can be applied for conv2d which moves the sparse loop outside.

Update:
I changed the conv2d check to match the code that is now generated.

Diff Detail

Event Timeline

streichgeorg created this revision.May 7 2023, 3:23 AM
Herald added a project: Restricted Project. · View Herald Transcript
streichgeorg edited the summary of this revision. (Show Details)May 7 2023, 7:50 AM
streichgeorg edited the summary of this revision. (Show Details)
streichgeorg edited the summary of this revision. (Show Details)May 7 2023, 7:53 AM
streichgeorg published this revision for review.May 9 2023, 11:05 PM

Squash commits

Thanks for contributing to the sparse compiler effort!

At the moment when running check-mlir one test fails. This is because now some of the Undef constraints can be applied for conv2d and the new loop order is an actual improvement (because it moves sparse loops outside).

Can you please update the CHECK test too in that case (or did you, I see a test updated, but also this comment, hence the question)? Also, did you measure any actual improvements. The sparse-out is intuitive, but it does not always really result in performance gains (still O(nnz * n) vs O(n * nnz). Having some measurements to back up the claim would help.

aartbik retitled this revision from Use incremental topological sort to compute loop order from iteration graph to [mlir][sparse] Use incremental topological sort to compute loop order from iteration graph.May 13 2023, 7:41 PM

Thank you for your reply!

I did update the check but forgot to update my comment accordingly. With the updated check, all tests in check-mlir are passing.

As for the improvement, I was only considering how many constraints are satisfied by each loop ordering, but did not look at any performance measurements. I will try to benchmark the two versions and report my findings.

streichgeorg edited the summary of this revision. (Show Details)May 17 2023, 11:54 AM
streichgeorg edited the summary of this revision. (Show Details)
Peiming added a subscriber: Peiming.EditedMay 22 2023, 10:41 AM

Relax constraints for dense tensors so that we only require the innermost dimension to be last. As far as I can tell, there is no need to have a specific ordering between dimensions otherwise.

This might not be true. If the cache line is large enough to hold multiple rows, it is better to visit them in order too. So you might still want to ensure the ordering for all dimensions if it is satisfiable. But what you proposed might be a suboptimal order in between if it is not satisfiable.

also FYI, the filter-loop might be removed sometime in the future because we have a better implementation to replace it.

Quick update, I managed to get the python benchmarking script working and added a convolution kernel as well as more matrix multiplication configurations. As far as I can tell there seems to be no performance gain by the new loop order this change introduces (at least for the sizes and sparsity factors I tried out).

Furthermore, I also tried leaving out the 'undef' constraints entirely, which seems to improve performance on the convolution kernel by a bit, but, causes a regression for matrix multiplication. As a next step I will try to figure out why that is and come up with a configuration that is fast for both of them.

This might not be true. If the cache line is large enough to hold multiple rows, it is better to visit them in order too. So you might still want to ensure the ordering for all dimensions if it is satisfiable. But what you proposed might be a suboptimal order in between if it is not satisfiable.

True, I did not consider the case where dimensions are small.

We really appreciate your contributions, but are a bit reluctant to accept this revision. It seems to over-engineer something that is currently not a bottleneck for us at all,and swapping in new code always has the risk of breaking something unexpected.

aartbik resigned from this revision.Aug 16 2023, 2:48 PM