This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse] Factoring out an enumerator over elements of SparseTensorStorage
ClosedPublic

Authored by wrengr on Mar 18 2022, 8:07 PM.

Diff Detail

Event Timeline

wrengr created this revision.Mar 18 2022, 8:07 PM
wrengr requested review of this revision.Mar 18 2022, 8:07 PM
wrengr updated this revision to Diff 417400.Mar 22 2022, 2:13 PM

rebase, and fixing build error

wrengr updated this revision to Diff 417404.Mar 22 2022, 2:19 PM

redoing the diff upload

aartbik added inline comments.Mar 28 2022, 4:34 PM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
315

Note that the //==---- style of separation was only used for comments that introduce a whole new section. This is the first time it is used for just a class comment and it feels a bit out of place. So I would remove L239.

321

Since we are moving towards documenting more, perhaps some more information around L174, the base class ,would be in place now too.

323

I like how you place this V-interface in between the typeless base class and the full tensor storage, very elegant!

327–328

Please do not refer to design doc (which is not accessible to outside world most likely and may go out of date). Also, in general, let's just document what we did, not what we could have done ;-)

335

how about making the members rev/sizes etc. part of the base class and making this non-virtual ,inline getters instead. It will require a "super" constructor, of course ,but it would make the part on who owns what a bit more clear

450

if we start using such "end class" comments ,let's do it everywhere

wrengr marked 2 inline comments as done.Mar 29 2022, 12:23 PM

Ah, those were notes-to-self for navigating the file during development. I think after D122061 lands we should split the file up to make navigation easier, though I'm still working out where the best splits would be.

wrengr updated this revision to Diff 419834.Apr 1 2022, 12:46 PM
wrengr marked 4 inline comments as done.

Factored out D122928 to address the request for reorganization

mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
321

Any particular documentation you'd like to see there?

327–328

I actually left this note for you :) That is, so that once you got a chance to see the final code we could revisit the designs we talked about and decide which one to go with :)

335

sgtm

wrengr updated this revision to Diff 419875.Apr 1 2022, 3:29 PM

Cleaning up how EnumerableSparseTensorStorage delegates to constructors of its base class.

wrengr updated this revision to Diff 421051.Apr 6 2022, 5:29 PM

Rebasing for D123166. Also removing a bunch of inline keywords, per MLIR style-guide.

aartbik added inline comments.Apr 8 2022, 10:20 AM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
456–457

Refresh my C++ knowledge ;-), but do we really need those Base:: qualifiers now? It is a method defined in a superclass, so shouldn't all the magic just work?

wrengr added inline comments.Apr 8 2022, 11:51 AM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
456–457

Alas we do :*( Since the superclass is templated, C++ refuses to do name resolution for superclass methods. https://www.cs.technion.ac.il/users/yechiel/c++-faq/nondependent-name-lookup-members.html

wrengr marked an inline comment as done.Apr 8 2022, 11:51 AM
aartbik added inline comments.Apr 12 2022, 5:52 PM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
353–362

do we use this constructor currently? if not, then please add later when used

383–386

given that this is not a public facing class, do we need to explicitly delete all of these (I assume you use this so you don't accidentally do the wrong thing)

427

I still find the original code that first restore the original sizes back a bit more intuitive (during debugging, I always check the factory point as entry to see if all is right), and it allows us to use the factory method newSparseTensorCOO exclusively, rather than introducing a direct new here. Given how much overhead you introduce elsewhere, is saving the extra loop really worth it.

556

Given that toCOO() is rather central to a lot of previous measured performance operations, can you please do a before/after measurement with some large tensors (see e.g. our pre-print paper), just to make sure that the use of a ElementConsumer callback does not introduce too much overhead?

723

does this need to be protected, or can it even be private, given that you "friend" this?

wrengr updated this revision to Diff 422674.Apr 13 2022, 3:07 PM
wrengr marked 3 inline comments as done.

Addressing comments, fixing an issue about "slicing", and incorporating D122936

wrengr added inline comments.Apr 13 2022, 3:09 PM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
383–386

Yep, these are to help prevent doing the wrong thing, since this class captures a reference that could become dangling. IMO it doesn't matter whether the class is public-facing or no; those of us working on the library are only human and thus fallible, so it's still beneficial to get the compiler to help guard against mistakes (especially since there's zero runtime cost to doing so!). Though it does look like it's sufficient to only explicitly delete the copy versions, since then the move versions will implicitly fallback to the defined(-as-deleted) copy versions

427

This part of the reorg isn't about improving performance (though that is a consequence), it's about having the right abstractions. I don't think newSparseTensorCOO is a particularly good abstraction. The only thing it does is construct the pushforward array (which is itself a very good abstraction that's also used in several other places), assert non-zero sizes (which can be abstracted into several other places), and then call the constructor. If we factor out the code for constructing pushforward arrays, then newSparseTensorCOO is just a macro: newSparseTensorCOO(rank, sizes, perm, cap) === new SparseTensorCOO(pushforward(rank, perm, sizes), cap). Since SparseTensorEnumerator must already construct the pushforward array for its own reasons, I fail to see what value newSparseTensorCOO adds that would make it worth intentionally duplicating that work.

Note that this is rather different than the situation with newSparseTensor. Because newSparseTensor does relatively unique things like comparing the runtime sizes against the static shape, and in D122061 it's overloaded on the type of the final argument. I'm still not convinced that it's the best abstraction (e.g., because openSparseTensorCOO also compares the runtime sizes against the static shape, which suggests there's a better place to draw the abstraction boundary), but it is at the very least a non-trivial abstraction

556

Will do. Can you send me the shell script you used for running those experiments?

723

Yeah, they can be private. I just used protected because that seems more legible to me (oh how I wish there was a way to restrict "friendship" to only these specific methods)

Thanks Wren. If the performance results are good, this is getting close to LGTM.

mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
427

Yeah, I am not pushing back against this part of the change per se, I just like the invariant that I had in newSparseTensorCOO during debugging. I can get used to looking for another invariant in the new constructor ;-)

556

I don't think any of my experiments directly apply, since I measured reading tensors. I think you will need to tightly time some of the conversions, in particular around the toCOO method. I am just curious if the function call shows up at all or not.

I find

auto t_start = std::chrono::steady_clock::now();
...
auto t_end = std::chrono::steady_clock::now();

extremely useful to time tight sections of code, and report very specific timings for a small set of instructions.

723

Ah, I see your point in making it apply to just one section. But yeah, this is find too, less confusion than seeing a protected section.

wrengr updated this revision to Diff 422704.Apr 13 2022, 5:39 PM
wrengr marked 3 inline comments as done.

Removed the intermediate class EnumerableSparseTensorStorage<V>.

I finally figured out how to reuse SparseTensorStorageBase::getValues in lieu of the EnumerableSparseTensorStorage<V>::getValue method. So I've moved the other two methods into the SparseTensorStorageBase class itself. This loses a little bit of type safety since the SparseTensorEnumerator constructor now has a new type invariant. But it means the SparseTensorStorage class no longer needs to qualify all the inherited methods, which is a considerable amount of cleanup.

mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
556

Ah, I was hoping it was scripted away rather than requiring such invasive checking. I'll see if I can't find some decently reliable way to test it. (I'm used to using https://github.com/haskell/criterion which handles all manner of complications re proper benchmarking; though I've no idea whether C++ has anything remotely analogous.)

wrengr updated this revision to Diff 422716.Apr 13 2022, 7:07 PM

rebasing to fix spurious build failure

Do you have some experimental validation to report before we proceed with this?

Unfortunately the benchmarks have been... annoying. I finished writing them up and ran them at the end of week last week, and they showed <1% slowdown. I was going to post a comment to that effect on monday, but I wanted to rerun them just to be sure— and even though I hadn't touched the code (neither for this CL, nor for the benchmark itself, nor rebasing for the recent upstream changes) suddenly it was showing 2~15% slowdown. Which has undermined my belief in the credibility of the benchmarks. So this week I've been trying to figure out how to improve the reliability of the benchmarks, as well as trying to track down where the slowdown is coming from (assuming it's not spurious).

wrengr added a comment.May 6 2022, 6:15 PM

After banging away at things, I seem to have come up with a version that has -4.82~-6.79% slowdown (i.e., 5~7% speedup). I need to check a few more things to make sure these results are actually valid, then I'll upload the new version.

wrengr updated this revision to Diff 428751.May 11 2022, 1:09 PM

Refactoring to minimize overhead (namely splitting the enumerator class up so that we can avoid the cost of virtual-method calls within the loop-nest). Current benchmarks indicate this differential has no statistically significant difference in cpu time compared to the baseline; or on occasion is somewhat faster than the baseline.

Also rebasing to incorporate recent changes (D124502, D124875, D124475).

wrengr marked an inline comment as done.May 11 2022, 1:12 PM
wrengr added inline comments.
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
556

I seem to have finally made the benchmarks report more stable numbers. There's still more variation than I would like, but they reliably show this differential to have the same (or better) performance. Only in rare cases is there any regression, and those cases are still <1%

wrengr updated this revision to Diff 428797.May 11 2022, 4:00 PM

Rerunning git-clang-format

aartbik accepted this revision.May 12 2022, 12:00 PM

Thanks for your patience during the review, Wren. It has been a long road, but nice to see this new abstraction!

This revision is now accepted and ready to land.May 12 2022, 12:00 PM
This revision was landed with ongoing or failed builds.May 12 2022, 5:06 PM
This revision was automatically updated to reflect the committed changes.