This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse] Factoring out `finalizeSegment` and (generic) `appendIndex`
ClosedPublic

Authored by wrengr on Mar 28 2022, 5:31 PM.

Details

Summary

This change introduces two new methods: finalizeSegment and appendIndex; and removes three old methods: endDim, appendCurrentPointer, appendIndex. The two new methods better encapsulate their algorithms, thus allowing to remove repetitious code in several other places.

Depends On D122435

Diff Detail

Event Timeline

wrengr created this revision.Mar 28 2022, 5:31 PM
wrengr requested review of this revision.Mar 28 2022, 5:31 PM
wrengr updated this revision to Diff 418746.Mar 28 2022, 5:49 PM

cleaning up a comment

aartbik added inline comments.Mar 30 2022, 1:10 PM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
436

this should be at the place of endDim, just to make the delta in this review easier to review.

typically when refactoring
(1) changes, remain in place, only shows delta
(2) move code around, in separate CL, no changes to moved code

475

I liked having appendPointer and appendIndex next to each other, but this pushes it way down. Let's keep it at the original place, which also makes the delta for the review a bit easier than when code moves around like this

wrengr updated this revision to Diff 419260.Mar 30 2022, 2:04 PM
wrengr marked 2 inline comments as done.

Rebasing for changes to D122435, and addressing comments

aartbik added inline comments.Apr 1 2022, 3:44 PM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
272

ah, an overflow check, please add that as string at end to make semantics more clear

(and other parts of MLIR will break long before this as I found out earlier ;-)

432

I like this generalization with count as parameter to all appending methods. Just checking, I assume that insert(x.end()) and x.push_back() have similar STL performance?

445

This was of course there before, but for uint64_t I, this is a nop check.
Shall we do the comparison in even higher precision to detect running out of that space
(very unlikely of course, but just observing while here)

495

it would be even shorter if we let appendIndex return the next logical value for full so that this would be

full = appendIndex(d, full, i)

and have all the compressed/dense logic hidden inside the method.

wdyt?

537–544

One of the motivations given for this change is to avoid repeating code. However, in the original code, we only had endDim() pushing zeros into the values array, and I was very used to that invariant. Now both appendIndex and finalizeSegment have this logic. It feels like this part should actuall ycall into appenIndex instead somehow. Do you see a chance to ensure we only add to values in one place?

wrengr marked an inline comment as done.Apr 1 2022, 7:40 PM
wrengr added inline comments.
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
272

Will do. Since I do this check in a couple places over future CLs, I think I'll factor it out into a top-level (inline) function too.

Since the division is expensive (and the error condition rare), does mlir have a cpp flag for "I need extra help debugging, so please turn on even more assertions than usual"?

432

I'm not entirely sure? I've found it notoriously difficult to get much information about stl implementations, since there are multiple vendors/versions of it. The specification https://en.cppreference.com/w/cpp/container/vector/insert requires this usage to be linear in count, but says nothing about the constant factors. I'd imagine in the worst case the overhead when count=1 is just to set up a for-loop that only runs the once (since push_back would also have the same check to prove to itself that the capacity doesn't need resizing). For the cases where count is greater than one, this should be a clear win over calling push_back within a loop: since it need only check the capacity once rather than on every iteration, and since it might possibly do tricks like the following...

Ideally the implementation would use something like memset/memcpy to handle this, since those are usually optimized for architecture-specific issues re alignment and vectorization. But I'm pretty new to C++, so I'm not sure if a random stl vendor could be relied on to do that sort of thing. Once we have some benchmarking tools in place, we can always check to see whether it'd be worth it to roll our own implementation.

For this particular function, the count generalization is just so finalizeSegment doesn't need to repeat the P-validity checking logic; whereas the other call-sites in this and future CLs always use the count=1 default. If vec.insert(vec.end(), 1, val) ends up being notably slower than vec.push_back(val) then I can always go back to the version of the code where this was inlined into finalizeSegment.

445

Hopefully when I=uint64_t the assertion will get optimized away entirely (since the compiler should be able to detect that it's always true). Lifting both operands to a higher precision wouldn't change anything, it'd still always be true, just with a few more zeros at front

537–544

finalizeSegment only ever pushes a bunch of zeros to values and/or calls appendPointer, exactly like endDim used to. Logically finalizeSegment is doing the same thing as endDim did, it's just doing so more generally and more efficiently. Re generality: endDim could only handle the case of completely empty dimensions (initial full is zero), whereas finalizeSegment can handle cases where the initial full isn't zero, as occurs in endPath and the part of appendIndex that handles insPath. Re efficiency: endDim had a bunch of for-loops to repeatedly call the same function (not depending on the induction variable), whereas finalizeSegment accumulates those repeated calls into its count parameter— so we only call finalizeSegment at most rank *times* (rather than at most rank *depth*), and when we hit the basis case all the count-many calls to push_back get combined into a single insert call (thus saving overhead re checking for capacity resizing, improving memory locality, etc).

Whereas appendIndex will write an arbitrary index to the indices array (if compressed), which is a completely different thing. The main change here over the previous version of appendIndex is just that now it can also handle the case where the dimension has dense storage: in which case we insert the appropriate number of zero values. Previously this situation was handled by insPath, but I think it makes more sense as part of appendIndex since the zero insertion is logically being done for the same reasons as writing to the indices array. As for code duplication, appendIndex calls finalizeSegment to perform the zero padding (just as previously insPath called endDim to perform the zero padding).

There is only a slight amount of code duplication, namely the one line for handling the d+1==rank case. This occurs because I adjusted the inductive hypothesis, so both finalizeSegment and appendIndex only take in d such that d<rank (matching the convention of most if not all the other methods); as opposed to the previous situation where d<=rank was allowed. While the duplication could be removed by reverting the IH to d<=rank, I think the d<rank IH is the right way to go. What would it mean for client code to call finalizeSegment(rank) ? There is no rank dimension, so there are no segments of that dimension to be finalized, so I can't think of anything sensible that it could mean.

wrengr updated this revision to Diff 420301.Apr 4 2022, 2:04 PM
wrengr marked 3 inline comments as done.

Addressing comments

mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
495

I like the idea, though I'm not entirely sure of the semantics; i.e., how to document the semantics so as to retain the appropriate encapsulation. I've taken a first pass at it, so let me know if you think it'd be better phrased another way.

aartbik added inline comments.Apr 4 2022, 2:45 PM
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
272

I am not aware of such a flag, but others may know.

But here, rather than passing by reference, how about returning the result so we can do

sz = checkedMulAssign(sz, sizes[r]);

but also more general in a result that is not the same as "lhs".

other = checkedMulAssign(sz, sizes[r]);

432

Thanks for checking. I am not overly worried, and we can indeed always profile and improve later.

445

Ah, yes, never mind. I was hoping to somehow detect "about to wrap-around" from i to i+1, but here we already deal with the new index value, so that will not work.

462

this is technically index set [full..full+i], but with space for the new current element right?
I also would not say cardinality of i+1, since that is only when we start at 0?

536

please put brackets around the subtraction just for my sanity ;-)

wrengr updated this revision to Diff 420340.Apr 4 2022, 4:12 PM
wrengr marked 5 inline comments as done.

addressing comments

mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
272

Will do. Though it's worth noting that I phrased it like operator*= rather than operator* for a couple reasons: (1) I feel like it helps clarify/allow the lhs!=0 precondition, whereas for operator* I think it'd be more appropriate to check whether lhs==0 rather than assuming it; (2) it's really just intended for the case where we're accumulating a bunch of things and thus overflow becomes more likely, rather than being a more general solution for detecting overflows elsewhere.

462

Why would it be [full..full+i]? If full means the number of entries already written, then before calling the method we've already written full many entries (for indices [0..full-1]); within the method we write another i-full many entries (for indices [full..i-1]), thus together we've written a total of [0..full-1]++[full..i-1] == [0..i-1].

In any case, this weird off-by-one relationship —between the i parameter to appendIndex vs the return value which becomes the full for the next iteration of the while-loop in fromCOO— is what I meant when I complained about the semantics being unclear / hard to explain. If full really meant the number of entries/zeros written, then we would return full + (i - full) == i as the new number of entries/zeros written (which matches the [0..i-1] derived above); but we don't, we return i+1 instead. And if full really meant the next unwritten index (since it starts at 0 in the fromCOO while-loop, yet we haven't written the 0th index yet), then the compressed case should also return i+1 (since the ith index has now been written); but it doesn't, it returns full instead. This inconsistency was already there in the original code, so maybe it's a bug rather than intentional; I was just making sure to retain the previous behavior through this refactoring is all

wrengr updated this revision to Diff 420346.Apr 4 2022, 4:38 PM

Re-adjusting the return value of appendIndex.

mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
462

Fwiw, I just checked and having the compressed case return i+1 passes all our tests. Which makes sense since appendIndex ignores the full for compressed dimensions anyways. Of course, the uniformity of that means that we could just have full = i + 1; at the callsite in fromCOO, rather than returning anything from this method.

Conversely, having the dense case return i causes the assertion at the end of toCOO to fail for three tests, and six others fail to FileCheck. (All this was expected, since I played around with that before.)

aartbik accepted this revision.Apr 4 2022, 5:08 PM
aartbik added inline comments.
mlir/lib/ExecutionEngine/SparseTensorUtils.cpp
272

Yeah, I figured there would be limited use cases but the reference parameter may still surprise the future reader. thanks for the change

462

Ai, I hate it when typos obscure my point ;-). I obviously meant that we wrote out [full, i], as in the original for loop, so I understand that afterwards the full [0..i-1] are written, but one method invocation may write less than that if full was > 0 to start with. But since you reverted back to no return, point is mood.

This revision is now accepted and ready to land.Apr 4 2022, 5:08 PM
wrengr marked 3 inline comments as done.Apr 4 2022, 5:48 PM