Page MenuHomePhabricator

Treat RangeDataVector as an augmented BST
ClosedPublic

Authored by unnar on Feb 18 2020, 3:48 AM.

Details

Summary

Since RangeDataVector is assumed to always be sorted we can treat it as an flattened BST and augment it with additional information about the ranges belonging to each "subtree". By storing the maximum endpoint in every subtree we can query for intervals in O(log n) time.

Note: this is not a complete patch, I just wanted to put it out there and see how you feel about this. Also what kind of testing could and should be done for this.

Diff Detail

Event Timeline

unnar created this revision.Feb 18 2020, 3:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2020, 3:48 AM
jarin added a subscriber: jarin.Feb 18 2020, 5:38 AM

Thanks for putting this together, some comments below. Let us see what Pavel thinks.

lldb/include/lldb/Utility/RangeMap.h
437

BST -> binary search tree

445

Here, B() should be the min value of type B, no? Perhaps this should be std::numeric_limits<B>::min() instead of B()?

527

Hmm, weird, I am surprised this is not std::vector<T> &indexes (I realize this was in the code before).

608

I am guessing this should have the m_ prefix?

I like this idea a lot, in principle. It is much simpler than a full blown interval tree, and it should give us similar performance characteristics.

Have you done a proper complexity analysis here? I doubt the O(log n) claim is true in general. It would have to be at least O(m + log n) (m - number of elements found), but it's not clear to me whether even this is true in general. (However, I believe this can achieve ~~log(n) for non-degenerate cases.)

The implementation itself needs some work though. My incomplete list of comments is:

  • replace int with size_t and closed intervals with half-open ones
  • let's move the computation of the upper bound into the "Sort" function. sorting is O(n log(n)), this is O(n) -- we can just hide it there.
  • make private functions private
  • we should avoid the business of figuring out what is the suitable "minimum" value of B by ensuring we call the recursive function on non-empty intervals
  • clang-format the patch

For testing you should add some c++ unit tests for the relevant interfaces.

jarin added a comment.Feb 18 2020, 6:36 AM

I like this idea a lot, in principle. It is much simpler than a full blown interval tree, and it should give us similar performance characteristics.

Have you done a proper complexity analysis here? I doubt the O(log n) claim is true in general. It would have to be at least O(m + log n) (m - number of elements found), but it's not clear to me whether even this is true in general. (However, I believe this can achieve ~~log(n) for non-degenerate cases.)

Thanks for the feedback! We were aiming for something simple and efficient enough. Our preliminary results show that the lookup pretty much disappears even from the profiles it was dominating before.

The implementation is pretty much taken from the "Augmented tree" section of https://en.wikipedia.org/wiki/Interval_tree where we just use the tree induced by the pivots of the binary search as the binary search tree that we augment. I believe the complexity is O(m log n), even though the wikipedia article makes a O(m + log n) claim. This should be still much better than the current O(n) and the memory cost seems to be quite palatable (extra word per symbol).

The implementation itself needs some work though. My incomplete list of comments is:

  • replace int with size_t and closed intervals with half-open ones
  • let's move the computation of the upper bound into the "Sort" function. sorting is O(n log(n)), this is O(n) -- we can just hide it there.
  • make private functions private
  • we should avoid the business of figuring out what is the suitable "minimum" value of B by ensuring we call the recursive function on non-empty intervals
  • clang-format the patch

For testing you should add some c++ unit tests for the relevant interfaces.

If we get away with this approach them I'm completely fine with it. But I would feel better if we first have some more unit tests for RangeDataVector first.

I like this idea a lot, in principle. It is much simpler than a full blown interval tree, and it should give us similar performance characteristics.

Have you done a proper complexity analysis here? I doubt the O(log n) claim is true in general. It would have to be at least O(m + log n) (m - number of elements found), but it's not clear to me whether even this is true in general. (However, I believe this can achieve ~~log(n) for non-degenerate cases.)

Thanks for the feedback! We were aiming for something simple and efficient enough. Our preliminary results show that the lookup pretty much disappears even from the profiles it was dominating before.

The implementation is pretty much taken from the "Augmented tree" section of https://en.wikipedia.org/wiki/Interval_tree where we just use the tree induced by the pivots of the binary search as the binary search tree that we augment. I believe the complexity is O(m log n), even though the wikipedia article makes a O(m + log n) claim. This should be still much better than the current O(n) and the memory cost seems to be quite palatable (extra word per symbol).

Ok, I see. Thanks for that reference. The wikipedia description is somewhat short, but I am beginning to understand how this could work. Maybe you could include a pointer to the wikipedia page and/or the book it references next to the algorithm.

unnar updated this revision to Diff 246679.Feb 26 2020, 5:01 AM
unnar marked 5 inline comments as done.

Have you done a proper complexity analysis here? I doubt the O(log n) claim is true in general. It would have to be at least O(m + log n) (m - number of elements found), but it's not clear to me whether even this is true in general. (However, I believe this can achieve ~~log(n) for non-degenerate cases.)

I should have been more specific, searching for any interval that contains a point can be done in O(log n) but in our case where we are searching for all intervals that contain the point and as Jarin said we believe it takes O(m log n) (I admit I have not done a thorough analysis of the time complexity myself). The theoretical best you can do is indeed O(m + log n).

The implementation itself needs some work though. My incomplete list of comments is:

  • replace int with size_t and closed intervals with half-open ones

Done.

  • let's move the computation of the upper bound into the "Sort" function. sorting is O(n log(n)), this is O(n) -- we can just hide it there.

That was my first thought but I decided to have it generated lazily instead, I will move it back to the sort.

  • make private functions private

Done.

  • we should avoid the business of figuring out what is the suitable "minimum" value of B by ensuring we call the recursive function on non-empty intervals

Done.

  • clang-format the patch

Done.

For testing you should add some c++ unit tests for the relevant interfaces.

Done. Do you think we should also unit test ComputeUpperBounds or just the interface?

lldb/include/lldb/Utility/RangeMap.h
445

Removed and made sure not to recursively call in degenerate cases.

527

I suspect this function used to have a different implementation where it would return the indexes of the entries rather than the data itself similar to FindEntryIndexThatContains and it was not properly changed when it was updated.

608

Removed as it is no longer needed.

unnar updated this revision to Diff 246681.Feb 26 2020, 5:19 AM

Added reference to Wikipedia article.

Thanks for adding the tests. They look good. Could you split them off into a separate patch? I believe @teemperor wanted (and I do agree it's a good idea) to check them in first in order to demonstrate that this patch does not change the behavior (or to make it what exactly changes).

lldb/unittests/Utility/RangeMapTest.cpp
22 ↗(On Diff #246681)

returning a const value from a function is fairly useless. All it does is make the caller jump through some hoops if he really wants to modify it.

unnar updated this revision to Diff 246716.Feb 26 2020, 7:16 AM
unnar marked 2 inline comments as done.

Have you done a proper complexity analysis here? I doubt the O(log n) claim is true in general. It would have to be at least O(m + log n) (m - number of elements found), but it's not clear to me whether even this is true in general. (However, I believe this can achieve ~~log(n) for non-degenerate cases.)

I should have been more specific, searching for any interval that contains a point can be done in O(log n) but in our case where we are searching for all intervals that contain the point and as Jarin said we believe it takes O(m log n) (I admit I have not done a thorough analysis of the time complexity myself). The theoretical best you can do is indeed O(m + log n).

Yep, I can easily convince myself is m*log(n). I can believe it can be O(m+log(n)) too, but proving that would require some pretty careful accounting. In either case, I am sure this is better than what we have now.

For testing you should add some c++ unit tests for the relevant interfaces.

Done. Do you think we should also unit test ComputeUpperBounds or just the interface?

Testing the interface should be enough. In fact, the main thing that is bothering me about this patch at this stage is that the new "upper_bound" member is public and accessible to the user. I haven't decided yet what to do about it. Have you looked at what it would take to make that private somehow? Maybe by storing std::pair<Entry, bound> in the private vector, and only handing out pointers to the Entry component ?

unnar added a comment.Feb 26 2020, 7:23 AM

Sure, removed tests from this patch. They are now at D75180

lldb/unittests/Utility/RangeMapTest.cpp
22 ↗(On Diff #246681)

Noted.

unnar added a comment.Feb 26 2020, 7:28 AM

Have you done a proper complexity analysis here? I doubt the O(log n) claim is true in general. It would have to be at least O(m + log n) (m - number of elements found), but it's not clear to me whether even this is true in general. (However, I believe this can achieve ~~log(n) for non-degenerate cases.)

I should have been more specific, searching for any interval that contains a point can be done in O(log n) but in our case where we are searching for all intervals that contain the point and as Jarin said we believe it takes O(m log n) (I admit I have not done a thorough analysis of the time complexity myself). The theoretical best you can do is indeed O(m + log n).

Yep, I can easily convince myself is m*log(n). I can believe it can be O(m+log(n)) too, but proving that would require some pretty careful accounting. In either case, I am sure this is better than what we have now.

For testing you should add some c++ unit tests for the relevant interfaces.

Done. Do you think we should also unit test ComputeUpperBounds or just the interface?

Testing the interface should be enough. In fact, the main thing that is bothering me about this patch at this stage is that the new "upper_bound" member is public and accessible to the user. I haven't decided yet what to do about it. Have you looked at what it would take to make that private somehow? Maybe by storing std::pair<Entry, bound> in the private vector, and only handing out pointers to the Entry component ?

That should work...although I'm not sure how that is any different to the range or data being public. If a user modifies anything then it has essentially invalidated the whole structure anyway.

That should work...although I'm not sure how that is any different to the range or data being public. If a user modifies anything then it has essentially invalidated the whole structure anyway.

That is a fair point. I suppose the reason why I see this as different is because this field is an implementation detail of the RangeDataVector class, and so the user should not even see it -- whereas the user has a legitimate reason to at least access the other fields (and most of the methods only provide read-only access to these fields).

I'm sorry, I haven't gotten around to looking at this patch today, but I thought I'd at least say that...

That should work...although I'm not sure how that is any different to the range or data being public. If a user modifies anything then it has essentially invalidated the whole structure anyway.

That is a fair point. I suppose the reason why I see this as different is because this field is an implementation detail of the RangeDataVector class, and so the user should not even see it -- whereas the user has a legitimate reason to at least access the other fields (and most of the methods only provide read-only access to these fields).

I'm sorry, I haven't gotten around to looking at this patch today, but I thought I'd at least say that...

That is true. I am fine with changing it if that's the only thing that you see as blocking this change from passing.

That is a fair point. I suppose the reason why I see this as different is because this field is an implementation detail of the RangeDataVector class, and so the user should not even see it -- whereas the user has a legitimate reason to at least access the other fields (and most of the methods only provide read-only access to these fields).

I'm sorry, I haven't gotten around to looking at this patch today, but I thought I'd at least say that...

That is true. I am fine with changing it if that's the only thing that you see as blocking this change from passing.

I finally took a good look at the patch, and I think that is the only remaining question on my mind. Could you try implementing that to see how it looks like?

@teemperor, do you have any more thoughts on this?

lldb/include/lldb/Utility/RangeMap.h
615

Llvm style guide does not recommend using auto in this situation. Entry &entry is one character longer, but it makes it clear what is going on.

635
unnar updated this revision to Diff 247211.EditedFeb 28 2020, 3:46 AM
unnar marked 2 inline comments as done.

As discussed with @labath I made a separate struct called AugmentedEntry that is used internally but we only ever expose the Entry part to the user.

labath added a comment.Mar 2 2020, 1:38 AM

I think this is looking very good now. Just a few polishing comments.

lldb/include/lldb/Utility/RangeMap.h
397

Add a short comment about the purpose of this class. Maybe you could just move the comment from the function ComputeUpperBounds to here?

401–402

looking at the usage, it would be simpler if this constructor just took a const RangeData<B, S, T>& argument, and then initialized the base class using its copy constructor.

422–423

and then here you'd do m_entries.emplace_back(entry)

unnar updated this revision to Diff 247597.Mar 2 2020, 4:21 AM
unnar marked 3 inline comments as done.
unnar added a comment.Mar 2 2020, 4:35 AM

Thanks for the feedback! I addressed all of your comments in the latest patch.

labath accepted this revision.Mar 2 2020, 7:45 AM

Awesome. Thanks for the patch. Do you need me to commit this for you?

This revision is now accepted and ready to land.Mar 2 2020, 7:45 AM
unnar added a comment.Mar 2 2020, 7:59 AM

Yes please. :)

This revision was automatically updated to reflect the committed changes.