This is an archive of the discontinued LLVM Phabricator instance.

[BitVector] Implement find_[first/last]_[set/unset]_in
ClosedPublic

Authored by zturner on May 11 2017, 9:44 AM.

Details

Summary

This effort started when I wanted a method that did the following:

Given a BitVector and indices N and M, find the largest index less than M but greater than or equal to N which contains a set bit.

At first I just used find_prev(M) which I had added in an earlier patch, but I quickly discovered the performance of this was poor. My BitVector had over 500,000 bits in it, and in the worst case this could cause an O(n) scan all the way to the beginning. But I already had a lower bound on the range I wanted to search in, so I figured I could just make a function find_last_in(LowerBound, PriorTo).

But it quickly became apparent that if we already have this function, then find_last() is just find_last_in(0, Size) and find_prev(PriorTo) is just find_last_in(0, PriorTo).

In fact, all of the methods are generalizable this way. For example, find_first and find_next reduce to find_first_in(0, Size) and find_first_in(After, Size). The generalizations to the find_xxx_unset family of methods is obvious.

So I went all the way. There is now a find_in method for all combinations of [first, last] x [set, unset], and all find methods delegate to these methods.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.May 11 2017, 9:44 AM

And.... this apparently breaks everything. Looking into why, all my unittests passed, so it's something more subtle.

zturner updated this revision to Diff 98656.May 11 2017, 10:38 AM

Fix the breakage. Given a BitVector of size N where the last bit (e.g. index N-1) was set, the failure occurred when you called find_first_in(N-1, Size). It should have been returning N-1 but it was returning -1. A test was added for this case, and edge boundary tests were added for all other functions as well.

zturner updated this revision to Diff 98658.May 11 2017, 10:41 AM

Remove the "old" versions of the functions. I tracked down the problem by copy/pasting the original versions back in and asserting any time they didn't return the same value. Forgot to remove them in my previous upload though.

chandlerc requested changes to this revision.May 16 2017, 6:54 PM

Very nice generally. Comments only on one, but clearly apply to each variation.

Bit of a shame there isn't a clean way to share all the logic, but all the ideas to do that are *awful*. =/

llvm/include/llvm/ADT/BitVector.h
151 ↗(On Diff #98658)

Given that we return an int, I would prefer that the arguments also be int (and the intermediate variables within).

152–153 ↗(On Diff #98658)

Why isn't this an assert?

This revision now requires changes to proceed.May 16 2017, 6:54 PM
zturner added inline comments.May 16 2017, 7:08 PM
llvm/include/llvm/ADT/BitVector.h
151 ↗(On Diff #98658)

The only reason we return an int is because we have to use a sentinel value to indicate "no bits found". The methods that these originated from find_next, find_prev, etc have always taken unsigned arguments even before I started tinkering on BitVector. Do you still think it's a good idea to change all of them?

152–153 ↗(On Diff #98658)

I suppose I should assert if Begin > End, but Begin == End should probably still just return -1 I think.

chandlerc accepted this revision.May 17 2017, 8:36 AM

LGTM with the assert changes.

llvm/include/llvm/ADT/BitVector.h
151 ↗(On Diff #98658)

I mean, my view is that we're doing a lot of arithmetic on these things including potentially subtraction. So generally, unless there is a problem with using a signed integer, I would prefer that.

If we return int, then we've given up on having more than 2^31 - 1 bits in one of these containers.

That said, a bitvector is one of the few datastructures where it seems entirely conceivable that we'll actually end up with 500mb vector of bits and an int index won't work. But an unsigned won't either, so I suspect that isn't the deciding factor and if we care we need a 64-bit type.

So I guess my feeling is that we should do one of two things here:
a) Switch the return types to an unsigned type and use something like npos, or
b) Switch the argument types to a signed type and continue to use -1.

(Whether we use 32-bits or 64-bits should be orthogonal as I see no reason why for a *bit* count 2gb vs 4gb would be "enough".)

However, looking at the code, I agree that the patch as-is remains consistent, and you shouldn't do either (a) or (b) in *this* patch. But I'd somewhat like to consider doing one of them in a subsequent patch.

My personal, strong preference is to use (b) because I find asserts easier to write for signed types in the face of arithmetic and we have strong usage of UBSan to ensure we don't develop bugs here. I also have a somewhat powerful dislike of npos. But I think either (a) or (b) would be an improvement. Anyways, sorry for the digression.

152–153 ↗(On Diff #98658)

Ah, yes, that makes sense.

This revision is now accepted and ready to land.May 17 2017, 8:36 AM
This revision was automatically updated to reflect the committed changes.