This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add strspn implementation.
ClosedPublic

Authored by cgyurgyik on Aug 2 2020, 5:47 PM.

Details

Summary

[libc] Adds strspn implementation and Bitset to CPP utils. To get O(max(M,N)) complexity (where M, N are the string lengths), one can use a table to first store the characters of the string to be matched against, and then iterate over the source string to determine if it contains any.

This approach takes it a step further, and reduces the table's size by using bit manipulation techniques.

Diff Detail

Event Timeline

cgyurgyik created this revision.Aug 2 2020, 5:47 PM
cgyurgyik requested review of this revision.Aug 2 2020, 5:47 PM

This is a working draft; currently not sure what the best approach for this is. I just stuck a few helper functions in that have some duplication, as well as some constants. I'm assuming this won't be the only function to use bit manipulation; maybe there already exists one for libc that I'm just not aware of.

Thanks ahead of time for the guidance and reviews.

cgyurgyik retitled this revision from [libc] Adds strspn implementation. to [libc] Add strspn implementation..Aug 3 2020, 12:44 PM

FWIW, I think in most real world scenarios the O(N*M) approach would be faster. Granted I don't know what those scenarios are I had never heard of this function, I have no idea where this is useful.

libc/src/string/strspn.cpp
17

None of the comments in this file are necessary I would say. This code explains itself.

18

static not necessary. Also I think that the naming convention is wrong here, it should be the same as all other identifiers.

20–21

The size of the byteset used. Not necessary.
The rest if you think that is actually a concern, presumably not then we should static_assert it otherwise remove that comment too.

22

32 is supposed to be how many bytes it takes to represent 256 bits? Maybe (256 / 8) / sizeof(size_t) is easier to read? This might have actually warranted a comment

Also should this be called a bitset not a byteset?

25

inline not necessary. Instead of taking an unsigned char then assigning it to a size_t just take a size_t here and in contains_bit.

32

byteset should be a const size_t *

43–52

The difference in speed here is going to be negligible, the question is wether you think it is worth complicating the code for these edge cases.

FWIW, I think in most real world scenarios the O(N*M) approach would be faster. Granted I don't know what those scenarios are I had never heard of this function, I have no idea where this is useful.

Its hard for me to argue this claim one way or another without an appropriate data set to benchmark with. Therefore, I chose to rely solely on complexity for now.

cgyurgyik updated this revision to Diff 282870.Aug 4 2020, 5:00 AM
cgyurgyik marked 6 inline comments as done.

Address abrachet's comments.

libc/src/string/strspn.cpp
18

Yeah I couldn't find much on constants naming convention. Changed to lowercase.

20–21

I don't know any systems that have a size_t > 32 bytes. I'll remove this for now.

25

Is there any harm in inline here? I removed it, but for my own knowledge. From what I understand, you're saying the compiler will inline it on its own accord, and this is just extra unnecessary specifier?

32

Good catch thanks.

43–52

This is something Siva hit me on before as well, so I'll wait for his input here too. I understand your thinking, but I don't find it grossly complicated, and think its good to avoid creating a bitset unless absolutely necessary.

cgyurgyik added inline comments.Aug 4 2020, 5:01 AM
libc/src/string/strspn.cpp
22

I think arguments could be made for both, but I like bitset a bit more.

cgyurgyik added inline comments.Aug 4 2020, 5:10 AM
libc/src/string/strspn.cpp
33–41

I believe I accidentally resolved this, but
@abrachet 's comment from earlier:

The difference in speed here is going to be negligible, the question is whether you think it is worth complicating the code for these edge cases.

sivachandra added inline comments.Aug 4 2020, 12:04 PM
libc/src/string/strspn.cpp
22

I would prefer a bool specialization of https://github.com/llvm/llvm-project/blob/master/libc/utils/CPP/Array.h for which the internal implementation is a bit array like std::vector<bool>.

Also, why not use uintptr_t array instead of size_t array?

static constexpr size_t BitsPerUnit = 8 * sizeof(uintptr_t);
uintptr_t Data[size / BitsPerUnit + size % BitsPerUnit]; // size is of type size_t

// Setter; x is of type size_t
Data[x / BitsPerUnit] |= (uintptr_t(1) << (x % BitsPerUnit));

// Getter; x is of type size_t
return Data[x / BitsPerUnit] & (uintptr_t(1) << (x % BitsPerUnit));
33–41

I agree with @abrachet that this has almost no benefit.

cgyurgyik updated this revision to Diff 283021.Aug 4 2020, 1:52 PM

Address some of Siva's comments, still need to use array<bool> representation.

cgyurgyik marked 2 inline comments as done.Aug 4 2020, 1:53 PM
cgyurgyik added inline comments.
libc/src/string/strspn.cpp
22

Ok I'll work on this.

cgyurgyik updated this revision to Diff 283022.Aug 4 2020, 1:56 PM
cgyurgyik added inline comments.Aug 4 2020, 2:54 PM
libc/src/string/strspn.cpp
22

Discussed offline - going to add std::bitset util instead, since std::array<bool, size_t> doesn't mimic the std::vector<bool> implementation.

cgyurgyik updated this revision to Diff 283096.Aug 4 2020, 6:34 PM
cgyurgyik edited the summary of this revision. (Show Details)

Adds Bitset to CPP utils.

sivachandra added inline comments.Aug 4 2020, 11:06 PM
libc/test/src/string/strspn_test.cpp
14

Seems to me like all of these tests should just be combined into one test as the different names are making it more confusing.

libc/test/utils/CPP/bitset_test.cpp
1 ↗(On Diff #283096)

Uppercase *B*itset.

57 ↗(On Diff #283096)

Seems like SetsProperBit and DoesNotSetNeighbotBts should be combined into something like SettingXDoesNotSetY.

79 ↗(On Diff #283096)

May be EXPECT_TRUE(...).

85 ↗(On Diff #283096)

s/Should/Does ?

libc/utils/CPP/Bitset.h
18 ↗(On Diff #283096)

s/CharacterBit/BitsPerByte ?

22 ↗(On Diff #283096)

If number_of_bits is always > zero, will the calculated size be zero?

43 ↗(On Diff #283096)

Because there is a static assert to ensure NumberOfBits is not 0, why not just do:

uintptr_t Data[(NumberOfBits + BitsPerByte - 1)/BitsPerUnit] = {0};
cgyurgyik updated this revision to Diff 283211.Aug 5 2020, 5:47 AM
cgyurgyik marked 3 inline comments as done.

Address Siva's comments.

cgyurgyik marked 3 inline comments as done.Aug 5 2020, 5:48 AM
cgyurgyik added inline comments.
libc/test/src/string/strspn_test.cpp
14

Can you clarify what you mean here?
I'm trying to unit test strspn, i.e. test different behaviors so that if it fails for behavior X, then behavior X unit test should fail rather than all of them. This has helped me already with some small debugging problems when manipulating bits.

I agree that my naming isn't phenomenal, but its hard when trying to talk about spans and segments. I an open to naming changes and combining certain tests, but I don't know if combining all the unit tests is the correct approach.

With that said, I've made some changes to the names in hopes to improve this.

libc/utils/CPP/Bitset.h
43 ↗(On Diff #283096)

Let

NumberOfBits = 16
BitsPerByte = 8
BitsPerUnit = BitsPerByte * sizeof(uintptr_t) = 8 * 4
(NumberOfBits + BitsPerByte - 1) / BitsPerUnit
= (16 + 8 - 1) / (8 * 4)
= 23 / 32
= 0

If (NumberOfBits + BitsPerByte - 1) < BitsPerUnit, then we get 0.
I can simplify this further and just say if NumberOfBits < BitsPerUnit, return 1 unit.

cgyurgyik marked an inline comment as done.Aug 5 2020, 5:49 AM
sivachandra added inline comments.Aug 5 2020, 8:20 AM
libc/utils/CPP/Bitset.h
18 ↗(On Diff #283211)

Make these static members of Bitset.

36 ↗(On Diff #283211)

Sorry, I had a typo in my earlier comment. It should have been:

(NumberOfBits + BitsPerUnit - 1)/BitsPerUnit

This should handle all cases?

cgyurgyik updated this revision to Diff 283260.Aug 5 2020, 8:44 AM
cgyurgyik marked 2 inline comments as done.
cgyurgyik added inline comments.
libc/utils/CPP/Bitset.h
36 ↗(On Diff #283211)

Yep! Good call on this. That should've been the obvious next step.

sivachandra accepted this revision.Aug 5 2020, 9:03 AM
This revision is now accepted and ready to land.Aug 5 2020, 9:03 AM

Thanks for the review Siva. Will wait for @abrachet 's LGTM as well.

abrachet accepted this revision.Aug 5 2020, 12:32 PM

Missing definition in spec/stdc.td

This revision was automatically updated to reflect the committed changes.