This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Fix handling of negated character classes in regex
ClosedPublic

Authored by ldionne on Aug 9 2018, 2:10 PM.

Diff Detail

Repository
rCXX libc++

Event Timeline

ldionne created this revision.Aug 9 2018, 2:10 PM

The changeset that introduced this regression was reviewed as https://reviews.llvm.org/D37955. @timshen, I'm curious to understand why you or'ed with __neg_chars_.empty() when initializing __in_neg_chars. The logic's not clear to me, and my tests seem to show that this is wrong, but I don't understand the reasoning behind doing it in the first place, so I might be introducing another bug here. Please double-check.

timshen added a comment.EditedAug 9 2018, 4:29 PM

I'm not fully equipped with the context right now, but something doesn't add up. if __neg_chars_.empty() check is removed, the (__neg_mask_ == 0) above should be removed too. They have to be consistent.

However, there is more weirdness in it. The comment above describes the intention:

union(complement(union(__neg_chars_, __neg_mask_)), other cases...)

With the __neg_chars_.empty() and (__neg_mask_ == 0) removed, I believe that the code exactly matches the comment. Let's see what happens when users don't specify any negative class or chars. __neg_chars_ and __neg_mask_ will be empty sets, and union(complement(union(__neg_chars_, __neg_mask_)), other cases...) always evaluate to full set, which means it always matches all characters. This can't be right.

It's likely that the comment description doesn't fully describe the intended behavior. I think we need to figure that out first.

ldionne updated this revision to Diff 160169.Aug 10 2018, 12:26 PM

Rewrite the patch in light of timshen's comment, and add tests.

I'm not fully equipped with the context right now, but something doesn't add up. if __neg_chars_.empty() check is removed, the (__neg_mask_ == 0) above should be removed too. They have to be consistent.

However, there is more weirdness in it. The comment above describes the intention:

union(complement(union(__neg_chars_, __neg_mask_)), other cases...)

With the __neg_chars_.empty() and (__neg_mask_ == 0) removed, I believe that the code exactly matches the comment. Let's see what happens when users don't specify any negative class or chars. __neg_chars_ and __neg_mask_ will be empty sets, and union(complement(union(__neg_chars_, __neg_mask_)), other cases...) always evaluate to full set, which means it always matches all characters. This can't be right.

It's likely that the comment description doesn't fully describe the intended behavior. I think we need to figure that out first.

Yes, I think you were right. I think what was missing is that the comment does not apply when there's no neg_mask and no neg_chars. Otherwise we get into that situation where we're taking the complement of an empty set, and we match everything, which is not what we want.

timshen accepted this revision.Aug 10 2018, 2:28 PM

That looks more correct to me, thanks! Although I'm still puzzled by the empty check at all, it's clearly an improvement.

This revision is now accepted and ready to land.Aug 10 2018, 2:28 PM
This revision was automatically updated to reflect the committed changes.

is this fixed in 7.0 release branch too?

@hans

hans added a comment.Sep 6 2018, 1:56 AM

is this fixed in 7.0 release branch too?

@hans

I've merged it in r341529 now. Thanks for letting me know.