Page MenuHomePhabricator

[Utility] Reimplement RegularExpression on top of llvm::Regex
ClosedPublic

Authored by JDevlieghere on Aug 13 2019, 2:48 PM.

Details

Summary

Originally I wanted to remove the RegularExpression class in Utility and replace it with llvm::Regex. However, during that transition I noticed that there are different places where need the underlying regular expression string. So instead I propose to keep the RegularExpression class and make it a relatively thin wrapper around llvm::Regex.

Diff Detail

Event Timeline

JDevlieghere created this revision.Aug 13 2019, 2:48 PM
JDevlieghere retitled this revision from [Utility] Phase out RegularExpression and use llvm::Regex instead. to [Utility] Reimplement RegularExpression on top of llvm::Regex.
JDevlieghere edited the summary of this revision. (Show Details)

Overall, I am very supportive of this, though I have some comments inline. The interface seems ok, though maybe you could add a couple of quick unit tests to show how the class is meant to be used.

lldb/source/Commands/CommandObjectBreakpoint.cpp
685

I wouldn't use auto in cases like these because Optional<string> is definitely not among the things one would expect a function called GetError to return. Though maybe that means the function should really return an llvm::Error?

lldb/source/Host/common/Socket.cpp
286

Are there any cases where evaluation of the regex can succeed, but the result will not contain the expected number of matches? Should this be changed into an assert, or just dropped completely?

lldb/unittests/Utility/NameMatchesTest.cpp
52–54

This is interesting. So, llvm::regex considers an empty pattern to not match anything? That doesn't sound right. I've just tried grep, python and perl, and all of them seem perfectly happy to accept an empty string as a pattern which matches everything..

I'd say this is a bug in llvm::regex we should fix.

I like the idea of using something from llvm instead of rolling our own. The code changes look relatively simple and straightforward, so that's good.

lldb/source/Commands/CommandObjectBreakpoint.cpp
685

If Pavel hadn't pointed this out, I would have thought that error was an llvm::Error here and not a an Optional. I think something should change here.

jdoerfert resigned from this revision.Aug 14 2019, 11:07 AM

I think we need to fix the behavior for "", other than that, this looks fine to me.

lldb/source/Host/common/Socket.cpp
286

Jonas introduced this pattern everywhere he calls Execute. It does seem a little odd, since there's no way any of the patterns that he checks matches for could match but not match the () part. It makes the reader wonder if they don't understand regular expressions after all...

lldb/source/Target/ThreadPlanStepInRange.cpp
369

You can move this into the "if (return_value" It doesn't get used outside the check, and this test mostly fails so it would be good not to do more work in that case.

lldb/unittests/Utility/NameMatchesTest.cpp
52–54

Yes, that does seem odd to me. I would expect:

(lldb) break set -r ""

to match everything, not nothing.

JDevlieghere marked 7 inline comments as done.
  • Cleanup interface
  • Return llvm::Error
  • Remove size comparison
  • Add empty regex workaround
  • Add unit tests
JDevlieghere added inline comments.Aug 14 2019, 2:23 PM
lldb/unittests/Utility/NameMatchesTest.cpp
52–54

The POSIX standard says that an empty regex is not a thing, it is rejected by the grammar as defined in https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html. The BSD implementation in LLVM agrees. Unfortunately GNU seems to disagree, which is probably why you expect the old behavior. If you look at the old code, the LLDB implementation worked around this.

I've put the old workaround back, but I don't think we should "fix" the llvm::Regex implementation. Personally I think we should follow the standard in LLDB as well. I tried a few things and I couldn't find a compelling scenario. For example, setting a breakpoint with an empty regex is already not supported.

Looks fine to me. A few nits.

lldb/include/lldb/Utility/RegularExpression.h
97

Why does this need to be mutable?

lldb/source/Commands/CommandObjectFrame.cpp
1

revert

JDevlieghere marked 4 inline comments as done.Aug 14 2019, 3:38 PM
JDevlieghere added inline comments.
lldb/include/lldb/Utility/RegularExpression.h
97

llvm::Regex's match method is not const. We could unconst Execute but that would mean a bunch of churn in the API.

lldb/source/Commands/CommandObjectFrame.cpp
1

Oops

JDevlieghere updated this revision to Diff 215287.EditedAug 14 2019, 5:02 PM
JDevlieghere marked 2 inline comments as done.

Undo typo in CommandObjectFrame.cpp

I was unhappy with the () workaround, so I did some investigation. Now, I am even more unhappy, but at least better educated. :P

Basically, what I've learned is that according to POSIX https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html an empty string is not a valid regular expression. Most tools seem to ignore that, and treat it as if it matches everything (and I can't say I blame them). However, the BSD implementation (that the llvm stuff is based on) tried to do a strict implementation and it rejects that as an invalid expression.

Funnily enough, () is not valid according to POSIX either, but the BSD implementation accepts it. So does (linux) grep and python. However, perl rejects it (it still considers it a valid expression, but one that doesn't match anything).

a||b (a OR empty pattern OR b) is also not valid. grep, python and perl accept that, but the BSD regexes don't.

So, overall, it seems to me that there is a lot of confusion about what should empty (sub-)patterns do. I don't think special-casing "" in lldb_private::RegularExpression would help alleviate any of that confusion as we would be still left with all of the other inconsistencies. So, if we want to go through with this (and I still think we should), I guess we'll just have to bite the bullet and say that our expressions are now (more or less) POSIX conformant, and repeat that to anyone who comes complaining that lldb regexes behave differently than grep, python or older versions of lldb...

lldb/source/Commands/CommandObjectFrame.cpp
577–578

GetErrorStream().Format("error: {0}\n", llvm::fmt_consume(std::move(err))); would be slightly less verbose.

I was unhappy with the () workaround, so I did some investigation. Now, I am even more unhappy, but at least better educated. :P

Basically, what I've learned is that according to POSIX https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html an empty string is not a valid regular expression. Most tools seem to ignore that, and treat it as if it matches everything (and I can't say I blame them). However, the BSD implementation (that the llvm stuff is based on) tried to do a strict implementation and it rejects that as an invalid expression.

Funnily enough, () is not valid according to POSIX either, but the BSD implementation accepts it. So does (linux) grep and python. However, perl rejects it (it still considers it a valid expression, but one that doesn't match anything).

a||b (a OR empty pattern OR b) is also not valid. grep, python and perl accept that, but the BSD regexes don't.

So, overall, it seems to me that there is a lot of confusion about what should empty (sub-)patterns do. I don't think special-casing "" in lldb_private::RegularExpression would help alleviate any of that confusion as we would be still left with all of the other inconsistencies. So, if we want to go through with this (and I still think we should), I guess we'll just have to bite the bullet and say that our expressions are now (more or less) POSIX conformant, and repeat that to anyone who comes complaining that lldb regexes behave differently than grep, python or older versions of lldb...

For the record, in case my earlier response gets lost in the inline comments, I strongly support this :-)

Make RegularExpression POSIX compliant.

labath accepted this revision.Aug 16 2019, 11:41 AM

lgtm. Thanks for doing this.

This revision is now accepted and ready to land.Aug 16 2019, 11:41 AM
jingham accepted this revision.EditedAug 16 2019, 1:10 PM

Pavel said "we'll just have to bite the bullet and say that our expressions are now (more or less) POSIX conformant"... We should say this somewhere users are likely to find.

All the options that take regular expressions look like:

(lldb) help break set
...
       -p <regular-expression> ( --source-pattern-regexp <regular-expression> )

And then for any option value type, you can get help:

(lldb) help regular-expression
  <regular-expression> -- A regular expression.

Maybe it would be good to say a little more here?

With that addition I'm fine with this.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2019, 2:32 PM