Originally I wanted to remove the RegularExpression class in Utility and replace it with llvm::Regex. However, during that transition I noticed that there are different places where need the underlying regular expression string. So instead I propose to keep the RegularExpression class and make it a relatively thin wrapper around llvm::Regex.
Details
- Reviewers
labath clayborg jingham xiaobai jdoerfert - Group Reviewers
Restricted Project - Commits
- rG3af3f1e8e253: [Utility] Reimplement RegularExpression on top of llvm::Regex
rL369153: [Utility] Reimplement RegularExpression on top of llvm::Regex
rLLDB369153: [Utility] Reimplement RegularExpression on top of llvm::Regex
Diff Detail
Event Timeline
Overall, I am very supportive of this, though I have some comments inline. The interface seems ok, though maybe you could add a couple of quick unit tests to show how the class is meant to be used.
lldb/source/Commands/CommandObjectBreakpoint.cpp | ||
---|---|---|
685 | I wouldn't use auto in cases like these because Optional<string> is definitely not among the things one would expect a function called GetError to return. Though maybe that means the function should really return an llvm::Error? | |
lldb/source/Host/common/Socket.cpp | ||
286 | Are there any cases where evaluation of the regex can succeed, but the result will not contain the expected number of matches? Should this be changed into an assert, or just dropped completely? | |
lldb/unittests/Utility/NameMatchesTest.cpp | ||
52–53 ↗ | (On Diff #215022) | This is interesting. So, llvm::regex considers an empty pattern to not match anything? That doesn't sound right. I've just tried grep, python and perl, and all of them seem perfectly happy to accept an empty string as a pattern which matches everything.. I'd say this is a bug in llvm::regex we should fix. |
I like the idea of using something from llvm instead of rolling our own. The code changes look relatively simple and straightforward, so that's good.
lldb/source/Commands/CommandObjectBreakpoint.cpp | ||
---|---|---|
685 | If Pavel hadn't pointed this out, I would have thought that error was an llvm::Error here and not a an Optional. I think something should change here. |
I think we need to fix the behavior for "", other than that, this looks fine to me.
lldb/source/Host/common/Socket.cpp | ||
---|---|---|
286 | Jonas introduced this pattern everywhere he calls Execute. It does seem a little odd, since there's no way any of the patterns that he checks matches for could match but not match the () part. It makes the reader wonder if they don't understand regular expressions after all... | |
lldb/source/Target/ThreadPlanStepInRange.cpp | ||
369 | You can move this into the "if (return_value" It doesn't get used outside the check, and this test mostly fails so it would be good not to do more work in that case. | |
lldb/unittests/Utility/NameMatchesTest.cpp | ||
52–53 ↗ | (On Diff #215022) | Yes, that does seem odd to me. I would expect: (lldb) break set -r "" to match everything, not nothing. |
- Cleanup interface
- Return llvm::Error
- Remove size comparison
- Add empty regex workaround
- Add unit tests
lldb/unittests/Utility/NameMatchesTest.cpp | ||
---|---|---|
52–53 ↗ | (On Diff #215022) | The POSIX standard says that an empty regex is not a thing, it is rejected by the grammar as defined in https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html. The BSD implementation in LLVM agrees. Unfortunately GNU seems to disagree, which is probably why you expect the old behavior. If you look at the old code, the LLDB implementation worked around this. I've put the old workaround back, but I don't think we should "fix" the llvm::Regex implementation. Personally I think we should follow the standard in LLDB as well. I tried a few things and I couldn't find a compelling scenario. For example, setting a breakpoint with an empty regex is already not supported. |
I was unhappy with the () workaround, so I did some investigation. Now, I am even more unhappy, but at least better educated. :P
Basically, what I've learned is that according to POSIX https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html an empty string is not a valid regular expression. Most tools seem to ignore that, and treat it as if it matches everything (and I can't say I blame them). However, the BSD implementation (that the llvm stuff is based on) tried to do a strict implementation and it rejects that as an invalid expression.
Funnily enough, () is not valid according to POSIX either, but the BSD implementation accepts it. So does (linux) grep and python. However, perl rejects it (it still considers it a valid expression, but one that doesn't match anything).
a||b (a OR empty pattern OR b) is also not valid. grep, python and perl accept that, but the BSD regexes don't.
So, overall, it seems to me that there is a lot of confusion about what should empty (sub-)patterns do. I don't think special-casing "" in lldb_private::RegularExpression would help alleviate any of that confusion as we would be still left with all of the other inconsistencies. So, if we want to go through with this (and I still think we should), I guess we'll just have to bite the bullet and say that our expressions are now (more or less) POSIX conformant, and repeat that to anyone who comes complaining that lldb regexes behave differently than grep, python or older versions of lldb...
lldb/source/Commands/CommandObjectFrame.cpp | ||
---|---|---|
577–578 | GetErrorStream().Format("error: {0}\n", llvm::fmt_consume(std::move(err))); would be slightly less verbose. |
For the record, in case my earlier response gets lost in the inline comments, I strongly support this :-)
Pavel said "we'll just have to bite the bullet and say that our expressions are now (more or less) POSIX conformant"... We should say this somewhere users are likely to find.
All the options that take regular expressions look like:
(lldb) help break set ... -p <regular-expression> ( --source-pattern-regexp <regular-expression> )
And then for any option value type, you can get help:
(lldb) help regular-expression <regular-expression> -- A regular expression.
Maybe it would be good to say a little more here?
With that addition I'm fine with this.
Why does this need to be mutable?