This is an archive of the discontinued LLVM Phabricator instance.

Regex: Add static convenience functions for "match" and "sub"
Needs ReviewPublic

Authored by nlguillemot on Sep 25 2019, 2:40 PM.

Details

Reviewers
thopre
Summary

There are many cases where a Regex object used only once: It is created,
used in a single call to match() or sub(), then thrown away. This was
done in various places in LLVM using the following idiom:

Regex(Pattern).match(String)

The problem with this idiom is that if the compilation of the Regex fails,
then match() gets called with an invalid regex pattern. This scenario is
currently handled by checking inside match() if the pattern compiled successfully.
This gives match() the double-duty of handling match errors and handling regex
compilation errors. To move away from match() having this double-duty,
we created an alternative to the idiom above as follows:

Regex::match(Pattern, String)

This static member function version of the idiom behaves like syntactical sugar
for the idiom from earlier, but it checks that the regex compiled without
error before making the call to match().

If we consistently explicitly check the validity of regex before calling match(),
we can require that the regex successfully compiled as a precondition for match().
However, there is code in other projects (eg: clang) that calls match() without
checking for validity of the compiled regex, so that code must first be updated
to be more strict about using the match() API before we can require a validly
compiled regex as a precondition for match(). Updating these uses and making
the API more strict is left as future work.

A similar static convenience function was added for sub(). The constructor of Regex
was also extended to be able to return an error without requiring a subsequent
call to isValid(), for the sake of convenience and code reuse.

Uses of the previous idiom in LLVM were updated to the new style.

Diff Detail

Event Timeline

nlguillemot created this revision.Sep 25 2019, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 25 2019, 2:40 PM
nlguillemot marked 3 inline comments as done.Sep 25 2019, 2:46 PM
nlguillemot added inline comments.
tools/sancov/sancov.cpp
127 ↗(On Diff #221833)

Missed making these ones const in the previous commit, so updated them in this one.

unittests/Support/RegexTest.cpp
20

This typedef might seem kind of redundant, but it's recommended by the googletest guide:

https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#death-test-naming

177

This style of conditionally executing death tests was copied from AlignmentTest.cpp

thopre added inline comments.Sep 26 2019, 1:42 AM
include/llvm/Support/Regex.h
90–92

While redundant, I'd add "which returns an error if the regex is invalid or no match is found" to be more explicit

120–122

Same as above.

tools/sancov/sancov.cpp
127 ↗(On Diff #221833)

I know it's a small change but I'd rather you split it out in a separate patch which I'll approve gladly. This allow easier revert or cherry-pick.

unittests/Support/RegexTest.cpp
20

Thanks for bringing this to my attention, I was not aware of it.

200–203

I'd exchange the two tests like you did in the Constructor test because it shows that calling the wrapper on a valid regex will clear Error while here it only shows it does not set it.

206–209

Likewise.

  • Added more comments to static match and static sub to clarify the return value and the error's value.
  • Remove updates of "static Regex" -> "static const Regex", to do them in a future separate patch instead.
  • Switch order of test lines in "ConvenienceFunctions" test.
nlguillemot marked 3 inline comments as done.Sep 26 2019, 10:14 AM
nlguillemot added inline comments.
include/llvm/Support/Regex.h
90–92

I expanded the description. Thoughts?

nlguillemot marked an inline comment as done.Sep 26 2019, 10:14 AM
nlguillemot marked an inline comment as done.Sep 26 2019, 10:33 AM
nlguillemot added inline comments.
tools/sancov/sancov.cpp
127 ↗(On Diff #221833)

Moved to a separate patch here: https://reviews.llvm.org/D68091

thopre accepted this revision.Sep 27 2019, 1:01 AM

LGTM

This revision is now accepted and ready to land.Sep 27 2019, 1:01 AM
thopre requested changes to this revision.Sep 27 2019, 4:06 AM

The following clang unit tests fail with your patch:

Format/./FormatTests/FormatTest.FunctionAnnotations
Format/./FormatTests/FormatTest.UnderstandsFunctionRefQualification

Can you have a look?

This revision now requires changes to proceed.Sep 27 2019, 4:06 AM
nlguillemot added a comment.EditedSep 27 2019, 11:08 AM

The following clang unit tests fail with your patch:

Format/./FormatTests/FormatTest.FunctionAnnotations
Format/./FormatTests/FormatTest.UnderstandsFunctionRefQualification

Can you have a look?

Oops, I didn't even think of checking if clang works. I started taking a look at it, and here's what I think we should do:

  1. We should do another Regex const correctness patch for clang's uses of Regex.
  2. We need to decide how to deal with the fact that clang depends on match()/sub()'s behavior of lazily reporting compile errors:

    Option A) Make a patch for llvm and clang to put before this one that adds explicit isValid() checks to all uses of match()/sub() with "untrusted" inputs.

    Option B) Keep the previous behavior and return false instead of asserting inside match(), for the sake of backwards compatibility.

If we do option A, we don't compromise on one of the original goals of this patch, and we make the API of match()/sub() more strict.
If we do option B, we have less chance of potentially breaking code in other projects in the LLVM ecosystem.

I think step (1) is a win-win, but I'm not sure about step (2). I'm tempted to conservatively go with the backwards compatible option B, but if we can judge that the potential impact of changing the API with option A is acceptable, we could go ahead and take that risk anyways.

Thoughts?

Opened a review that just increases the const-correctness of Regex in clang: https://reviews.llvm.org/D68155

This is the first of two steps I suggested in my previous comment.

  1. We need to decide how to deal with the fact that clang depends on match()/sub()'s behavior of lazily reporting compile errors:

    Option A) Make a patch for llvm and clang to put before this one that adds explicit isValid() checks to all uses of match()/sub() with "untrusted" inputs.

    Option B) Keep the previous behavior and return false instead of asserting inside match(), for the sake of backwards compatibility.

If we do option A, we don't compromise on one of the original goals of this patch, and we make the API of match()/sub() more strict.
If we do option B, we have less chance of potentially breaking code in other projects in the LLVM ecosystem.

I think step (1) is a win-win, but I'm not sure about step (2). I'm tempted to conservatively go with the backwards compatible option B, but if we can judge that the potential impact of changing the API with option A is acceptable, we could go ahead and take that risk anyways.

Thoughts?

I'm quite new to LLVM community but my understanding is that the general approach is to not worry about external dependencies. On the other hand the benefit is small, doing a check again in match and sub is cheap. We can still have those convenience function to avoid using a temporary variable, so I'd tend to agree with you and go for option B. Update the patch to keep the checks for errors in sub and match and I'll approve it.

Removed the assert inside match() that made it crash when match() is called with a regex pattern that isn't successfully compiled. Also removed the "death test" unit tests that tested that this assert was triggering.

Removed the assert inside match() that made it crash when match() is called with a regex pattern that isn't successfully compiled. Also removed the "death test" unit tests that tested that this assert was triggering.

Can you reword the description along the line of this new API being a convenience when one does not care about distinguishing validity of a regex from whether it matches?

nlguillemot edited the summary of this revision. (Show Details)Nov 19 2019, 9:57 AM

Removed the assert inside match() that made it crash when match() is called with a regex pattern that isn't successfully compiled. Also removed the "death test" unit tests that tested that this assert was triggering.

Can you reword the description along the line of this new API being a convenience when one does not care about distinguishing validity of a regex from whether it matches?

Did an editing pass on the review summary to reflect the new scope of the patch, and noted what is intended as future work.

thopre added a comment.EditedNov 21 2019, 6:02 AM

Not committing to remove the ability to call match/sub on an invalid regex implies that the addition of the new API needs to be justified independently of the extra check that needs to be performed in match/sub to allow the current behaviour. And in fact, the only motivation for this patch I could think of was consistency with other languages and I was finding it a hard sell and was about to apologise for the extra work I made you do.

However I overlooked one benefit of removing the ability to match an invalid regex: robustness. All the cases where a call to match/sub is guards a code other than throwing an error is likely to result in wrong behaviour in case the regex is invalid. So I think it would be safer to use a separate API for combined regex creation + match/sub as per the previous revision of this patch and prevent the existing approach, so that users are mindful when not doing separate checks.

What do you think? If you agree that would mean reverting to the previous version of this patch and the simultaneous apologies from me for requesting the current version.

Not committing to remove the ability to call match/sub on an invalid regex implies that the addition of the new API needs to be justified independently of the extra check that needs to be performed in match/sub to allow the current behaviour. And in fact, the only motivation for this patch I could think of was consistency with other languages and I was finding it a hard sell and was about to apologise for the extra work I made you do.

However I overlooked one benefit of removing the ability to match an invalid regex: robustness. All the cases where a call to match/sub is guards a code other than throwing an error is likely to result in wrong behaviour in case the regex is invalid. So I think it would be safer to use a separate API for combined regex creation + match/sub as per the previous revision of this patch and prevent the existing approach, so that users are mindful when not doing separate checks.

What do you think? If you agree that would mean reverting to the previous version of this patch and the simultaneous apologies from me for requesting the current version.

If I understand correctly, you're suggesting we re-add the assert inside match() that requires the regex to be compiled successfully as a precondition? If so, I agree that the API should be strict. The problem with the assert() was that it affected other projects like clang, but now that there's the monorepo I feel more confident that we can safely make a breaking change by fixing any uses in the other projects in the monorepo.

If we want to re-add the assert(), I would suggest that we first commit this patch to lay the groundwork, then create a separate patch that adds the assert() back in. Isolating the patch with the API breakage would make it easier to revert if necessary.

By the way, I'm going to be on vacation from tomorrow up to December 3rd, so I might not be able to answer to comments in a timely way. Apologies in advance.

thopre added a comment.Jan 2 2020, 4:27 AM

How about the following commit message:

There are many cases where a Regex object is used only once: It is created,
used in a single call to match() or sub(), then thrown away. This was
done in various places in LLVM using the following idiom:

Regex(Pattern).match(String)

The problem with this idiom is that invalid patterns result in a match
failure, which can lead to unexpected behavior if the return value of
match() is used as a condition for an if statement. To force developers
to be mindful of this aspect, an assert is added to match() to check
that the regex is valid and an new idiom is created as follows for
cases where the pattern is known to be valid:

Regex::match(Pattern, String)

This new idiom is documented as returning false when the pattern is invalid.
Code using the old idiom is thus updated to use the new idiom.

A similar static convenience function was added for sub(). The constructor of Regex
was also extended to be able to return an error without requiring a subsequent
call to isValid(), for the sake of convenience and code reuse.

How about the following commit message:

(...)

I would suggest an amendment to this part:

To force developers to be mindful of this aspect, an assert is added to match() to check
that the regex is valid and an new idiom is created as follows for
cases where the pattern is known to be valid:

As-is, this patch doesn't assert inside match(), since this makes the API more backwards compatible. The wording of the commit message should be updated to match this.

It was originally a bit tricky to track down and update all users of the API, but the monorepo makes that a lot easier. If we brought back the older version of the commit that *did* assert inside match(), and we updated all affected users of the API (eg: clang) before committing, I wouldn't be opposed.

thopre added a comment.Jan 6 2020, 1:03 PM

How about the following commit message:

(...)

I would suggest an amendment to this part:

To force developers to be mindful of this aspect, an assert is added to match() to check
that the regex is valid and an new idiom is created as follows for
cases where the pattern is known to be valid:

As-is, this patch doesn't assert inside match(), since this makes the API more backwards compatible. The wording of the commit message should be updated to match this.

It was originally a bit tricky to track down and update all users of the API, but the monorepo makes that a lot easier. If we brought back the older version of the commit that *did* assert inside match(), and we updated all affected users of the API (eg: clang) before committing, I wouldn't be opposed.

Yes I think it should be done otherwise the change does not bring benefit besides an API familiarity to users of other languages (since it's not more concise and does not help catch errors).