This is an archive of the discontinued LLVM Phabricator instance.

Change llvm::Regex to expose a fallible constructor.
Needs ReviewPublic

Authored by dlj on Dec 6 2016, 8:38 AM.

Details

Summary

Currently, if llvm::Regex is constructed with an invalid pattern, it only
exposes this fact if the call site checks isValid(). The Regex class itself also
statefully maintains a (mutable) error code internally, which means its match()
function is not const.

This change adds a "fallible constructor" to llvm::Regex:
http://llvm.org/docs/ProgrammersManual.html#fallible-constructors
and then plumbs it through LLVM code. (Follow-up changes are prepared to update
uses in Clang.)

I plan to expand RegexError to handle the error values directly (see regerror.c)
and return it from match(), which will remove the last of the error state from
Regex, allowing match() to become const. Right away, though, this change deletes
the last vestiges of stateful error codes from the internal regex implementation
(regexec.c). This allows constructing cheap errors that require only the code,
and do not need to refer to the compiled regex. These error cases (REG_ATOI and
REG_ITOA) never happen in LLVM code, and are not POSIX-standard.

Event Timeline

dlj updated this revision to Diff 80426.Dec 6 2016, 8:38 AM
dlj retitled this revision from to Change llvm::Regex to expose a fallible constructor..
dlj updated this object.
dlj added reviewers: rsmith, llvm-commits.
vsk added a subscriber: vsk.Dec 6 2016, 6:34 PM

If the goal is to make sure regex errors are checked, we could avoid a substantial amount of API churn by asserting 'no error' in every method in Regex.

If you also want to make match() const, we'll need to get rid of the isValid method and the error field. That probably means adding an optional out-param to the Regex constructor for error descriptions.

dlj updated this revision to Diff 80631.Dec 7 2016, 11:30 AM
  • Add comments for unhandled errors that will assert-fail.
  • Augment error handling in LinePrinter.
  • Fix a broken regex in ARMScheduleR52.td (empty subexpressions are not allowed by our regex library).
dlj added a comment.Dec 7 2016, 11:31 AM
In D27463#615399, @vsk wrote:

If the goal is to make sure regex errors are checked, we could avoid a substantial amount of API churn by asserting 'no error' in every method in Regex.

That's the end goal I'm aiming for in the current change, but by a different route: for regexes compiled from static char[] constants, I'm just letting the Expected<Regex> propagate the error. It's a bit subtle, but I'll add comments. With asserts disabled, a failed compilation for a constant will simply look like it didn't match.

For "constants," I think this is the right thing. It limits API churn, since things that should never fail don't need to propagate an error, they just assert. For non-constants, I'm intending to propagate errors in whatever way is consistent.

(Perhaps unsurprisingly, there are, in fact, some broken regex literals in the code base.)

If you also want to make match() const, we'll need to get rid of the isValid method and the error field. That probably means adding an optional out-param to the Regex constructor for error descriptions.

Right, I was planning to eliminate the error field in a separate change. The Expected<> wrapper obviates the need for an out-parameter to the constructor, though. Anywhere an out-param would be used can use the Expected<> instead.

dlj added a comment.Dec 7 2016, 11:32 AM
In D27463#616162, @dlj wrote:
In D27463#615399, @vsk wrote:

If the goal is to make sure regex errors are checked, we could avoid a substantial amount of API churn by asserting 'no error' in every method in Regex.

That's the end goal I'm aiming for in the current change, but by a different route: for regexes compiled from static char[] constants, I'm just letting the Expected<Regex> propagate the error.

err, rather, *not* propagate the error -- and crash with asserts enabled.

vsk added a comment.Dec 7 2016, 12:01 PM

That's the end goal I'm aiming for in the current change, but by a different route: for regexes compiled from static char[] constants, I'm just letting the Expected<Regex> propagate the error.

err, rather, *not* propagate the error -- and crash with asserts enabled.

I think it'd be better to add in 'no error' asserts to the methods in Regex. The advantages are that (1) we can catch broken regexes / broken tests up front, instead of waiting for downstream projects to opt-in to the new constructor, (2) it doesn't require any code changes outside of the Regex class (except to fix broken code), and (3) it doesn't introduce any unchecked uses of Expected, which IMO is an abuse of the error handling API.

dlj updated this revision to Diff 80669.Dec 7 2016, 2:44 PM
  • Make Regex handle empty patterns without asserting an error.
dlj added a comment.Dec 7 2016, 2:44 PM
In D27463#616195, @vsk wrote:

That's the end goal I'm aiming for in the current change, but by a different route: for regexes compiled from static char[] constants, I'm just letting the Expected<Regex> propagate the error.

err, rather, *not* propagate the error -- and crash with asserts enabled.

I think it'd be better to add in 'no error' asserts to the methods in Regex. The advantages are that (1) we can catch broken regexes / broken tests up front, instead of waiting for downstream projects to opt-in to the new constructor,

So just to be clear... you're proposing a different breaking change, such that an invalid regex causes an assertion failure, instead of silently clamping to "no match?" That still cements the need for the error field, though; and it also seems to block making match() const -- the semantics of isValid would silently change, such that isValid() would stop reporting errors that arise during match().

The alternative proposal also has a different, more philosophical problem: it opens a broad window for validation escape in new uses of Regex. I already see plenty of rollbacks for things that only fail with asserts enabled. Without a stronger contract of "never match an invalid regex," it would just add another sharp edge, and I'm worried it would start to cut. Making the error check part of the type seems like a reasonable speed bump to ensure that the right assumptions are being made.

As for the broken users: I'm fixing the ones within the LLVM projects now. Once those patches are in, I can get rid of the constructor (probably needs a short intermediate state for out-of-tree users to notice, though). It seems better to do the cleanup work now instead of "kicking the can" on the stateful error.

(2) it doesn't require any code changes outside of the Regex class (except to fix broken code),

This smells slightly like a false economy to me... there aren't many call sites, and I'm finding bugs already. I get why we wouldn't want a ton of churn, but the end state seems worth it to me (i.e., Regex becomes const-correct and has no internal error state).

(3) it doesn't introduce any unchecked uses of Expected, which IMO is an abuse of the error handling API.

Unfortunately, there just isn't a concise, usable alternative. Code with existing error paths are fine, but otherwise a debug-only assertion crash is probably the best alternative. (In fact, that's probably what I would do in most of those cases, anyhow).

One example (but not the only example) are AST matchers: the macro matchers are basically painted into an API corner, so an assertion really seems like the best option. I could probably add DEBUG() checks, but at least they can explicitly clamp to false in a release build.

Class constructors are another (plentiful) example: if there's a reasonable way to defer errors, I can do that now; but in many cases there isn't, so signalling the failure *somehow* -- even if it's only with debug assertions -- seems strictly superior.

rsmith edited edge metadata.Jan 23 2017, 3:53 PM

Generally the new interface makes sense to me, but deliberately ignoring the error to provoke an assertion seems a bit subtle.

lib/IR/AutoUpgrade.cpp
98–103

Rather than adding this comment in various places, what do you think about also adding a static Regex Regex::compileKnownValid(StringRef) that encapsulates the compile + assert-if-not-valid logic?

chandlerc added inline comments.
include/llvm/Support/YAMLTraits.h
433–436

No need for braces IMO.

437–438

In addition to what Richard said, I would also try to use a pattern that ends up working more like:

if (auto MyMatcher = Regex::compile(...))
  return MyMatcher->match(...);
else
  llvm_unreachable("boom");

This will give us nice assert like behavior in assert builds and will actually kill the other edge in optimized builds.

lib/IR/DiagnosticInfo.cpp
45–54

Do you mind reflowing all of this to use early return so that there are less braces and indentation?

dlj updated this revision to Diff 87922.Feb 9 2017, 5:13 PM
  • Add comments for unhandled errors that will assert-fail, and augment error handling in LinePrinter.
  • Added compileKnownGood, plus examples of llvm_unreachable approach.
  • Minor style fixes.
dlj updated this revision to Diff 87923.Feb 9 2017, 5:15 PM
  • Style nits.
dlj marked an inline comment as done.Feb 9 2017, 5:15 PM
dlj added inline comments.
lib/IR/DiagnosticInfo.cpp
45–54

Better?

chandlerc added inline comments.Feb 9 2017, 5:31 PM
include/llvm/Support/YAMLTraits.h
437–438

I would just re-use compileKnownValid everywhere instead of having to inline the llvm_unreachable.

Haven't made it all the way through, but I think it would be helpful to apply some of the comments so far across all the changes here.

lib/Passes/PassBuilder.cpp
140–142

Is it important to use ExitOnError here? I'm thinkign something like compileKnownValid might be more appropriate.

lib/Support/SpecialCaseList.cpp
127–130

This isn't used any more. I think you can simplify the code a lot by just getting the error, and inverting the condition below.

157

This also seems like it could be compileKnownValid, as you have validated them above.

lib/Target/AArch64/Utils/AArch64BaseInfo.cpp
91–94

compileKnownValid again?

tools/llvm-cov/CodeCoverage.cpp
617–620

I still would prefer early error exiting here so:

auto RegexOrError = ...
if (!RegexORError) {
  ...
  return ...;
}
...
tools/llvm-pdbdump/LinePrinter.h
46–52

Use continue on error to reduce non-error indentation?