This is an archive of the discontinued LLVM Phabricator instance.

[AST] Add RecoveryExpr; produce it on overload resolution failure and missing member.
Needs ReviewPublic

Authored by sammccall on May 9 2019, 3:48 AM.

Details

Summary

In many places, Sema refuses to create an Expr if it encounters an error.
This means that broken code has no representation in the AST. As well as the broken node itself, child Exprs and parent Exprs are usually dropped.

This is terrible for tools that rely on the AST and see a lot of broken code (e.g. libclang-based IDEs and clangd).
In the expression foo(a, takes2args(b)), "go to definition" doesn't work on any of the subexpressions. And after takes2args(x)., member completion doesn't work, although in practice we almost always know the type.

This patch introduces RecoveryExpr. This AST node represents broken code, and has very weak semantics. It's part of the final AST (unlike TypoExpr) but can't produce code, as it's always accompanied by an error.
It captures valid subexpressions without checking or assigning them meaning.
Some RecoveryExprs have a known type (e.g. a broken non-overloaded call).
Where the type is unknown, it is modeled as the dependent RecoveryTy.
This allows us to reuse most existing behavior for suppressing further checks.
RecoveryTy is only used in C++, in C we don't recover unless the type is known.

Initiallly, RecoveryExpr is emitted in two common cases:

  • a function call where overload resolution failed (most typically: wrong args) Callee is captured as an UnresolvedLookupExpr, and args are captured. Type is available if the "best" candidates have the same return type.
  • access of a nonexistent member Base expression is captured, type is never available.

Test changes:

  • SemaCXX/enable_if.cpp: we now emit more detailed diagnostics (the non-constexpr subexpression is inside a function call), which breaks the heuristic that tries to move the whole diagnostic to the subexpression location. This case (IMO) doesn't matter much: the subexpression is invalid, flagging it as non-constexpr isn't very useful to start with.
  • SemaTemplate/instantiate-function-params.cpp: it looks like the testcase was minimized in an unrealistic way: the only version of if_ doesn't have type and the only version of requirement_ doesn't have failed, and it seems reasonable to diagnose this. If the precise invalid form is required, I can add another 17 expect-*s to the test, or remove -verify so we only fail on crashing. Original patch adding this test is very terse: https://github.com/llvm/llvm-project/commit/5112157958438005095aff805853f9b14ba974eb Testcase subsequently modified to test for a crash: https://github.com/llvm/llvm-project/commit/a02bb341552b2b91c93e708645c32d11cc4133d2

Diff Detail

Event Timeline

sammccall created this revision.May 9 2019, 3:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2019, 3:48 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
sammccall updated this revision to Diff 198786.May 9 2019, 4:07 AM

Fix serialization maybe.

sammccall updated this revision to Diff 199678.May 15 2019, 2:11 PM

This should work for real now: handle types properly, make tests pass, add some tests.

sammccall retitled this revision from Prototype of RecoveryExpr. Here, it's emitted when overload resolution fails. to [AST] Add RecoveryExpr; produce it on overload resolution failure and missing member..May 15 2019, 2:15 PM
sammccall edited the summary of this revision. (Show Details)
sammccall added a reviewer: rsmith.
sammccall added a subscriber: ilya-biryukov.
erik.pilkington added a subscriber: erik.pilkington.
erik.pilkington added inline comments.
include/clang/AST/BuiltinTypes.def
265

Why are you creating a new type as opposed to just using DependentTy (sorta like TypoExpr does)? It seems like if you want to recycle all the dependence-propagating code in Sema, then you need to fall back to DependentTy anyways, i.e. 1 + <recovery-expr> will have dependent type with this patch, right?

lib/Sema/SemaOverload.cpp
12248

Have you also considered handling this like delayed typos, where we try to TreeTransform into different possible recoveries later, in order to find out what the best fix is? You might be able to make a better guess as to what the right function to pick (or how to recover in general) is if you can see how the recovery would be used, but deciding here means you have a lot less information.

I expect we'll want a ContainsErrors bit on Stmt, propagated similarly to the existing InstantiationDependent / ContainsUnexpandedParameterPack / ... bits. For example, the constant evaluator will want to quickly bail out of evaluating expressions and statements containing errors, and the error recovery TreeTransform that we perform to fix typos will want that too (and maybe could be able to fix other kinds of errors for which we build these error nodes eventually?).

include/clang/AST/BuiltinTypes.def
265

Using DependentTy for TypoExpr is a mistake; it leads to all sorts of bad follow-on diagnostics incorrectly claiming that constructs have dependent types.

Rather than calling this RecoveryTy, I'd prefer to call it something a bit more general like ErrorTy, with the intent being that we eventually use it for TypoExpr too.

include/clang/AST/Expr.h
5801–5809

Do we need this at all, if we have a properly-propagated ErrorTy anyway? Instead of using this, it would seem that we could model an ill-formed expression as the corresponding AST node but with its type set to ErrorTy. Presumably the latter is what we'll do when we find a subexpression containing an error, rather than creating a RecoveryExpr at every enclosing syntactic level, so AST clients will need to deal with that anyway.

include/clang/AST/Type.h
2423–2424

Experience with TypoExpr suggests that treating non-dependent constructs as dependent is a mistake in the long term.

sammccall marked 3 inline comments as done.May 16 2019, 2:51 AM

Thanks for the useful insights!

I'm going to work on a version of this patch that doesn't rely on type dependency and treats errors as a new concept instead.
I'm not convinced it's feasible to drop RecoveryExpr and reuse existing nodes, so I'll keep that at least for now. (I tried and failed, and don't have new ideas there - more detail below)

I expect we'll want a ContainsErrors bit on Stmt, propagated similarly to the existing InstantiationDependent / ContainsUnexpandedParameterPack / ... bits.

This sounds good (as an alternative to using dependent bits for this).

Do you think we want a similar bit on Type (set on e.g. vector<ErrorTy>) , or should be ensure ErrorTy never gets used for compound types?

For example, the constant evaluator will want to quickly bail out of evaluating expressions and statements containing errors, and the error recovery TreeTransform that we perform to fix typos will want that too (and maybe could be able to fix other kinds of errors for which we build these error nodes eventually?).

Definitely possible, I don't know whether it's worth it or not:

  1. errors that can be recovered eagerly: They are, and RecoveryExpr or equivalent never kicks in here. This seems ideal.
  2. errors that can never be recovered: returned as RecoveryExpr or equivalent. This is a difference from current TypoExpr which never does this.
  3. errors that must be recovered lazily: we could e.g. add TypoExpr-style recovery callbacks to RecoveryExpr and merge the two concepts. I don't have a clear idea of how large/important this category is compared to 1 though.
include/clang/AST/BuiltinTypes.def
265

Why are you creating a new type as opposed to just using DependentTy (sorta like TypoExpr does)?

Indeed it's probably redundant in this form of the patch. While I was still trying to get this to work in C mode, I was checking for RecoveryTy instead of isDependent() in various places. (Maybe that approach would always miss cases, as you suggest)

(Obviously if we *don't* reuse the concept of dependent types, then some new type/type concepts are needed.)

include/clang/AST/Expr.h
5801–5809

properly-propagated ErrorTy

An important part here (especially for code completion, but not only) is that there are RecoveryExprs where the real type is known. So ErrorTy doesn't mean "subexpression has errors".

The motivating case is:

string foo(param1, param2, param3, param4);
foo(param1, param2, param3);

Here foo has type string, not ErrorTy.


But we can certainly add a bit to Stmt for "subexpression has errors", so the question stands:

as the corresponding AST node but [with the HasError bit set]. Presumably the latter is what we'll do when we find a subexpression containing an error, rather than creating a RecoveryExpr at every enclosing syntactic level, so AST clients will need to deal with that anyway.

Yes. I did try this and we debated it a lot. Somehow I forgot to mention this central design question in the patch description :-(

It certainly preserves more information/structure, and allows existing code to work with broken nodes.
However:

  • it breaks existing invariants, so a large fraction of consuming code needs to be modified but it's not obvious how. By comparison, most consumers are robust to adding a node type (or it'll break the compile in an obvious way with -Wswitch etc).
  • every time we want to add more recovery, we break new invariants. Whereas code that handles RecoveryExpr/ErrorTy can be pretty generic.
  • code tends to be partitioned by node type, so making errors a separate node type (mostly) isolates the complexity of error handling. Adding error conditions to many node types means this complexity is spread throughout consuming code.
  • it's particularly important that error-recovery code be "correct by construction" because test coverage of all error-recovery situations is very hard. Using the type system helps here.

In practice with this approach I wasn't able to fix the crashes either in clang or client code in any principled way, and I didn't have any confidence that the code was going to be correct even if I got the tests to pass.

include/clang/AST/Type.h
2423–2424

Experience with TypoExpr suggests that treating non-dependent constructs as dependent is a mistake in the long term.

It's certainly a tradeoff. Reusing dependent types isn't quite accurate, but adding a separate concept complicates the model a lot I think. Could be that

  • TypoExpr should have used a separate concept
  • dependent types were the best option, the alternatives would have been even worse
  • TypoExpr never should have happened, as both alternatives are too bad

Certainly this is a chance to try the other approach.

I expect we'll want a ContainsErrors bit on Stmt, propagated similarly to the existing InstantiationDependent / ContainsUnexpandedParameterPack / ... bits.

This sounds good (as an alternative to using dependent bits for this).

Do you think we want a similar bit on Type (set on e.g. vector<ErrorTy>) , or should be ensure ErrorTy never gets used for compound types?

The simplest thing to do might be to ensure that vector<ErrorTy> is itself ErrorTy. There might be some cases where we can get better recovery by keeping a more precise type (eg, maybe we can infer something from an operation on ErrorTy* that we couldn't infer from merely ErrorTy), but I suspect the added complexity isn't worthwhile.

For example, the constant evaluator will want to quickly bail out of evaluating expressions and statements containing errors, and the error recovery TreeTransform that we perform to fix typos will want that too (and maybe could be able to fix other kinds of errors for which we build these error nodes eventually?).

Definitely possible, I don't know whether it's worth it or not:

  1. errors that can be recovered eagerly: They are, and RecoveryExpr or equivalent never kicks in here. This seems ideal.
  2. errors that can never be recovered: returned as RecoveryExpr or equivalent. This is a difference from current TypoExpr which never does this.
  3. errors that must be recovered lazily: we could e.g. add TypoExpr-style recovery callbacks to RecoveryExpr and merge the two concepts. I don't have a clear idea of how large/important this category is compared to 1 though.

Well, typo correction would be in case 1: we can recover from typos eagerly, but we choose to defer them because we might be able to figure out a better recovery based on looking at context that we don't have yet. The same is probably true for a lot of our error recovery, but it's probably not worth it most of the time. So I think it's more of a question of balancing when it's worthwhile to save enough state to recover.

What kinds of cases are you thinking of that would be in your second category? The examples you give for RecoveryExpr (a call that can't be resolved and a member access that can't be resolved) are *already* in case 3 when the reason they can't be resolved is due to a typo within them. We model that today by treating them as dependent in the initial parse, but that's problematic and we will want to use whatever replacement mechanism we come up with here instead.

Looking at your current uses for RecoveryExpr, I'm not sure whether adding a new AST node is the right model:

a function call where overload resolution failed (most typically: wrong args) Callee is captured as an UnresolvedLookupExpr, and args are captured. Type is available if the "best" candidates have the same return type.

I would assume that we would represent a function call for which the callee or an argument contains an error as a CallExpr with the "contains error" flag set and with ErrorTy as its type. Could we use the same representation here? (That is, represent this case as a CallExpr whose type is ErrorTy with the "contains error" flag set, with no RecoveryExpr.)

access of a nonexistent member Base expression is captured, type is never available.

Using UnresolvedMemberExpr for this case might make sense. I think we should think about how we'd represent a member access that can't be fully analyzed because the object expression contains errors, and think about whether it makes sense to use that same representation for the case where the error is immediately in forming the member access itself. (And similarly across all other kinds of recovery expression we might create.)

If we want to enable (eg) AST matchers to operate on erroneous AST, using a homogeneous RecoveryExpr for all different syntaxes that originate errors seems like it would introduce complexity, especially if all consumers of the AST already need to deal with the case that some existing AST node is marked as being erroneous so don't get any benefit from having a concrete RecoveryExpr as a known stopping point. (If you get there, you've probably already gone too far.)

The simplest thing to do might be to ensure that vector<ErrorTy> is itself ErrorTy.

You're probably right, I'll try this.

For example, the constant evaluator will want to quickly bail out of evaluating expressions and statements containing errors

Thinking more - how important is this? These expressions are expected to be rare.
Discussing offline with @gribozavr, it seemed likely we could get away with the "errorness" of an expr being a local property, which would be simplifying. (Whereas the errorness of a type needs to be a transitive one).

What kinds of cases are you thinking of that would be in your second category?

Incomplete code, either new code or during refactoring

  • not enough arguments
  • symbol not found due to missing #include
  • member not added yet
  • function not defined yet

In an IDE context, it's *much* more important that diagnostics are useful and AST nodes aren't discarded, than that recovery correctly "fixes" the AST to have the intended semantics.

From that point of view, typo recovery/TypoExpr spends too much complexity budget on accurate recovery, rather than generality. RecoveryExpr is a different part of the design space for sure.

I would assume that we would represent a function call for which the callee or an argument contains an error as a CallExpr with the "contains error" flag set and with ErrorTy as its type.

Almost:

  • if the argument has an error, but its type is still known, then overload resolution runs as normal and this CallExpr doesn't itself have an error. (If there's a transitive HasError bit on Expr, it would be set).
  • if the argument has an error and has ErrorTy, overload resolution may still succeed (depending on the rules we choose for ErrorTy). In which case again, the CallExpr itself has no error.
  • if the argument has an error and has ErrorTy, and this causes overload resolution to fail, but all (best) overloads have the same type, we get a RecoveryExpr/CallExpr-with-error with a real type
  • if the argument has an error and has ErrorTy, and this causes overload resolution to fail, and the type is unknown, then we get a RecoveryExpr/CallExpr-with-error with ErrorTy.

Could we use the same representation here? (That is, represent this case as a CallExpr whose type is ErrorTy with the "contains error" flag set, with no RecoveryExpr.)

We can, but it breaks a lot of client code. These two cases are pretty different for clients:

  • for an otherwise-valid CallExpr where one argument has errors, all the *local* invariants are preserved.
  • for a CallExpr where callee is null, or the #args don't match the function, the client code often crashed or did the wrong thing.

My view is that we're fundamentally better having two types to keep the guarantees around CallExpr stronger. That we want matchers to "match through" RecoveryExpr, but callExpr shouldn't match here.
@ilya-biryukov disagreed and thought the main virtue of RecoveryExpr is backwards-compatibility.
Either way, I'm not optimistic about being able to get recovery changes to stick under this model. I'm willing to try if you feel strongly.

access of a nonexistent member Base expression is captured, type is never available.

Using UnresolvedMemberExpr for this case might make sense. I think we should think about how we'd represent a member access that can't be fully analyzed because the object expression contains errors, and think about whether it makes sense to use that same representation for the case where the error is immediately in forming the member access itself. (And similarly across all other kinds of recovery expression we might create.)

I agree it's important to be consistent here. It's hard work to be consistent and also precise!
Using RecoveryExpr for both cases seems sufficient to me, and one of its advantages is that it avoids having to make lots of hard decisions one-by-one, many of which will be wrong :-)

But that aside, I agree the best way to model that in the existing AST is make UnresolvedMemberExpr available in C, allow it to have no candidate decls, etc.
For the object expression, the error can be nested arbitrarily deep and I think we just have to apply whatever usual "subexpression-has-errors" strategy to UnresolvedMemberExpr and MemberExpr.

If we want to enable (eg) AST matchers to operate on erroneous AST, using a homogeneous RecoveryExpr for all different syntaxes that originate errors seems like it would introduce complexity, especially if all consumers of the AST already need to deal with the case that some existing AST node is marked as being erroneous so don't get any benefit from having a concrete RecoveryExpr as a known stopping point. (If you get there, you've probably already gone too far.)

Based on experiments, I think the desired behavior for ASTMatchers is that callExpr() wouldn't match a broken call, but descendant/ancestor matching would still traverse "through" broken calls.
This gets most of the possible benefit for little cost: matchers can "for free" match the rest of the expression tree that gets dropped today, but e.g. existing clang-tidy checks won't bind directly to the broken parts of the AST.
Matchers/options to opt-in can be added. e.g. brokenCallExpr(...) or recoveryExpr(attemptedExpr(CallExpr), ...) as the case may be.

I thought @klimek was on board with this approach, but he can correct me if not...