Page MenuHomePhabricator

[AST] Add a flag indicating if any subexpression had errors
Needs ReviewPublic

Authored by ilya-biryukov on Aug 1 2019, 10:27 AM.

Details

Reviewers
rsmith
Summary

The only subexpression that is considered an error now is TypoExpr, but
we plan to add expressions with errors to improve editor tooling on broken
code. We intend to use the same mechanism to guard against spurious
diagnostics on those as well.

See the follow-up revision for an actual usage of the flag.

Diff Detail

Unit TestsFailed

TimeTest
50 msLLVM.tools/llvm-ar::mri-utf8.test
Script: -- : 'RUN: at line 4'; rm -rf /mnt/disks/ssd0/agent/workspace/Phabricator/build/test/tools/llvm-ar/Output/mri-utf8.test.tmp && mkdir -p /mnt/disks/ssd0/agent/workspace/Phabricator/build/test/tools/llvm-ar/Output/mri-utf8.test.tmp/extracted
40 msLLVM.tools/llvm-objdump/X86::disassemble-functions.test
Script: -- : 'RUN: at line 4'; /mnt/disks/ssd0/agent/workspace/Phabricator/build/bin/yaml2obj -o /mnt/disks/ssd0/agent/workspace/Phabricator/build/test/tools/llvm-objdump/X86/Output/disassemble-functions.test.tmp.out /mnt/disks/ssd0/agent/workspace/Phabricator/llvm/test/tools/llvm-objdump/X86/Inputs/simple-executable-x86_64.yaml

Event Timeline

ilya-biryukov created this revision.Aug 1 2019, 10:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2019, 10:27 AM
Herald added a subscriber: jfb. · View Herald Transcript

Gentle ping. @rsmith, please take a look when you have time.

Rather than adding a bit onto Expr to specify whether it's erroneous, have you considered making this a property of the type system by introducing an ErrorExpr AST node that other nodes can inherit from? I think that approach would work more naturally for things like AST matchers while solving some problems we have with being able to pretty printing or AST dump erroneous ASTs. The ErrorExpr node could contain a subnode of a partially-valid Expr object that would retain as much of the information as we've been able to gather, but the error node (or its subclasses) could contain other information useful when handling errors, such as storing the original source text (or range) for the expression, potential fixes, etc.

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

clang/include/clang/AST/Expr.h
1526

/*ContainsError=*/false here and elsewhere ?

3607

What is this dead2 ?

clang/lib/AST/ExprObjC.cpp
29

s/ContainsErros/ContainsErrors

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

Exactly right, thanks for summing this up!

Rather than adding a bit onto Expr to specify whether it's erroneous, have you considered making this a property of the type system by introducing an ErrorExpr AST node that other nodes can inherit from? I think that approach would work more naturally for things like AST matchers while solving some problems we have with being able to pretty printing or AST dump erroneous ASTs. The ErrorExpr node could contain a subnode of a partially-valid Expr object that would retain as much of the information as we've been able to gather, but the error node (or its subclasses) could contain other information useful when handling errors, such as storing the original source text (or range) for the expression, potential fixes, etc.

Having something similar to ErrorExpr is actually the next logical step I'm currently exploring. The idea of this bit is to carry information on whether any (recursive) child of an expression had an error that was already diagnosed.
Detecting those expression is useful to avoid producing additional irrelevant diagnostics (see the dependent revision that uses the bit to get rid of one such diagnostic) and avoid running various functions that are not prepared to handle subexpressions that have errors (e.g. constant evaluation).

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

That's true. I had figured that answering "does this expression contain an invalid sub-expression" could be implemented with a walk of the expression tree rather than consuming a bit. To consumers of containsErrors(), there shouldn't be much difference aside from performance (and that may be sufficient reason to go with a bit, but I think I'd like to see performance measurements that show the bit is necessary).

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

That's true. I had figured that answering "does this expression contain an invalid sub-expression" could be implemented with a walk of the expression tree rather than consuming a bit. To consumers of containsErrors(), there shouldn't be much difference aside from performance (and that may be sufficient reason to go with a bit, but I think I'd like to see performance measurements that show the bit is necessary).

Fair enough, especially given that IIRC there are not many bits left in these bit-fields classes.

ilya-biryukov added a comment.EditedAug 12 2019, 7:10 AM

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

That's true. I had figured that answering "does this expression contain an invalid sub-expression" could be implemented with a walk of the expression tree rather than consuming a bit. To consumers of containsErrors(), there shouldn't be much difference aside from performance (and that may be sufficient reason to go with a bit, but I think I'd like to see performance measurements that show the bit is necessary).

Are expression bits scarce?
We don't have any checks if expressions contain errors now, we simply drop too many invalid expressions and never put them into the AST.
It's impossible to do the measurements at this point, but it would be nice if adding those checks was cheap.

We can also assume they're cheap, use the visitor-based implementation and later switch if this turn out to be a problem.
I think we should prefer the approach that guarantees the compiler is not getting slow ever rather than chase the slowness later.

Fair enough, especially given that IIRC there are not many bits left in these bit-fields classes.

May be reasonable in that case, but we should be aware of a potential O(n^2) and even exponential complexities in the frontend arising from calling containsErrors in too many places in that case.
If we ever run out of bits, we could remove this bit and instead introduce a cache (something like set<Expr*> InvalidExprs) in ASTContext.

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

That's true. I had figured that answering "does this expression contain an invalid sub-expression" could be implemented with a walk of the expression tree rather than consuming a bit. To consumers of containsErrors(), there shouldn't be much difference aside from performance (and that may be sufficient reason to go with a bit, but I think I'd like to see performance measurements that show the bit is necessary).

Are expression bits scarce?
We don't have any checks if expressions contain errors now, we simply drop too many invalid expressions and never put them into the AST.
It's impossible to do the measurements at this point, but it would be nice if adding those checks was cheap.

Sort of... I went through the list of bit-fields classes and if I count correctly there is 4 (possibly 5) bits left in Stmt or Expr which can be used without doing anything else. I guess that using one of them is okay.

We can also assume they're cheap, use the visitor-based implementation and later switch if this turn out to be a problem.
I think we should prefer the approach that guarantees the compiler is not getting slow ever rather than chase the slowness later.

Fair enough, especially given that IIRC there are not many bits left in these bit-fields classes.

May be reasonable in that case, but we should be aware of a potential O(n^2) and even exponential complexities in the frontend arising from calling containsErrors in too many places in that case.
If we ever run out of bits, we could remove this bit and instead introduce a cache (something like set<Expr*> InvalidExprs) in ASTContext.

It seems that these two options are not exactly the same right ? The ContainsError bit is useful to quickly answer "Does this expression contains an invalid sub-expression" without doing the search, while adding an ErrorExpr node is useful to note that this sub-expression is invalid (and as Aaron says the hypothetical ErrorExpr node can carry more info about the error).

That's true. I had figured that answering "does this expression contain an invalid sub-expression" could be implemented with a walk of the expression tree rather than consuming a bit. To consumers of containsErrors(), there shouldn't be much difference aside from performance (and that may be sufficient reason to go with a bit, but I think I'd like to see performance measurements that show the bit is necessary).

Are expression bits scarce?
We don't have any checks if expressions contain errors now, we simply drop too many invalid expressions and never put them into the AST.
It's impossible to do the measurements at this point, but it would be nice if adding those checks was cheap.

We can also assume they're cheap, use the visitor-based implementation and later switch if this turn out to be a problem.
I think we should prefer the approach that guarantees the compiler is not getting slow ever rather than chase the slowness later.

The problem is: those bits are not infinite and we only have a handful left until bumping the allocation size; is this use case critical enough to use one of those bits? I don't think it will be -- it seems like premature optimization. Also, by calculating rather than using a bit, you don't have to touch every Expr constructor, which reduces the complexity of the patch.

Some other things I think are missing from the patch (regardless of whether you go with a bit or calculate on the fly):

  • Do you need some changes to AST serialization and deserialization? Does anything special need to happen for modules?
  • I would expect to see this information reflected in an AST dump
  • How should this impact AST matching interfaces?
  • Test cases

The problem is: those bits are not infinite and we only have a handful left until bumping the allocation size; is this use case critical enough to use one of those bits? I don't think it will be -- it seems like premature optimization. Also, by calculating rather than using a bit, you don't have to touch every Expr constructor, which reduces the complexity of the patch.

Alternatively, we could keep an equivalent of set<Expr*> InvalidExprs in ASTContext. Gives the same computational complexity, but has a higher constant overhead factor.
Does that look reasonable?

Some other things I think are missing from the patch (regardless of whether you go with a bit or calculate on the fly):

  • Do you need some changes to AST serialization and deserialization?

Good point, will update the patch.

  • Does anything special need to happen for modules?

Not sure. What are the potential problems you foresee?

  • I would expect to see this information reflected in an AST dump

Good point. Will do. Although it's a little hard to test in this patch, since it's hard to catch a TypoExpr in the AST dump.

  • How should this impact AST matching interfaces?

We could add a matcher that filters on this flag, but I would start with adding more expressions first (something similar to ErrorExpr);
For the purposes of this patch, I'd keep the matcher interfaces untouched.

  • Test cases

Again, since it's hard to catch a TypoExpr in the final AST dump, it's hard to catch this bit. See the dependent revision for a bogus diagnostic not being emitted anymore.

The problem is: those bits are not infinite and we only have a handful left until bumping the allocation size; is this use case critical enough to use one of those bits? I don't think it will be -- it seems like premature optimization. Also, by calculating rather than using a bit, you don't have to touch every Expr constructor, which reduces the complexity of the patch.

Alternatively, we could keep an equivalent of set<Expr*> InvalidExprs in ASTContext. Gives the same computational complexity, but has a higher constant overhead factor.
Does that look reasonable?

Yup, that is also very reasonable.

Some other things I think are missing from the patch (regardless of whether you go with a bit or calculate on the fly):

  • Do you need some changes to AST serialization and deserialization?

Good point, will update the patch.

  • Does anything special need to happen for modules?

Not sure. What are the potential problems you foresee?

I'm not certain there are real problems there, but I am wondering whether something like a BMI should include serialized error AST nodes or not. Would a consumer of the module expect that? I suppose it could if we wanted it to.

  • I would expect to see this information reflected in an AST dump

Good point. Will do. Although it's a little hard to test in this patch, since it's hard to catch a TypoExpr in the AST dump.

Ah, drat, I was hoping we had at least one test that already did this, but I don't see one.

  • How should this impact AST matching interfaces?

We could add a matcher that filters on this flag, but I would start with adding more expressions first (something similar to ErrorExpr);
For the purposes of this patch, I'd keep the matcher interfaces untouched.

I think that makes sense. The sort of things I'm wondering about are: if we are going to start retaining error nodes in the AST more regularly, should AST matchers opt into matching on those nodes, or opt out of matching on them? I think the answer should be that we only AST match on valid AST nodes, but I could see arguments being made either way, so maybe this part needs an RFC for more opinions.

  • Test cases

Again, since it's hard to catch a TypoExpr in the final AST dump, it's hard to catch this bit. See the dependent revision for a bogus diagnostic not being emitted anymore.

Yeah, I was hoping we'd have some AST dumping mechanism for testing this. If not, though, perhaps we could still use a unit test?

I did some experiments with adding something similar to the ErrorExpr Aaron suggest. It has this flag set and does not get removed from the AST.
I believe the best way to address Aaron's comments is to add tests mentioning this expression instead of trying to catch TypoExpr, which normally get removed from the AST.

Would need some time to fix a few remaining test failures (bogus diagnostics), I'll update this patch when I have something (hopefully tomorrow).

  • Add the added bit to serialization
  • Mention contains-errors in the AST dump. Still not tests in this revision, see D69330 for an expression that is actually preserved and has this bit.

Build result: fail - 59344 tests passed, 2 failed and 812 were skipped.

failed: LLVM.tools/llvm-ar/mri-utf8.test
failed: LLVM.tools/llvm-objdump/X86/disassemble-functions.test

Log files: cmake-log.txt, ninja_check_all-log.txt, CMakeCache.txt