This is an archive of the discontinued LLVM Phabricator instance.

[clang][USR] Prevent crashes on incomplete FunctionDecls
ClosedPublic

Authored by kadircet on May 3 2023, 2:37 AM.

Details

Summary

FunctionDecls can be created with null types (D124351 added such a new
code path), to be filled in later. But parsing can stop before
completing the Decl (e.g. if code completion
point is reached).
Unfortunately most of the methods in FunctionDecl and its derived
classes assume a complete decl and don't perform null-checks.
Since we're not encountring crashes in the wild along other code paths
today introducing extra checks into quite a lot of places didn't feel
right (due to extra complexity && run time checks).
I believe another alternative would be to change Parser & Sema to never
create decls with invalid types, but I can't really see an easy way of
doing that, as most of the pieces are structured around filling that
information as parsing proceeds.

Diff Detail

Event Timeline

kadircet created this revision.May 3 2023, 2:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 2:37 AM
Herald added a subscriber: arphaman. · View Herald Transcript
kadircet requested review of this revision.May 3 2023, 2:37 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 3 2023, 2:37 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
ilya-biryukov added a subscriber: erichkeane.EditedMay 3 2023, 2:59 AM

It's true that FunctionType having a null type does break a lot of even its own methods,
e.g. even something as simple as isVariadic will dereference a null pointer as it uses getType()->getAs<FunctionType>().

I am not entirely sure what the contract there should be either.

  • Is it the client code that's responsible for checking the null type or methods of FunctionDecl?
  • Would it be easier to avoid creating the functions altogether? (I remember seeing some legitimate use-cases for null types there, but I wonder if those could be modelled differently).

@aaron.ballman, @erichkeane what is Clang's stance on those? How should we proceed here?

FWIW, https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/Decl.cpp#L2987 is the assertion that suggests "null type is fine for functiondecls", but as pointed out pretty much all of the methods assume it's non-null && most of the users also don't check for validness before use.
so my 2cents, in the past this intermediate state of a functiondecl wasn't observable outside, but overtime that "constraint" got violated in various places.

Hmm... So I don't see us being able to change how we create FunctionDecls, the way we are doing it with a 'build up to the final thing' is about all we can do. However, it should never escape its 'creation' functions without its type set.

IF I were to design this from scratch, I probably would have designed a FunctionDeclBuilder type thing that modeled a FunctionDecl, but asserted when being finalized (to get the FunctionDecl from it) if it wasn't complete. However, we don't really have anything that models that in this codebase, and I'm not sure we really want to add something like that...

I don't recall the case of a Null type being valid for functions (perhaps Aaron does? Or perhaps its an error condition?). But otherwise, I would expect FunctionDecl to have a type as soon as it escapes the function responsible for 'making' it (that is, the one that calls the ::Create and is filling in everything). If that patch added a path that doesn't set the return type, perhaps we should just fix that?

I don't recall the case of a Null type being valid for functions (perhaps Aaron does? Or perhaps its an error condition?). But otherwise, I would expect FunctionDecl to have a type as soon as it escapes the function responsible for 'making' it (that is, the one that calls the ::Create and is filling in everything).

I guess that's where the issue comes from. In theory Parser::ParseLambdaExpressionAfterIntroducer calls both Sema::ActOnLambdaExpressionAfterIntroducer (which creates incomplete method decl) and Sema::ActOnStartOfLambdaDefinition (which populates type info), so once we exit the parse function the methoddecl should have a valid type.
But the way clang implements code completion breaks that guarantee, as parser can hand out those semi-complete declarations to code completion consumers if code completion point is reached while parsing the tokens in between.
More importantly I don't know if this "variant" holds in all the places that creates a function decl (even in this example we can actually take an early exit but we at least mark the decl invalid before doing so).

If that patch added a path that doesn't set the return type, perhaps we should just fix that?

it's not just the return type, it's the full type for the function, see here (line 919 of clang/lib/Sema/SemaLambda.cpp in case the link doesn't work).
due to above mentioned reasons, i don't think there's anything wrong with that patch in particular (either it's fine or we've a bunch more cases that are wrong and hard to fix due to the first reason you mentioned), it just separated introduction and finalisation of a decl. https://github.com/llvm/llvm-project/blob/main/clang/lib/Sema/SemaDeclCXX.cpp#L14469 is another example (and we've got bunch more).

Adding in some of the clang-repl folks because they are likely to run into this as well, and adding Richard in case he has more historical context.

What I understand of the historical context here is that we've assumed very early on that inspecting state of partially-constructed AST nodes only happens when debugging (through things like calls to dump(), etc). So it was assumed that having a partially constructed object which is updated as we continue parsing and semantic analysis is fine because the only way to break the invariant is from the debugger. However, over time, we've invalidated that assumption more and more (REPL, libclang, etc) and now we're paying the price.

I think that refactoring the way we construct AST nodes so that they're not in a partially-constructed state by the time we've called Create() is a nonstarter; I think we've got too much reliance built up on that pattern. For example, we do CreateEmpty() + building up state as a matter of course when deserializing the AST for PCH or modules. However, I'm not keen on us playing whack-a-mole with the kinds of checks from this review. For starters, that's going to have a long-tail that makes it hard to know if we've ever gotten to the bottom of the issue. But also, each one of these checks is basically useless for the primary way in which Clang is consumed (as a compiler), so this increases compile times for limited benefit to "most" users. In this particular case, we may be fine as this is limited to libclang and so it shouldn't add overhead for the common path, but I suspect we're going to find cases that need fixing in more common areas of the compiler that could be more troublesome.

I wish I had a good answer for how to address this, but I don't have one off the top of my head. About the closest I can think of is to use a bit on the declaration to say "I am being worked on still" and add a Finalize() method to say "I am no longer being worked on" and perform sanity checks and a getter method to tell us if the AST node is still under construction or not. That gives us a generic way to quickly test "should we be inspecting this further" in circumstances where it matters and there is precedent with how we handle parsing declaration specifiers (https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Sema/DeclSpec.h#L844), but this isn't all that much better than ad hoc tests like checking it the type is null or not (so I'm not suggesting this is the best way to solve the problem, just spitballing ideas).

However, I'm not keen on us playing whack-a-mole with the kinds of checks from this review. For starters, that's going to have a long-tail that makes it hard to know if we've ever gotten to the bottom of the issue. But also, each one of these checks is basically useless for the primary way in which Clang is consumed (as a compiler), so this increases compile times for limited benefit to "most" users.

I completely agree, that's the reason why I've stayed away from adding those checks to various FunctionDecl helpers (isVariadic, params, etc.).

In this particular case, we may be fine as this is limited to libclang and so it shouldn't add overhead for the common path, but I suspect we're going to find cases that need fixing in more common areas of the compiler that could be more troublesome.

Agreed, and I am also pretty sure this is not the only place that can be affected from incomplete decls/types. But this is the only one showing up quite frequently ever since changes to lambda parsing.
I think there's some strategy decision to be made about clang's invariants:

  • whether to accept declarations/types can be inspected in the middle of parsing as a new constraint, declare all the existing violations as bugs and fix them as we go (without introducing new ones, which is quite hard) and give people the means to construct ast nodes "safely".
  • claim that variants are WAI and it's on use cases that perform such inspections to figure out how to deal with consequences (e.g. in code-completion consumers).

But in either case, I don't think this review is the right medium to make that decision. Surely it contains a lot of useful pointers, and I am happy to move them to a discourse thread (LMK if I should do it, or you'd like to kick it off @aaron.ballman) to raise awareness, but in the end this review is just trying to fix an issue by adding extra checks to only the applications that can violate contracts of clang parsing. So unless we've specific concerns about the issue being addressed in this patch, I'd like to move forward.

However, I'm not keen on us playing whack-a-mole with the kinds of checks from this review. For starters, that's going to have a long-tail that makes it hard to know if we've ever gotten to the bottom of the issue. But also, each one of these checks is basically useless for the primary way in which Clang is consumed (as a compiler), so this increases compile times for limited benefit to "most" users.

I completely agree, that's the reason why I've stayed away from adding those checks to various FunctionDecl helpers (isVariadic, params, etc.).

In this particular case, we may be fine as this is limited to libclang and so it shouldn't add overhead for the common path, but I suspect we're going to find cases that need fixing in more common areas of the compiler that could be more troublesome.

Agreed, and I am also pretty sure this is not the only place that can be affected from incomplete decls/types. But this is the only one showing up quite frequently ever since changes to lambda parsing.
I think there's some strategy decision to be made about clang's invariants:

  • whether to accept declarations/types can be inspected in the middle of parsing as a new constraint, declare all the existing violations as bugs and fix them as we go (without introducing new ones, which is quite hard) and give people the means to construct ast nodes "safely".
  • claim that variants are WAI and it's on use cases that perform such inspections to figure out how to deal with consequences (e.g. in code-completion consumers).

But in either case, I don't think this review is the right medium to make that decision. Surely it contains a lot of useful pointers, and I am happy to move them to a discourse thread (LMK if I should do it, or you'd like to kick it off @aaron.ballman) to raise awareness, but in the end this review is just trying to fix an issue by adding extra checks to only the applications that can violate contracts of clang parsing. So unless we've specific concerns about the issue being addressed in this patch, I'd like to move forward.

I think we probably should have a broader discussion before moving forward here. It's not that this isn't incremental progress fixing an issue, but it's more that this same justification works to add the workaround 200 more times without ever addressing the underlying architectural concerns.

That said, is this issue blocking further work so you need to land this in order to make forward progress elsewhere?

I think we probably should have a broader discussion before moving forward here. It's not that this isn't incremental progress fixing an issue, but it's more that this same justification works to add the workaround 200 more times without ever addressing the underlying architectural concerns.

I can see the desire to fix such issues at a higher level, and not wanting to play whack-a-mole in various places. But that's the reason why this patch specifically increases robustness in a piece of code that already has some logic to deal with invalid code. Rather than core clang pieces, which has different assumptions about invariants of the AST (e.g. not adding the null-check to every getType call in FunctionDecl methods).
If we're not willing to land such fixes that only add relevant checks to pieces of clang that's suppose to be more robust against invalid code, I don't see how we can have any stable tooling in the short term. Most of the feature work that introduces handling for new language constructs introduces regressions on invalid code as it's not really a concern and never verified. Hence we fix those afterwards as we notice them, fixing the implementation whenever it's feasible or increasing robustness in rest of the pieces when fixing the implementation is infeasible (or implementation is just right and the application was relying on some assurances that were incidentally there before). As we happen to hit some increasing resistance towards these robustness improvement patches, it'd be nice to understand what should be the way to fix them going forward. e.g. treat them as usual crashes, find the offending patch and just revert it with the reproducer?

That said, is this issue blocking further work so you need to land this in order to make forward progress elsewhere?

This is resulting in crashes in clangd whenever the users trigger code completion in such code patterns (which is ~100s times a day in our production setup, so not rare in practice).
So it isn't linked to any bigger task, but rather general robustness/QoL improvements for invalid code handling in clang(d) which are quite important especially on big projects. As a crash could cost minutes of warm-up times because we need to rebuild all of the caches.

aaron.ballman accepted this revision.May 15 2023, 7:10 AM

I think we probably should have a broader discussion before moving forward here. It's not that this isn't incremental progress fixing an issue, but it's more that this same justification works to add the workaround 200 more times without ever addressing the underlying architectural concerns.

I can see the desire to fix such issues at a higher level, and not wanting to play whack-a-mole in various places. But that's the reason why this patch specifically increases robustness in a piece of code that already has some logic to deal with invalid code. Rather than core clang pieces, which has different assumptions about invariants of the AST (e.g. not adding the null-check to every getType call in FunctionDecl methods).
If we're not willing to land such fixes that only add relevant checks to pieces of clang that's suppose to be more robust against invalid code, I don't see how we can have any stable tooling in the short term. Most of the feature work that introduces handling for new language constructs introduces regressions on invalid code as it's not really a concern and never verified. Hence we fix those afterwards as we notice them, fixing the implementation whenever it's feasible or increasing robustness in rest of the pieces when fixing the implementation is infeasible (or implementation is just right and the application was relying on some assurances that were incidentally there before). As we happen to hit some increasing resistance towards these robustness improvement patches, it'd be nice to understand what should be the way to fix them going forward. e.g. treat them as usual crashes, find the offending patch and just revert it with the reproducer?

On the one hand, I agree with you about the fact that this code is already trying to be robust against invalid code and thus it's reasonable to add more checks for that. No argument from me about that! But at the same time, it's become more obvious (at least to me) that clangd has features that don't work with all of the invariants in Clang and I don't know that we ever really stopped to figure out whether that's reasonable or not. That's why I think we need a broader discussion. The problem is not ideological, it's one of maintainability of the primary product. For example, the community could decide "it is not Clang's job to be resilient to this sort of thing". Or we could decide "we need to be resilient to this but only if it doesn't introduce more than X% overhead". And so on. Each time we land another one of these whack-a-mole changes, we potentially make it harder to get to a more principled approach.

(My personal feelings are that invariants about internal object state are going to be hard for us to change or introduce unwarranted overhead in at least some circumstances. In those circumstances, I think the onus is on clangd to determine how to work within those invariants and not on clang to change unless there isn't another reasonable option. Refactoring or adding smoke tests can introduce overhead that typical compilation scenarios should never have to pay the cost for, and we should avoid that as best we can. But I also realize this adds burden to the clangd folks to have compiler performance statistics for changes or refactoring that relate to invariants which may or may not be reasonable.)

That said, is this issue blocking further work so you need to land this in order to make forward progress elsewhere?

This is resulting in crashes in clangd whenever the users trigger code completion in such code patterns (which is ~100s times a day in our production setup, so not rare in practice).
So it isn't linked to any bigger task, but rather general robustness/QoL improvements for invalid code handling in clang(d) which are quite important especially on big projects. As a crash could cost minutes of warm-up times because we need to rebuild all of the caches.

The fact that this is happening 100s of times a day for you suggests we should land the changes in this patch so there's less pressure when having the broader discussion about where the division of labor is between clangd and clang. So LGTM!

This revision is now accepted and ready to land.May 15 2023, 7:10 AM

But at the same time, it's become more obvious (at least to me) that clangd has features that don't work with all of the invariants in Clang and I don't know that we ever really stopped to figure out whether that's reasonable or not.

Just to be clear, this behaviour we observe is not clangd specific. it's triggered through code completion callbacks that's implemented through Sema, which is used by clangd and libclang but are not part of clangd.
You can basically trigger the same behaviour using -Xclang -code-completion-at=xxx, if you trigger it while parsing the lambda you receive an incomplete object COMPLETION: a : [#auto#]a (which I believe is the part that breaks the invariants of clang AST, but I am happy to go with the perspective that sema code completion doesn't get to rely on AST nodes being complete because it can trigger in the middle of parsing, which I believe implies all of the code completion users needs to deal with consequences) and if you trigger it after parsing the lambda, you get the proper object COMPLETION: a : [#void#]a(<#Foo<int>#>)[# const#].

Inside clangd we don't do anything special with regards to feeding code to clang. The only difference is we feed it incomplete code most of the time. The expectation from our side is clang generating a ~okay shaped AST & diagnostics rather than crashes. Any crashes that we trigger in our functionality while traversing that AST, we mostly try to handle inside clangd (when we seem to rely on some incidental contracts). But there are also cases like this, which are not really AST handling on the clangd side but some clang pieces triggering a crash and the only reason why they seem related to clangd is we're just more likely to hit them as regular use-case of clang doesn't involve being triggered at every keystroke :D
I feel like this approach is quite minimally intrusive but if you disagree, I am happy to follow up and change our approach going forward.

That's why I think we need a broader discussion. The problem is not ideological, it's one of maintainability of the primary product. For example, the community could decide "it is not Clang's job to be resilient to this sort of thing". Or we could decide "we need to be resilient to this but only if it doesn't introduce more than X% overhead". And so on. Each time we land another one of these whack-a-mole changes, we potentially make it harder to get to a more principled approach.

Again, this change is only introducing this overhead to USRGeneration which is only used by libclang and clangd, and both of these use cases are suppose to not crash on incomplete code when trying to code-complete. Hence I don't think the maintenance and overhead arguments do actually apply in this case (I know you're talking about the general case, but I believe we should also be aware of the distinction going forward. As despite not affecting rest of the callers neither code-complexity nor runtime overhead-wise, we still held up this particular patch for ~2 weeks).

(My personal feelings are that invariants about internal object state are going to be hard for us to change or introduce unwarranted overhead in at least some circumstances. In those circumstances, I think the onus is on clangd to determine how to work within those invariants and not on clang to change unless there isn't another reasonable option. Refactoring or adding smoke tests can introduce overhead that typical compilation scenarios should never have to pay the cost for, and we should avoid that as best we can. But I also realize this adds burden to the clangd folks to have compiler performance statistics for changes or refactoring that relate to invariants which may or may not be reasonable.)

We're actually interested in overall performance of the compiler as well, hence I think we'd be fine with tracking implications of such changes whenever we feel like clang is the right layer to have them. But to do best of my knowledge, we don't really have a great way of testing out effects of such changes on compile times. Only useful tool that I know about is https://llvm-compile-time-tracker.com/, are there any other sources that explain how to run such benchmarks?

The fact that this is happening 100s of times a day for you suggests we should land the changes in this patch so there's less pressure when having the broader discussion about where the division of labor is between clangd and clang. So LGTM!

Thanks a lot!

This revision was landed with ongoing or failed builds.May 15 2023, 9:12 AM
This revision was automatically updated to reflect the committed changes.

But at the same time, it's become more obvious (at least to me) that clangd has features that don't work with all of the invariants in Clang and I don't know that we ever really stopped to figure out whether that's reasonable or not.

Just to be clear, this behaviour we observe is not clangd specific. it's triggered through code completion callbacks that's implemented through Sema, which is used by clangd and libclang but are not part of clangd.
You can basically trigger the same behaviour using -Xclang -code-completion-at=xxx, if you trigger it while parsing the lambda you receive an incomplete object COMPLETION: a : [#auto#]a (which I believe is the part that breaks the invariants of clang AST, but I am happy to go with the perspective that sema code completion doesn't get to rely on AST nodes being complete because it can trigger in the middle of parsing, which I believe implies all of the code completion users needs to deal with consequences) and if you trigger it after parsing the lambda, you get the proper object COMPLETION: a : [#void#]a(<#Foo<int>#>)[# const#].

code completions exist for clangd support, don't they? (We didn't have that functionality before we supported clangd, did we?)

Inside clangd we don't do anything special with regards to feeding code to clang. The only difference is we feed it incomplete code most of the time. The expectation from our side is clang generating a ~okay shaped AST & diagnostics rather than crashes. Any crashes that we trigger in our functionality while traversing that AST, we mostly try to handle inside clangd (when we seem to rely on some incidental contracts). But there are also cases like this, which are not really AST handling on the clangd side but some clang pieces triggering a crash and the only reason why they seem related to clangd is we're just more likely to hit them as regular use-case of clang doesn't involve being triggered at every keystroke :D
I feel like this approach is quite minimally intrusive but if you disagree, I am happy to follow up and change our approach going forward.

It's good to know that usually you handle this on the clangd side, that makes me more comfortable. I've been seeing a fair amount of changes to clang to support clangd's and clang-repl's needs for producing ~okay ASTs for malformed code, but I typically only review the clang side of those changes, so it's possible I'm not seeing how much goes on behind the scenes to avoid making changes to clang.

That's why I think we need a broader discussion. The problem is not ideological, it's one of maintainability of the primary product. For example, the community could decide "it is not Clang's job to be resilient to this sort of thing". Or we could decide "we need to be resilient to this but only if it doesn't introduce more than X% overhead". And so on. Each time we land another one of these whack-a-mole changes, we potentially make it harder to get to a more principled approach.

Again, this change is only introducing this overhead to USRGeneration which is only used by libclang and clangd, and both of these use cases are suppose to not crash on incomplete code when trying to code-complete.

Today. Within clang. But there are in-tree uses of libclang (like clangd) that are impacted by this, and there's ever reason to believe that at some point, USRs might wind up hoisted into Clang proper. But your point is still valid that for the moment, this is pretty self-contained.

Hence I don't think the maintenance and overhead arguments do actually apply in this case (I know you're talking about the general case, but I believe we should also be aware of the distinction going forward. As despite not affecting rest of the callers neither code-complexity nor runtime overhead-wise, we still held up this particular patch for ~2 weeks).

Just to make sure we're on the same page: I don't consider it a bug that it took ~2 weeks for us to determine if this patch was reasonable or not. Clang ships to millions of users; we SHOULD be quite concerned about ancillary tools pushing Clang into a worse shape. Those tools are secondary to doing what is right for Clang's primary use case, which is a C[++]/Objective-C[++] compiler.

(My personal feelings are that invariants about internal object state are going to be hard for us to change or introduce unwarranted overhead in at least some circumstances. In those circumstances, I think the onus is on clangd to determine how to work within those invariants and not on clang to change unless there isn't another reasonable option. Refactoring or adding smoke tests can introduce overhead that typical compilation scenarios should never have to pay the cost for, and we should avoid that as best we can. But I also realize this adds burden to the clangd folks to have compiler performance statistics for changes or refactoring that relate to invariants which may or may not be reasonable.)

We're actually interested in overall performance of the compiler as well, hence I think we'd be fine with tracking implications of such changes whenever we feel like clang is the right layer to have them. But to do best of my knowledge, we don't really have a great way of testing out effects of such changes on compile times. Only useful tool that I know about is https://llvm-compile-time-tracker.com/, are there any other sources that explain how to run such benchmarks?

That is the primary tool we've been using to help gauge compile time impacts. If there are other tools, I'd love to know what they are!