This is an archive of the discontinued LLVM Phabricator instance.

[clang][dataflow] Analyze calls to in-TU functions
ClosedPublic

Authored by samestep on Jul 21 2022, 2:34 PM.

Details

Summary

This patch adds initial support for context-sensitive analysis of simple functions whose definition is available in the translation unit, guarded by the ContextSensitive flag in the new TransferOptions struct. When this option is true, the VisitCallExpr case in the builtin transfer function has a fallthrough case which checks for a direct callee with a body. In that case, it constructs a CFG from that callee body, uses the new pushCall method on the Environment to make an environment to analyze the callee, and then calls runDataflowAnalysis with a NoopAnalysis (disabling context-sensitive analysis on that sub-analysis, to avoid problems with recursion). After the sub-analysis completes, the Environment from its exit block is simply assigned back to the environment at the callsite.

The pushCall method (which currently only supports non-method functions with some restrictions) maps the SourceLocations for all the parameters to the existing source locations for the corresponding arguments from the callsite.

This patch adds a few tests to check that this context-sensitive analysis works on simple functions. More sophisticated functionality will be added later; the most important next step is to explicitly model context in some fields of the DataflowAnalysisContext class, as mentioned in a FIXME comment in the pushCall implementation.

Diff Detail

Event Timeline

samestep created this revision.Jul 21 2022, 2:34 PM
Herald added a project: Restricted Project. · View Herald Transcript
samestep requested review of this revision.Jul 21 2022, 2:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 2:34 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
samestep edited the summary of this revision. (Show Details)Jul 21 2022, 2:48 PM
samestep added reviewers: ymandel, gribozavr2, sgatev.
xazax.hun added a comment.EditedJul 21 2022, 4:15 PM

There are many ways to introduce context sensitivity into the framework, this patch seems to take the "inline substitution" approach, the same approach the Clang Static Analyzer is taking. While this approach is relatively easy to implement and has great precision, it also has some scalability problems. Did you also consider a summary-based approach? In general, I believe the inline substitution approach results in an easier to use interface for the users of the framework, but I am a bit concerned about the scalability problems.

Some other related questions:

  • Why call noop analysis? As far as I understand, this would only update the environment but not the lattice of the current analysis, i.e., if the analysis is computing some information like liveness, that information would not be context sensitive. Do I miss something?
  • Why limit the call depth to 1? The patch mentions recursive functions. In case of the Clang Static Analyzer, the call depth is 4. I think if we go with the inline substitution approach, we want this parameter to be tunable, because different analyses might have different sweet spots for the call stack depth.
  • The CSA also has other tunables, e.g., small functions are always inlined and large functions are never inlined.
  • Currently, it looks like the framework assumes functions that cannot be inlined are not doing anything. This is an unsound assumption, and I wonder if we should change that before we try to settle on the best values for the tunable parameters.
  • The current code might do a bit too much work. E.g. consider:
while (...) {
  inlinableCall();
}

As far as I understand, the current approach would start the analysis of inlinableCall from scratch in each iteration. I wonder if we actually want to preserve the state between the iterations, so we do not always reevaluate the call from scratch. Currently, it might not be a big deal as the fixed-point iteration part is disabled. But this could be a perf problem in the future, unless I miss something.

  • I think the environment currently is not really up to context-sensitive evaluation. Consider a recursive function:
void f(int a) {
  ++a;
  if (a < 10)
      f(a);
}

Here, when the recursive call is evaluated, it is ambiguous what a refers to. Is it the argument of the caller or the callee? To solve this ambiguity, the calling context needs to be the part of the keys in the environment.

Sam, this is a great start! I'm really excited to see that you have a core working so quickly.

clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
208
220

I wonder how this will work between caller and callee. Do we need separate global var state in the frame? If so, maybe mention that as well in the FIXME above.

237

I'm pretty sure we want SkipPast::Reference. That will ensure that the parameter and argument share the same underlying location. Otherwise, in the case of references, the parameter will point to the reference location object rather than just directly to the location.

clang/lib/Analysis/FlowSensitive/Transfer.cpp
517

here and below: s/TODO/FIXME.

524

This seems worth a FIXME or, at least, an explanation. It implies that with the current design, we can't support general-purpose analyses, which we should probably fix. Given our goal of supporting models that don't involve specialized lattices, I think this is a good compromise for the short term, but not a stable solution for the framework (hence FIXME sounds right).

samestep updated this revision to Diff 446830.Jul 22 2022, 7:24 AM

Update parent patch

There are many ways to introduce context sensitivity into the framework, this patch seems to take the "inline substitution" approach, the same approach the Clang Static Analyzer is taking. While this approach is relatively easy to implement and has great precision, it also has some scalability problems. Did you also consider a summary-based approach? In general, I believe the inline substitution approach results in an easier to use interface for the users of the framework, but I am a bit concerned about the scalability problems.

Good point, thanks! Yes, we considered a summary-based approach, but we decided not to use it because (as you mentioned) it would be much more difficult to implement, especially for callees with nontrivial CFGs, which would result in a nontrivial flow condition instead of just Values in the Environment.

Could you elaborate on what specific scalability problems you are concerned about? The main one that comes to mind for me is unpredictable cost due to the potential for arbitrary callee bodies to be present in the translation unit. While this particular patch doesn't address that concern, we definitely have plans to do so: I'm guessing that will take the form of providing the analysis an allowlist of symbols of which it is allowed to analyze the bodies, so it would treat any symbols not in that list as if their bodies are not available in the TU regardless.

To clarify, the main reason we want context-sensitive analysis is to allow us to simplify our definitions of some models, such as the optional checker model. The goal is to provide the analysis a mock implementation of an optional type, and then use context-sensitive analysis (probably just one or two layers deep) to model the constructors and methods.

Some other related questions:

  • Why call noop analysis? As far as I understand, this would only update the environment but not the lattice of the current analysis, i.e., if the analysis is computing some information like liveness, that information would not be context sensitive. Do I miss something?

The alternative in this case would be to use the same analysis for the callee that is already being used for the caller? I agree that would be nicer to serve a broader category of use cases. I didn't do that in this patch for a couple reasons:

  1. In the short term, that would require threading more information through to the builtin transfer function, while this patch was meant to just be a minimum viable product.
  2. In the longer term, we probably don't need that for our specific goals (just modeling simple fields of mock classes) mentioned above.

However, if you have a suggestion for a way to construct an instance of the outer analysis here, that would definitely be useful.

  • Why limit the call depth to 1? The patch mentions recursive functions. In case of the Clang Static Analyzer, the call depth is 4. I think if we go with the inline substitution approach, we want this parameter to be tunable, because different analyses might have different sweet spots for the call stack depth.

There's no particular reason for this. We plan to support more call stack depth soon. This would probably make sense as a field in the TransferOptions struct.

  • The CSA also has other tunables, e.g., small functions are always inlined and large functions are never inlined.

See my response earlier about an allowlist for symbols to inline.

  • Currently, it looks like the framework assumes functions that cannot be inlined are not doing anything. This is an unsound assumption, and I wonder if we should change that before we try to settle on the best values for the tunable parameters.

Agreed, this assumption is unsound. However, the framework already makes many other unsound assumptions in a similar spirit, so this one doesn't immediately strike me as one that needs to change. I'll defer to people with more context.

  • The current code might do a bit too much work. E.g. consider:
while (...) {
  inlinableCall();
}

As far as I understand, the current approach would start the analysis of inlinableCall from scratch in each iteration. I wonder if we actually want to preserve the state between the iterations, so we do not always reevaluate the call from scratch. Currently, it might not be a big deal as the fixed-point iteration part is disabled. But this could be a perf problem in the future, unless I miss something.

Agreed, this is somewhat wasteful. Note that not everything is thrown away, because the same DataflowAnalysisContext is reused when analyzing the callee. Still, we would like to handle this in a smarter way in the future, as you mention. For now, though, while just building up functionality behind a feature flag, we don't plan to worry as much about this particular performance concern.

  • I think the environment currently is not really up to context-sensitive evaluation. Consider a recursive function:
void f(int a) {
  ++a;
  if (a < 10)
      f(a);
}

Here, when the recursive call is evaluated, it is ambiguous what a refers to. Is it the argument of the caller or the callee? To solve this ambiguity, the calling context needs to be the part of the keys in the environment.

Yes, I mentioned this in the patch description and in a comment in the pushCall implementation: we plan to do this by modifying the set of fields in the DataflowAnalysisContext class. I plan to do this in my next patch, once this one is landed.

samestep added inline comments.Jul 22 2022, 7:53 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
208

OK, I'll change this; would you like for me to replace all the other TODOs with FIXMEs, as well?

220

Could you clarify what you mean? Perhaps I just don't understand exactly what is meant by "global vars" here.

237

OK, thank you! I'll make that change.

clang/lib/Analysis/FlowSensitive/Transfer.cpp
517

Will do.

524

Good point, and @xazax.hun pointed this out as well. I'll add a FIXME here, at least.

ymandel added inline comments.Jul 22 2022, 8:10 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
208

Just those in this patch.

220

https://github.com/llvm/llvm-project/blob/main/clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp#L131-L135

/// Initializes global storage values that are declared or referenced from
/// sub-statements of `S`.
// FIXME: Add support for resetting globals after function calls to enable
// the implementation of sound analyses.

Since this already mentions a need to reset after function calls, seemed relevant here.

samestep updated this revision to Diff 446842.Jul 22 2022, 8:11 AM

Address Yitzie's comments

samestep added inline comments.Jul 22 2022, 8:14 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
220

Hmm, OK. I pretty much just pattern-matched from the Environment constructor right above this method implementation. Would it be better for me to instead just remove this initGlobalVars call for now and replace it with a FIXME saying that we'll probably need to call this but it's unclear how exactly to do so?

samestep updated this revision to Diff 446855.Jul 22 2022, 8:46 AM

Fix typo in comment

ymandel added inline comments.Jul 22 2022, 8:53 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
220

SGTM. We can note this constraint on models' reference implementations. Specifically, that they cannot reference globals.

gribozavr2 added inline comments.Jul 22 2022, 8:57 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
231

The Clang AST includes argument expressions for defaulted arguments, so I believe there shouldn't be anything left to do here, it should just work.

clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
3896
samestep updated this revision to Diff 446914.Jul 22 2022, 11:31 AM

Don't allow globals

samestep added inline comments.Jul 22 2022, 11:34 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
220

Done.

231

Oh nice! I'm updating this comment, thanks.

clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
3896

Good idea, done.

samestep updated this revision to Diff 446915.Jul 22 2022, 11:36 AM

Update comment about default parameters

li.zhe.hua added inline comments.Jul 22 2022, 12:58 PM
clang/include/clang/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.h
64

Nit: -Wmissing-field-initializers is apparently enabled, and starts warning on this.

clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
42

For the purposes of the test, there's really only 3 states:

  1. No built-in transfer
  2. Built-in transfer, no context-sensitive
  3. Built-in transfer, with context-sensitive

It may be more readable for tests to have a 3-state enum, that runDataFlow will then use to produce the corresponding DataflowAnalysisOptions. As is, a snippet like

{/*.ApplyBuiltinTransfer=*/true,
 /*.BuiltinTransferOptions=*/{/*.ContextSensitive=*/false}});

is rough to read. Good enum names with good comments would probably make this much better. WDYT?

67

Nit: -Wmissing-field-initializers is apparently enabled, and starts warning on this.

samestep added inline comments.Jul 22 2022, 1:08 PM
clang/include/clang/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.h
64

Ah thanks, will fix.

clang/unittests/Analysis/FlowSensitive/TransferTest.cpp
42

I agree that there are only 3 states, but I also think that conceptually this really is a multi-layered thing; either we apply the built-in transfer or not, and then if we do apply the builtin transfer, there's some set of options we pass to it. Thus, it doesn't seem right to just collapse it into a flat enum; I'm not sure though.

67

Same here; thanks!

samestep updated this revision to Diff 446958.Jul 22 2022, 1:11 PM

Appease -Wmissing-field-initializers

xazax.hun added a comment.EditedJul 22 2022, 2:34 PM

Yes, we considered a summary-based approach, but we decided not to use it because (as you mentioned) it would be much more difficult to implement, especially for callees with nontrivial CFGs, which would result in a nontrivial flow condition instead of just Values in the Environment.

Thanks! I think this is a major design decision, and I'd love to see a discussion about the alternatives considered before jumping to an implementation. Specifically, I'd like to know if summaries are out of scope or something you would consider in the future. Knowing this is useful because this can influence how the code should be reviewed. E.g., if you plan to have multiple ways to do context sensitive analysis in the future, we should make sure that the current code will not lock us in in the inline substitution approach. If summaries are not planned at all, this is not a concern.

Could you elaborate on what specific scalability problems you are concerned about?

The Clang Static Analyzer is using this approach and it was a long way to iron out all the corner cases where the performance was bad. It has many cuts including maximum number of visits for a basic block, maximum call stack depth, not inlining functions after a certain size threshold and so on. Basically, it took some time to get the right performance and precision. But overall, the inline substitution approach will never scale to large call stacks/long calling contexts. On the other hand, a summary-based approach can potentially find bugs across a really large number of function calls with reasonable costs. Mainly because the same function is not reanalyzed for every context.

To clarify, the main reason we want context-sensitive analysis is to allow us to simplify our definitions of some models, such as the optional checker model. The goal is to provide the analysis a mock implementation of an optional type, and then use context-sensitive analysis (probably just one or two layers deep) to model the constructors and methods.

Thanks, this is also useful context!

Some other related questions:

  • Why call noop analysis?

The alternative in this case would be to use the same analysis for the callee that is already being used for the caller? I agree that would be nicer to serve a broader category of use cases. I didn't do that in this patch for a couple reasons:

  1. In the short term, that would require threading more information through to the builtin transfer function, while this patch was meant to just be a minimum viable product.
  2. In the longer term, we probably don't need that for our specific goals (just modeling simple fields of mock classes) mentioned above.

I see. This was not clear to me from the description of the patch notes. It seems to me that the goal here is to simplify the modeling of certain types, not general context-sensitive analysis. I reviewed this patch with the wrong idea in mind. If that is the goal, the current approach makes sense. But I think the comments should make clear what the intended use case and the limitations of the current approach is.

  • Currently, it looks like the framework assumes functions that cannot be inlined are not doing anything. This is an unsound assumption, and I wonder if we should change that before we try to settle on the best values for the tunable parameters.

Agreed, this assumption is unsound. However, the framework already makes many other unsound assumptions in a similar spirit, so this one doesn't immediately strike me as one that needs to change. I'll defer to people with more context.

The main reason I wanted to call this out because it increasingly seems to be whenever a decision needs to be made, the framework is getting less and less sound. Basically, many design decisions here are very similar to what the Clang Static Analyzer is doing. Since Clang already has an analysis framework for unsound analysis, I wanted to avoid you reinventing the wheel and doing something similar to CSA instead of providing something new for the use cases that cannot be covered by the CSA.

  • I think the environment currently is not really up to context-sensitive evaluation. Consider a recursive function:

Yes, I mentioned this in the patch description and in a comment in the pushCall implementation: we plan to do this by modifying the set of fields in the DataflowAnalysisContext class. I plan to do this in my next patch, once this one is landed.

Oh, my bad! I somehow blinked and totally skipped that comment.

samestep edited the summary of this revision. (Show Details)Jul 25 2022, 7:28 AM

Thanks! I think this is a major design decision, and I'd love to see a discussion about the alternatives considered before jumping to an implementation. Specifically, I'd like to know if summaries are out of scope or something you would consider in the future. Knowing this is useful because this can influence how the code should be reviewed. E.g., if you plan to have multiple ways to do context sensitive analysis in the future, we should make sure that the current code will not lock us in in the inline substitution approach. If summaries are not planned at all, this is not a concern.

This is a good question. I shared with you our design doc, which may help clarify somewhat; please let me know if you have further concerns.

The Clang Static Analyzer is using this approach and it was a long way to iron out all the corner cases where the performance was bad. It has many cuts including maximum number of visits for a basic block, maximum call stack depth, not inlining functions after a certain size threshold and so on. Basically, it took some time to get the right performance and precision. But overall, the inline substitution approach will never scale to large call stacks/long calling contexts. On the other hand, a summary-based approach can potentially find bugs across a really large number of function calls with reasonable costs. Mainly because the same function is not reanalyzed for every context.

That makes a lot of sense. From what you're saying, it sounds like we'll avoid that in our plan by keeping contexts small due to only context-sensitively analyzing simple models that we write ourselves.

I see. This was not clear to me from the description of the patch notes. It seems to me that the goal here is to simplify the modeling of certain types, not general context-sensitive analysis. I reviewed this patch with the wrong idea in mind. If that is the goal, the current approach makes sense. But I think the comments should make clear what the intended use case and the limitations of the current approach is.

Fair point. Hopefully the intended use case will become more clear in the code itself as I follow up with further patches; if not then I can modify the comments to clarify that. Are there any specific things you'd like to see written comments in this patch itself before landing?

The main reason I wanted to call this out because it increasingly seems to be whenever a decision needs to be made, the framework is getting less and less sound. Basically, many design decisions here are very similar to what the Clang Static Analyzer is doing. Since Clang already has an analysis framework for unsound analysis, I wanted to avoid you reinventing the wheel and doing something similar to CSA instead of providing something new for the use cases that cannot be covered by the CSA.

This makes sense; in this case, though, the unsoundness is already present (this patch does nothing to change the way we analyze calls to functions for which we can't see the body), so if anything, unsoundness is reduced here. I'll let @ymandel respond to this in more detail, though.

Oh, my bad! I somehow blinked and totally skipped that comment.

No worries! This patch is a bit large so it's easy to miss. Thanks for taking the time to review in such detail!

The main reason I wanted to call this out because it increasingly seems to be whenever a decision needs to be made, the framework is getting less and less sound. Basically, many design decisions here are very similar to what the Clang Static Analyzer is doing. Since Clang already has an analysis framework for unsound analysis, I wanted to avoid you reinventing the wheel and doing something similar to CSA instead of providing something new for the use cases that cannot be covered by the CSA.

This makes sense; in this case, though, the unsoundness is already present (this patch does nothing to change the way we analyze calls to functions for which we can't see the body), so if anything, unsoundness is reduced here. I'll let @ymandel respond to this in more detail, though.

Gabor, I fully agree. We need to start paying down the debt on the unsoundness, reducing it where possible and otherwise giving users control over whether to incur it. However, as Sam wrote, we did not expect to be incurring any new unsoundness here.

Thanks! Knowing the context, I am much happier with the direction overall. Is the plan to analyze a mock of std::optional instead of the actual code in the STL? How will that mock be shipped? Would that be embedded in the binary?

Gabor, I fully agree. We need to start paying down the debt on the unsoundness, reducing it where possible and otherwise giving users control over whether to incur it.

I'm glad that this is still on the roadmap. I am a bit worried about how hard it will be to make the current memory model sound. Generally, I saw some researchers arguing that the access path approach approach is a better fit for dataflow analysis. See this paper as an example. Although admittedly, I do not fully agree with everything they claim, they focus on distributive problems in that paper, and I found that most actual problems that we want to solve are not distributive. But whatever model we end up using, to restore soundness we might need to introduce a way to summarize certain constructs in our memory model. There are some ideas in this survey paper.

However, as Sam wrote, we did not expect to be incurring any new unsoundness here.

I fully agree that this patch itself will not introduce new unsoundness (after the fixme mentioned in the description is resolved). My main concerns were:

  • The framework started to feel more similar to symbolic execution than abstract interpretation, see the answer to this question on the differences.
  • I was a bit worried that adding more features before fixing the soundness issues might make fixing those problems harder.
  • I was not sure the current approach would scale to general context-sensitive analysis. But looks like that is a non-goal for now. Doing this to model certain types of interest makes sense to me and is a good first step.

Overall, I am excited for context-sensitive analysis, and some of my concerns are addressed. Looking forward to the follow-up patches :)

Thanks! Knowing the context, I am much happier with the direction overall. Is the plan to analyze a mock of std::optional instead of the actual code in the STL? How will that mock be shipped? Would that be embedded in the binary?

Glad to hear it! Yes, the current plan is to analyze a mock of std::optional instead of the actual type. One reason for this is that we would like to use the same mock to model multiple different optional types (e.g. absl::optional) using the same mock. Our current plan is to embed it directly in the binary.

Overall, I am excited for context-sensitive analysis, and some of my concerns are addressed. Looking forward to the follow-up patches :)

Thanks Gábor! I'll let @ymandel and others respond to your other points; also, thanks for the links, those resources look very helpful.

@xazax.hun Do you have anything else you'd like addressed/changed (either here or in the doc we shared with you) before I land this?

xazax.hun accepted this revision.Jul 26 2022, 7:07 AM

@xazax.hun Do you have anything else you'd like addressed/changed (either here or in the doc we shared with you) before I land this?

Nope, most of my concerns are unrelated to this patch. Sorry for hijacking the conversation with some offtopics. Feel free to land.

This revision is now accepted and ready to land.Jul 26 2022, 7:07 AM
ymandel accepted this revision.Jul 26 2022, 8:40 AM
ymandel added inline comments.
clang/include/clang/Analysis/FlowSensitive/TypeErasedDataflowAnalysis.h
64

optional, but in my experience, being explicit can help readability/findability in certain places.

samestep updated this revision to Diff 447750.Jul 26 2022, 10:15 AM

Be explicit when constructing TransferOptions

This revision was landed with ongoing or failed builds.Jul 26 2022, 10:27 AM
This revision was automatically updated to reflect the committed changes.
samestep reopened this revision.Jul 26 2022, 10:37 AM
This revision is now accepted and ready to land.Jul 26 2022, 10:37 AM
samestep updated this revision to Diff 447758.Jul 26 2022, 10:40 AM

Use different name for TransferOptions field

samestep edited the summary of this revision. (Show Details)Jul 26 2022, 10:42 AM
This revision was landed with ongoing or failed builds.Jul 26 2022, 10:54 AM
This revision was automatically updated to reflect the committed changes.

A few variables cause warinings in -Asserts.

clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
218

It is used only in assert.

224

ditto

A few variables cause warinings in -Asserts.

Thanks for pointing this out! How should I address this? Should I just inline the definitions of those variables into the asserts themselves?

A few variables cause warinings in -Asserts.

Thanks for pointing this out! How should I address this? Should I just inline the definitions of those variables into the asserts themselves?

Someone took care of it: https://github.com/llvm/llvm-project/commit/1f8ae9d7e7e4afcc4e76728b28e64941660ca3eb