This is an archive of the discontinued LLVM Phabricator instance.

[clang][dataflow] Add transfer function for VarDecl statements
ClosedPublic

Authored by sgatev on Dec 29 2021, 3:39 AM.

Details

Summary

This is part of the implementation of the dataflow analysis framework.
See "[RFC] A dataflow analysis framework for Clang AST" on cfe-dev.

Diff Detail

Event Timeline

sgatev created this revision.Dec 29 2021, 3:39 AM
sgatev requested review of this revision.Dec 29 2021, 3:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 29 2021, 3:39 AM
sgatev updated this revision to Diff 396516.Dec 29 2021, 5:03 AM

Minor changes to names and comments.

Nice! A few small comments on the headers...

clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
57

The term "assignment" is overloaded. :) Maybe instead "Associates Loc with D"? Or, expand on the meaning of the assignment? e.g. "Assigns Loc as the storage location of D?

clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h
41

here and below. If it never returns null, why not return a ref?
For that matter, all of the pointer parameters with comments "must not be null" -- can we just make those refs?

clang/include/clang/Analysis/FlowSensitive/StorageLocation.h
10

Indent two spaces? (vs 1) same below.

41–43

I don't think its required, but LLVM style typically puts private field decls at the beginning of the class, without the explicit "private:" decl.

53

ref?

84
clang/include/clang/Analysis/FlowSensitive/Value.h
38–39

Move to beginning of class decl?

42

nit: Here and below. I think that "Models..." would be as good or better than "Class that models...".

sgatev updated this revision to Diff 396529.Dec 29 2021, 7:03 AM
sgatev marked 8 inline comments as done.

Address reviewers' comments.

sgatev added inline comments.Dec 29 2021, 7:11 AM
clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
57

I opted for "Assigns Loc as the storage location of D."

clang/include/clang/Analysis/FlowSensitive/DataflowEnvironment.h
41

I replaced the pointers with references where possible.

clang/include/clang/Analysis/FlowSensitive/StorageLocation.h
41–43

Ack. I prefer this order because it puts the public interface higher so the important bits can be found more easily. If this goes against the LLVM coding style I'd be happy to change it, but I prefer to do this separately because I already used this style in some of the other dataflow classes.

53

I believe this needs to accept a pointer to be compatible with operations defined in llvm/Support/Casting.h.

clang/include/clang/Analysis/FlowSensitive/Value.h
38–39

Ack. I'll apply this separately if necessary.

42

Yeap, that's better. Updated all comments.

sgatev updated this revision to Diff 396537.Dec 29 2021, 7:41 AM

Convert pointers to references.

Thanks! I have some questions inline.

clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
62

Are other language constructs coming in a follow-up PR? (E.g.: FieldDecl, locations on the heap). I'd like to some TODO comments if that is the case. Or if they are not supported by design, I'd love to some some comments about that.

Environment is referencing ValueDecl as opposed to VarDecl. What is the reason for this asymmetry?
I think I overall find it confusing that some of the DeclToLoc mapping is here some is in the Environment.
Does this has something to do with the fact that certain mappings should be the same regardless of which basic blocks we are processing (i.e., invariant globally so we do not need to duplicate it)? If that is the case please make it clear in the comments.

77

Just curious: would it make sense to deduplicate these values?

clang/include/clang/Analysis/FlowSensitive/Value.h
34

= default;

79

I wonder if getLocation is ambiguous. (The pointer's own location vs the pointee location.) How about something like getPointeeLoc?

clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
33

Shouldn't we add operator== instead?

47

I wonder if these algorithms should rather be somewhere in the support library.

89

Could this end up creating an overly large state? There might be objects with quite a lot fields but each function would only access a small subset of those. Alternatively, we could attempt to create the representations for fields on demand (this is the approach what the clang static analyzer is using).

118

I think this function implementation is better to be moved closer to the called initValueInStorageLocationUnlessSelfReferential for readability.

145

Are we planning to track float values in the near future? If not, would it make sense to defer creating these objects until then?

155

Do we want to canonicalize types before inserting into this set?

clang/lib/Analysis/FlowSensitive/Transfer.cpp
34

Maybe add a TODO for int a, b; and similar codes?

42

Instead of using a long if else and dynamic casts, don't we want to use an StmtVisitor instead?

sgatev updated this revision to Diff 396660.Dec 30 2021, 6:08 AM
sgatev marked 12 inline comments as done.

Address reviewers' comments.

sgatev added inline comments.Dec 30 2021, 6:11 AM
clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
62

Are other language constructs coming in a follow-up PR? (E.g.: FieldDecl, locations on the heap). I'd like to some TODO comments if that is the case. Or if they are not supported by design, I'd love to some some comments about that.

Yes, more is coming in a follow-up. Nothing unsupported by design comes to mind at this point. I added FIXMEs to indicate the language constructs that we plan to implement first, but we will certainly need to add more going forward. Please let me know if there's something in particular that you'd like to see implemented early.

Environment is referencing ValueDecl as opposed to VarDecl. What is the reason for this asymmetry?

This is not intentional. I updated DataflowAnalysisContext to use ValueDecl as well.

I think I overall find it confusing that some of the DeclToLoc mapping is here some is in the Environment.
Does this has something to do with the fact that certain mappings should be the same regardless of which basic blocks we are processing (i.e., invariant globally so we do not need to duplicate it)? If that is the case please make it clear in the comments.

DeclToLoc in Environment holds the set of ValueDecl to StorageLocation assignments that are in scope for a particular basic block. DeclToLoc in DataflowAnalysisContext aggregates the assignments across all basic blocks and is mainly used to produce stable storage locations when the same basic blocks are evaluated multiple times. I added comments in the code explaining this.

77

I think no because even if the values are equal we'd like to track them separately as they represent different identities which is important for inference. For example:

std::optional<int> a = foo();
std::optional<int> b = bar();
std::optional<int> c = a;
// at this point we can model `a`, `b`, and `c` with the same value, however
if (a.has_value()) {
  // here we gain knowledge only about the values of `a` and `c`
  a.value(); // safe
  c.value(); // safe
  b.value(); // unsafe 
}

In our analysis we model the above by assigning the same value to the storage locations for a and c and assigning a different value to the storage location for b, even if the two values are equal.

clang/include/clang/Analysis/FlowSensitive/Value.h
79

That's much better! I also changed the StorageLocation pointer to a reference in the constructor and the getter.

clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
33

I'd be happy to do that. Do we need reviews from other folks for it? Would it make sense to move the code to the other location in a follow-up PR, to limit the scope of the change?

47

I'd be happy to do that. Do we need reviews from other folks for it? Would it make sense to move the code to the other location in a follow-up PR, to limit the scope of the change?

89

That's a great point, however I don't think initialization on demand plays well with the dataflow algorithm. For example:

struct S {
  int Val;
};

void target(bool b) {
  // basic block 1:
  S Foo;
  int Bar;
  if (b) {
    // basic block 2:
    Bar = Foo.Val;
  } else {
    // basic block 3:
    Bar = Foo.Val;
  }
  // basic block 4:
  ...
}

In basic block 4 we should be able to infer that the value that is assigned to the storage location for Bar is unambiguous. However, since Foo.Value isn't created in basic block 1, this means that it's not inherited by basic block 2 and basic block 3. Each of these blocks will end up creating distinct values to assign to the storage location for Foo.Value and so in basic block 4 the value for Bar will end up being ambiguous.

Alternatively, we can do a pass over all statements in all basic blocks before running the analysis to identify which fields are used in the function. We can use this information here to initialize only parts that are being used.

What do you think?

145

No immediate plans to track floats. Removed it.

clang/lib/Analysis/FlowSensitive/Transfer.cpp
42

Makes a lot of sense. Replaced it with a ConstStmtVisitor instance.

sgatev updated this revision to Diff 396668.Dec 30 2021, 6:51 AM

Canonicalize types.

xazax.hun accepted this revision.Dec 30 2021, 9:53 AM

I have one nit inline, and some topics that should probably be deferred to follow-up PRs. Overall, looks good to me, thanks!

clang/include/clang/Analysis/FlowSensitive/DataflowAnalysisContext.h
77

I see. This makes sense, thanks!

clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
33

Actually, I think DenseMaps should already have operator== so we should be able to remove this code.

47

Yep, I'm fine with moving this in a follow-up PR.

89

I am not convinced about this example. I think it should not matter where and when we create the location for Foo.Val, it always should be equivalent to the others, i.e., they should have the same identity regardless their creation. Basically, I think the identity of such a location should be determined by the base object (Foo in this case), and the access path to the subobject.

This would be a non-trivial change, so I'm ok with deferring this discussion to some follow-up PRs. But I'd love to see some alternatives explored because eagerly evaluating all fields used to bite me in the past in other static analysis tools performance-wise.

This revision is now accepted and ready to land.Dec 30 2021, 9:53 AM
sgatev updated this revision to Diff 396778.Dec 31 2021, 5:26 AM

Address reviewers' comments.

sgatev marked 4 inline comments as done.Dec 31 2021, 5:46 AM
sgatev added inline comments.
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
33

Thanks, I must have missed that. Removed the local implementation.

89

Right. I agree with the comment about storage locations. The issue I see has to do with values that are assigned to the respective storage locations. In general, different values can be assigned to the same storage location in different basic blocks. I don't see a straightforward way to incorporate on-demand initialization while preserving the behavior with respect to values. Perhaps we can achieve this by storing references to parent environments in Environment instead of merging their LocToVal maps when creating environments for successor basic blocks, but this patch is already pretty big so I'd rather defer that. I do agree that this is an important problem we should solve so I added a FIXME to remind us.

xazax.hun added inline comments.Dec 31 2021, 9:41 AM
clang/lib/Analysis/FlowSensitive/DataflowEnvironment.cpp
89

The issue I see has to do with values that are assigned to the respective storage locations.

Oh, I see. Generally, Foo.Val could be the initial value (representing the lack of knowledge) in both basic blocks, so merging them would be fine. But we potentially want to assign symbols to the unknown values to be able to describe flow constraints and solve them. So the problem is, we want Foo.Val to evaluate to the same unknown symbol in both basic blocks, so we know the values at Foo.Val are actually the same in those branches.

While it might not be straightforward, I think there are ways around this problem. As far as I understand, from the values' point of view the only problematic case is when we access a value for the first time. We could either derive the unknown symbols from the location to ensure that we initialize Foo.Val to the same symbol in both cases (so the identity of the unknown symbol would describe the fact that this symbols is the initial value for that location). Alternatively, we could have a separate mapping for the initial values and use it to look the right symbol up. The problematic case is the following:

struct S {
  int Val;
};

void invalidate(S&);

void target(bool b) {
  // basic block 1:
  S Foo;
  int Bar;
  if (b) {
    // basic block 2:
    Bar = Foo.Val;
  } else {
    // basic block 3:
    invalidate(S);
    Bar = Foo.Val;
  }
  // basic block 4:
  ...
}

Whatever we come up with, we should ensure that Bar will not have the same value in block 2 and 3 in this case.

Alternatively, we can do a pass over all statements in all basic blocks before running the analysis to identify which fields are used in the function.

This is not an ideal solution. During the fixed-point iteration we might have more information, e.g. we could resolve some of the pointers/references. So in most of the cases creating Locations for fields on demand could be more precise than doing them upfront.

The issue I see has to do with values that are assigned to the respective storage locations.

I agree, we should not block this patch and this is better solved separately.

sgatev marked 3 inline comments as done.Jan 1 2022, 9:34 AM

Thanks for the reviews!

This revision was landed with ongoing or failed builds.Jan 4 2022, 1:22 AM
This revision was automatically updated to reflect the committed changes.