This is an archive of the discontinued LLVM Phabricator instance.

[clang][dataflow] Add support for structured bindings of tuple-like types.
ClosedPublic

Authored by ymandel on Dec 7 2022, 8:08 AM.

Details

Summary

This patch adds interpretation of binding declarations resulting from a
structured binding (DecompositionDecl) to a tuple-like type. Currently, the
framework only supports binding to a struct.

Fixes issue #57252.

Diff Detail

Event Timeline

ymandel created this revision.Dec 7 2022, 8:08 AM
Herald added a project: Restricted Project. · View Herald Transcript
ymandel requested review of this revision.Dec 7 2022, 8:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2022, 8:08 AM

If CFG can be updated so that the
synthesized variables occur in the block *before* the DeclStmt containing the
DecompositionDecl, then we can drop the workaround that is included in this
patch.

I'm not sure that this is possible.

If you look at the AST of the following example

std::pair<int, int> p{1, 2};

auto [a, b] = p;

you get

`-DecompositionDecl 0x55b0ac862428 'std::pair<int, int>':'std::pair<int, int>' cinit
  |-CXXConstructExpr 
  | `-ImplicitCastExpr
  |   `-DeclRefExpr  'p' 
  |-BindingDecl a
  | |-VarDecl a
  | | `-CallExpr 
  | |   |-ImplicitCastExpr 
  | |   | `-DeclRefExpr  'get' 
  | |   `-ImplicitCastExpr 
  | |     `-DeclRefExpr Decomposition 0x55b0ac862428 '' 
  | `-DeclRefExpr 'a' 
  ` ...

Basically get<>() is getting the element from the copy of the struct, that's being created when the DecompositionDecl happens.
The CFG reflects this accordingly.

 7: p
 8: [B1.7] (ImplicitCastExpr, NoOp, const pair<int, int>)
 9: [B1.8] (CXXConstructExpr, [B1.10], std::pair<int, int>)
10: auto = p;
11: get<0UL>
12: [B1.11] (ImplicitCastExpr, FunctionToPointerDecay, typename tuple_element<0UL, pair<int, int> >::type &&(*)(pair<int, int> &&) noexcept)
13:
14: [B1.13] (ImplicitCastExpr, NoOp, std::pair<int, int>)
15: [B1.12]([B1.14])
16: std::tuple_element<0, std::pair<int, int>>::type a = get<0UL>();

Here 13 would be the DecompositionDecl. The variable declarations must appear after the DecompositionDecl,
otherwhise the tuple, from which the variables are initialized cannot be accessed.

Thanks for clarifying the issues here. A few thoughts:

  1. Thanks for clarifying about the blank lines. I was wondering how to read that and I think that fills in an important missing detail for me.
  1. I'm particularly interested in the case of tuple-like containers, which is a little different (though I think the same point holds):

Example: https://godbolt.org/z/qcWMdG33T

#include <tuple>
std::tuple<bool, int> mk();

void f() {
  auto [b, i] = mk();
  (void)i;
}

CFG:

void f()
 [B2 (ENTRY)]
   Succs (1): B1

 [B1]
   1: mk
   2: [B1.1] (ImplicitCastExpr, FunctionToPointerDecay, std::tuple<_Bool, int> (*)(void))
   3: [B1.2]() (CXXRecordTypedCall, [B1.4])
   4: auto = mk();
   5: get<0UL>
   6: [B1.5] (ImplicitCastExpr, FunctionToPointerDecay, typename tuple_element<0UL, tuple<_Bool, int> >::type &&(*)(tuple<_Bool, int> &&) noexcept)
   7: 
   8: [B1.7] (ImplicitCastExpr, NoOp, std::tuple<_Bool, int>)
   9: [B1.6]([B1.8])
  10: std::tuple_element<0, std::tuple<bool, int>>::type b = get<0UL>();
  11: get<1UL>
  12: [B1.11] (ImplicitCastExpr, FunctionToPointerDecay, typename tuple_element<1UL, tuple<_Bool, int> >::type &&(*)(tuple<_Bool, int> &&) noexcept)
  13: 
  14: [B1.13] (ImplicitCastExpr, NoOp, std::tuple<_Bool, int>)
  15: [B1.12]([B1.14])
  16: std::tuple_element<1, std::tuple<bool, int>>::type i = get<1UL>();
  17: i
  18: (void)[B1.17] (CStyleCastExpr, ToVoid, void)
   Preds (1): B2
   Succs (1): B0

 [B0 (EXIT)]
   Preds (1): B1

Here, both lines 7 and 13 are blank but seem to be the underlying tuple that's being deconstructed. Are these different instances of the DecompositionDecl or are they (unprinted) references back to line 4?

  1. I'm still bothered that this seems to violate the design/philosophy of the CFG, which I take as ensuring that elements proceed in the order of operation. I see the problem here -- that the AST includes backedge, but can we solve it by adding another synthesized element for the object being deconstructed? Or, perhaps we could add new CFG kind that marks the appearance of the DecompositionDecl an extra time? That is, first time is so that the DecompositionDecl can be referenced as it is now, but second time is so that the statement can be evaluated in the proper order (after all of its constituent elements)?

Finally, a nit: why doesn't line 13 in your example, and lines 7 and 13 in my example, print? Is that something that I could add to the CFG printer?

Gabor -- An alternative solution to this problem, assuming the current structure of the CFG, is to add a map to the Environment of type VarDecl -> BindingDecl that registers BindingDecls in need of initialization. Then, I could modify the DeclStmt interpretation to check this map for each VarDecl it processes, processing the corresponding BindingDecl if there is one. This solution, while a bit more complicated, has the mild advantage of incurring less cost (on average) since VarDecls are less common than DeclRefExpr. Also, it's eager instead of lazy, which seems somewhat easier to reason about.

I'm particularly interested in the case of tuple-like containers, which is a little different (though I think the same point holds)

std::pair<> is actually tuple-like if that's what you meant. Only tuple-like types have holding vars inside a DecompositionDecl.

Here, both lines 7 and 13 are blank but seem to be the underlying tuple that's being deconstructed. Are these different instances of the DecompositionDecl or are they (unprinted) references back to line 4?

Those lines are what you see as DeclRefExpr Decomposition '' in the AST. Which are indeed references back to line 4 as you mentioned.

I'm still bothered that this seems to violate the design/philosophy of the CFG, which I take as ensuring that elements proceed in the order of operation.

I think in this case the elements do proceed in order of operation. When you create a structured binding by value to a tuple like type, the copy constructor of the type is invoked first.

Let's assume that you have a tuple-like std::pair<int, int> p{1, 2}; and you create the structured binding auto [a, b] = p;.

  1. The copy constructor of p is invoked first and a copy get's created. Let's call this copy tmp. Note that if the copy constructor has side effects, they will also be visible.
  2. A new variable for each element in the tuple is created and is initialized by get<idx>(tmp);. It is important that the initialization happens from the copy, tmp and not from the original tuple p.

The CFG resembles this behaviour. If you look at both of the example CFGs in this discussion, you can see that both of them begin with the constructor invocation.

 7: p
 8: [B1.7] (ImplicitCastExpr, NoOp, const pair<int, int>)
 9: [B1.8] (CXXConstructExpr, [B1.10], std::pair<int, int>)
10: auto = p;
1: mk
2: [B1.1] (ImplicitCastExpr, FunctionToPointerDecay, std::tuple<_Bool, int> (*)(void))
3: [B1.2]() (CXXRecordTypedCall, [B1.4])
4: auto = mk();

Points 4 and 10 are a bit weird, but that's where the copy is created, however the copy is unnamed. This unnamed copy is what I refered to as tmp.

Note that when you do a regular copy-constructor invocation like

std::tuple<bool, int> mk;
std::tuple<bool, int> mk2 = mk;

you would get

3: mk
4: [B1.3] (ImplicitCastExpr, NoOp, const class std::tuple<_Bool, int>)
5: [B1.4] (CXXConstructExpr, [B1.6], std::tuple<_Bool, int>)
6: std::tuple<bool, int> mk2 = mk;
                         ^^^This is the name that's missing from the CFGs above.

After this the get<>() calls simply use this unnamed copy to initialize the elements from first to last, so everything seems to proceed in order in the CFG.

Finally, a nit: why doesn't line 13 in your example, and lines 7 and 13 in my example, print? Is that something that I could add to the CFG printer?

If I remember correctly (and my intuition is correct based on the examples) a DeclRefExpr in the CFG is the name of the variable being referenced.
So for example DeclRefExpr 'p' is simply p in the CFG.

If you look at the AST in my example, you can see that the DeclRefExpr to the decomposition is DeclRefExpr Decomposition '', where ''
is supposed to be the name of the variable. So I think the empty line comes from here.

Domján -- thanks for the detailed explanations -- this has been really helpful.

After this the get<>() calls simply use this unnamed copy to initialize the elements from first to last, so everything seems to proceed in order in the CFG.

Agreed -- it definitely should appear before, but I think it would be good to appear _after_ as well. That is, I think there's a catch-22: the temporary's construction should appear before, but the binding decls, which use the synthetic variables, should appear after.

What do you think of duplicating it in the CFG, possibly with a new (statement) element kind (to mark it as a special)? There are other workarounds that we could do on our end, but the duplication would avoid the need for such -- for us and any other clients.

the temporary's construction should appear before, but the binding decls, which use the synthetic variables, should appear after

I'm confused a bit here. Right now the CFG looks like this:

<constructor_call>
<binding_decl_0>
...
<binding_decl_n>

Based on what you say I assume you want something to happen between constructor_call and binding_decl_0.
Could you visualize somehow how you think the ideal CFG here looks like?

What do you think of duplicating it in the CFG, possibly with a new (statement) element kind (to mark it as a special)?
There are other workarounds that we could do on our end, but the duplication would avoid the need for such -- for us and any other clients.

Well, I don't think duplicating a statement in the CFG is a good idea as long as there are other solutions for this problem. This seems to be a
very specific problem, so I think a local solution should be pursued instead if possible. What do you think @xazax.hun @NoQ ?

the temporary's construction should appear before, but the binding decls, which use the synthetic variables, should appear after

I'm confused a bit here. Right now the CFG looks like this:

<constructor_call>
<binding_decl_0>
...
<binding_decl_n>

Based on what you say I assume you want something to happen between constructor_call and binding_decl_0.
Could you visualize somehow how you think the ideal CFG here looks like?

Sorry, I should have explained in more detail (refernce example: https://godbolt.org/z/dE9ojcjj7). *Perhaps I'm still misreading the CFG, but I don't see the binding decls serialized in the CFG, only the synthesized var decls.* That is, here's my reading of the CFG:
The resulting simplified CFG would be:

<constructor_call>
<synthesized_var_decl_0>
...
<synthesized_var_decl_n>

In more detail, here is my annotated reading of the CFG, based on our discussion:

[B1]
   1: mk
   2: [B1.1] (ImplicitCastExpr, FunctionToPointerDecay, std::tuple<int, int> (*)(void))
   3: [B1.2]() (CXXRecordTypedCall, [B1.4])
   4: auto = mk();  // <constructor_call> (with <binding_decl_0>, <binding_decl_1> nested as children)
   5: get<0UL>
   6: [B1.5] (ImplicitCastExpr, FunctionToPointerDecay, typename tuple_element<0, tuple<int, int> >::type (*)(tuple<int, int>))
   7:  // <ref to anonymous temporary defined on line 4>
   8: [B1.7] (ImplicitCastExpr, NoOp, std::tuple<int, int>)
   9: [B1.8] (CXXConstructExpr, [B1.10]+0, tuple<int, int>)
  10: [B1.6]([B1.9])
  11: [B1.10]
  12: std::tuple_element<0, std::tuple<int, int>>::type i0 = get<0UL>(); // <synth_var_decl_0>
  13: get<1UL>
  14: [B1.13] (ImplicitCastExpr, FunctionToPointerDecay, typename tuple_element<1, tuple<int, int> >::type (*)(tuple<int, int>))
  15: // <ref to anonymous temporary defined on line 4>
  16: [B1.15] (ImplicitCastExpr, NoOp, std::tuple<int, int>)
  17: [B1.16] (CXXConstructExpr, [B1.18]+0, tuple<int, int>)
  18: [B1.14]([B1.17])
  19: [B1.18]
  20: std::tuple_element<1, std::tuple<int, int>>::type i1 = get<1UL>(); // <synth_var_decl_1>
  21: i0 // ref to <binding_decl_0>
  22: (void)[B1.21] (CStyleCastExpr, ToVoid, void)

Here is the crux of the issue: per the AST, line 21 is reference is to the *BindingDecl*, not the (synthesized) VarDecl:

-DeclRefExpr <col:9> 'std::tuple_element<0, std::tuple<int, int>>::type':'int' lvalue Binding 0x559942acf350 'i0' 'std::tuple_element<0, std::tuple<int, int>>::type':'int'

So, in order to make sense of that reference, I need to define the BindingDecl. But, that in turn is defined in terms of the (synthesized) VarDecl. So, I want to insert between elements 20 and 21:

20.a: <binding_decl_0> // references synthesized variable `i0` from line 12 in its binding
20.b: <binding_decl_1> // references synthesized variable `i1` from line 20 in its binding

The resulting simplified CFG would be:

<constructor_call>
<synthesized_var_decl_0>
...
<synthesized_var_decl_n>
<binding_decl_0>
...
<binding_decl_n>

I suggested repeating the DecompositionDecl because it includes the BindingDecls, but spelling them out explicitly would probably be even better. Right now, the only chance I get to visit the BindingDecls is when I encounter line 4, as children of the DecompositionDecl, at which point the binding of each decl reference as-yet-undefined (synthesized) VarDecls, hence my dilemma.

I'm fine with a custom solution -- that's indeed what this patch contains and I'm happy to leave it at that.

Thank you, again, for your patience hashing this out!

Those BindingDecls never appear in the CFG unless one of the is used.

Take a look at the other cases.

1.) Arrays

void foo() {
    int arr[2];

    auto [i0, i1] = arr;
}
[B1]
   1: int arr[2];
   2: arr
   3: [B1.2] (ImplicitCastExpr, ArrayToPointerDecay, int *)
   4: *
   5: [B1.3][[B1.4]]
   6: [B1.5] (ImplicitCastExpr, LValueToRValue, int)
   7: {[B1.6]}
   8: auto = {arr[*]};
   Preds (1): B2
   Succs (1): B0

2.) Structs

struct S {
    int x;
    int y;
};

void foo() {
    S s;

    auto [i0, i1] = s;
}
[B1]
   1:  (CXXConstructExpr, [B1.2], S)
   2: S s;
   3: s
   4: [B1.3] (ImplicitCastExpr, NoOp, const struct S)
   5: [B1.4] (CXXConstructExpr, [B1.6], S)
   6: auto = s;
   Preds (1): B2
   Succs (1): B0

i0 and i1 are not in the CFG. They only get added, when one of them is actually used.

int x = i0;
 9: i0
10: [B1.9] (ImplicitCastExpr, LValueToRValue, int)
11: int x = i0;

That means. when you see the DeclRefExpr you mentioned, the synthesized variable has already been created and the BindingDecl can see it.

-DeclRefExpr <col:9> 'std::tuple_element<0, std::tuple<int, int>>::type':'int' lvalue Binding 0x559942acf350 'i0' 'std::tuple_element<0, std::tuple<int, int>>::type':'int'

By the time you see this node, the variable has already been created and you can access it using BindingDecl::getHoldingVar(). You want to analyze these
BindingDecls when they are used not when you encounter the DecompositionDecl.

Those BindingDecls never appear in the CFG unless one of the is used.

Indeed. That''s the issue (which this discussion has helped clarify) -- the binding decls are not in the CFG and I was expecting them to be.

By the time you see this node, the variable has already been created and you can access it using BindingDecl::getHoldingVar(). You want to analyze these BindingDecls when they are used not when you encounter the DecompositionDecl.

Great, that's exactly what this patch does. So, it sounds like we have what we need. Thank you!

Gabor -- now that I understand the issues involved, I think there are two options here:

  1. When we encounter a DeclRefExpr to a BindingDecl, just treat that DeclRefExpr as the underlying binding expression. So, BindingDecl's are never interpreted directly -- we always see through them to their binding expression.
  2. (What I have here): When we encounter a DeclRefExpr to a BindingDecl, force the evaluation of the BindingDecl and leave the handling of the DeclRefExpr as is.

Based on this discussion, I'm inclined to go with the first option. Moreover, we could use it *for all* BindingDecls, not just tuple-typed decls and thereby entirely drop the custom handling of BindingDecls. WDYT?

Those BindingDecls never appear in the CFG unless one of the is used.

Sorry, I should fix my response above: I *never* see the BindingDecls in the CFG, whether or not they are used. Can you send me an example CFG that contains the BindingDecls?

Sorry, I should fix my response above: I *never* see the BindingDecls in the CFG, whether or not they are used.

It was my fault, I was wrong with the wording. I meant the DeclRefExprs that refer to the BindingDecls. You can
access the BindingDecls through the DeclRefExprs, they are not present in the CFG directly.

That's also how the static analyzer handles the BindingDecls in ExprEngine::VisitCommonDeclRefExpr().

ymandel updated this revision to Diff 481651.Dec 9 2022, 7:42 AM

Simplifies the implementation and removes mention of any potential CFG problem.

This new version does the work eagerly by forcing a location for the synthesized variable.

Sorry, I should fix my response above: I *never* see the BindingDecls in the CFG, whether or not they are used.

It was my fault, I was wrong with the wording. I meant the DeclRefExprs that refer to the BindingDecls. You can
access the BindingDecls through the DeclRefExprs, they are not present in the CFG directly.

That's also how the static analyzer handles the BindingDecls in ExprEngine::VisitCommonDeclRefExpr().

Got it, thanks!

So, I think everything is resolved now, you've been incredibly helpful. Based on my now understanding the CFG representation of DecompositionDecls a lot better, I found a simpler solution that doesn't even need to involve our handling of DeclRefExprs and is therefore consistent with how we already handle the struct references.

I think there's a separate discussion we might want to have about whether it would be valuable (to the static analyzer and the dataflow framework, among others) to update CFG to explicitly represent BindingDecls and the AST nodes in their bindings. But, that seems better for a discourse thread. :)

ymandel edited the summary of this revision. (Show Details)Dec 9 2022, 7:50 AM

This approach looks good to me. Some context: we kept the CFGs lightweight because it looks like we did not need to do any extensions for the Clang Static Analyzer. I'm glad the dataflow framework can also work with the current representation. On the other hand, I think structured bindings in C++ are really limited, e.g., we cannot nest them, and it is nowhere near what pattern matching can do in other languages like Rust. I do remember seeing papers about extending structured bindings in at least some dimensions like nesting. I wonder if making the representation more explicit in the CFG as structured bindings get more complex will simplify implementations in the future, but that is a problem for tomorrow :)

clang/lib/Analysis/FlowSensitive/Transfer.cpp
191

Nit commented out code.

ymandel updated this revision to Diff 481677.Dec 9 2022, 10:07 AM

remove commented out code

ymandel marked an inline comment as done.Dec 9 2022, 10:07 AM

Thanks!

This approach looks good to me. Some context: we kept the CFGs lightweight because it looks like we did not need to do any extensions for the Clang Static Analyzer. I'm glad the dataflow framework can also work with the current representation. On the other hand, I think structured bindings in C++ are really limited, e.g., we cannot nest them, and it is nowhere near what pattern matching can do in other languages like Rust. I do remember seeing papers about extending structured bindings in at least some dimensions like nesting. I wonder if making the representation more explicit in the CFG as structured bindings get more complex will simplify implementations in the future, but that is a problem for tomorrow :)

Heh. Very much agree.

xazax.hun accepted this revision.Dec 9 2022, 10:17 AM
This revision is now accepted and ready to land.Dec 9 2022, 10:17 AM
This revision was landed with ongoing or failed builds.Dec 9 2022, 10:58 AM
This revision was automatically updated to reflect the committed changes.