This is an archive of the discontinued LLVM Phabricator instance.

[-Wunsafe-buffer-usage] Group variables associated by pointer assignments
ClosedPublic

Authored by t-rasmud on Mar 9 2023, 4:14 PM.

Details

Summary
  • For assignments involving pointer types, if the machine decides that the LHS pointer has to be fixed to std::span, this patch introduces the same fixit for the RHS pointer.
  • It groups all such pointer variables (that transitively depend on the unsafe pointer) and suggests that they all be fixed atomically to correctly propagate bounds information among them.
  • TODOs: UUC, variables claimed by multiple gadgets (ex: int *p = q), handle parameters.

Diff Detail

Event Timeline

t-rasmud created this revision.Mar 9 2023, 4:14 PM
Herald added a project: Restricted Project. ยท View Herald TranscriptMar 9 2023, 4:14 PM
t-rasmud requested review of this revision.Mar 9 2023, 4:14 PM
NoQ added a comment.Mar 13 2023, 3:51 PM

Brilliant!!

I guess we can abandon D143133 now in favor of this patch?

Before I jump to nitpicking, I think we should sync up on how the end result should look like, in terms of warnings and notes that we display. This may affect the overall algorithm, like require us to gather more data.

clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp
5โ€“14

We probably want to build some finer narrative here ๐Ÿค” In this case p and q are both independently unsafe and there's a one-way connection between them. Would it make sense to emit just one warning for both of them? I suspect it also makes sense to leave a note at q = p with a text like "note: bounds information needs to propagate from p to q here" (I've no idea if it'll scale to more than 2 variables, or how much more information we'll need to gather).

57โ€“63

In this case we definitely want to spanify q, for two independent reasons. We need to figure out how to tell that to the user though ๐Ÿค”

I have one general question: suppose we have two variables A and B in the same group. So fix-its emitted for A includes fix-its for B and vice verse, right? Does it mean that we cannot apply all fix-its at once from the command line?

clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
42

It could be nice to have a document for this function since it is in the header.

clang/lib/Analysis/UnsafeBufferUsage.cpp
277

So every Fixable needs to implement this. I would love to see a document for this function.

587

DeclPtrAtLeft is not used.

1834

It seems like we do not care about the direction of edges in the graph, maybe we should call it UndirectedGraph?

clang/lib/Sema/AnalysisBasedWarnings.cpp
2267

would it be better if to make VariableGroups a map from VarDecls to groups ?

t-rasmud updated this revision to Diff 507863.Mar 23 2023, 1:40 PM
ziqingluo-90 added inline comments.Mar 29 2023, 11:03 AM
clang/lib/Sema/AnalysisBasedWarnings.cpp
2250

These note messages are nice. Maybe we can have some tests for them. It will also help telling whether variables are grouped correctly.

clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-fixits-test.cpp
139

Could we have an example where there is a cyclic dependency? Such as p = q; q = r; r = p;.

t-rasmud updated this revision to Diff 509440.Mar 29 2023, 12:31 PM

Adds tests that were accidentally deleted in the last diff.

t-rasmud updated this revision to Diff 511227.Apr 5 2023, 3:53 PM
  • Handle the case where implied variables don't have a fixit strategy
  • TODO: address feedback for the previous diff
t-rasmud updated this revision to Diff 511492.Apr 6 2023, 11:32 AM
t-rasmud marked 3 inline comments as done.
  • Address review comments.
t-rasmud marked 4 inline comments as done.Apr 6 2023, 11:48 AM
t-rasmud added inline comments.
clang/lib/Analysis/UnsafeBufferUsage.cpp
1834

I changed it to PtrAssignmentGraph which seems (atleast to me) to give more context. It is still a directed graph because Deps[0] represents the LHS pointer and Deps[1] the RHS pointer. Hopefully the new documentation on getStrategyImplications makes this clear!

clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp
5โ€“14

The new note is more descriptive and summarizes the reason for grouping variables together. We still don't inform the user of the direction of bounds propagation or the fact that buffers are independently unsafe. Like we discussed offline, we could consider these refinements in future versions of the tool.

t-rasmud edited the summary of this revision. (Show Details)Apr 6 2023, 11:52 AM
ziqingluo-90 added inline comments.Apr 6 2023, 2:44 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
579

I was wondering if we need this sub-matcher declRefExpr(expr()).bind(PointerAssignRHSTag). It is eventually used as an inner matcher of binaryOperator. I doubt it will match anything.

Ditto for line 582-583.

1667

How about using FixablesForUnsafeVars.byVar[Var] instead of the outer loop?

t-rasmud updated this revision to Diff 511552.Apr 6 2023, 3:38 PM
t-rasmud marked 2 inline comments as done.

Address review comments.

t-rasmud marked 2 inline comments as done.Apr 6 2023, 3:40 PM
t-rasmud added inline comments.
clang/lib/Analysis/UnsafeBufferUsage.cpp
579

Great catch @ziqingluo-90! I was experimenting to match assignments at declarations and never intended for those sub-matchers to be part of this patch.

NoQ added a comment.Apr 6 2023, 5:33 PM

I think it'd be really valuable to document the algorithm.

It's essential for the reader to understand why this is a two-step process and a simple flood-fill through bidirectional implications graph isn't going to be correct.

Then, you did something very clever to avoid finding intersections between directed "zones of influence" of every variable (and then joining them), and I really like it. In inline comments i have a couple suggestions on how to demonstrate to the reader that it's equivalent to (my) intuitive understanding of the problem. It looks like your algorithm is pretty much linear over the size of the AST, which is great because this means we don't have to worry too much about budgets! - this is probably worth proclaiming as well.

clang/lib/Analysis/UnsafeBufferUsage.cpp
277โ€“281

You're saying this is a two-element list with well-defined LHS and RHS, but the caller implementation accepts arbitrarily long list and treats it as implication from item [0] to items [1 ... N-1].

I'm not really sure how it *should* behave in the more-than-two-elements case.

IIRC our primary motivating example for more than 2 elements is foo(p1, p2, ..., pN) where foo has attribute [[unsafe_buffer_usage]] with respect to all N parameters and has a safe version that accepts N spans, but no safe versions for smaller subsets of parameters. If we try to represent it with these "directed" vectors, it becomes quadratic over N as you'll need edges from each argument to each argument.

Maybe before we get to handling the other case, it makes sense to have an abstract class StrategyImplication, so that assignment-like implication was one concrete subclass, and a symmetric set of N related variables was another concrete subclass. Then the call site can handle these cases separately. For the purposes of this patch it probably makes sense for this function to just return optional<pair<VarDecl *>>, so that to manage the reader's expectations :)

1768

Probably better to compare pointers. There could be variables with the same name in the same function (in smaller scopes). Also IIRC getName() crashes on anonymous declarations (not sure if it can happen in this case).

1826

I was about to say that, there could be fixables that don't connect any unsafe variables, but connect two implicated variables together, so you need to iterate over *all* fixables instead.

But then I noticed that FixablesForUnsafeVars is actually mis-named ๐Ÿ™ According to groupFixablesByVar() contains fixables for *all* variables, sorted by variable, and nobody ever checks whether it's unsafe. It'd be really valuable if we could rename it before or together with this patch! โ€“ as it makes the code much harder to read and reason about, there could even be existing bugs based on this misunderstanding.

Also, iterating over FixablesForUnsafeVars.byVar here would cause you to visit multivariable fixables twice (once for each variable), so there could be a bit of duplicated work here, maybe it still makes sense to iterate over the whole list.

1855โ€“1856

Using Var here instead of CurrentVar is really clever. I think there needs to be a comment bragging about this decision! I was quite confused initially when I was thinking of it as essential to the algorithm, whereas in reality it looks like it's just an optimization.

If CurrentVar was used, this would have straightforwardly meant "Just keep the edges in the original graph that are reachable from unsafe variables, and add reverse edges; and also no need to explore from the same var more than once". It's somewhat obvious that it would correctly define the graph for the second part of the algorithm.

By using Var you're most likely achieving the same result, but dramatically cutting the amount of hoops the second part of the algorithm needs to jump through, given that a lot of these connections become "direct". But this results in a dramatically different graph, and the shape of that graph is non-deterministic (depends on iteration order in this loop).

So I guess my point is, it's valuable to tell the reader that they shouldn't try to imagine how the graph is transformed because of using Var instead of CurrentVar. It's still essentially the same graph, just a bit shallower. The reader can continue reading as if CurrentVar was used, and think about why it's equivalent later.

This looks good!
I am trying to catch up with this patch and my comments are meant as "Did we consider case X?" rather than "This doesn't work for case X."

clang/lib/Analysis/UnsafeBufferUsage.cpp
577

Shall we back-port how matcher() and getFixits() methods split responsibilities that we've started to use in more recent patches?
For other fixables we now check if the DRE refers to a var decl already in the matcher like this:
declRefExpr(to(varDecl()))

1663

Array subscript operator adds an entry to the map if the key is not found.
https://en.cppreference.com/w/cpp/container/map/operator_at
That might or might not be ok here.
But we might also just take FixablesForUnsafeVars as a const reference and use find() instead.
https://en.cppreference.com/w/cpp/container/map/find

1714

Note: VarGrpMap being std::map inserts a record if VD is not among the keys yet.

1721

I am trying to understand how do the Fixables matched to other variables in the VarGroupForVD learn that they shouldn't emit any Fix-It.
I was naively expecting that we'd do:

for (const VarDecl * V : VarGroupForVD) {
    FixItsForVariable.erase(VD);
}
jkorous added inline comments.Apr 7 2023, 6:04 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
1768

+1 to comparing pointers
That's what we do above.

1833

IIUC we're copying elements from an std::vector to an std::set.
Nit: Maybe we could avoid the explicit iteration?

assert(Deps.size() > 1);
PtrAssignmentGraph[Deps[0]].insert(Deps.begin() + 1, Deps.end());

But I feel we might want to handle the case where Deps has exactly one element regardless.

1855โ€“1856

+1 to adding comments
Our future selves might be pretty grateful if they have to debug this code :)

clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp
2

I am looking for a test that checks that if one of the variables in the group is referred to by at least one DRE for which we don't have a Fix-It (e. g. negative index for span strategy) that the whole group will remain Fix-It-less (and note-less).
I can't find any and maybe that's just because I didn't look hard enough but it makes me wonder - should we add a short explanation to such test (either an existing one or the one we should add)?

t-rasmud updated this revision to Diff 512973.Apr 12 2023, 2:11 PM
t-rasmud marked an inline comment as done.

Address partial feedback.

I'm working on fixing multiple parameters at a time. It is based on this patch. So I suddenly had a few more questions.

clang/lib/Analysis/UnsafeBufferUsage.cpp
1738

If VD is removed (it means giving up on fixing VD), do we also need to remove the whole group of VD?

1772

Could we do the grouping after all variables having their fix-its generated so that we can directly copy their fix-its instead of re-generating them?
This requires fix-its generated for two different variables to be independent. And, I think so far they are independent.

t-rasmud marked 6 inline comments as done.Apr 25 2023, 12:00 PM
t-rasmud added inline comments.
clang/lib/Analysis/UnsafeBufferUsage.cpp
1738

That is right and a good optimization to have. I'll include it in the next iteration of the patch along.

t-rasmud updated this revision to Diff 516910.Apr 25 2023, 2:25 PM

Address more feedback.

t-rasmud updated this revision to Diff 516923.Apr 25 2023, 2:36 PM

Minor fix.

t-rasmud marked 2 inline comments as done.Apr 25 2023, 2:38 PM
t-rasmud marked an inline comment as done.Apr 25 2023, 3:26 PM
t-rasmud updated this revision to Diff 517385.Apr 26 2023, 4:40 PM

Address feedback.

t-rasmud marked 3 inline comments as done.Apr 26 2023, 4:44 PM
t-rasmud added inline comments.
clang/lib/Analysis/UnsafeBufferUsage.cpp
277โ€“281

You are right, hadn't considered the case with more than two elements and will address it in a separate patch.

t-rasmud updated this revision to Diff 518064.Apr 28 2023, 2:59 PM
t-rasmud marked an inline comment as done.

Add documentation.

t-rasmud marked 2 inline comments as done.Apr 28 2023, 2:59 PM
t-rasmud retitled this revision from [-Wunsafe-buffer-usage][WIP] Group variables associated by pointer assignments to [-Wunsafe-buffer-usage] Group variables associated by pointer assignments.Apr 28 2023, 3:00 PM
NoQ added a comment.May 2 2023, 2:12 PM

Alright I think everything looks mostly good from high-level perspective! I have a few minor nitpicks.

clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
35โ€“37

We can remove the old interface now right? You removed both the definition and the call site.

43

We should probably turn VarGrpMap into a const reference. Otherwise it'll be deep-copied every time we call the function.

It's probably also better to make a typedef for this map type, as it shows up multiple times.

Does this map need to be ordered/tree-based? Maybe we should use std::unordered_map or llvm::DenseMap?

clang/lib/Analysis/UnsafeBufferUsage.cpp
579

Looks like the hasOperatorName("=") check is duplicated on both sides.

597โ€“598

We're confident that this cast always succeeds. And we really don't want to see null pointers inside these pairs.

1134โ€“1135
clang/lib/Sema/AnalysisBasedWarnings.cpp
2215

We should probably find a way to put all these plain text pieces into DiagnosticSemaKinds.td (except probably the and part). I wonder if a

... %select{|, and change %2 to %select{std::span|std::array|std::span::iterator}1 to propagate bounds information betwen them'.}`

would work (like, nest more format specifiers into a %select). Or we could keep using the old diagnostic id when we don't need extra stuff at the end.

(It might be a good idea to make a new format specifier for this purpose, like plural but with lists. But this is probably an overkill if there's just one warning of this kind.)

clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp
44

This is a FIXME test right? In theory we want to propagate all the way up to a, but we only support assignments so far, not initializations.

NoQ added a comment.May 2 2023, 2:28 PM

TODOs: UUC

I think now is a good time to address this. The patch isn't going to be correct without it so we can't land it until the context is checked.

t-rasmud updated this revision to Diff 519273.May 3 2023, 3:08 PM

Address comments.

t-rasmud marked 4 inline comments as done.May 3 2023, 3:10 PM
t-rasmud marked an inline comment as done.
t-rasmud added inline comments.May 3 2023, 3:21 PM
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
35โ€“37

Assuming you're referring to handleFixableVariable I have already removed it in this patch. Did I miss something here?

t-rasmud updated this revision to Diff 519297.May 3 2023, 4:23 PM

Modify diagnostic message.

t-rasmud marked an inline comment as done.May 9 2023, 11:28 AM
t-rasmud updated this revision to Diff 520821.May 9 2023, 2:38 PM

Use nested %select in format specifier of the diagnostic message.

t-rasmud marked an inline comment as done.May 9 2023, 2:42 PM

Friendly Ping.

NoQ added a comment.May 9 2023, 4:05 PM

Ok this looks great to me, almost ready to land! Still needs the UUC work.

clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
35โ€“37

Yeah nvm I misread, you're absolutely right!

clang/include/clang/Basic/DiagnosticSemaKinds.td
11793โ€“11796

Ok this one's also unused right?

11796

Found a typo!

Also I think we shouldn't have single quotes around %0 here, but instead we're supposed to stuff our const VarDecl * directly into the diagnostic stream, just like we already do in warn_unsafe_buffer_variable, which automatically surrounds it with single quotes.

(unfortunately we can't do the same with %2 so we add quotes manually when we build the dynamic string)

clang/lib/Analysis/UnsafeBufferUsage.cpp
557
t-rasmud updated this revision to Diff 523596.May 18 2023, 4:09 PM

Address comments.

t-rasmud marked 3 inline comments as done.May 18 2023, 4:11 PM
NoQ accepted this revision.May 22 2023, 11:55 AM

Ok I think this is good to go! Amazing work!!

This revision is now accepted and ready to land.May 22 2023, 11:55 AM
This revision was landed with ongoing or failed builds.May 24 2023, 4:21 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. ยท View Herald TranscriptMay 24 2023, 4:21 PM
Herald added a subscriber: cfe-commits. ยท View Herald Transcript

Hi, just a heads-up, some bots seem to be unhappy with the test:

https://lab.llvm.org/buildbot/#/builders/216/builds/21765

error: 'note' diagnostics expected but not seen: 
  File Z:\b\llvm-clang-x86_64-sie-win\llvm-project\clang\test\SemaCXX\warn-unsafe-buffer-usage-multi-decl-warnings.cpp Line 169: {{^change type of 'q' to 'std::span' to preserve bounds information, and change 'r' and 'p' to 'std::span' to propagate bounds information between them$}}
error: 'note' diagnostics seen but not expected: 
  File Z:\b\llvm-clang-x86_64-sie-win\llvm-project\clang\test\SemaCXX\warn-unsafe-buffer-usage-multi-decl-warnings.cpp Line 169: change type of 'q' to 'std::span' to preserve bounds information, and change 'p' and 'r' to 'std::span' to propagate bounds information between them
2 errors generated.
dyung added a subscriber: dyung.May 24 2023, 10:36 PM

Hi, just a heads-up, some bots seem to be unhappy with the test:

https://lab.llvm.org/buildbot/#/builders/216/builds/21765

error: 'note' diagnostics expected but not seen: 
  File Z:\b\llvm-clang-x86_64-sie-win\llvm-project\clang\test\SemaCXX\warn-unsafe-buffer-usage-multi-decl-warnings.cpp Line 169: {{^change type of 'q' to 'std::span' to preserve bounds information, and change 'r' and 'p' to 'std::span' to propagate bounds information between them$}}
error: 'note' diagnostics seen but not expected: 
  File Z:\b\llvm-clang-x86_64-sie-win\llvm-project\clang\test\SemaCXX\warn-unsafe-buffer-usage-multi-decl-warnings.cpp Line 169: change type of 'q' to 'std::span' to preserve bounds information, and change 'p' and 'r' to 'std::span' to propagate bounds information between them
2 errors generated.

The test seems to randomly pass and fail on the bot. It seems that the order of 'p' and 'r' in the output string may not be deterministic? Can you make the test more reliable or make it handle each situation (if appropriate)?

dyung added a comment.May 25 2023, 2:12 AM

@t-rasmud I am sorry, but I had to revert your change, it is randomly failing on the Windows bots causing a lot of instability. See the revert commit message for examples.

Hi Douglas,

No worries, I know the root cause for the issue. I will make the necessary changes and re-commit the patch.

Thanks,
Rashmi

NoQ added a comment.May 25 2023, 12:46 PM

Thank you Sergei for catching this! Thank you Rashmi for fixing this!