- For assignments involving pointer types, if the machine decides that the LHS pointer has to be fixed to std::span, this patch introduces the same fixit for the RHS pointer.
- It groups all such pointer variables (that transitively depend on the unsafe pointer) and suggests that they all be fixed atomically to correctly propagate bounds information among them.
- TODOs: UUC, variables claimed by multiple gadgets (ex: int *p = q), handle parameters.
Details
Diff Detail
Event Timeline
Brilliant!!
I guess we can abandon D143133 now in favor of this patch?
Before I jump to nitpicking, I think we should sync up on how the end result should look like, in terms of warnings and notes that we display. This may affect the overall algorithm, like require us to gather more data.
clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp | ||
---|---|---|
5โ14 | We probably want to build some finer narrative here ๐ค In this case p and q are both independently unsafe and there's a one-way connection between them. Would it make sense to emit just one warning for both of them? I suspect it also makes sense to leave a note at q = p with a text like "note: bounds information needs to propagate from p to q here" (I've no idea if it'll scale to more than 2 variables, or how much more information we'll need to gather). | |
57โ63 | In this case we definitely want to spanify q, for two independent reasons. We need to figure out how to tell that to the user though ๐ค |
I have one general question: suppose we have two variables A and B in the same group. So fix-its emitted for A includes fix-its for B and vice verse, right? Does it mean that we cannot apply all fix-its at once from the command line?
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h | ||
---|---|---|
42 | It could be nice to have a document for this function since it is in the header. | |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
277 | So every Fixable needs to implement this. I would love to see a document for this function. | |
587 | DeclPtrAtLeft is not used. | |
1834 | It seems like we do not care about the direction of edges in the graph, maybe we should call it UndirectedGraph? | |
clang/lib/Sema/AnalysisBasedWarnings.cpp | ||
2267 | would it be better if to make VariableGroups a map from VarDecls to groups ? |
clang/lib/Sema/AnalysisBasedWarnings.cpp | ||
---|---|---|
2250 | These note messages are nice. Maybe we can have some tests for them. It will also help telling whether variables are grouped correctly. | |
clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-fixits-test.cpp | ||
139 | Could we have an example where there is a cyclic dependency? Such as p = q; q = r; r = p;. |
- Handle the case where implied variables don't have a fixit strategy
- TODO: address feedback for the previous diff
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
1834 | I changed it to PtrAssignmentGraph which seems (atleast to me) to give more context. It is still a directed graph because Deps[0] represents the LHS pointer and Deps[1] the RHS pointer. Hopefully the new documentation on getStrategyImplications makes this clear! | |
clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp | ||
5โ14 | The new note is more descriptive and summarizes the reason for grouping variables together. We still don't inform the user of the direction of bounds propagation or the fact that buffers are independently unsafe. Like we discussed offline, we could consider these refinements in future versions of the tool. |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
579 | I was wondering if we need this sub-matcher declRefExpr(expr()).bind(PointerAssignRHSTag). It is eventually used as an inner matcher of binaryOperator. I doubt it will match anything. Ditto for line 582-583. | |
1667 | How about using FixablesForUnsafeVars.byVar[Var] instead of the outer loop? |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
579 | Great catch @ziqingluo-90! I was experimenting to match assignments at declarations and never intended for those sub-matchers to be part of this patch. |
I think it'd be really valuable to document the algorithm.
It's essential for the reader to understand why this is a two-step process and a simple flood-fill through bidirectional implications graph isn't going to be correct.
Then, you did something very clever to avoid finding intersections between directed "zones of influence" of every variable (and then joining them), and I really like it. In inline comments i have a couple suggestions on how to demonstrate to the reader that it's equivalent to (my) intuitive understanding of the problem. It looks like your algorithm is pretty much linear over the size of the AST, which is great because this means we don't have to worry too much about budgets! - this is probably worth proclaiming as well.
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
277โ281 | You're saying this is a two-element list with well-defined LHS and RHS, but the caller implementation accepts arbitrarily long list and treats it as implication from item [0] to items [1 ... N-1]. I'm not really sure how it *should* behave in the more-than-two-elements case. IIRC our primary motivating example for more than 2 elements is foo(p1, p2, ..., pN) where foo has attribute [[unsafe_buffer_usage]] with respect to all N parameters and has a safe version that accepts N spans, but no safe versions for smaller subsets of parameters. If we try to represent it with these "directed" vectors, it becomes quadratic over N as you'll need edges from each argument to each argument. Maybe before we get to handling the other case, it makes sense to have an abstract class StrategyImplication, so that assignment-like implication was one concrete subclass, and a symmetric set of N related variables was another concrete subclass. Then the call site can handle these cases separately. For the purposes of this patch it probably makes sense for this function to just return optional<pair<VarDecl *>>, so that to manage the reader's expectations :) | |
1768 | Probably better to compare pointers. There could be variables with the same name in the same function (in smaller scopes). Also IIRC getName() crashes on anonymous declarations (not sure if it can happen in this case). | |
1826 | I was about to say that, there could be fixables that don't connect any unsafe variables, but connect two implicated variables together, so you need to iterate over *all* fixables instead. But then I noticed that FixablesForUnsafeVars is actually mis-named ๐ According to groupFixablesByVar() contains fixables for *all* variables, sorted by variable, and nobody ever checks whether it's unsafe. It'd be really valuable if we could rename it before or together with this patch! โ as it makes the code much harder to read and reason about, there could even be existing bugs based on this misunderstanding. Also, iterating over FixablesForUnsafeVars.byVar here would cause you to visit multivariable fixables twice (once for each variable), so there could be a bit of duplicated work here, maybe it still makes sense to iterate over the whole list. | |
1855โ1856 | Using Var here instead of CurrentVar is really clever. I think there needs to be a comment bragging about this decision! I was quite confused initially when I was thinking of it as essential to the algorithm, whereas in reality it looks like it's just an optimization. If CurrentVar was used, this would have straightforwardly meant "Just keep the edges in the original graph that are reachable from unsafe variables, and add reverse edges; and also no need to explore from the same var more than once". It's somewhat obvious that it would correctly define the graph for the second part of the algorithm. By using Var you're most likely achieving the same result, but dramatically cutting the amount of hoops the second part of the algorithm needs to jump through, given that a lot of these connections become "direct". But this results in a dramatically different graph, and the shape of that graph is non-deterministic (depends on iteration order in this loop). So I guess my point is, it's valuable to tell the reader that they shouldn't try to imagine how the graph is transformed because of using Var instead of CurrentVar. It's still essentially the same graph, just a bit shallower. The reader can continue reading as if CurrentVar was used, and think about why it's equivalent later. |
This looks good!
I am trying to catch up with this patch and my comments are meant as "Did we consider case X?" rather than "This doesn't work for case X."
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
577 | Shall we back-port how matcher() and getFixits() methods split responsibilities that we've started to use in more recent patches? | |
1663 | Array subscript operator adds an entry to the map if the key is not found. | |
1714 | Note: VarGrpMap being std::map inserts a record if VD is not among the keys yet. | |
1721 | I am trying to understand how do the Fixables matched to other variables in the VarGroupForVD learn that they shouldn't emit any Fix-It. for (const VarDecl * V : VarGroupForVD) { FixItsForVariable.erase(VD); } |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
1768 | +1 to comparing pointers | |
1833 | IIUC we're copying elements from an std::vector to an std::set. assert(Deps.size() > 1); PtrAssignmentGraph[Deps[0]].insert(Deps.begin() + 1, Deps.end()); But I feel we might want to handle the case where Deps has exactly one element regardless. | |
1855โ1856 | +1 to adding comments | |
clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp | ||
2 | I am looking for a test that checks that if one of the variables in the group is referred to by at least one DRE for which we don't have a Fix-It (e. g. negative index for span strategy) that the whole group will remain Fix-It-less (and note-less). |
I'm working on fixing multiple parameters at a time. It is based on this patch. So I suddenly had a few more questions.
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
1738 | If VD is removed (it means giving up on fixing VD), do we also need to remove the whole group of VD? | |
1772 | Could we do the grouping after all variables having their fix-its generated so that we can directly copy their fix-its instead of re-generating them? |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
1738 | That is right and a good optimization to have. I'll include it in the next iteration of the patch along. |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
---|---|---|
277โ281 | You are right, hadn't considered the case with more than two elements and will address it in a separate patch. |
Alright I think everything looks mostly good from high-level perspective! I have a few minor nitpicks.
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h | ||
---|---|---|
35โ37 | We can remove the old interface now right? You removed both the definition and the call site. | |
43 | We should probably turn VarGrpMap into a const reference. Otherwise it'll be deep-copied every time we call the function. It's probably also better to make a typedef for this map type, as it shows up multiple times. Does this map need to be ordered/tree-based? Maybe we should use std::unordered_map or llvm::DenseMap? | |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
579 | Looks like the hasOperatorName("=") check is duplicated on both sides. | |
597โ598 | We're confident that this cast always succeeds. And we really don't want to see null pointers inside these pairs. | |
1134โ1135 | (valid use case for auto according to https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable) | |
clang/lib/Sema/AnalysisBasedWarnings.cpp | ||
2215 | We should probably find a way to put all these plain text pieces into DiagnosticSemaKinds.td (except probably the and part). I wonder if a ... %select{|, and change %2 to %select{std::span|std::array|std::span::iterator}1 to propagate bounds information betwen them'.}` would work (like, nest more format specifiers into a %select). Or we could keep using the old diagnostic id when we don't need extra stuff at the end. (It might be a good idea to make a new format specifier for this purpose, like plural but with lists. But this is probably an overkill if there's just one warning of this kind.) | |
clang/test/SemaCXX/warn-unsafe-buffer-usage-multi-decl-warnings.cpp | ||
44 | This is a FIXME test right? In theory we want to propagate all the way up to a, but we only support assignments so far, not initializations. |
TODOs: UUC
I think now is a good time to address this. The patch isn't going to be correct without it so we can't land it until the context is checked.
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h | ||
---|---|---|
35โ37 | Assuming you're referring to handleFixableVariable I have already removed it in this patch. Did I miss something here? |
Ok this looks great to me, almost ready to land! Still needs the UUC work.
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h | ||
---|---|---|
35โ37 | Yeah nvm I misread, you're absolutely right! | |
clang/include/clang/Basic/DiagnosticSemaKinds.td | ||
11793โ11796 | Ok this one's also unused right? | |
11796 | Found a typo! Also I think we shouldn't have single quotes around %0 here, but instead we're supposed to stuff our const VarDecl * directly into the diagnostic stream, just like we already do in warn_unsafe_buffer_variable, which automatically surrounds it with single quotes. (unfortunately we can't do the same with %2 so we add quotes manually when we build the dynamic string) | |
clang/lib/Analysis/UnsafeBufferUsage.cpp | ||
557 |
Hi, just a heads-up, some bots seem to be unhappy with the test:
https://lab.llvm.org/buildbot/#/builders/216/builds/21765
error: 'note' diagnostics expected but not seen: File Z:\b\llvm-clang-x86_64-sie-win\llvm-project\clang\test\SemaCXX\warn-unsafe-buffer-usage-multi-decl-warnings.cpp Line 169: {{^change type of 'q' to 'std::span' to preserve bounds information, and change 'r' and 'p' to 'std::span' to propagate bounds information between them$}} error: 'note' diagnostics seen but not expected: File Z:\b\llvm-clang-x86_64-sie-win\llvm-project\clang\test\SemaCXX\warn-unsafe-buffer-usage-multi-decl-warnings.cpp Line 169: change type of 'q' to 'std::span' to preserve bounds information, and change 'p' and 'r' to 'std::span' to propagate bounds information between them 2 errors generated.
The test seems to randomly pass and fail on the bot. It seems that the order of 'p' and 'r' in the output string may not be deterministic? Can you make the test more reliable or make it handle each situation (if appropriate)?
@t-rasmud I am sorry, but I had to revert your change, it is randomly failing on the Windows bots causing a lot of instability. See the revert commit message for examples.
Hi Douglas,
No worries, I know the root cause for the issue. I will make the necessary changes and re-commit the patch.
Thanks,
Rashmi
We can remove the old interface now right? You removed both the definition and the call site.