This is an archive of the discontinued LLVM Phabricator instance.

[-Wunsafe-buffer-usage] Initiate Fix-it generation for local variable declarations
ClosedPublic

Authored by ziqingluo-90 on Dec 9 2022, 12:10 PM.

Details

Summary

Use clang fix-its to transform declarations of local variables, which are used for buffer access , to be of std::span type.

We placed a few limitations to keep the solution simple:

  1. it only transforms local variable declarations (no parameter declaration)
  2. it only considers single level pointers, i.e., pointers of type T * regardless of whether T is again a pointer.
  3. it only transforms to std::span types (no std::array, or std::span::iterator, or ...)
  4. it can only transform a VarDecl that belongs to a DeclStmt whose has a single child.

One of the purposes of keeping this patch simple enough is to first evaluate if fix-it is an appropriate approach to do the transformation.

Diff Detail

Event Timeline

ziqingluo-90 created this revision.Dec 9 2022, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 12:10 PM
Herald added a subscriber: rnkovacs. · View Herald Transcript
ziqingluo-90 requested review of this revision.Dec 9 2022, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 12:10 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
jkorous added a reviewer: sgatev.
NoQ added a comment.Dec 12 2022, 1:40 PM

Ok so at the conference @ymandel suggested us to use libTooling's Transformers API to make our lives easier (https://clang.llvm.org/docs/ClangTransformerTutorial.html). I'm fairly certain we should at least consider using them, but I haven't looked into it deeply enough yet. "Changing the type of a variable" is a problem that ideally should be solved only once, because it's a problem that's easy to formulate but difficult to "get right". Transformers appear to be a collection of solutions to problems of this kind. We should see if they already provide an out-of-the-box solution, or if they have bits and pieces of machinery we can reuse in our solution.

Ok so at the conference @ymandel suggested us to use libTooling's Transformers API to make our lives easier (https://clang.llvm.org/docs/ClangTransformerTutorial.html). I'm fairly certain we should at least consider using them, but I haven't looked into it deeply enough yet. "Changing the type of a variable" is a problem that ideally should be solved only once, because it's a problem that's easy to formulate but difficult to "get right". Transformers appear to be a collection of solutions to problems of this kind. We should see if they already provide an out-of-the-box solution, or if they have bits and pieces of machinery we can reuse in our solution.

  1. Transformer is good when you are basing your changes around AST matchers. It is a collection of combinators over the MatchResult type. From what I'm seeing here, though, you don't seem to be basing it on AST matchers, in which case the libraries are not as useful.
  2. We (google) have a tool for changing the type of a variable which takes into account all cascading changes to other declarations that result from that type change. It is built on Transformer and another library that builds a graph of relationships in the AST, relevant to the type-rewriting problem. Unfortunately, that's all internal -- not for IP reasons but simply because we never had reason to upstream it (nor an obvioius place to put it). So, I'd say that this advanced tooling is useful here if you want to _selectively_ change type T to type S, so you need tooling to tell you specifically which S's to update. If you're changing all T to S, then the problem is simpler.
NoQ added a comment.Dec 13 2022, 6:44 PM
  1. We (google) have a tool for changing the type of a variable which takes into account all cascading changes to other declarations that result from that type change. It is built on Transformer and another library that builds a graph of relationships in the AST, relevant to the type-rewriting problem. Unfortunately, that's all internal -- not for IP reasons but simply because we never had reason to upstream it (nor an obvioius place to put it). So, I'd say that this advanced tooling is useful here if you want to _selectively_ change type T to type S, so you need tooling to tell you specifically which S's to update. If you're changing all T to S, then the problem is simpler.

Yeah we want to be selective, so in our case there's an extra analysis step between noticing that the variable is of a certain type and realizing that we want to change it to a different type.

I guess I'm curious if there are smaller building blocks of your machine that we could reuse. Like, for instance, the procedure of updating the type of the variable (without even updating all uses): it's a somewhat self-contained problem, and it's difficult enough on its own (you need to split up multi-variable DeclStmts in two or three, find the right source locations for everything, and so on). I wonder if we could avoid reinventing the wheel there somehow.

  1. We (google) have a tool for changing the type of a variable which takes into account all cascading changes to other declarations that result from that type change. It is built on Transformer and another library that builds a graph of relationships in the AST, relevant to the type-rewriting problem. Unfortunately, that's all internal -- not for IP reasons but simply because we never had reason to upstream it (nor an obvioius place to put it). So, I'd say that this advanced tooling is useful here if you want to _selectively_ change type T to type S, so you need tooling to tell you specifically which S's to update. If you're changing all T to S, then the problem is simpler.

Yeah we want to be selective, so in our case there's an extra analysis step between noticing that the variable is of a certain type and realizing that we want to change it to a different type.

I guess I'm curious if there are smaller building blocks of your machine that we could reuse. Like, for instance, the procedure of updating the type of the variable (without even updating all uses): it's a somewhat self-contained problem, and it's difficult enough on its own (you need to split up multi-variable DeclStmts in two or three, find the right source locations for everything, and so on). I wonder if we could avoid reinventing the wheel there somehow.

Sure. I'll try to get you some sample code later today.

did a rebase

Thanks for the rebase!

Nit: I'd just replace std::function with auto as it saves us repeating the parameter types (and #include <functional>).

clang/lib/Analysis/UnsafeBufferUsage.cpp
676
680
684

To follow LLVM's convention that global variables better have types that do NOT require construction, I change the type of the global variable from std::string to constexpr const char * const.

Rebase the patch with respect to the re-architecture: https://reviews.llvm.org/D140062.

Since we do not have any FixableGadget to trigger fix-its at this point, I let fix-its of local variable declarations always be emitted for the purpose of testing.

jkorous added inline comments.Jan 9 2023, 2:30 PM
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
56

Should we rather pick something that is syntactically incorrect in C++ in order to prevent accidental silent corruption of the sources?
FWIW Xcode uses <#placeholder#> syntax.

clang/test/SemaCXX/warn-unsafe-buffer-usage-fixits.cpp
1 ↗(On Diff #487527)

I am starting to think we should split up our tests to allow for less conflicts between patches.
Could we please rename the file to a more specific name e. g. warn-unsafe-buffer-usage-fixits-local-var-span.cpp?

1 ↗(On Diff #487527)

Also, let's check only correctness of the FixIts in this test.
I would remove RUN lines with -verify and expected-warnings.

NoQ added a comment.Jan 10 2023, 5:54 PM

Since we do not have any FixableGadget to trigger fix-its at this point, I let fix-its of local variable declarations always be emitted for the purpose of testing.

It sounds to me as if by landing the patch we'll temporarily make the compiler emit incorrect fixes. I think we should avoid doing that. Is it possible to wait until our first proof-of-concept FixableGadget lands before landing this patch? I think there shouldn't be too much conflict between such patches.

It sounds to me as if by landing the patch we'll temporarily make the compiler emit incorrect fixes. I think we should avoid doing that. Is it possible to wait until our first proof-of-concept FixableGadget lands before landing this patch? I think there shouldn't be too much conflict between such patches.

Any patch with FixableGadget will face the same problem - i. e. we still won't emit the fixit until this patch has landed.
We decided to include a simple Fixable in this patch to unlock testing.

jkorous added inline comments.Jan 11 2023, 3:44 PM
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
56
ziqingluo-90 added inline comments.Jan 11 2023, 5:25 PM
clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
56

It looks like we could put different messages in between <# and #> to hint users in different situations. I'll make this member a function then.

Rebased the code w.r.t. a series of refactoring in [-Wunsafe-buffer-usage].
Added a FixableGadget for array subscripts of the form DRE[*] in the context of lvalue-to-rvalue casting.

Also did a refactoring at the place where matchers are applied: Matchers of a FixableGadget and of a WarningGadget cat match for the same AST node. So they should not be put in an anyOf group, otherwise they can disable each other due to short-circuiting.

NoQ added inline comments.Jan 19 2023, 4:34 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
385

There's already a forward declaration on line 144!

412

Subscripts [0] might be safe, but this doesn't mean we should avoid emitting fixits when we see them. I think this clause should be dropped here. (Might be worth adding a test).

723

Uhh, I'm worried about this approach. I'm not sure these pretty-printers are even correct. They try to look similar to the underlying code, but I don't think they're required to. And even if they were, It only takes one forgotten override in StmtPrinter to produce incorrect pretty-prints that won't compile. So it's very unreliable technology.

I think the intended way to do these things is to simply avoid overwriting code that you want to preserve. Instead, modify code around it. Eg., if you want to change

int *x = foo();

to

std::span<int> x { foo(), N };

then you add std::span< before int, then don't touch int, then replace * with > , then replace = with {, then don't touch foo(), then add , N } immediately after.

If you actually need to move code around (eg., make preserved chunks appear in a different order), I think this should go through a facility like Lexer::getSourceText() which would give you the exact source code as written.

Refactored the fix-it generation code to stop using the pretty-printer.

Addressed the minor comments

NoQ added a comment.Jan 24 2023, 4:07 PM

Awesome, we're getting closer to producing our first fix!

I think the most important thing right now is to add a lot of SourceLocation::isMacroID() checks everywhere, so that to *never* try to fix code that's fully or partially expanded from macros. I suspect that we can do that "after the fact": just scan the list of emitted fixits for any source ranges that start or end in macros, and if even one such range is found, abandon the fix. Might be worth double-checking that the compiler doesn't already do that automatically for all emitted fixits.

I see that the patch doesn't touch handleFixableVariable(). Do we still attach fixits to the warning, or did we already change it to be attached to a note associated with the warning? Because fixits with placeholders aren't allowed on warnings, we should make sure it's attached to note before landing this patch.

clang/include/clang/Analysis/Analyses/UnsafeBufferUsage.h
42

Do we really want this method in the public interface? If the idea is to let people the user override it, maybe let's actually allow that by dropping static and adding virtual? I think this is actually a good idea, maybe if somebody uses our analysis outside of the clang binary, they may want a different placeholder.

43

const StringRef & seems redundant given that StringRef is already a "Ref". Just pass by value.

clang/lib/Analysis/UnsafeBufferUsage.cpp
407

A bit more concise.

637

Hmm, did this need to be moved? I don't think you're calling this function from the new code.

729
771

Excellent observation!

774

I wonder how do we end up in such situation. If it's not array new, this means it's also not a buffer of multiple elements. But I guess the pointer can be reassigned later. Yeah I think you're right, this is the correct solution. Maybe let's leave a comment explaining how this could happen?

776–777

That's another case where .getTypePtr() is unnecessary: dyn_cast<T>(QT.getTypePtr()) is equivalent to`QT->getAs<T>() where -> is QualType`'s overloaded operator and getAs is Type::getAs.

Additionally, array types are special - see doxygen comments for ASTContext::getAsArrayType(). For some reason you're suppposed to consult ASTContext when casting them, which I did in the suggested fix. I'm not sure if it matters in this case when we just want to extract the size.

783–784
int x = 1;
char *ptr = &x; // std::span<char> ptr { &x, 4 };

This is valid code. I suspect we want to check types as well, to see that type sizes match.

Most of the time code like this violates strict aliasing, but char is exceptional, and even if it did violate strict aliasing, people can compile with -fno-strict-aliasing to define away the UB, so we have to respect that.

821

getDesugaredType() looks redundant, getPointeeType() probably already does that.

852

Interesting, so you don't need to discriminate between situation "fix is impossible" and situation "fix is not necessary" because the latter is impossible because the variable type definitely needs fixing? Which is different from the behavior of FixableGadget::getFixits() that returns an empty optional when the fix is impossible and a non-empty optional with zero fixes when the fix is not necessary. I guess let's leave a comment at fixVariable() documenting this behavior?

I also suspect that eventually we'll want variable declarations to simply act as another fixable gadget. In this case we might have to undo this solution and go back to optionals.

859–861

I suspect this code will crash whenever such declarations are encountered, so this is probably not something we want to commit.

Typically assert(..., "not implemented yet") makes sense only when the crashing codepath is also unreachable because no other facility in the compiler exercises it.

It probably make sense to return {} in these cases: the function is still responsible for them, we just haven't implemented them yet.

Let's also add a FIXME test to demonstrate that there's no fix yet in such situations.

clang/test/SemaCXX/warn-unsafe-buffer-usage-fixits-local-var-span.cpp
56

In the new int[n] case it also technically doesn't have a constant value. The code correctly checks for side effects, the comment should also probably say something about side effects.

jkorous added inline comments.Jan 27 2023, 11:51 AM
clang/lib/Analysis/UnsafeBufferUsage.cpp
126

Nit: I think we could simplify this to just

static auto isInUnspecifiedLvalueContext(internal::Matcher<Expr> innerMatcher) {
  return implicitCastExpr(hasCastKind(CastKind::CK_LValueToRValue),
                          castSubExpr(innerMatcher));
}
ziqingluo-90 marked 8 inline comments as done.Jan 27 2023, 12:33 PM
ziqingluo-90 added inline comments.
clang/lib/Analysis/UnsafeBufferUsage.cpp
637

it does look like I moved it. Will change it back.

783–784

This code is not valid in C++. An explicit cast is needed in front of &x. I will add a test to show that

int x = 1;
char * ptr = (char *)&x;

will have a place holder for the span size.

Addressing comments

ziqingluo-90 marked an inline comment as done.Jan 27 2023, 12:49 PM
NoQ added a comment.Jan 27 2023, 5:48 PM

Thank you!! I think we're almost ready to commit but this concern is still hanging:

I see that the patch doesn't touch handleFixableVariable(). Do we still attach fixits to the warning, or did we already change it to be attached to a note associated with the warning? Because fixits with placeholders aren't allowed on warnings, we should make sure it's attached to note before landing this patch.

clang/lib/Analysis/UnsafeBufferUsage.cpp
783–784

Yes you're right! It's only valid in C where these fixits don't apply.

I am sorry I haven't notice this earlier - let's fix this before we land the patch.

clang/lib/Analysis/UnsafeBufferUsage.cpp
695

We either need a zero to terminate the string or pass the size of Txt to the std::string constructor here. (While toString's name might sound like it'll take care of that it does not.)

Simplified testcase:

void local_ptr_to_array() {
  int tmp;
  int a[10];
  int *p = a;
  tmp = p[5];
}

what I get is (something like this):

void local_ptr_to_array() {
  int tmp;
  int a[10];
  std::span<int> p {a, 10�o};
  tmp = p[5];
}

The problem is that APInt::toString stores '1' and '0' to Txt but is missing the terminating \0 character that std::string constructor expects.

695
NoQ added inline comments.Jan 31 2023, 5:08 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
699
ziqingluo-90 marked 2 inline comments as done.

To attach fix-its to notes instead of warnings.
Fix the ending '\0' issue raised by @jkorous

ziqingluo-90 marked an inline comment as done.Feb 1 2023, 1:27 PM
jkorous added a comment.EditedFeb 1 2023, 2:53 PM

I got a test failure in SemaCXX/warn-unsafe-buffer-usage-fixits-local-var-span.cpp which I believe is caused solely by the fact that we emit the diagnostics in different order.
I am not sure it matters and since the Fix-Its clearly specify what source lines they apply to I suggest we simply replace every // CHECK: with // CHECK-DAG:.
That fixed the problem for me.

https://llvm.org/docs/CommandGuide/FileCheck.html#the-check-dag-directive

jkorous added inline comments.Feb 1 2023, 8:15 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
871

I believe we should add another condition here: VD->isLocalVarDecl() as we don't support globals (yet?).
We run the matcher with any_ds tag only on function bodies so we won't discover globals anyway and the assert(It != Defs.end() && "Definition never discovered!"); would fail.

jkorous added inline comments.Feb 2 2023, 7:11 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
680

I am afraid I might have found one more problem :(
I believe that for span strategy we have to make sure the index is > 0. Otherwise
That means either an unsigned integer or signed or unsigned literal that is greater than 0.
For the literal you can take inspiration here:
https://reviews.llvm.org/D142795

address comments

ziqingluo-90 marked 2 inline comments as done.Feb 3 2023, 10:56 AM
NoQ accepted this revision.Feb 3 2023, 1:43 PM

As far as I'm concerned, I think this is great for the initial implementation! Let's commit as soon as Jan confirms that his problem is addressed.

clang/lib/Analysis/UnsafeBufferUsage.cpp
871

I think this check should happen much earlier. Like, we shouldn't define a strategy for globals, and we shouldn't build fixables out of them. And then assert() here, just to double-check.

This revision is now accepted and ready to land.Feb 3 2023, 1:43 PM
jkorous added inline comments.Feb 3 2023, 2:08 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
680

@ziqingluo-90 Sorry, looks like I wasn't clear here.
One case (that you've already addressed) is ptr[-5] - for that we can't use std::span::operator[] as it would immediately trap.
But there's the other case of:

uint8_t foo(uint8_t *ptr, int idx) {
  return ptr[idx]
}

If anyone uses a value that's both signed and not a compile-time constant then our compile-time analysis can not prove that the index is always >= 0 and consequently we can't use std::span::operator[] as a replacement.
That's why I think we really need to make sure that the index is ether a) positive literal or b) unsigned.
WDYT?

jkorous added inline comments.Feb 3 2023, 2:10 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
680

And yes ... I was wrong - literal 0 is totally fine. Thanks for spotting that!

871

Fair enough, we can do that. Maybe a follow-up patch?

ziqingluo-90 updated this revision to Diff 495250.EditedFeb 6 2023, 12:47 PM

Fixed a bug in manipulating source locations:

Using SourceLocation::getLocWithOffset to get a new source location L with a relative offset is convenient. But it requires extra care to make sure that L has the same file ID as the location where L is originated. Otherwise, L may be associated with a wrong file ID (or macro ID). Feeding such an L to a Lexer could cause failures.

To avoid emitting incorrect fix-its for array subscripts on span objects:

  • For unsafe operations of the form s[e] where s is being transformed to be a span, we only emit fix-its for s[e] when e is a non-negative integer literal or e has an unsigned integer type.
clang/lib/Analysis/UnsafeBufferUsage.cpp
680

I think you are right. Fixed it.

jkorous accepted this revision.Feb 6 2023, 2:15 PM

LGTM
Thank you!

This revision was landed with ongoing or failed builds.Feb 7 2023, 1:18 PM
This revision was automatically updated to reflect the committed changes.