This is an archive of the discontinued LLVM Phabricator instance.

[-Wunsafe-buffer-usage] Add Fixable for dereference of simple ptr arithmetic
ClosedPublic

Authored by ziqingluo-90 on Jan 27 2023, 6:26 PM.

Details

Summary

For each expression e of the form *(DRE + n) (or *(n + DRE)), where DRE has a pointer type and n is an integer literal, e will be transformed to DRE[n] (or n[DRE] respectively), if

  • 1. e is at the left-hand side of an assignment or is an lvalue being casted to an rvalue; and
  • 2. the variable referred by DRE is going to be transformed to be of std::span type.

Diff Detail

Event Timeline

jkorous created this revision.Jan 27 2023, 6:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2023, 6:26 PM
Herald added a subscriber: ChuanqiXu. · View Herald Transcript

I think that my current name of the Fixable is pretty bad - open to suggestions!

clang/lib/Analysis/UnsafeBufferUsage.cpp
608

I promise I will change the format - this one is just natural to me and I used it to create the matcher.

jkorous added inline comments.Jan 27 2023, 6:33 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
839

We will do this in: https://reviews.llvm.org/D139737
I will remove it from this patch.

NoQ added a comment.Jan 30 2023, 1:32 PM

Nice! I have some nitpicks.

clang/include/clang/Analysis/Analyses/UnsafeBufferUsageGadgets.def
33–36

Spacing is a bit weird here.

clang/lib/Analysis/UnsafeBufferUsage.cpp
608

Looks quite readable to me!

633–635

Interesting, you're using std::nullopt to indicate both that fix isn't implemented yet (below) and that the fix is fundamentally impossible (here). I think a slightly cleaner design would be to make this check part of the matcher: if it's not fixable, let's not construct a Fixable. Then return std::nullopt will always indicate "well, it's probably fixable but not implemented yet". This way later we'll be able to "profile" our fixit coverage by catching these std::nullopt return values and classifying them; it'd be harder to do if there are "valid" std::nullopt results.

This also implies that std::nullopt will never indicate that "the strategy isn't appropriate for this fixable pattern". The gadget will need a different channel to communicate that. But then I realize that this same gadget is quite fixable for eg. iterator strategy (which can deal with negative values easily). So if we add this check to the matcher, we'll have to make a separate gadget for potentially negative values. Which may start conflicting with this gadget because the new gadget will also happily accept positive values. So I guess a better thing to do would be to make the "different channel" more flexible, so that it could say "Well, I matched, but I'm not going to do *this* strategy because I'm negative, please try a different strategy". In this case we'll move the check from the matcher to the strategy feedback method.

So, yeah, there's some room for design discussions, but as of today we probably don't care.

647

This needs some clang-format.

clang/test/SemaCXX/warn-unsafe-buffer-usage-fixits-deref-simple-ptr-arith.cpp
77
jkorous added inline comments.Jan 30 2023, 3:32 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
633–635

Oh! This is interesting!

Let me start by saying that I agree that at some point we should discuss the contract of this part of the machinery. I actually would be quite surprised if we won't have more broader design discussion once we look into things like more advanced type replacement strategies, etc. in next couple months. That makes me want to stick to our current design for a bit longer because despite its limitations it will allow us to still get more data that we can use to plan the next iteration better.

Now, about the current design. My understanding was that the AST pattern defines "area of authority" for a particular Fixable and returning std::nullopt from getFixits method just means that a given strategy can't be achieved (for whatever reason).
Since we expect to have a pretty diverse set of strategies (e. g. array, vector, span, span_iterator, ...) I implicitly assumed that it is almost inevitable that some Fixables won't be able to provide a Fix-It for each and every strategy and thought that's how the design is intended to work. But I guess we never really discussed these details because we didn't have a specific example and we all just assumed something - not necessarily the same thing though :)

Related to how do we learn about what hasn't been implemented - I actually started imagining we might use some logging feature (possibly only for the debug mode) which would tell us why are fixits not emitted. But since it is just a solution and not necessarily the optimal one I'd also prefer to first finish our initial batch of Fixable and go through the exercise of verifying them against real codebases before we decide on "if && what" is necessary.

Anyways, would you find it reasonable to add a FIXME here and/or to the FixableGadget::getFixits() declaration but keep using std::nullopt for both cases described above for now?

ziqingluo-90 added inline comments.Feb 9 2023, 3:36 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
568

The parameter Loc is always the begin location of some token. So maybe we could instead use a for-loop over the token length to avoid any risk of looping forever?

645

I wonder if this would be an at-least-not-worse fix-it: replacing *(pointer + 123) with pointer[123] in one step? I think it could reduce the whitespace problem.

NoQ added inline comments.Feb 16 2023, 6:11 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
569

Looks like this will crash with assertion failure if Loc is at the beginning of the file. Probably not a real issue but might be worth checking.

633–635

Ok sure let's add a FIXME and handle this later!

ziqingluo-90 added inline comments.Feb 21 2023, 6:14 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
648

Just realized that we can achieve the same goal without using getBeginOfPrecHWSpace.

649
ziqingluo-90 commandeered this revision.Feb 22 2023, 1:26 PM
ziqingluo-90 edited reviewers, added: jkorous; removed: ziqingluo-90.

Taking care of this revision on behalf of @jkorous

Address some of the comments.
Add handling for both *(ptr + n) and *(n + ptr).
Add handling for *((..(ptr + n) .. )
Remove the white space handling

ziqingluo-90 retitled this revision from [-Wunsafe-buffer-usage][WIP] Add Fixable for dereference of simple ptr arithmetic to [-Wunsafe-buffer-usage] Add Fixable for dereference of simple ptr arithmetic.Feb 22 2023, 2:36 PM
ziqingluo-90 edited the summary of this revision. (Show Details)
NoQ added inline comments.Feb 22 2023, 5:17 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
868–869

You can combine dyn_cast() with assertion by using cast() instead. But in any case, I'm not sure the assertion is actually correct here (see also), I think it's a good idea to add a test case for BindingDecl here as well.

Looking over this patch, I'm a little concerned that you seem to have re-invented the infrastructure of RewriteRules from Transformer. Not the details of the patch itself, but the infrastructure its built on -- the FixableGadget and FixitLists all seem very similar. While you're welcome to do as you please, I imagine the project would be best off if there was a common infrastructure. Have you taken a look at clang::transformer? If so, is there a reason that it doesn't meet your needs? https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Tooling/Transformer/RewriteRule.h is the starting point, if you're not familiar with it. The doc page is here: https://intel.github.io/llvm-docs/clang/ClangTransformerTutorial.html.

NoQ added a comment.EditedFeb 23 2023, 4:54 PM

@ymandel I think we should definitely try it.

I don't think our FixableGadgets correspond nicely to RewriteRules. Our story is that we gather *global* understanding of the entire function by collecting gadgets inside it, which is then used for discovering the best "strategy" for transforming the function, which may activate certain (not necessarily all) fixable gadgets in a specific manner. For instance, once D143133 gets fully developed, it'll involve fixpoint iteration over all fixables to discover variables that need to be transformed simultaneously. So I suspect that RewriteRule is too high-level / restrictive for our purposes.

But I do think that we can use the EditGenerator infrastructure to greatly simplify fixit generation once we figure out the strategy. So we can have something like

class FixableGadget {
  virtual EditGenerator generator(Strategy &S) = 0;

  FixItList getFixits(Strategy &S) { // non-virtual!
    return generator(S)(MatchResult).somehowFlattenAllTheseFixits();
  }
};

class DerefSimplePtrArithFixableGadget : public FixableGadget {
  static constexpr const char *const AddOpTag = "AddOp";
  static constexpr const char *const OffsetTag = "Offset";

  static StmtMatcher matcher() { /* matcher that makes all these bindings */ }

  EditGenerator generator(Strategy &S) override {
    return changeTo(node(AddOpTag), cat(node(BaseDRETag), "[", node(OffsetTag), "]"));
  }
};

which is much simpler than what we have. So I agree that we should consider this, at least for new code.

Things I don't immediately understand, which we'll need to figure out:

  • How do we preserve MatchResult long enough? It needs to be kept alive while we build the Strategy object, so that later we didn't have to rerun the matchers for fixit purposes.
  • We're somewhat conscious about minimizing our fixit's source ranges. In particular, we really need the fixit replacement range to not overlap with the OffsetTag range in this example. This is because the offset may contain other fixable operations inside it! So we won't be happy if the fixit produced by the generator throws out the entire expression and replaces it with concatenated text, we need something more targeted.
NoQ added a comment.Feb 23 2023, 5:16 PM

How do we preserve MatchResult long enough?

(it looks easily copyable as it's basically a std::map<std::string, Stmt *>; but it also looks like nobody tried this before, so we'll have to define a copy constructor, or like a .clone() method if we're worried about accidental copies)

NoQ added a comment.Feb 23 2023, 5:51 PM

It also sounds like this is going to be the first use of libClangTransformers in clang proper, so we'll have to link the clang binary to it.

NoQ added a comment.Feb 23 2023, 5:56 PM

Also, @ymandel would you mind if we commit this patch as-is and experiment with Transformers in a follow-up patch? 'Cause we have like 6 patches waiting for this one to land, and we'd rather avoid large-scale rebasing as we fight to keep our backlogs short.

Very sorry for the delayed response! Also, feel free to move this discussion to some other forum. Responses inline:

@ymandel I think we should definitely try it.

I don't think our FixableGadgets correspond nicely to RewriteRules. Our story is that we gather *global* understanding of the entire function by collecting gadgets inside it, which is then used for discovering the best "strategy" for transforming the function, which may activate certain (not necessarily all) fixable gadgets in a specific manner. For instance, once D143133 gets fully developed, it'll involve fixpoint iteration over all fixables to discover variables that need to be transformed simultaneously. So I suspect that RewriteRule is too high-level / restrictive for our purposes.

This makes a lot of sense. We've actually moved in the same direction ourself, away from one-shot rules (though we still use them for simpler tasks) towards analysis (or even, an analysis pipeline) followed by application of edits. I think the RewriteRule concept needs to expand to encompass this approach and would hope that feedback from your use case could help drive that. In the meantime, we've found taken two approaches:

  1. Embed the analysis in the matchers and thread that state to the edit generator. So, the rule looks something like makeRule(matcherThatAnchorsAnalysis(SharedState), generatorFromState(SharedState)). Sometimes the generator is a custom function, but we can often use the standard combinators.
  2. Generate _metadata_ from the rewrite rule, and don't generate edits. Then, followup passes can consume the metadata and turn them into edits (using EditGenerators or otherwise). The metadata can also just bundle edits directly, but in a form suited to followup processing rather than the standard output from rules that's intended for direct application.

Regardless, I'd love to work together to find a format that works for these use cases.

But I do think that we can use the EditGenerator infrastructure to greatly simplify fixit generation once we figure out the strategy. So we can have something like

class FixableGadget {
  virtual EditGenerator generator(Strategy &S) = 0;

  FixItList getFixits(Strategy &S) { // non-virtual!
    return generator(S)(MatchResult).somehowFlattenAllTheseFixits();
  }
};

class DerefSimplePtrArithFixableGadget : public FixableGadget {
  static constexpr const char *const AddOpTag = "AddOp";
  static constexpr const char *const OffsetTag = "Offset";

  static StmtMatcher matcher() { /* matcher that makes all these bindings */ }

  EditGenerator generator(Strategy &S) override {
    return changeTo(node(AddOpTag), cat(node(BaseDRETag), "[", node(OffsetTag), "]"));
  }
};

which is much simpler than what we have. So I agree that we should consider this, at least for new code.

Things I don't immediately understand, which we'll need to figure out:

  • How do we preserve MatchResult long enough? It needs to be kept alive while we build the Strategy object, so that later we didn't have to rerun the matchers for fixit purposes.

I'm not sure I understand the issue, can you expand? I'd sooner expect that you would eagerly generate the Edits, rather than building up EditGenerators and then you don't need to keep around the MatchResult.

  • We're somewhat conscious about minimizing our fixit's source ranges. In particular, we really need the fixit replacement range to not overlap with the OffsetTag range in this example. This is because the offset may contain other fixable operations inside it! So we won't be happy if the fixit produced by the generator throws out the entire expression and replaces it with concatenated text, we need something more targeted.

Yes, this is a common concern. We tend to apply targeted fixes where possible, using the combinators to correctly select the parts of the syntax to target. Ideally, the framework could do that for you -- that is, you could do larger edit for clarity and it would diff only produce the smaller edit -- but that's "future work" with no plans at this point.

NoQ added a comment.Mar 7 2023, 5:27 PM

@ymandel I think we should definitely try it.

I don't think our FixableGadgets correspond nicely to RewriteRules. Our story is that we gather *global* understanding of the entire function by collecting gadgets inside it, which is then used for discovering the best "strategy" for transforming the function, which may activate certain (not necessarily all) fixable gadgets in a specific manner. For instance, once D143133 gets fully developed, it'll involve fixpoint iteration over all fixables to discover variables that need to be transformed simultaneously. So I suspect that RewriteRule is too high-level / restrictive for our purposes.

This makes a lot of sense. We've actually moved in the same direction ourself, away from one-shot rules (though we still use them for simpler tasks) towards analysis (or even, an analysis pipeline) followed by application of edits. I think the RewriteRule concept needs to expand to encompass this approach and would hope that feedback from your use case could help drive that. In the meantime, we've found taken two approaches:

  1. Embed the analysis in the matchers and thread that state to the edit generator. So, the rule looks something like makeRule(matcherThatAnchorsAnalysis(SharedState), generatorFromState(SharedState)). Sometimes the generator is a custom function, but we can often use the standard combinators.
  2. Generate _metadata_ from the rewrite rule, and don't generate edits. Then, followup passes can consume the metadata and turn them into edits (using EditGenerators or otherwise). The metadata can also just bundle edits directly, but in a form suited to followup processing rather than the standard output from rules that's intended for direct application.

Regardless, I'd love to work together to find a format that works for these use cases.

Yeah this is a really interesting subject, I'd love to have a bigger conversation about it. We've got a lot of very interesting examples of after-the-fact interactions, and a lot of them seem to be specific to our work. A lot of them stem from the fact that we aren't settling for any fix that would compile and silence the warning, but we aspire to generate code with the maximum security benefit. And when we cannot, we find it better to abandon the entire fixit and recommend manual transformation, than to recommend a less-than-ideal automatic solution. Because the less-than-ideal solution would silence the warning, and the code won't be revisited for later security improvements or flagged for future security audit. So it's better to have our tool teach the developer by example how to write better code whenever it can, and offer the developer a chance to demonstrate what they learned when it cannot find a good solution automatically.

The ideal solution often involves substituting multiple variables at a time, so that, say, the span object didn't need to be unpacked and repacked. In fact, because span size is a thing (but pointer size isn't), there's no easy analysis that would help us eliminate span-repacking code after the fact! Partially spanified code is much harder to analyze than fully unspanified code because there's now this entire extra aspect of the program's behavior (span sizes) that we need to either preserve or prove away as irrelevant. Whereas in the original unspanified program it's obviously irrelevant from the start. So even though we want to successfully analyze partially spanified code in some cases, it's clearly a much harder problem and ideally we don't want to block the tool on solving it, so instead we're trying to make larger leaps from unspanified code directly to ideal code.

(This sounds as if it relates to our discussion in D143128 but it really doesn't, that patch's dilemma was about two fixes that are both equally (un-)ideal, and it's probably easy to transform one into another "after the fact".)

  • How do we preserve MatchResult long enough? It needs to be kept alive while we build the Strategy object, so that later we didn't have to rerun the matchers for fixit purposes.

I'm not sure I understand the issue, can you expand? I'd sooner expect that you would eagerly generate the Edits, rather than building up EditGenerators and then you don't need to keep around the MatchResult.

I suspect that as our machine grows larger, that'd cause us to eagerly generate around 5-10x more edits than we ever emit. Just because a code pattern is fixable, doesn't mean we want to fix it. Even if we do, there's also like five different ways to fix it (encoded in our Strategy class, currently we implement just one way: "span"). And answers to these questions - whether we want to fix the pattern, and how - depend not only on this pattern, but also on properties of the space of all patterns discovered in the function. This isn't something we can decide upon in the constructor of the fixable pattern, when half of patterns aren't even constructed yet. We either need to see all MatchResults together before emitting any edits, or ~90% of our edits go to waste.

  • We're somewhat conscious about minimizing our fixit's source ranges. In particular, we really need the fixit replacement range to not overlap with the OffsetTag range in this example. This is because the offset may contain other fixable operations inside it! So we won't be happy if the fixit produced by the generator throws out the entire expression and replaces it with concatenated text, we need something more targeted.

Yes, this is a common concern. We tend to apply targeted fixes where possible, using the combinators to correctly select the parts of the syntax to target. Ideally, the framework could do that for you -- that is, you could do larger edit for clarity and it would diff only produce the smaller edit -- but that's "future work" with no plans at this point.

That's another fascinating subject.

We have a weak signal that it might not even be the right thing to do. Say, if you're replacing a function call foo with foo2, it's arguably better to have the fixit look like this:

foo();
^~~
foo2

than to have it look like this:

foo();
   ^
   2

The second solution may be alright in some cases (especially when you hardcode the names of the function in the compiler, so you know for sure that "adding a suffix to the function" is an accurate way to describe the fix from user's point of view) but in the general case (say, when you discover functions through attributes) the second solution may be very surprising. But a purely textual "fixit minimizer" machine will not be able to tell the difference and recognize that the first approach is more appropriate in this case.

It could also be bad in the same way as the command-line diff tool sometimes completely misunderstands the nature of the diff (say, when it tells us that we've added } foo() { instead of foo() {}). So I think there's a few good reasons to retain full control over the shape of the edits.

ziqingluo-90 added inline comments.Mar 9 2023, 5:15 PM
clang/lib/Analysis/UnsafeBufferUsage.cpp
868–869

Good point! Actually an example with BindingDecls could break our code. The Tracker collects all declRefExpr(to(varDecl())) but most of the Gadget matchers simply look for declRefExpr()s. If a Gadget is associated to a DRE to a BindingDecl, it has an unclaimed DRE.

We probably need another patch to add to(varDecl()) to places where its missing and add proper tests.

We agreed with @ymandel to have a separate discussion about how can Safe Buffers Fix-Its benefit from using libClangTransformers .
Our tentative plan is to it for new code that we write and eventually change all our code to use it.

NoQ added a comment.Mar 16 2023, 1:51 PM

Other than the to(varDecl()) discussion, LGTM!

clang/lib/Analysis/UnsafeBufferUsage.cpp
868–869

Aha, sure, we probably need to fix all places, but I also don't want this patch to introduce a new source of crashes, where other gadgets already act defensively. So let's at least replace assert with an early return before we land this patch?

Address comments

ziqingluo-90 marked 2 inline comments as done.Mar 16 2023, 2:49 PM
jkorous accepted this revision.Mar 20 2023, 2:41 PM

LGTM

This revision is now accepted and ready to land.Mar 20 2023, 2:41 PM
NoQ accepted this revision.Mar 20 2023, 2:48 PM
This revision was landed with ongoing or failed builds.Mar 20 2023, 5:07 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 5:07 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript