This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Support tainted objects in GenericTaintChecker
Needs ReviewPublic

Authored by boga95 on Dec 15 2019, 5:20 AM.

Details

Summary

I extended the supported C++ features:

  • The this pointer can be tainted (0. argument)
  • All std::basic_istream objects are tainted unconditionally (std::cin, std::ifstream, etc.)
  • std::getline and some member function of std::string propagates taint
  • Extraction operator and copy assignment propagate taint

Diff Detail

Event Timeline

boga95 created this revision.Dec 15 2019, 5:20 AM

I strongly agree with this comment: D70878#1780513, maybe the placement of functions like getArg and getNumArgs would be most appropriate in CallDescription. How about we try to cut down on duplicating functionalities?

I realize that many, if not all of my comments should've been placed at D70878, I was unfortunately not available then, but I think it would be a lot better if we settled on this before making the eventual changes too hard to switch. Checkers that were written with CallDescriptionMap (D70818, D63915) or have recently been converted to it (D67706, D62557, D68165) are a joy to look at, and reduce the overall complexity of the codebase greatly. Are there strong arguments against it?

The overall direction is great, as demonstrated by the test files. Its very exciting to see taintness analysis improving this much lately!

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
105–135

I know this isn't really relevant, but isn't this redundant with CallDescription?

142

Extraction operator? Is that a thing?

204

http://llvm.org/docs/ProgrammersManual.html#other-map-like-container-options

We never use hash_set and unordered_set because they are generally very expensive (each insertion requires a malloc) and very non-portable.

369

I would be shocked if this is the first time I've come across very similar function -- in any case, could you rename it to something like getArgIgnoringImplicitThis?

809

In this case, maybe llvm::Regex is overkill? [[https://llvm.org/doxygen/classllvm_1_1StringRef.html#a82369bea2700347f68e1f43e30d2d47b | llvm::StringRef::find()]] may be sufficient.

xazax.hun added inline comments.Feb 6 2020, 3:47 PM
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
270–271

If we consider Stdin and Stdstream to be tainted does it make sense to fold them into isTainted so we never miss checking for them?

A portion of my concerns are answered by this patch: D72035.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
105–135

Ah, okay, so CallDescription doesn't really have the FunctionDecl, but this still feels like a duplication of functionalities.

boga95 updated this revision to Diff 246120.Feb 23 2020, 12:50 PM
boga95 marked 5 inline comments as done.
Szelethus requested changes to this revision.Feb 24 2020, 7:45 AM
Szelethus added a reviewer: steakhal.

This patch is really cool, but I still feel anxious a bit about duplicating so much functionality, especially since we're working very hard to make CallEvent a widespread thing. @steakhal's patch D72035 already made some great progress in this direction, and should land before this one. @NoQ, do you agree that we should maybe stop for a bit before making the code a bit more consistent with the rest of the checkers?

I added @steakhal as reviewer if you don't mind.

This revision now requires changes to proceed.Feb 24 2020, 7:45 AM
boga95 marked 6 inline comments as done.Feb 24 2020, 9:10 AM

@steakhal's revision is on the top of this. Changing the order will only cause unnecessary work on both sides.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
105–135

We have already thought about that and it is on our TODO list.

142

I can call it operator>> if you think that is better.

204

I didn't find any multimap among the alternatives. I think the performance won't be an issue here, because the elements are inserted at the beginning of the analysis if there is any. Otherwise, we are planning to replace it with CallDescriptionMap.

270–271

Then we have to pass the CheckerContext to the isTainted functions. I think this function wraps it well.

@steakhal's revision is on the top of this. Changing the order will only cause unnecessary work on both sides.

He recently rebased on top of master. I'm no fan of creating unnecessary work for either of you, and it might just be the case that my worry is greater then necessary. But for now, landing more patches with this infrastructure and having to painstakingly redo them all sounds like more work.

It'd be best to wait for @NoQ's feedback, after all, he did approve earlier patches. And again, sorry for not being available earlier.

balazske added inline comments.
clang/test/Analysis/taint-generic.cpp
191

These std declarations are at a better place in system-header-simulator-cxx.h or a similar file.

@steakhal's revision is on the top of this. Changing the order will only cause unnecessary work on both sides.

I would happily rebase this patch if you want.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
142

I think extraction operator is the right term for this.
It is used in the standard: http://eel.is/c++draft/input.streams#istream.extractors

clang/test/Analysis/taint-generic.cpp
191

In the current form, it seems to be a bit verbose.
Why don't we create a minimal std::string which does not inherit from anything and implements the features and behavior only what is necessary.

After minimizing this class there would be no benefit moving to the system-header-simulator-cxx.h header.

@steakhal's revision is on the top of this. Changing the order will only cause unnecessary work on both sides.

I would happily rebase this patch if you want.

Here is the rebased version of your patch.
https://github.com/steakhal/llvm-project/compare/use-callevent...steakhal:boga-rebased-tainted-objects-D71524

Minor modifications applied:

  • clang-formatted every modified lines
  • simplified the getArgWithImplicitThis function body
  • simplified the getNumArgsWithImplicitThis function body
  • omitting the spelling of the TaintPropagationRule type name where obvious
boga95 updated this revision to Diff 249019.Mar 8 2020, 3:01 PM
boga95 marked 2 inline comments as done.

Rebase to master.

So, as far as I understand, your thinking here is that CXXOperatorCallExprs (which are global, non-member operators) when they are >>, they will propagate taintedness from the 0th argument to the 1st and the return value, correct? So, if I do this:

struct Panda {};

// Turn integers into pandas!
bool operator>>(int i, Panda p) {
  return printf("Panda integer party!");
}

// Turn characters into pandas!
bool operator>>(char i, Panda p) {
  return printf("Panda character party!");
}

int main() {
  Panda p1, p2;
  bool success1 = getTaintedInt() >> p1;
  bool success2 = getTaintedChar() >> p2;
  // success1, success2, p1, p2 are all tainted
}

Are we sure this is what we want? If this is a heuristic, we should document it well, and even then I'm not sure whether we want it. I'm also pretty sure this would make the eventual change to CallDescriptionMap more difficult, because the way taintedness is propagated around std::basic_istream not really the property of the global >> operator and what its parameters are, but rather the property of std::basic_istream<CharT,Traits>::operator>>, right? What do you think?

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
290–294

Hmm, is this the appropriate place to put these? It seems like this job is handled in getTaintPropagationRule. I thought CustomPropagations are reserved for the config file.

582

As far as I know, Call.getOriginExpr() may be null, we should probably use dyn_cast_or_null.

NoQ added inline comments.Mar 15 2020, 9:15 PM
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
290

So 0 stands for this? Can we have a named constant please? ^.^

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
155–162

Aha, ok, so this is your plan: only ever deal with fully conjured objects.

I suspect that it'll fail when the function that produces the object is fully inlined. It's probably unlikely to happen with things like std::cin but it's pretty likely to happen with custom taint sources defined by the user. Once you want to fix this, you'll probably have to iterate through all direct bindings in the structure and mark them tainted as well.

clang/test/Analysis/taint-generic.cpp
131

Let's add a header simulator for this if we didn't already.

Szelethus added inline comments.Mar 16 2020, 4:01 AM
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
142

Well, I consider myself proven wrong :)

Are we sure this is what we want? If this is a heuristic, we should document it well, and even then I'm not sure whether we want it. I'm also pretty sure this would make the eventual change to CallDescriptionMap more difficult, because the way taintedness is propagated around std::basic_istream not really the property of the global >> operator and what its parameters are, but rather the property of std::basic_istream<CharT,Traits>::operator>>, right? What do you think?

I think CallDescription can only identify objects/functions which has IdefntifyerInfo in them. AFAIK operators don't have such. Though somehow AST matchers of Clang Tidy were triggered with this: functionDecl(hasName("operator>>"))
I'm afraid it needs to be a different patch to replace with CallDescriptionMap, even though I agree with you.

I think CallDescription can only identify objects/functions which has IdefntifyerInfo in them. AFAIK operators don't have such. Though somehow AST matchers of Clang Tidy were triggered with this: functionDecl(hasName("operator>>"))

Yup, we could make improvements on CallDescription as well :)

I'm afraid it needs to be a different patch to replace with CallDescriptionMap, even though I agree with you.

I don't mean to deploy the map in this patch, but if this lands as-is, the proposed logic will have to be revisited again.

boga95 updated this revision to Diff 256839.Apr 12 2020, 12:47 AM
boga95 marked 9 inline comments as done.Sep 20 2020, 1:44 PM

Ping

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
290–294

We are planning to move all of the propagation rules into a configuration file.

I figured you're still working on this, sorry! I'd really like to chat about my earlier comment D71524#1917251, as it kind of challenges the high level idea.

I figured you're still working on this, sorry! I'd really like to chat about my earlier comment D71524#1917251, as it kind of challenges the high level idea.

What about marking the std::cin object itself as tainted and any object created by ifstream::ifstream(const char*) or similar functions.
Then propagate taint via the extraction operator (operator>>) only if the stream was tainted.
This way we could reduce the false-positives of this crude heuristic. What do you think?

boga95 marked an inline comment as done.Sep 30 2020, 2:29 PM

I figured you're still working on this, sorry! I'd really like to chat about my earlier comment D71524#1917251, as it kind of challenges the high level idea.

What about marking the std::cin object itself as tainted and any object created by ifstream::ifstream(const char*) or similar functions.
Then propagate taint via the extraction operator (operator>>) only if the stream was tainted.
This way we could reduce the false-positives of this crude heuristic. What do you think?

As far as I remember I tried to make std::cin tainted, but it was complicated. I run the checker against many projects and there wasn't any false positive related to this heuristic.
We can restrict the operator>> to std::basic_stream and cover only the standard library. I think most of the programmers will use this in a conventional way, therefore it should work for their implementation too.

As far as I remember I tried to make std::cin tainted, but it was complicated.

What sort of issues did you find by implementing that approach? Could you elaborate?

I run the checker against many projects and there wasn't any false positive related to this heuristic.

It seems a bit hand-wavy to me.

We can restrict the operator>> to std::basic_stream and cover only the standard library.

If we choose this approach, then we have to do something about this, yes. This could be a reasonable choice.

We are arguing about whether the operator>> is a propagation or a source function.
IMO this operator should propagate taint, from the stream to an object.

I've opened the cppreference to find an example where your logic would introduce serious false positives.
I've spent like 2 minutes to come up with this, there might be many more:

std::istringstream is("hello, world");
char arr[10];
is >> std::setw(6) >> arr;
std::cout << "Input from \"" << is.str() << "\" with setw(6) gave \""
          << arr << "\"\n";

godbolt

You would deliberately mark arr as a tainted.
Fortunately/unfortunately it's quite hard to remove taint - so any unnecessarily tainted value is a serious problem, thus requires careful design decisions.

This example might seem to be a made-up one, however, we should consider developers who want to extract data from streams - which aren't tainted.
Marking the stream itself tainted, and propagating taint over the extraction operator would not have such false positives.