This is an archive of the discontinued LLVM Phabricator instance.

[Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place
ClosedPublic

Authored by dkrupp on Feb 17 2023, 7:36 AM.

Details

Summary

This patch improves the diagnostics of the alpha.security.taint.TaintPropagation and taint related checkers by showing the "Taint originated here" note at the correct place, where the attacker may inject it. This greatly improves the understandability of the taint reports.

Taint Analysis: The attacker injects the malicious data at the taint source (e.g. getenv() call) which is then propagated and used at taint sink (e.g. exec() call) causing a security vulnerability (e.g. shell injection vulnerability), without data sanitation.

The goal of the checker is to discover and show to the user these potential taint source, sink pairs and the propagation call chain.

In the baseline the taint source was pointing to an invalid location, typically somewhere between the real taint source and sink.

After the fix, the "Taint originated here" tag is correctly shown at the taint source. This is the function call where the attacker can inject a malicious data (e.g. reading from environment variable, reading from file, reading from standard input etc.).

Before the patch the clang static analyzer puts the taint origin note wrongly to the strtol(..) call.

int main(){
  char *pathbuf;
  char *user_data=getenv("USER_INPUT"); 
  char *end;  
  long size=strtol(user_data, &end, 10); // note: Taint originated here. 
  if (size > 0){
    pathbuf=(char*) malloc(size+1);//note: Untrusted data is used to specify the buffer size ...
    // ... 
    free(pathbuf);
  }
  return 0;
}

After the fix, the taint origin point is correctly annotated at getenv() where the attacker really injects the value.

int main(){
  char *pathbuf;
  char *user_data=getenv("USER_INPUT");  // note: Taint originated here. 
  char *end;
  long size=strtol(user_data, &end, 10); 
  if (size > 0){
    pathbuf=(char*) malloc(size+1);//note: Untrusted data is used to specify the buffer size ...
    // ... 
    free(pathbuf);
  }
  return 0;
}

The BugVisitor placing the note was wrongly going back only until introduction of the tainted SVal in the sink.

This patch removes the BugVisitor from the implmeentation and replaces it with 2 new NoteTags. One, in the taintOriginTrackerTag() prints the "taint originated here" Note and the other in taintPropagationExplainerTag() explaining how the taintedness is propagating from argument to argument or to the return value ("Taint propagated to the xth argument").
This implemetnation uses the interestingess bugReport utility to track back the tainted symbols through propagating function calls to the point where the taintedness was introduced by a source function call.

The checker which wishes to emit a Taint related diagnostic must use the categories::TaintedData BugType category and must mark the tainted symbols as interesting. Then the TaintPropagationChecker will automatically generate the "Taint originated here" and the "Taint propagated to..." diagnostic notes.

You can find the new improved reports

here

And the old reports (look out for "Taint originated here" notes. They are at the wrong place, close to the end of the reports)

here

A simple example report from curl:
Basline:


New:

Diff Detail

Event Timeline

dkrupp created this revision.Feb 17 2023, 7:36 AM
Herald added a project: Restricted Project. · View Herald Transcript
dkrupp requested review of this revision.Feb 17 2023, 7:36 AM
dkrupp edited the summary of this revision. (Show Details)Feb 17 2023, 7:37 AM
dkrupp edited the summary of this revision. (Show Details)
dkrupp edited the summary of this revision. (Show Details)Feb 17 2023, 7:44 AM
dkrupp updated this revision to Diff 498795.Feb 20 2023, 4:17 AM

Added documentation to the newly introduced types: TaintData, TaintBugReport.

I haven't checked the implementation, but fundamentally patching the TaintBugVisitor is not how we should improve the diagnostic for taint issues.
I saw that this patch is not about NoteTags, so I didn't go any further that point.

What we should do instead, to add a fancy NoteTags to each of the Post transitions to propagate interestingness to the taint sources.
Where each NoteTag does:

  • checks if any of the taint destinations are actually 'interesting', if none then just return an empty note.
  • take the taint source arguments and mark their pre-call values as interesting
  • construct a descriptive message explaining what happened:
    • If the transition had no taint sources, then it must be a "taint source"
    • If we had tainted sources, tell the user that X', Y', and Z' arguments were tainted, hence we propagated taint
    • take all the "interesting" taint destinations and tell the user that X, Y and Z arguments become tainted due to the propagation rule.

I'm attaching my proposed version for improving the diagnostics where I demonstrate all what I said.


Note that my patch is really crude, and I just finished hacking it to get all tests pass in a couple hours.

Let me know if it would be a good way to refine your patch or I should review your current implementation.

NoQ added a comment.Feb 21 2023, 4:16 PM

I completely agree with @steakhal, these should be note tags:

  • The "visitor way" is to reverse-engineer the exploded graph after the fact.
  • The "slightly more sophisticated visitor way" is have checker callbacks leave extra hints in the graph to assist reverse engineering, which is what you appear to be trying to do.
  • The "note tag" way is to simply capture that information from inside checker callbacks in the form of lambda captures. It eliminates the need to think about how to store the information in the state (it's stored in the program point instead), or how to structure it.

I also completely agree with @steakhal that the intermediate notes are valuable. In the motivating example, ideally both strtol and getenv need a note ("taint propagated here" and "taint originated here" respectively).

The challenging part with note tags is how do you figure out whether your bug report is taint-related. The traditional solution is to check the BugType but in this case an indeterminate amount of checkers may emit taint-related reports. I think now's a good time to include a "generic data map"-like data structure in PathSensitiveBugReport objects, so that checkers could put some data there during emitReport(), which can be picked up by note tags and potentially mutated in the process. For example, you can introduce a set of tracked tainted symbols there, which will be pre-populated by the checker with the final tainted symbol, then every time a note tag discovers that a symbol in the set becomes a target of taint propagation, it removes the symbol from the set and replaces it with the symbols from which its taint originated, so that later note tags would react on these new symbols instead.

@steakhal, @NoQ

thanks for your reviews.

Please note that I am not extending TaintBugVisitor. On the contrary I removed it.
Instead I use NoteTag to generate the "Taint Originated here" text (see GenericTaintChecker.cpp:156).

I can also add additional NoteTags for generating propagation messages "Taint was propagated here" easily.

The challenging part with note tags is how do you figure out whether your bug report is taint-related.

I solved this by checking if the report is an instance of TaintBugReport a new BugReport type, which should be used by all taint related reports (ArrayBoundCheckerV2 checker and divisionbyzero checker
was changed to use this new report type for taint related reports).

The tricky part was is to how to show only that single "Taint originated here" note tag at the taint source only which is relevant to the report. This is done by remembering the unique flowid in the
NoteTag in the forward analysis direction (see GenericTaintChecker:cpp:859) InjectionTag = createTaintOriginTag(C, TaintFlowId); and then filtering out the irrelevant
NoteTags when the the report is generated (when the lamdba callback is called). See that flowid which reaches the sink is backpropagated in the PathSensitiveBugreport (see GenericTaintCHekcer.cpp:167).

FlowIds are unique and increased at every taint source (GenericTaintChecker:869) and it is stored as an additional simple int in the program state along with the already existing (Taint.cpp:22) TaintTagType.

My fear with the interestingness is that it may propagating backwards according to different "rules" than whot the taintedness is popagating in the foward direction even with the "extensions" pointed out by steakhal.
So the two types of propagation may be or can get out of sync.

So if the above is not a concern and you think implementing this with interestingness is more elegant, idiomatic and causes less maintenance burden, I am happy to create an alternative patch with that solution.

Szelethus added a comment.EditedFeb 23 2023, 5:42 AM

The challenging part with note tags is how do you figure out whether your bug report is taint-related. The traditional solution is to check the BugType but in this case an indeterminate amount of checkers may emit taint-related reports.

Yeah, this is why we created a new type. Not sure what is the better infrastructure design, whether to create a subtype of BugType or BugReport, but it fundamentally achieves the same thing.

My fear with the interestingness is that it may propagating backwards according to different "rules" than whot the taintedness is popagating in the foward direction even with the "extensions" pointed out by steakhal.
So the two types of propagation may be or can get out of sync.

So if the above is not a concern and you think implementing this with interestingness is more elegant, idiomatic and causes less maintenance burden, I am happy to create an alternative patch with that solution.

@dkrupp and I discussed in detail whether to use FlowID's (what is currently implemented in the patch) or something similar, or reuse interestingness. Here's why we decided against reusing interestiness as is.

Interestingness, as it stands now, mostly expresses data-dependency, and is propageted with using the analyzers usualy somewhat conservative approach. While the length of a string is strictly speaking data dependent on the actual string, I don't think analyzer currently understand that. We approach taint very differently, and propagete it in some sense more liberally.

As I best recall, however, interestingness may be propagated through other means as well. If we reused interestingness, I fear that the interestiness set could be greater than the actual interesting tainted set, causing more notes to be emitted than needed.

For these reasons, which I admit are a result of some speculation, we concluded that interstingness as it is and taint are two different properties that are best separated.

I think now's a good time to include a "generic data map"-like data structure in PathSensitiveBugReport objects, so that checkers could put some data there during emitReport(), which can be picked up by note tags and potentially mutated in the process.

Maybe a new interestingness kind (D65723)? Not sure how this design aged, but we don't really need to store an ID for this, so a simple interestingness flag (just not the default Thorough interestiness) is good enough.

The tricky part was is to how to show only that single "Taint originated here" note tag at the taint source only which is relevant to the report. This is done by remembering the unique flowid in the
NoteTag in the forward analysis direction (see GenericTaintChecker:cpp:859) InjectionTag = createTaintOriginTag(C, TaintFlowId); and then filtering out the irrelevant
NoteTags when the the report is generated (when the lamdba callback is called). See that flowid which reaches the sink is backpropagated in the PathSensitiveBugreport (see GenericTaintCHekcer.cpp:167).

FlowIds are unique and increased at every taint source (GenericTaintChecker:869) and it is stored as an additional simple int in the program state along with the already existing (Taint.cpp:22) TaintTagType.

If you propagate this property during analysis, those IDs may be needed, but a simple flag should suffice when BugReporter does it.

NoQ added a comment.Feb 23 2023, 2:04 PM

Yeah looks like I replied without properly reading the patch.

TaintBugReport is brilliant and we already have a precedent for subclassing BugReport in another checker. However I'm somewhat worried that once we start doing more of this, we'll eventually end up with multiple inheritance situations when the report needs multiple kinds of information. So at a glance my approach with a "generic data map" in bugreport objects looks a bit more future-proof to me. Also a bit easier to set up, no need to deal with custom RTTI.

So I think interestingness is just an example of such "generic data" attached to bug report. Interestingness is also somewhat confusing, because indeed, there are existing interesting rules, and I don't think anybody remembers what they are or what was even the purpose of having interestingness in the first place. Interestingness is currently used for tracking symbols with trackExpressionValue(), and we have those tracking kinds added by @Szelethus to make tracking behave slightly differently. So, yeah, I think interestingness shouldn't be used; it's already in use. I think it should be generalized upon i.e. just let checkers track whatever/however they want.

I guess my main point is, there shouldn't be a need to assist tracking by adding extra information to the program state. Information in the state should ideally be "material" to program execution, "tangible", it has to describe something that's actually stored somewhere in memory (either by directly defining it, or by constraining it). In particular, if two nodes result in indistinguishable future behavior of the program, we're supposed to merge them; but any "immaterial" bits of information in the state will prevent that from happening.

In our case it should be enough to have the lambda for propagation method ask "Hey, is this freshly produced propagation target value relevant to this specific report?" and if yes, mark the corresponding propagation source value as relevant to the report as well; also emit a note and "consume" the mark on the target value. Such chain of local decisions can easily replace the global taint flow identifier, and it's more flexible because this way the flow doesn't need to be "linear", it may branch in various ways and that's ok.

TaintBugReport is brilliant and we already have a precedent for subclassing BugReport in another checker. However I'm somewhat worried that once we start doing more of this, we'll eventually end up with multiple inheritance situations when the report needs multiple kinds of information. So at a glance my approach with a "generic data map" in bugreport objects looks a bit more future-proof to me. Also a bit easier to set up, no need to deal with custom RTTI.

Adding a data map (like a string->sval map) to the PathSensitiveBugreport instead of relying on dynamic casting sounds an easy addition. I will update the patch with this. Or you specifically mean this kind of datamap ? typedef llvm::ImmutableMap<void*, void*> GenericDataMap; (ProgramState.h:74) I guess it should not be immutable…

I guess my main point is, there shouldn't be a need to assist tracking by adding extra information to the program state. Information in the state should ideally be "material" to program execution, "tangible", it has to describe something that's actually stored somewhere in memory (either by directly defining it, or by constraining it). In particular, if two nodes result in indistinguishable future behavior of the program, we're supposed to merge them; but any "immaterial" bits of information in the state will prevent that from happening.

@NoQ aha! Now I see where you are coming from! If an SVal is tainted on both analysis branches, but their taint flow value is different (meaning that they carry taint values from different taint sources), then they cannot be merged which causes inefficiency.
I understand the generic principle, but I wonder how frequent would that be in practice. I would think not too much, because taint sources are uncommon. Especially having multiple taint sources in the same source file/Translation Unit (only that creates different taint flow ids).

In our case it should be enough to have the lambda for propagation method ask "Hey, is this freshly produced propagation target value relevant to this specific report?" and if yes, mark the corresponding propagation source value as relevant to the report as well; also emit a note and "consume" the mark on the target value. Such chain of local decisions can easily replace the global taint flow identifier, and it's more flexible because this way the flow doesn't need to be "linear", it may branch in various ways and that's ok.

Taint propagation is not only handled in the GenericTaintChecker:892, where we calculate that the taintedness should propagate from function argument x to y or return value, but also it spreads in peculiar ways within expressions, from subregion to parent region etc. handled in the addtaint(..) and addPartialTaint(..) functions in Taint.cpp. What your proposed solution would essentially mean that we would need to implement the taint propagation in backward direction too. I think this design would be fragile and difficult to maintain (especially if taint propagation would change in the future).

I definitely don’t have the full picture here, so if you think that the sval backtracking is the better way, because of the potential performance penalty of the taintflow solution with the merges, I will go down that road and start to work on an alternative patch.

First and foremost, I want to deeply apologize about my rushed response. When I say that the Taint originated here note remained, I wrongly draw conclusions. Won't happen again.

Taint-flow IDs:
I would challenge that we need a flow ID (number) because for me the tainted symbol already has a unique ID which should be suitable for this purpose.
What I can see is that it would be useful to know directly what was the first tainted symbols of any given tainted symbol. (SymbolData -> SymbolData mapping)
This is pretty much analog with what you implemented by piggybacking on the taint kind. Although, I'd argue that having an explicit mapping would be cleaner, but I can see that it's more on personal preference.
In addition to that, I think there is value in minimizing the number of places where we introduce such static counters, so I would adwise against that unless we have a good reason for doing so.

I'm thinking that although it's handy to have the originally tainted symbol directly at the error-node at hand, I'm still not sure if that couldn't be calculated and tracked back to the place where we introduced taint.

int n;
scanf("%d", &n); // binds $1, not important
int v = mytaintsource(n); // taint originated here, returns $2
int z = taintprop(v); // taint propagated here, returns $3
int x = 42 / z; // 42 div $3

Soundness of the back-propagation:
The back-propagation is always in-sync given that the state-transition of introducing taint also attaches the NoteTag explaining what it did and why.
This basically means that after we call addTaint() we must also add a NoteTag when we call addTransition(). Under these circumstances I find it easier to argue that the back-propagation is consistent (sound). If a specific checker (other than the GenericTaintChecker) models taint, it should stick to the rules described and emit the right NoteTag. That way even downstream checkers would play nicely with the taint-tracking and benefit from it.

Interestingness:
The concerns seem valid about that the set of interesting symbols could be larger than the actually required (and desired) set. However, I wouldn't worried about this much unless we have an example on which we can continue discussing that this is a real concern.
What I can see is that as-of-now we use the interestingness notion for this, and deviating or introducing something else would introduce complexity. So, I'm not strictly against having something else, but I would lean towards using interestingness here as well, unless we have clear-cut examples demonstrating the need for something more sophisticated.

Finally, I'd like to thank you for investing your time into this subject. We really should have done it much earlier. Without these similar improvements like this one, it's just half way done. Thank you.

If we worry about having taint-related reports without a Note message explaining where the taint was introduced, we could just assert that in a BugReportVisitor at the finalizeVisitor() callback. I think such an assertion would make a lot of sense.
To achieve this, we could take the R.getNotes() and check if any of them refers to a specific one produced by the NoteTag callback for taint sources, let's say TaintSourceTag for that PathDiagnosticNotePiece.

void MyVisitor::finalizeVisitor(BugReporterContext &, const ExplodedNode *, PathSensitiveBugReport &R) {
  assert(llvm::any_of(R.getNotes(),
                      [](const auto &Piece) { return Piece->getTag() == TaintSourceTag; }) &&
         "Each taint report should have at least one taint-source");
}

With this assertion, we would gain confidence that the taint reports are complete, or at least they all have at least one taint source.

dkrupp planned changes to this revision.Feb 27 2023, 8:13 AM

@steakhal , @NoQ thanks for the reviews. I will try to implement an alternative solution based on your suggestions.

dkrupp updated this revision to Diff 510108.Mar 31 2023, 1:28 PM

This is a totally rewritten version of the patch which solely relies on the existing "interestingness" utility to track back the taint propagation. (And does not introduce a new FlowID in the ProgramState as requested in the reviews.)

-The new version also places a Note, when the taintedness is propagated to an argument or to a return value. So it should be easier for the user to follow how the taint information is spreading.
-"The taint originated here" is printed correctly at the taint source function, which introduces taintedness. (Main goal of this patch.)

Implementation:
-The createTaintPreTag() function places a NoteTag at the taint propagation function calls, if taintedness is propagated. Then at report creation, the tainted arguments are marked interesting if propagated taintedness is relevant for the bug report.

  • The isTainted() function is extended to return the actually tainted SymbolRef. This is important to be able to consistently mark relevant symbol interesting which carries the taintedness in a complex expression.

-createTaintPostTag(..) function places a NoteTag to the taint generating function calls to mark them interesting if they are relevant for a taintedness report. So if they propagated taintedness to interesting symbol(s).

The tests are passing and the reports on the open source projects are much better understandable than before (twin, tmux, curl):

https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29

This is a totally rewritten version of the patch which solely relies on the existing "interestingness" utility to track back the taint propagation. (And does not introduce a new FlowID in the ProgramState as requested in the reviews.)

-The new version also places a Note, when the taintedness is propagated to an argument or to a return value. So it should be easier for the user to follow how the taint information is spreading.
-"The taint originated here" is printed correctly at the taint source function, which introduces taintedness. (Main goal of this patch.)

Implementation:
-The createTaintPreTag() function places a NoteTag at the taint propagation function calls, if taintedness is propagated. Then at report creation, the tainted arguments are marked interesting if propagated taintedness is relevant for the bug report.

  • The isTainted() function is extended to return the actually tainted SymbolRef. This is important to be able to consistently mark relevant symbol interesting which carries the taintedness in a complex expression.

So this is how you circumvent introducing "transitive interestingness", because now you know which symbol to track.

-createTaintPostTag(..) function places a NoteTag to the taint generating function calls to mark them interesting if they are relevant for a taintedness report. So if they propagated taintedness to interesting symbol(s).

The tests are passing and the reports on the open source projects are much better understandable than before (twin, tmux, curl):

https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29

I've looked at the results you attached and they look good to me.

Do you have an example where two tainted values are contributing to the same bug? Something like:

n <- read();
m <- read();
malloc(n+m)

Will both of these values tracked back? What do the notes look like?


All in all, I'm pleased to see the improvements. It looks much better now IMO.
Using two NoteTags cleans up the implementation quite a bit, kudos.
I don't think there are major problems with this implementation, so I decided to spew your code with my nitpicks :D

Remember to clang-format your code. See clang/tools/clang-format/clang-format-diff.py.
And there are a few overloads of getNoteTag(), and there could be a better fit for usecases; you should decide.

I also find it difficult to track how a variable got tainted across assignments and computations like in this case.
This observation is completely orthogonal to your patch, I'm just noting it. it was bad previously as well.
Maybe we could have a visitor for explaining the taint tracking across assignments and computations to complement the NoteTag.

I hope I didn't miss much this time :D

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
266–272
clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
48–49

It feels odd to have both const char* and a std::string on the same line.
Should we update const char* to a more sophisticated type?

I'm thinking of StringRef. It seems like that type should be used for the Category as well since the PathSensitiveBugReport constructor takes that, so we don't need to have an owning type here.

106–114
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
139–140

I'm not sure about passing both the CheckerContext and the State.
The CheckerContext already encapsulates a State, which opens up the possibilities for misuse.

For example, the getPointeeOf() is called in this function, and that will eventually call C.getState() under the hood. So, for me it feels like a bad API design.
What we could do instead, is to pass an ASTContext and a State; resolving this discrepancy.

Could you please check if this is a real concern?

190

This function should be an implementation detail, as such I wonder if we should make it static.

How about naming this function differently?
I'm thinking of taintOriginTrackerTag, IDK.

194

If the Call parameter is only used for acquiring the LocationContext, wouldn't it be more descriptive to directly pass the LocationContext to the function instead?
I'm also puzzled that we use getCalleeStackFrame here. I rarely ever see this function, so I'm a bit worried if this pick was intentional. That we pass the 0 as the BlockCount argument only reinforces this instinct.

197

It should consume the TaintedSymbols and TaintedArgs variables, as such you should std::move out from the original parameters like this:

[TaintedSymbols = std::move(TaintedSymbols), TaintedArgs = std::move(TaintedArgs)](){...}
201

What does the first half of this condition guard against?
Do you have a test for it to demonstrate?

205–208
209–215
224

How about calling this taintPropagationExplainerTag?

229

Please assert that the size of TaintedSymbols must be the same as TaintedArgs. Same in the other function.

230
246

So, if TaintedSymbols.size() > 1, then the note message will look weird.
Could you please have a test for this?

868

I cannot see a test against the "Strange" string. Is this dead code?
Same for the other block.

911–914

I just want to get it confirmed that this hunk is unrelated to your change per-say. Is it?
BTW I don't mind this change being part of this patch, rather the opposite. Finally, we will have it.

932–950

I prefer to move declarations close to their uses.

993–996
1018–1022
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
222–226

I'd probably swap the two branches, so that the index would be tracked if both the region and the index are tainted.
I also wonder if this edge-case could be tested at all.

306–311

What does TSR abbreviate? I would find TaintedSym more descriptive.

clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
244–252

Why don't we use a distinct BugType for this?

clang/test/Analysis/taint-diagnostic-visitor.c
47

I'd suggest to count from 1 instead of 0 when referring to Nth arguments.
You can also use the llvm::getOrdinalSuffix(N) to get nicer messages.
We already count from 1 in the std library checker.

dkrupp updated this revision to Diff 511078.EditedApr 5 2023, 6:27 AM
dkrupp marked 21 inline comments as done.

@steakhal thanks for your review. I tried to address all your concerns.
I added an extra test case too (multipleTaintSources(..)) which highlights the limitation of the current patch: If multiple tainted "variables" reach a sink, we only generate diagnostics for one of them. The main reason is that the isTainted() function returns a single tainted Symbolref instead of a vector of Symbolrefs if there are multiple instances.
I highlighted this in the test and the implementation.

I think this could be still an acceptable limitation for now, because as the user sanitizes one of the tainted variables, he will get a new diagnostics for the remaining one(s).

Should we address this limitation in follow-up patche(s) or here?

dkrupp added a comment.Apr 5 2023, 6:30 AM

All comments addressed. Thanks for the review @steakhal .

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
194

The call.getCalleeStackFrame(0) gets the location context of the actual call that we are analyzing (in the pre or postcall), and that's what we need to mark interesting. It is intentionally used like this. I changed the parameter to locationcontext as use suggested.

201

To only follow taint propagation to function calls which actually result in tainted variables used in the report and not every function which returns a tainted variable.

char* taintDiagnosticPropagation2() is such a test which is failing without this due to giving extra unrelated propagation notes.

246

Test added multipleTaintedArgs (). I could not provoke the multi-argument message as we only track-back one tainted symbol now.

868

It was a debugging code, which I removed. I noticed that in some cases (e.g. if the argument pointer is pointing to an unknown area) we don't get back a tainted symbol even though we call the addtaint on the arg/return value.

911–914

It is related in the sense that in isTainted() function call did not return a valid SymbolRef for stdin if we did not make the stdin tainted when we first see it. Caused testcase to fail as it was before. Now it is handled similarly to other tainted symbols.

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
306–311

TSR = Tainted Symbol Ref

but I changed it as you suggested.

clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
244–252

You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler this way. To distinguish identify the tainted reports with the bug category.

Looks even better. Only minor concerns remained, mostly about style and suggestions of llvm utilities.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
130–131

BTW I don't know but State->getStateManager().getContext() can give you an ASTContext. And we tend to not put const to variable declarations. See readability-avoid-const-params-in-decls

In other places we tend to refer to ASTContext by the ACtx I think.
We also prefer const refs over mutable refs. Is the mutable ref justified for this case?

157

My bad. In LLVM style we use UpperCamelCase for variable names.

171–173

Generally, in LLVM style we don't put braces to single block statements unless it would hurt readability, which I don't think applies here.

steakhal added inline comments.Apr 5 2023, 9:04 AM
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
174–178

I was also bad with this recommendation.
I think we can now use structured bindings to get the index and value right there, like:
for (auto [Idx, Sym] : llvm::enumerate(TaintedSymbols))
See

194

Okay.

202

Same here about structured bindings.

208–213

I'd recommend using llvm::interleaveComma() in such cases.
You can probably get rid of nofTaintedArgs as well - by using this function.

210

For clang diagnostics we usually use ordinary suffixes like {st,nd,rd,th}. It would be nice to align with the rest of the clang diagnostics on this.
It would require a bit of work on the wording though, I admit.

212

I believe this branch is uncovered by tests.

220

I think since you explicitly specify the return type of the lambda, you could omit the spelling of std::string here.

863–864

We tend to fuse such declarations:
I've seen other cases like this elsewhere. Please check.

875–878

You could iterate over the symbol dependencies of the SymExpr (of the *V SVal).

SymbolRef PointeeAsSym = V->getAsSymbol();
// eee, can it be null? Sure it can. See isTainted(Region),... for those cases we would need to descend and check their symbol dependencies.
for (SymbolRef SubSym : llvm::make_range(PointeeAsSym->symbol_begin(), PointeeAsSym->symbol_end())) {
  // TODO: check each if it's also tainted, and update the `TaintedSymbols` accordingly, IDK.
}

Something like this should work for most cases (except when *V refers to a tainted region instead of a symbol), I think.

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
278–282

Here we still have the TSR token.

clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
244–252

You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler this way. To distinguish identify the tainted reports with the bug category.

I never checked how BugTypes constitute to bugreport construction, but my gut instinct suggests that we should have two separate instances like we frequently do for other checkers.

I think as we converge, now it could be time to update the summary of the patch to reflect the current implementation. (e.g. flowids etc.)

dkrupp updated this revision to Diff 513556.Apr 14 2023, 5:57 AM
dkrupp marked 11 inline comments as done.
dkrupp edited the summary of this revision. (Show Details)

-All remarks from @steakhal was fixed. Thanks for the review!
-Now we can generate diagnostics for all tainted values when they reach a sink.

Se for example the following test case:

void multipleTaintedArgs(void) {
  int x,y;
  scanf("%d %d", &x, &y); // expected-note {{Taint originated here}}
                          // expected-note@-1 {{Taint propagated to the 2nd argument, 3rd argument}}
  int* ptr = (int*) malloc(x + y); // expected-warning {{Untrusted data is used to specify the buffer size}}
                                   // expected-note@-1{{Untrusted data is used to specify the buffer size}}
  free (ptr);
}

All remarks from @steakhal has been fixed. Thanks for the review.
This new version now can handle the tracking back of multiple symbols!

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
130–131

Thanks for the suggestion. I took out ASTContext from the signature.

208–213

I chose another solution. I hope that is ok too.

212

Now it is covered. See multipleTaintedSArgs(..) test in taint-diagnostic-visitor.c

220

not sure. Got a "cannot convert raw_svector_ostream::str() from llvm:StringRef" error.

875–878

I implememented a new function getTaintedSymbols(..) in Taint.cpp which returns all tainted symbols for a complex expr, SVal etc. With this addition, now we can track back multiple tainted symbols reaching a sink.

dkrupp edited the summary of this revision. (Show Details)Apr 15 2023, 1:10 AM

You can find the improved reports on tmux, postgres, twin, openssl here:

here

dkrupp edited the summary of this revision. (Show Details)Apr 15 2023, 1:13 AM
dkrupp edited the summary of this revision. (Show Details)Apr 15 2023, 1:16 AM
steakhal requested changes to this revision.Apr 19 2023, 2:42 AM

Nice improvement!
I only have minor nitpicks and some recommendations for the taint API.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
205

We usually use the operator[] on vector instead of the at(). When in doubt, we assert that the index is in idx < size().
Apply this for all .at() uses.

208–213

Looks good. I'm not expecting many cases propagating to multiple arguments anyway.

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
147–151

TBH I'm not sure if I like that now we allocate unbounded amount of times (because of getTaintedSymbols() is recursive and returns by value), where we previously did not.

What we could possibly do is to compute the elements of this sequence lazily.
I'm thinking of the llvm::mapped_iterator, but I'm not sure if it's possible to have something like that as a return type as it might encode the map function in the type or something like that.
Anyway, I'm just saying that it would be nice to not do more than it's necessary, and especially not allocate a lot of short-lived objects there.

Do you think there is a way to have the cake and eat it too?


I did some investigation, and one could get pretty far in the implementation, and maybe even complete it but it would be really complicated as of now. Maybe we could revisit this subject when we have coroutines.

So, I would suggest to have two sets of APIs:

  • the usual isTainted(.) -> bool
  • and a getTaintedSymbols(.) -> vector<Sym>

The important point would be that the isTainted() version would not eagerly collect all tainted sub-syms but return on finding the first one.
While, the getTaintedSymbols() would collect eagerly all of them, as its name suggests.

Imagine if getTaintedSymbolsImpl() had an extra flag like bool returnAfterFirstMatch. This way isTainted() can call it like that. While in the other case, the parameter would be false, and eagerly collect all symbols.

This is probably the best of both worlds, as it prevents isTainted from doing extra work and if we need to iterate over the tainted symbols, we always iterate over all of them, so doing it lazily wouldn't gain us much in that case anyway.
As a bonus, the user-facing API would be self-descriptive.

WDYT?

306–311

For such constructs, I would prefer this.

clang/test/Analysis/taint-diagnostic-visitor.c
12–13

The premerge bots are complaining about these two lines on Windows:

error: 'warning' diagnostics seen but not expected: 
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 12: incompatible redeclaration of library function 'strlen'
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 13: incompatible redeclaration of library function 'malloc'
error: 'note' diagnostics seen but not expected: 
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 12: 'strlen' is a builtin with type 'unsigned long long (const char *)'
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 13: 'malloc' is a builtin with type 'void *(unsigned long long)'
4 errors generated.

I think it's because size_t should be defined as unsigned long long on x86_64. This also means that you should pin the target to x86_64 to satisfy this test on all platforms.

53–67

I know this is subjective, but I'd suggest to reformat the tests to match LLVM style guidelines, unless the formatting is important for the test.
Consistency helps the reader and reviewer, as code and tests are read many more times than written.

This applies to the rest of the touched tests.

This revision now requires changes to proceed.Apr 19 2023, 2:42 AM
dkrupp updated this revision to Diff 514973.Apr 19 2023, 8:50 AM
dkrupp marked an inline comment as done.
  • Implemented early return in getTaintedSymbols() when it is called by isTainted() for efficiency
  • Fixed test incompatibility on Windows
dkrupp marked an inline comment as done.Apr 19 2023, 8:51 AM

@steakhal thanks for your review. All your remarks have been fixed.

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
147–151

Good idea. I implemented the early return option in getTaintedSymbols(). This is used now by the isTainted() function.

dkrupp marked an inline comment as done.Apr 21 2023, 1:19 AM

@steakhal is there anything else to do before we merge this? Thanks.

steakhal requested changes to this revision.Apr 21 2023, 6:59 AM

I didn't go through the whole revision this time, but I think the next steps are already clear for the next round.
My impression was that I might not expressed my intent and expectations about the directions of the next step.
I hope I managed this time. Let me know if you have questions.

clang/include/clang/StaticAnalyzer/Checkers/Taint.h
82–103

The overloads having the extra ReturnFirstOnly parameter shouldn't be visible here in the header.
That is an implementation detail that no users should know about.
Note that having a single default argument overload potentially doubles the variations the user might need to keep in mind when choosing the right one. So there is value in simplicity.

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
254–260
clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
62
108–109
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
863–868
875–879

In these cases, the code would acquire all the tainted subsymbols, which then we throw away and keep only the first one.
This is why I suggested the approach I did I'm my last review.

936–940

Here getTaintedPointeeOrPointer would be called two times, unnecessarily.

947–948
1008–1010
1018–1022
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
149–150

We usually pass booleans by "name".

316

If returnFirstOnly is true, this getTaintedSymbols() call would still eagerly (needlessly) collect all of the symbols.
I'd recommend propagating the returnFirstOnly parameter to the recursive calls to avoid this problem.
I also encourage you to make use of the llvm::append_range() whenever makes sense.

clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
230–232

Ah, how awesome it would be to have a markInteresting(llvm::ArrayRef<SymbolRef>) overload.

clang/test/Analysis/taint-diagnostic-visitor.c
53–67

Originally I meant this to the rest of the test cases you change or add part of this patch. I hope it clarifies.

This revision now requires changes to proceed.Apr 21 2023, 6:59 AM
dkrupp updated this revision to Diff 516077.Apr 22 2023, 8:32 AM
dkrupp marked 13 inline comments as done.

-getTaintedSymbols(.) -> getTaintedSymbolsImpl() proxy function introduced for interface safety
-Other minor fixes based on comments from @steakhal

@steakhal your comments are fixed. Thanks for the review.

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
108–109

We cannot get rid off the getTaintedSymbols() call, as we need to pass all tainted symbols to reportTaintBug if we want to track back multiple variables. taintedSyms is a parameter of reportTaintBug(...)

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
863–868

I think this suggested solution would not be correct here, as ArgSym might not be the actual _tainted_ symbol (inside a more complex expression).

So I would prefer to leave it like this for correctness.

875–879

I think this suggested solution would not be correct here, as ArgSym might not be the actual _tainted_ symbol (inside a more complex expression).

So I would prefer to leave it like this for correctness.

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
147–151

First I wanted to avoid the getTaintedSymbols()->getTaintedSymbolsImpl() proxy calls as it is too bloated IMHO.
But I see your point that it is safer. So I changed it.

316

You are perfectly right. I overlooked these calls and because of the the default parameter did got get a warning. now fixed.

To conclude the review, please respond to the "Not Done" inline comments, and mark them "Done" if you think they are resolved.
Thank you for your patience.

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
108–109

Yes, makes sense. mb.
One more thing: if reportTaintBug() takes the taintedSyms vector "by-value", you should express your intent by std::move()-ing your collection expressing that it's meant to be consumed instead of taking a copy.
Otherwise, you could express this intent if the reportTaintBug() take a view type for the collection, such as llvm::ArrayRef<SymbolRef> - which would neither copy nor move from the callsite's vector, being more performant and expressive.

I get that this vector is small and bugreport construction will dominate the runtime anyway so I'm not complaining about this specific case, I'm just noting this for the next time. So, here I'm not expecting any actions.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
875–879

Okay, I also checked out the code and verified this. Indeed we would have failing tests with my recommendation.
I still think it's suboptimal. This is somewhat related to the tainted API. It shouldn't make sense to have tainted regions in the first place, which bites here again. Let's keep it as-is, no actions are required.

920

This unused variable generates a compiler warning.

947–948

I observed you didn't take any action about this suggestion.
It leaves me wonder if this suggestion - in general - makes sense or if there are other reasons what I cannot foresee.
I've seen you using the fully spelled-out version in total 8 times.
Shouldn't we prefer the shorter, more expressive version instead?

994
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
271

The second part of the conjunction should be tautologically true.

dkrupp updated this revision to Diff 516380.Apr 24 2023, 6:16 AM
dkrupp marked 10 inline comments as done.

-append_range(..) used instead of std::vector.insert(...) to improve readability
-minor updates based on @steakhal comments

dkrupp updated this revision to Diff 516389.Apr 24 2023, 6:45 AM
dkrupp marked an inline comment as done.

-using llvm::ArrayRef<SymbolRef> in the reportTaintBug(..) function in the DivZero Checker

@steakhal thanks for the review. I fixed all outstanding remarks.
I left the test taint-diagnostic-visitor.c formatting as is to remain consistent with the rest of the file. I think we should keep it as is, or reformat the whole file.

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
108–109

Fixed as suggested. thanks.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
947–948

Sorry I overlooked this comment. I like this shorter version. It is so much consize! Changed at all places. Thanks for the suggestion.

clang/test/Analysis/taint-diagnostic-visitor.c
53–67

I made some formatting changes you suggested, but
I would like to leave the //expected-note tags as they are now, because then it remains consistent with the rest of the test cases.

Would it be okay like this, or should I reformat the whole file (untouched parts too)?

steakhal accepted this revision.Apr 24 2023, 10:54 PM

LGTM.
About formatting the tests:
Personally, I would have preferred to "clean as you code", but I can see your point. Leave it as-is.
Land it, please.

This revision is now accepted and ready to land.Apr 24 2023, 10:54 PM