This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/StaticAnalyzer/
-
clang/
-
StaticAnalyzer/
-
Checkers/
1/1
Taint.h
-
Core/BugReporter/
-
BugReporter/
-
CommonBugCategories.h
-
lib/StaticAnalyzer/
-
StaticAnalyzer/
-
Checkers/
2/2
ArrayBoundCheckerV2.cpp
7/7
DivZeroChecker.cpp
53/53
GenericTaintChecker.cpp
12/12
Taint.cpp
4/4
VLASizeChecker.cpp
-
Core/
-
CommonBugCategories.cpp
-
test/Analysis/
-
Analysis/
4/5
taint-diagnostic-visitor.c
-
taint-tester.c

Differential D144269

[Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place
ClosedPublic

Authored by dkrupp on Feb 17 2023, 7:36 AM.

Download Raw Diff

Details

Reviewers

Szelethus
NoQ
steakhal
gamesh411

Summary

This patch improves the diagnostics of the alpha.security.taint.TaintPropagation and taint related checkers by showing the "Taint originated here" note at the correct place, where the attacker may inject it. This greatly improves the understandability of the taint reports.

Taint Analysis: The attacker injects the malicious data at the taint source (e.g. getenv() call) which is then propagated and used at taint sink (e.g. exec() call) causing a security vulnerability (e.g. shell injection vulnerability), without data sanitation.

The goal of the checker is to discover and show to the user these potential taint source, sink pairs and the propagation call chain.

In the baseline the taint source was pointing to an invalid location, typically somewhere between the real taint source and sink.

After the fix, the "Taint originated here" tag is correctly shown at the taint source. This is the function call where the attacker can inject a malicious data (e.g. reading from environment variable, reading from file, reading from standard input etc.).

Before the patch the clang static analyzer puts the taint origin note wrongly to the strtol(..) call.

int main(){
  char *pathbuf;
  char *user_data=getenv("USER_INPUT"); 
  char *end;  
  long size=strtol(user_data, &end, 10); // note: Taint originated here. 
  if (size > 0){
    pathbuf=(char*) malloc(size+1);//note: Untrusted data is used to specify the buffer size ...
    // ... 
    free(pathbuf);
  }
  return 0;
}

After the fix, the taint origin point is correctly annotated at getenv() where the attacker really injects the value.

int main(){
  char *pathbuf;
  char *user_data=getenv("USER_INPUT");  // note: Taint originated here. 
  char *end;
  long size=strtol(user_data, &end, 10); 
  if (size > 0){
    pathbuf=(char*) malloc(size+1);//note: Untrusted data is used to specify the buffer size ...
    // ... 
    free(pathbuf);
  }
  return 0;
}

The BugVisitor placing the note was wrongly going back only until introduction of the tainted SVal in the sink.

This patch removes the BugVisitor from the implmeentation and replaces it with 2 new NoteTags. One, in the taintOriginTrackerTag() prints the "taint originated here" Note and the other in taintPropagationExplainerTag() explaining how the taintedness is propagating from argument to argument or to the return value ("Taint propagated to the xth argument").
This implemetnation uses the interestingess bugReport utility to track back the tainted symbols through propagating function calls to the point where the taintedness was introduced by a source function call.

The checker which wishes to emit a Taint related diagnostic must use the categories::TaintedData BugType category and must mark the tainted symbols as interesting. Then the TaintPropagationChecker will automatically generate the "Taint originated here" and the "Taint propagated to..." diagnostic notes.

You can find the new improved reports

here

And the old reports (look out for "Taint originated here" notes. They are at the wrong place, close to the end of the reports)

here

A simple example report from curl:
Basline:

New:

Diff Detail

Event Timeline

dkrupp created this revision.Feb 17 2023, 7:36 AM

Herald added a reviewer: NoQ. · View Herald TranscriptFeb 17 2023, 7:36 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: steakhal, manas, ASDenysPetrov and 10 others. · View Herald Transcript

dkrupp requested review of this revision.Feb 17 2023, 7:36 AM

Herald added a subscriber: cfe-commits. · View Herald TranscriptFeb 17 2023, 7:36 AM

dkrupp edited the summary of this revision. (Show Details)Feb 17 2023, 7:37 AM

dkrupp edited the summary of this revision. (Show Details)

dkrupp edited the summary of this revision. (Show Details)Feb 17 2023, 7:44 AM

Harbormaster completed remote builds in B214415: Diff 498366.Feb 17 2023, 8:50 AM

Added documentation to the newly introduced types: TaintData, TaintBugReport.

dkrupp added a reviewer: steakhal.Feb 20 2023, 4:18 AM

dkrupp added a reviewer: gamesh411.Feb 20 2023, 5:52 AM

I haven't checked the implementation, but fundamentally patching the TaintBugVisitor is not how we should improve the diagnostic for taint issues.
I saw that this patch is not about NoteTags, so I didn't go any further that point.

What we should do instead, to add a fancy NoteTags to each of the Post transitions to propagate interestingness to the taint sources.
Where each NoteTag does:

checks if any of the taint destinations are actually 'interesting', if none then just return an empty note.
take the taint source arguments and mark their pre-call values as interesting
construct a descriptive message explaining what happened:
- If the transition had no taint sources, then it must be a "taint source"
- If we had tainted sources, tell the user that X', Y', and Z' arguments were tainted, hence we propagated taint
- take all the "interesting" taint destinations and tell the user that X, Y and Z arguments become tainted due to the propagation rule.

I'm attaching my proposed version for improving the diagnostics where I demonstrate all what I said.

proposed.patch211 KBDownload

Note that my patch is really crude, and I just finished hacking it to get all tests pass in a couple hours.

Let me know if it would be a good way to refine your patch or I should review your current implementation.

I completely agree with @steakhal, these should be note tags:

The "visitor way" is to reverse-engineer the exploded graph after the fact.
The "slightly more sophisticated visitor way" is have checker callbacks leave extra hints in the graph to assist reverse engineering, which is what you appear to be trying to do.
The "note tag" way is to simply capture that information from inside checker callbacks in the form of lambda captures. It eliminates the need to think about how to store the information in the state (it's stored in the program point instead), or how to structure it.

I also completely agree with @steakhal that the intermediate notes are valuable. In the motivating example, ideally both strtol and getenv need a note ("taint propagated here" and "taint originated here" respectively).

The challenging part with note tags is how do you figure out whether your bug report is taint-related. The traditional solution is to check the BugType but in this case an indeterminate amount of checkers may emit taint-related reports. I think now's a good time to include a "generic data map"-like data structure in PathSensitiveBugReport objects, so that checkers could put some data there during emitReport(), which can be picked up by note tags and potentially mutated in the process. For example, you can introduce a set of tracked tainted symbols there, which will be pre-populated by the checker with the final tainted symbol, then every time a note tag discovers that a symbol in the set becomes a target of taint propagation, it removes the symbol from the set and replaces it with the symbols from which its taint originated, so that later note tags would react on these new symbols instead.

@steakhal, @NoQ

thanks for your reviews.

Please note that I am not extending TaintBugVisitor. On the contrary I removed it.
Instead I use NoteTag to generate the "Taint Originated here" text (see GenericTaintChecker.cpp:156).

I can also add additional NoteTags for generating propagation messages "Taint was propagated here" easily.

The challenging part with note tags is how do you figure out whether your bug report is taint-related.

I solved this by checking if the report is an instance of TaintBugReport a new BugReport type, which should be used by all taint related reports (ArrayBoundCheckerV2 checker and divisionbyzero checker
was changed to use this new report type for taint related reports).

The tricky part was is to how to show only that single "Taint originated here" note tag at the taint source only which is relevant to the report. This is done by remembering the unique flowid in the
NoteTag in the forward analysis direction (see GenericTaintChecker:cpp:859) InjectionTag = createTaintOriginTag(C, TaintFlowId); and then filtering out the irrelevant
NoteTags when the the report is generated (when the lamdba callback is called). See that flowid which reaches the sink is backpropagated in the PathSensitiveBugreport (see GenericTaintCHekcer.cpp:167).

FlowIds are unique and increased at every taint source (GenericTaintChecker:869) and it is stored as an additional simple int in the program state along with the already existing (Taint.cpp:22) TaintTagType.

My fear with the interestingness is that it may propagating backwards according to different "rules" than whot the taintedness is popagating in the foward direction even with the "extensions" pointed out by steakhal.
So the two types of propagation may be or can get out of sync.

So if the above is not a concern and you think implementing this with interestingness is more elegant, idiomatic and causes less maintenance burden, I am happy to create an alternative patch with that solution.

In D144269#4143066, @NoQ wrote:

The challenging part with note tags is how do you figure out whether your bug report is taint-related. The traditional solution is to check the BugType but in this case an indeterminate amount of checkers may emit taint-related reports.

Yeah, this is why we created a new type. Not sure what is the better infrastructure design, whether to create a subtype of BugType or BugReport, but it fundamentally achieves the same thing.

In D144269#4146809, @dkrupp wrote:

My fear with the interestingness is that it may propagating backwards according to different "rules" than whot the taintedness is popagating in the foward direction even with the "extensions" pointed out by steakhal.
So the two types of propagation may be or can get out of sync.

So if the above is not a concern and you think implementing this with interestingness is more elegant, idiomatic and causes less maintenance burden, I am happy to create an alternative patch with that solution.

@dkrupp and I discussed in detail whether to use FlowID's (what is currently implemented in the patch) or something similar, or reuse interestingness. Here's why we decided against reusing interestiness as is.

Interestingness, as it stands now, mostly expresses data-dependency, and is propageted with using the analyzers usualy somewhat conservative approach. While the length of a string is strictly speaking data dependent on the actual string, I don't think analyzer currently understand that. We approach taint very differently, and propagete it in some sense more liberally.

As I best recall, however, interestingness may be propagated through other means as well. If we reused interestingness, I fear that the interestiness set could be greater than the actual interesting tainted set, causing more notes to be emitted than needed.

For these reasons, which I admit are a result of some speculation, we concluded that interstingness as it is and taint are two different properties that are best separated.

In D144269#4143066, @NoQ wrote:

I think now's a good time to include a "generic data map"-like data structure in PathSensitiveBugReport objects, so that checkers could put some data there during emitReport(), which can be picked up by note tags and potentially mutated in the process.

Maybe a new interestingness kind (D65723)? Not sure how this design aged, but we don't really need to store an ID for this, so a simple interestingness flag (just not the default Thorough interestiness) is good enough.

In D144269#4146809, @dkrupp wrote:

The tricky part was is to how to show only that single "Taint originated here" note tag at the taint source only which is relevant to the report. This is done by remembering the unique flowid in the
NoteTag in the forward analysis direction (see GenericTaintChecker:cpp:859) InjectionTag = createTaintOriginTag(C, TaintFlowId); and then filtering out the irrelevant
NoteTags when the the report is generated (when the lamdba callback is called). See that flowid which reaches the sink is backpropagated in the PathSensitiveBugreport (see GenericTaintCHekcer.cpp:167).

FlowIds are unique and increased at every taint source (GenericTaintChecker:869) and it is stored as an additional simple int in the program state along with the already existing (Taint.cpp:22) TaintTagType.

If you propagate this property during analysis, those IDs may be needed, but a simple flag should suffice when BugReporter does it.

Yeah looks like I replied without properly reading the patch.

TaintBugReport is brilliant and we already have a precedent for subclassing BugReport in another checker. However I'm somewhat worried that once we start doing more of this, we'll eventually end up with multiple inheritance situations when the report needs multiple kinds of information. So at a glance my approach with a "generic data map" in bugreport objects looks a bit more future-proof to me. Also a bit easier to set up, no need to deal with custom RTTI.

So I think interestingness is just an example of such "generic data" attached to bug report. Interestingness is also somewhat confusing, because indeed, there are existing interesting rules, and I don't think anybody remembers what they are or what was even the purpose of having interestingness in the first place. Interestingness is currently used for tracking symbols with trackExpressionValue(), and we have those tracking kinds added by @Szelethus to make tracking behave slightly differently. So, yeah, I think interestingness shouldn't be used; it's already in use. I think it should be generalized upon i.e. just let checkers track whatever/however they want.

I guess my main point is, there shouldn't be a need to assist tracking by adding extra information to the program state. Information in the state should ideally be "material" to program execution, "tangible", it has to describe something that's actually stored somewhere in memory (either by directly defining it, or by constraining it). In particular, if two nodes result in indistinguishable future behavior of the program, we're supposed to merge them; but any "immaterial" bits of information in the state will prevent that from happening.

In our case it should be enough to have the lambda for propagation method ask "Hey, is this freshly produced propagation target value relevant to this specific report?" and if yes, mark the corresponding propagation source value as relevant to the report as well; also emit a note and "consume" the mark on the target value. Such chain of local decisions can easily replace the global taint flow identifier, and it's more flexible because this way the flow doesn't need to be "linear", it may branch in various ways and that's ok.

TaintBugReport is brilliant and we already have a precedent for subclassing BugReport in another checker. However I'm somewhat worried that once we start doing more of this, we'll eventually end up with multiple inheritance situations when the report needs multiple kinds of information. So at a glance my approach with a "generic data map" in bugreport objects looks a bit more future-proof to me. Also a bit easier to set up, no need to deal with custom RTTI.

Adding a data map (like a string->sval map) to the PathSensitiveBugreport instead of relying on dynamic casting sounds an easy addition. I will update the patch with this. Or you specifically mean this kind of datamap ? typedef llvm::ImmutableMap<void*, void*> GenericDataMap; (ProgramState.h:74) I guess it should not be immutable…

I guess my main point is, there shouldn't be a need to assist tracking by adding extra information to the program state. Information in the state should ideally be "material" to program execution, "tangible", it has to describe something that's actually stored somewhere in memory (either by directly defining it, or by constraining it). In particular, if two nodes result in indistinguishable future behavior of the program, we're supposed to merge them; but any "immaterial" bits of information in the state will prevent that from happening.

@NoQ aha! Now I see where you are coming from! If an SVal is tainted on both analysis branches, but their taint flow value is different (meaning that they carry taint values from different taint sources), then they cannot be merged which causes inefficiency.
I understand the generic principle, but I wonder how frequent would that be in practice. I would think not too much, because taint sources are uncommon. Especially having multiple taint sources in the same source file/Translation Unit (only that creates different taint flow ids).

In our case it should be enough to have the lambda for propagation method ask "Hey, is this freshly produced propagation target value relevant to this specific report?" and if yes, mark the corresponding propagation source value as relevant to the report as well; also emit a note and "consume" the mark on the target value. Such chain of local decisions can easily replace the global taint flow identifier, and it's more flexible because this way the flow doesn't need to be "linear", it may branch in various ways and that's ok.

Taint propagation is not only handled in the GenericTaintChecker:892, where we calculate that the taintedness should propagate from function argument x to y or return value, but also it spreads in peculiar ways within expressions, from subregion to parent region etc. handled in the addtaint(..) and addPartialTaint(..) functions in Taint.cpp. What your proposed solution would essentially mean that we would need to implement the taint propagation in backward direction too. I think this design would be fragile and difficult to maintain (especially if taint propagation would change in the future).

I definitely don’t have the full picture here, so if you think that the sval backtracking is the better way, because of the potential performance penalty of the taintflow solution with the merges, I will go down that road and start to work on an alternative patch.

First and foremost, I want to deeply apologize about my rushed response. When I say that the Taint originated here note remained, I wrongly draw conclusions. Won't happen again.

Taint-flow IDs:
I would challenge that we need a flow ID (number) because for me the tainted symbol already has a unique ID which should be suitable for this purpose.
What I can see is that it would be useful to know directly what was the first tainted symbols of any given tainted symbol. (SymbolData -> SymbolData mapping)
This is pretty much analog with what you implemented by piggybacking on the taint kind. Although, I'd argue that having an explicit mapping would be cleaner, but I can see that it's more on personal preference.
In addition to that, I think there is value in minimizing the number of places where we introduce such static counters, so I would adwise against that unless we have a good reason for doing so.

I'm thinking that although it's handy to have the originally tainted symbol directly at the error-node at hand, I'm still not sure if that couldn't be calculated and tracked back to the place where we introduced taint.

int n;
scanf("%d", &n); // binds $1, not important
int v = mytaintsource(n); // taint originated here, returns $2
int z = taintprop(v); // taint propagated here, returns $3
int x = 42 / z; // 42 div $3

Soundness of the back-propagation:
The back-propagation is always in-sync given that the state-transition of introducing taint also attaches the NoteTag explaining what it did and why.
This basically means that after we call addTaint() we must also add a NoteTag when we call addTransition(). Under these circumstances I find it easier to argue that the back-propagation is consistent (sound). If a specific checker (other than the GenericTaintChecker) models taint, it should stick to the rules described and emit the right NoteTag. That way even downstream checkers would play nicely with the taint-tracking and benefit from it.

Interestingness:
The concerns seem valid about that the set of interesting symbols could be larger than the actually required (and desired) set. However, I wouldn't worried about this much unless we have an example on which we can continue discussing that this is a real concern.
What I can see is that as-of-now we use the interestingness notion for this, and deviating or introducing something else would introduce complexity. So, I'm not strictly against having something else, but I would lean towards using interestingness here as well, unless we have clear-cut examples demonstrating the need for something more sophisticated.

Finally, I'd like to thank you for investing your time into this subject. We really should have done it much earlier. Without these similar improvements like this one, it's just half way done. Thank you.

If we worry about having taint-related reports without a Note message explaining where the taint was introduced, we could just assert that in a BugReportVisitor at the finalizeVisitor() callback. I think such an assertion would make a lot of sense.
To achieve this, we could take the R.getNotes() and check if any of them refers to a specific one produced by the NoteTag callback for taint sources, let's say TaintSourceTag for that PathDiagnosticNotePiece.

void MyVisitor::finalizeVisitor(BugReporterContext &, const ExplodedNode *, PathSensitiveBugReport &R) {
  assert(llvm::any_of(R.getNotes(),
                      [](const auto &Piece) { return Piece->getTag() == TaintSourceTag; }) &&
         "Each taint report should have at least one taint-source");
}

With this assertion, we would gain confidence that the taint reports are complete, or at least they all have at least one taint source.

@steakhal , @NoQ thanks for the reviews. I will try to implement an alternative solution based on your suggestions.

This is a totally rewritten version of the patch which solely relies on the existing "interestingness" utility to track back the taint propagation. (And does not introduce a new FlowID in the ProgramState as requested in the reviews.)

-The new version also places a Note, when the taintedness is propagated to an argument or to a return value. So it should be easier for the user to follow how the taint information is spreading.
-"The taint originated here" is printed correctly at the taint source function, which introduces taintedness. (Main goal of this patch.)

Implementation:
-The createTaintPreTag() function places a NoteTag at the taint propagation function calls, if taintedness is propagated. Then at report creation, the tainted arguments are marked interesting if propagated taintedness is relevant for the bug report.

The isTainted() function is extended to return the actually tainted SymbolRef. This is important to be able to consistently mark relevant symbol interesting which carries the taintedness in a complex expression.

-createTaintPostTag(..) function places a NoteTag to the taint generating function calls to mark them interesting if they are relevant for a taintedness report. So if they propagated taintedness to interesting symbol(s).

The tests are passing and the reports on the open source projects are much better understandable than before (twin, tmux, curl):

https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29

Harbormaster completed remote builds in B223071: Diff 510108.Mar 31 2023, 2:35 PM

In D144269#4237539, @dkrupp wrote:

This is a totally rewritten version of the patch which solely relies on the existing "interestingness" utility to track back the taint propagation. (And does not introduce a new FlowID in the ProgramState as requested in the reviews.)

-The new version also places a Note, when the taintedness is propagated to an argument or to a return value. So it should be easier for the user to follow how the taint information is spreading.
-"The taint originated here" is printed correctly at the taint source function, which introduces taintedness. (Main goal of this patch.)

Implementation:
-The createTaintPreTag() function places a NoteTag at the taint propagation function calls, if taintedness is propagated. Then at report creation, the tainted arguments are marked interesting if propagated taintedness is relevant for the bug report.

The isTainted() function is extended to return the actually tainted SymbolRef. This is important to be able to consistently mark relevant symbol interesting which carries the taintedness in a complex expression.

So this is how you circumvent introducing "transitive interestingness", because now you know which symbol to track.

-createTaintPostTag(..) function places a NoteTag to the taint generating function calls to mark them interesting if they are relevant for a taintedness report. So if they propagated taintedness to interesting symbol(s).

The tests are passing and the reports on the open source projects are much better understandable than before (twin, tmux, curl):

https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29

I've looked at the results you attached and they look good to me.

Do you have an example where two tainted values are contributing to the same bug? Something like:

n <- read();
m <- read();
malloc(n+m)

Will both of these values tracked back? What do the notes look like?

All in all, I'm pleased to see the improvements. It looks much better now IMO.
Using two NoteTags cleans up the implementation quite a bit, kudos.
I don't think there are major problems with this implementation, so I decided to spew your code with my nitpicks :D

Remember to clang-format your code. See clang/tools/clang-format/clang-format-diff.py.
And there are a few overloads of getNoteTag(), and there could be a better fit for usecases; you should decide.

I also find it difficult to track how a variable got tainted across assignments and computations like in this case.
This observation is completely orthogonal to your patch, I'm just noting it. it was bad previously as well.
Maybe we could have a visitor for explaining the taint tracking across assignments and computations to complement the NoteTag.

I hope I didn't miss much this time :D

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
266–272
clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
48–49	It feels odd to have both `const char` and a `std::string` on the same line. Should we update `const char` to a more sophisticated type? I'm thinking of `StringRef`. It seems like that type should be used for the `Category` as well since the `PathSensitiveBugReport` constructor takes that, so we don't need to have an owning type here.
106–114
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
139–140	I'm not sure about passing both the `CheckerContext` and the `State`. The `CheckerContext` already encapsulates a `State`, which opens up the possibilities for misuse. For example, the `getPointeeOf()` is called in this function, and that will eventually call `C.getState()` under the hood. So, for me it feels like a bad API design. What we could do instead, is to pass an `ASTContext` and a `State`; resolving this discrepancy. Could you please check if this is a real concern?
190	This function should be an implementation detail, as such I wonder if we should make it `static`. How about naming this function differently? I'm thinking of `taintOriginTrackerTag`, IDK.
194	If the `Call` parameter is only used for acquiring the `LocationContext`, wouldn't it be more descriptive to directly pass the `LocationContext` to the function instead? I'm also puzzled that we use `getCalleeStackFrame` here. I rarely ever see this function, so I'm a bit worried if this pick was intentional. That we pass the `0` as the `BlockCount` argument only reinforces this instinct.
197	It should consume the `TaintedSymbols` and `TaintedArgs` variables, as such you should `std::move` out from the original parameters like this: [TaintedSymbols = std::move(TaintedSymbols), TaintedArgs = std::move(TaintedArgs)](){...}
201	What does the first half of this condition guard against? Do you have a test for it to demonstrate?
205–208
209–215
224	How about calling this `taintPropagationExplainerTag`?
229	Please assert that the size of `TaintedSymbols` must be the same as `TaintedArgs`. Same in the other function.
230
246	So, if `TaintedSymbols.size() > 1`, then the note message will look weird. Could you please have a test for this?
868	I cannot see a test against the "Strange" string. Is this dead code? Same for the other block.
911–914	I just want to get it confirmed that this hunk is unrelated to your change per-say. Is it? BTW I don't mind this change being part of this patch, rather the opposite. Finally, we will have it.
932–950	I prefer to move declarations close to their uses.
993–996
1018–1022
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
222–226	I'd probably swap the two branches, so that the index would be tracked if both the region and the index are tainted. I also wonder if this edge-case could be tested at all.
306–311	What does `TSR` abbreviate? I would find `TaintedSym` more descriptive.
clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
244–252	Why don't we use a distinct BugType for this?
clang/test/Analysis/taint-diagnostic-visitor.c
47	I'd suggest to count from 1 instead of 0 when referring to Nth arguments. You can also use the `llvm::getOrdinalSuffix(N)` to get nicer messages. We already count from 1 in the std library checker.

@steakhal thanks for your review. I tried to address all your concerns.
I added an extra test case too (multipleTaintSources(..)) which highlights the limitation of the current patch: If multiple tainted "variables" reach a sink, we only generate diagnostics for one of them. The main reason is that the isTainted() function returns a single tainted Symbolref instead of a vector of Symbolrefs if there are multiple instances.
I highlighted this in the test and the implementation.

I think this could be still an acceptable limitation for now, because as the user sanitizes one of the tainted variables, he will get a new diagnostics for the remaining one(s).

Should we address this limitation in follow-up patche(s) or here?

All comments addressed. Thanks for the review @steakhal .

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
194	The call.getCalleeStackFrame(0) gets the location context of the actual call that we are analyzing (in the pre or postcall), and that's what we need to mark interesting. It is intentionally used like this. I changed the parameter to locationcontext as use suggested.
201	To only follow taint propagation to function calls which actually result in tainted variables used in the report and not every function which returns a tainted variable. char* taintDiagnosticPropagation2() is such a test which is failing without this due to giving extra unrelated propagation notes.
246	Test added multipleTaintedArgs (). I could not provoke the multi-argument message as we only track-back one tainted symbol now.
868	It was a debugging code, which I removed. I noticed that in some cases (e.g. if the argument pointer is pointing to an unknown area) we don't get back a tainted symbol even though we call the addtaint on the arg/return value.
911–914	It is related in the sense that in isTainted() function call did not return a valid SymbolRef for stdin if we did not make the stdin tainted when we first see it. Caused testcase to fail as it was before. Now it is handled similarly to other tainted symbols.
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
306–311	TSR = Tainted Symbol Ref but I changed it as you suggested.
clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
244–252	You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler this way. To distinguish identify the tainted reports with the bug category.

Harbormaster completed remote builds in B223788: Diff 511078.Apr 5 2023, 7:07 AM

Looks even better. Only minor concerns remained, mostly about style and suggestions of llvm utilities.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
130–131	BTW I don't know but `State->getStateManager().getContext()` can give you an `ASTContext`. And we tend to not put `const` to variable declarations. See readability-avoid-const-params-in-decls In other places we tend to refer to `ASTContext` by the `ACtx` I think. We also prefer const refs over mutable refs. Is the mutable ref justified for this case?
157	My bad. In LLVM style we use `UpperCamelCase` for variable names.
171–173	Generally, in LLVM style we don't put braces to single block statements unless it would hurt readability, which I don't think applies here.

steakhal added inline comments.Apr 5 2023, 9:04 AM

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
174–178	I was also bad with this recommendation. I think we can now use structured bindings to get the index and value right there, like: `for (auto [Idx, Sym] : llvm::enumerate(TaintedSymbols))` See
194	Okay.
202	Same here about structured bindings.
208–213	I'd recommend using `llvm::interleaveComma()` in such cases. You can probably get rid of `nofTaintedArgs` as well - by using this function.
210	For clang diagnostics we usually use ordinary suffixes like `{st,nd,rd,th}`. It would be nice to align with the rest of the clang diagnostics on this. It would require a bit of work on the wording though, I admit.
212	I believe this branch is uncovered by tests.
220	I think since you explicitly specify the return type of the lambda, you could omit the spelling of `std::string` here.
863–864	We tend to fuse such declarations: I've seen other cases like this elsewhere. Please check.
875–878	You could iterate over the symbol dependencies of the SymExpr (of the `V` SVal). SymbolRef PointeeAsSym = V->getAsSymbol(); // eee, can it be null? Sure it can. See isTainted(Region),... for those cases we would need to descend and check their symbol dependencies. for (SymbolRef SubSym : llvm::make_range(PointeeAsSym->symbol_begin(), PointeeAsSym->symbol_end())) { // TODO: check each if it's also tainted, and update the `TaintedSymbols` accordingly, IDK. } Something like this should work for most cases (except when `V` refers to a tainted region instead of a symbol), I think.
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
278–282	Here we still have the `TSR` token.
clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
244–252	You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler this way. To distinguish identify the tainted reports with the bug category. I never checked how `BugTypes` constitute to bugreport construction, but my gut instinct suggests that we should have two separate instances like we frequently do for other checkers.

I think as we converge, now it could be time to update the summary of the patch to reflect the current implementation. (e.g. flowids etc.)

-All remarks from @steakhal was fixed. Thanks for the review!
-Now we can generate diagnostics for all tainted values when they reach a sink.

Se for example the following test case:

void multipleTaintedArgs(void) {
  int x,y;
  scanf("%d %d", &x, &y); // expected-note {{Taint originated here}}
                          // expected-note@-1 {{Taint propagated to the 2nd argument, 3rd argument}}
  int* ptr = (int*) malloc(x + y); // expected-warning {{Untrusted data is used to specify the buffer size}}
                                   // expected-note@-1{{Untrusted data is used to specify the buffer size}}
  free (ptr);
}

All remarks from @steakhal has been fixed. Thanks for the review.
This new version now can handle the tracking back of multiple symbols!

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
130–131	Thanks for the suggestion. I took out ASTContext from the signature.
208–213	I chose another solution. I hope that is ok too.
212	Now it is covered. See multipleTaintedSArgs(..) test in taint-diagnostic-visitor.c
220	not sure. Got a "cannot convert raw_svector_ostream::str() from llvm:StringRef" error.
875–878	I implememented a new function getTaintedSymbols(..) in Taint.cpp which returns all tainted symbols for a complex expr, SVal etc. With this addition, now we can track back multiple tainted symbols reaching a sink.

Harbormaster completed remote builds in B225590: Diff 513556.Apr 14 2023, 6:50 AM

dkrupp edited the summary of this revision. (Show Details)Apr 15 2023, 1:10 AM

You can find the improved reports on tmux, postgres, twin, openssl here:

here

dkrupp edited the summary of this revision. (Show Details)Apr 15 2023, 1:13 AM

dkrupp edited the summary of this revision. (Show Details)Apr 15 2023, 1:16 AM

Nice improvement!
I only have minor nitpicks and some recommendations for the taint API.

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
205	We usually use the `operator[]` on vector instead of the `at()`. When in doubt, we assert that the index is in `idx < size()`. Apply this for all `.at()` uses.
208–213	Looks good. I'm not expecting many cases propagating to multiple arguments anyway.
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
147–151	TBH I'm not sure if I like that now we allocate unbounded amount of times (because of `getTaintedSymbols()` is recursive and returns by value), where we previously did not. What we could possibly do is to compute the elements of this sequence lazily. I'm thinking of the `llvm::mapped_iterator`, but I'm not sure if it's possible to have something like that as a return type as it might encode the map function in the type or something like that. Anyway, I'm just saying that it would be nice to not do more than it's necessary, and especially not allocate a lot of short-lived objects there. Do you think there is a way to have the cake and eat it too? I did some investigation, and one could get pretty far in the implementation, and maybe even complete it but it would be really complicated as of now. Maybe we could revisit this subject when we have coroutines. So, I would suggest to have two sets of APIs: the usual `isTainted(.) -> bool` and a `getTaintedSymbols(.) -> vector<Sym>` The important point would be that the `isTainted()` version would not eagerly collect all tainted sub-syms but return on finding the first one. While, the `getTaintedSymbols()` would collect eagerly all of them, as its name suggests. Imagine if `getTaintedSymbolsImpl()` had an extra flag like `bool returnAfterFirstMatch`. This way `isTainted()` can call it like that. While in the other case, the parameter would be `false`, and eagerly collect all symbols. This is probably the best of both worlds, as it prevents `isTainted` from doing extra work and if we need to iterate over the tainted symbols, we always iterate over all of them, so doing it lazily wouldn't gain us much in that case anyway. As a bonus, the user-facing API would be self-descriptive. WDYT?
306–311	For such constructs, I would prefer this.
clang/test/Analysis/taint-diagnostic-visitor.c
12–13	The premerge bots are complaining about these two lines on Windows: error: 'warning' diagnostics seen but not expected: File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 12: incompatible redeclaration of library function 'strlen' File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 13: incompatible redeclaration of library function 'malloc' error: 'note' diagnostics seen but not expected: File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 12: 'strlen' is a builtin with type 'unsigned long long (const char )' File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 13: 'malloc' is a builtin with type 'void (unsigned long long)' 4 errors generated. I think it's because `size_t` should be defined as `unsigned long long` on `x86_64`. This also means that you should pin the target to `x86_64` to satisfy this test on all platforms.
53–67	I know this is subjective, but I'd suggest to reformat the tests to match LLVM style guidelines, unless the formatting is important for the test. Consistency helps the reader and reviewer, as code and tests are read many more times than written. This applies to the rest of the touched tests.

This revision now requires changes to proceed.Apr 19 2023, 2:42 AM

Implemented early return in getTaintedSymbols() when it is called by isTainted() for efficiency
Fixed test incompatibility on Windows

@steakhal thanks for your review. All your remarks have been fixed.

clang/lib/StaticAnalyzer/Checkers/Taint.cpp
147–151	Good idea. I implemented the early return option in getTaintedSymbols(). This is used now by the isTainted() function.

Harbormaster completed remote builds in B226618: Diff 514973.Apr 19 2023, 9:28 AM

@steakhal is there anything else to do before we merge this? Thanks.

I didn't go through the whole revision this time, but I think the next steps are already clear for the next round.
My impression was that I might not expressed my intent and expectations about the directions of the next step.
I hope I managed this time. Let me know if you have questions.

clang/include/clang/StaticAnalyzer/Checkers/Taint.h
82–103	The overloads having the extra `ReturnFirstOnly` parameter shouldn't be visible here in the header. That is an implementation detail that no users should know about. Note that having a single default argument overload potentially doubles the variations the user might need to keep in mind when choosing the right one. So there is value in simplicity.
clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
254–260
clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
62
108–109
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
863–868
875–879	In these cases, the code would acquire all the tainted subsymbols, which then we throw away and keep only the first one. This is why I suggested the approach I did I'm my last review.
936–940	Here `getTaintedPointeeOrPointer` would be called two times, unnecessarily.
947–948
1008–1010
1018–1022
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
149–150	We usually pass booleans by "name".
316	If `returnFirstOnly` is `true`, this `getTaintedSymbols()` call would still eagerly (needlessly) collect all of the symbols. I'd recommend propagating the `returnFirstOnly` parameter to the recursive calls to avoid this problem. I also encourage you to make use of the `llvm::append_range()` whenever makes sense.
clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
230–232	Ah, how awesome it would be to have a `markInteresting(llvm::ArrayRef<SymbolRef>)` overload.
clang/test/Analysis/taint-diagnostic-visitor.c
53–67	Originally I meant this to the rest of the test cases you change or add part of this patch. I hope it clarifies.

This revision now requires changes to proceed.Apr 21 2023, 6:59 AM

-getTaintedSymbols(.) -> getTaintedSymbolsImpl() proxy function introduced for interface safety
-Other minor fixes based on comments from @steakhal

@steakhal your comments are fixed. Thanks for the review.

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
108–109	We cannot get rid off the getTaintedSymbols() call, as we need to pass all tainted symbols to reportTaintBug if we want to track back multiple variables. taintedSyms is a parameter of reportTaintBug(...)
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
863–868	I think this suggested solution would not be correct here, as ArgSym might not be the actual _tainted_ symbol (inside a more complex expression). So I would prefer to leave it like this for correctness.
875–879	I think this suggested solution would not be correct here, as ArgSym might not be the actual _tainted_ symbol (inside a more complex expression). So I would prefer to leave it like this for correctness.
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
147–151	First I wanted to avoid the getTaintedSymbols()->getTaintedSymbolsImpl() proxy calls as it is too bloated IMHO. But I see your point that it is safer. So I changed it.
316	You are perfectly right. I overlooked these calls and because of the the default parameter did got get a warning. now fixed.

Harbormaster completed remote builds in B227461: Diff 516077.Apr 22 2023, 9:32 AM

To conclude the review, please respond to the "Not Done" inline comments, and mark them "Done" if you think they are resolved.
Thank you for your patience.

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
108–109	Yes, makes sense. mb. One more thing: if `reportTaintBug()` takes the `taintedSyms` vector "by-value", you should express your intent by `std::move()`-ing your collection expressing that it's meant to be consumed instead of taking a copy. Otherwise, you could express this intent if the `reportTaintBug()` take a view type for the collection, such as `llvm::ArrayRef<SymbolRef>` - which would neither copy nor move from the callsite's vector, being more performant and expressive. I get that this vector is small and bugreport construction will dominate the runtime anyway so I'm not complaining about this specific case, I'm just noting this for the next time. So, here I'm not expecting any actions.
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
875–879	Okay, I also checked out the code and verified this. Indeed we would have failing tests with my recommendation. I still think it's suboptimal. This is somewhat related to the tainted API. It shouldn't make sense to have tainted regions in the first place, which bites here again. Let's keep it as-is, no actions are required.
920	This unused variable generates a compiler warning.
947–948	I observed you didn't take any action about this suggestion. It leaves me wonder if this suggestion - in general - makes sense or if there are other reasons what I cannot foresee. I've seen you using the fully spelled-out version in total 8 times. Shouldn't we prefer the shorter, more expressive version instead?
994
clang/lib/StaticAnalyzer/Checkers/Taint.cpp
271	The second part of the conjunction should be tautologically true.

-append_range(..) used instead of std::vector.insert(...) to improve readability
-minor updates based on @steakhal comments

-using llvm::ArrayRef<SymbolRef> in the reportTaintBug(..) function in the DivZero Checker

@steakhal thanks for the review. I fixed all outstanding remarks.
I left the test taint-diagnostic-visitor.c formatting as is to remain consistent with the rest of the file. I think we should keep it as is, or reformat the whole file.

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
108–109	Fixed as suggested. thanks.
clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
947–948	Sorry I overlooked this comment. I like this shorter version. It is so much consize! Changed at all places. Thanks for the suggestion.
clang/test/Analysis/taint-diagnostic-visitor.c
53–67	I made some formatting changes you suggested, but I would like to leave the //expected-note tags as they are now, because then it remains consistent with the rest of the test cases. Would it be okay like this, or should I reformat the whole file (untouched parts too)?

Harbormaster completed remote builds in B227715: Diff 516389.Apr 24 2023, 7:35 AM

LGTM.
About formatting the tests:
Personally, I would have preferred to "clean as you code", but I can see your point. Leave it as-is.
Land it, please.

This revision is now accepted and ready to land.Apr 24 2023, 10:54 PM

Committed in 343bdb10940cb2387c0b9bd3caccee7bb56c937b.

Revision Contents

Path

Size

clang/

include/

clang/

StaticAnalyzer/

Checkers/

Taint.h

54 lines

Core/

BugReporter/

CommonBugCategories.h

1 line

lib/

StaticAnalyzer/

Checkers/

ArrayBoundCheckerV2.cpp

47 lines

DivZeroChecker.cpp

46 lines

GenericTaintChecker.cpp

177 lines

Taint.cpp

180 lines

VLASizeChecker.cpp

59 lines

Core/

CommonBugCategories.cpp

1 line

test/

Analysis/

taint-diagnostic-visitor.c

79 lines

taint-tester.c

2 lines

Diff 516389

clang/include/clang/StaticAnalyzer/Checkers/Taint.h

	Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines
	bool isTainted(ProgramStateRef State, SymbolRef Sym,			bool isTainted(ProgramStateRef State, SymbolRef Sym,
	TaintTagType Kind = TaintTagGeneric);			TaintTagType Kind = TaintTagGeneric);

	/// Check if the pointer represented by the region is tainted in the given			/// Check if the pointer represented by the region is tainted in the given
	/// state.			/// state.
	bool isTainted(ProgramStateRef State, const MemRegion *Reg,			bool isTainted(ProgramStateRef State, const MemRegion *Reg,
	TaintTagType Kind = TaintTagGeneric);			TaintTagType Kind = TaintTagGeneric);

				/// Returns the tainted Symbols for a given Statement and state.
				std::vector<SymbolRef> getTaintedSymbols(ProgramStateRef State, const Stmt *S,
				const LocationContext *LCtx,
				TaintTagType Kind = TaintTagGeneric);

				/// Returns the tainted Symbols for a given SVal and state.
				std::vector<SymbolRef> getTaintedSymbols(ProgramStateRef State, SVal V,
				TaintTagType Kind = TaintTagGeneric);

				/// Returns the tainted Symbols for a SymbolRef and state.
				std::vector<SymbolRef> getTaintedSymbols(ProgramStateRef State, SymbolRef Sym,
				TaintTagType Kind = TaintTagGeneric);

				/// Returns the tainted (index, super/sub region, symbolic region) symbols
				/// for a given memory region.
				std::vector<SymbolRef> getTaintedSymbols(ProgramStateRef State,
				const MemRegion *Reg,
				TaintTagType Kind = TaintTagGeneric);

				std::vector<SymbolRef> getTaintedSymbolsImpl(ProgramStateRef State,
				const Stmt *S,
				const LocationContext *LCtx,
				steakhalUnsubmitted Done Reply Inline Actions The overloads having the extra `ReturnFirstOnly` parameter shouldn't be visible here in the header. That is an implementation detail that no users should know about. Note that having a single default argument overload potentially doubles the variations the user might need to keep in mind when choosing the right one. So there is value in simplicity. steakhal: The overloads having the extra `ReturnFirstOnly` parameter shouldn't be visible here in the…
				TaintTagType Kind,
				bool returnFirstOnly);

				std::vector<SymbolRef> getTaintedSymbolsImpl(ProgramStateRef State, SVal V,
				TaintTagType Kind,
				bool returnFirstOnly);

				std::vector<SymbolRef> getTaintedSymbolsImpl(ProgramStateRef State,
				SymbolRef Sym, TaintTagType Kind,
				bool returnFirstOnly);

				std::vector<SymbolRef> getTaintedSymbolsImpl(ProgramStateRef State,
				const MemRegion *Reg,
				TaintTagType Kind,
				bool returnFirstOnly);

	void printTaint(ProgramStateRef State, raw_ostream &Out, const char *nl = "\n",			void printTaint(ProgramStateRef State, raw_ostream &Out, const char *nl = "\n",
	const char *sep = "");			const char *sep = "");

	LLVM_DUMP_METHOD void dumpTaint(ProgramStateRef State);			LLVM_DUMP_METHOD void dumpTaint(ProgramStateRef State);

	/// The bug visitor prints a diagnostic message at the location where a given
	/// variable was tainted.
	class TaintBugVisitor final : public BugReporterVisitor {
	private:
	const SVal V;

	public:
	TaintBugVisitor(const SVal V) : V(V) {}
	void Profile(llvm::FoldingSetNodeID &ID) const override { ID.Add(V); }

	PathDiagnosticPieceRef VisitNode(const ExplodedNode *N,
	BugReporterContext &BRC,
	PathSensitiveBugReport &BR) override;
	};

	} // namespace taint			} // namespace taint
	} // namespace ento			} // namespace ento
	} // namespace clang			} // namespace clang

	#endif			#endif

clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h

	Show All 16 Lines
	extern const char *const LogicError;			extern const char *const LogicError;
	extern const char *const MemoryRefCount;			extern const char *const MemoryRefCount;
	extern const char *const MemoryError;			extern const char *const MemoryError;
	extern const char *const UnixAPI;			extern const char *const UnixAPI;
	extern const char *const CXXObjectLifecycle;			extern const char *const CXXObjectLifecycle;
	extern const char *const CXXMoveSemantics;			extern const char *const CXXMoveSemantics;
	extern const char *const SecurityError;			extern const char *const SecurityError;
	extern const char *const UnusedCode;			extern const char *const UnusedCode;
				extern const char *const TaintedData;
	} // namespace categories			} // namespace categories
	} // namespace ento			} // namespace ento
	} // namespace clang			} // namespace clang
	#endif			#endif

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp

Show All 27 Lines

using namespace clang;

using namespace ento;

using namespace taint;

namespace {

class ArrayBoundCheckerV2 :

public Checker<check::Location> {

mutable std::unique_ptr<BuiltinBug> BT;

mutable std::unique_ptr<BugType> TaintBT;

enum OOB_Kind { OOB_Precedes, OOB_Excedes, OOB_Tainted };

enum OOB_Kind { OOB_Precedes, OOB_Excedes };

void reportOOB(CheckerContext &C, ProgramStateRef errorState, OOB_Kind kind,

void reportOOB(CheckerContext &C, ProgramStateRef errorState,

std::unique_ptr<BugReporterVisitor> Visitor = nullptr) const;

OOB_Kind kind) const;

void reportTaintOOB(CheckerContext &C, ProgramStateRef errorState,

SVal TaintedSVal) const;

public:

void checkLocation(SVal l, bool isLoad, const Stmt*S,

CheckerContext &C) const;

};

// FIXME: Eventually replace RegionRawOffset with this class.

class RegionRawOffsetV2 {

▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines

do {

ProgramStateRef state_exceedsUpperBound, state_withinUpperBound;

std::tie(state_exceedsUpperBound, state_withinUpperBound) =

state->assume(*upperboundToCheck);

// If we are under constrained and the index variables are tainted, report.

if (state_exceedsUpperBound && state_withinUpperBound) {

SVal ByteOffset = rawOffset.getByteOffset();

if (isTainted(state, ByteOffset)) {

reportOOB(checkerContext, state_exceedsUpperBound, OOB_Tainted,

reportTaintOOB(checkerContext, state_exceedsUpperBound, ByteOffset);

std::make_unique<TaintBugVisitor>(ByteOffset));

return;

}

} else if (state_exceedsUpperBound) {

// If we are constrained enough to definitely exceed the upper bound,

// report.

assert(!state_withinUpperBound);

reportOOB(checkerContext, state_exceedsUpperBound, OOB_Excedes);

return;

}

assert(state_withinUpperBound);

state = state_withinUpperBound;

}

while (false);

checkerContext.addTransition(state);

}

void ArrayBoundCheckerV2::reportTaintOOB(CheckerContext &checkerContext,

ProgramStateRef errorState,

SVal TaintedSVal) const {

ExplodedNode *errorNode = checkerContext.generateErrorNode(errorState);

if (!errorNode)

return;

if (!TaintBT)

TaintBT.reset(

new BugType(this, "Out-of-bound access", categories::TaintedData));

void ArrayBoundCheckerV2::reportOOB(

SmallString<256> buf;

CheckerContext &checkerContext, ProgramStateRef errorState, OOB_Kind kind,

llvm::raw_svector_ostream os(buf);

std::unique_ptr<BugReporterVisitor> Visitor) const {

os << "Out of bound memory access (index is tainted)";

auto BR =

std::make_unique<PathSensitiveBugReport>(*TaintBT, os.str(), errorNode);

// Track back the propagation of taintedness.

for (SymbolRef Sym : getTaintedSymbols(errorState, TaintedSVal)) {

BR->markInteresting(Sym);

}

checkerContext.emitReport(std::move(BR));

}

void ArrayBoundCheckerV2::reportOOB(CheckerContext &checkerContext,

ProgramStateRef errorState,

OOB_Kind kind) const {

ExplodedNode *errorNode = checkerContext.generateErrorNode(errorState);

steakhalUnsubmitted

Done

std::make_unique<PathSensitiveBugReport>(*TaintBT, os.str(), errorNode);

- std::vector<SymbolRef> TaintedSyms =

- getTaintedSymbols(errorState, TaintedSVal);

- // Mark all tainted symbols interesting

- // to track back the propagation of taintedness.

- for (auto Sym : TaintedSyms) {

+ // Track back the propagation of taintedness.

+ for (SymbolRef Sym : getTaintedSymbols(errorState, TaintedSVal)) {

BR->markInteresting(Sym);

}

checkerContext.emitReport(std::move(BR));

steakhal:

if (!errorNode)

return;

if (!BT)

BT.reset(new BuiltinBug(this, "Out-of-bound access"));

// FIXME: This diagnostics are preliminary. We should get far better

// diagnostics for explaining buffer overruns.

SmallString<256> buf;

llvm::raw_svector_ostream os(buf);

os << "Out of bound memory access ";

steakhalUnsubmitted

Done

if (!BT) {

- if (kind == OOB_Tainted)

BT.reset(

- new BugType(this, "Out-of-bound access", categories::TaintedData));

- else

- BT.reset(

- new BugType(this, "Out-of-bound access", categories::LogicError));

+ new BugType(this, "Out-of-bound access", kind == OOB_Tainted ? categories::TaintedData : categories::LogicError));

}

// FIXME: This diagnostics are preliminary. We should get far better

steakhal:

switch (kind) {

case OOB_Precedes:

os << "(accessed memory precedes memory block)";

break;

case OOB_Excedes:

os << "(access exceeds upper limit of memory block)";

break;

case OOB_Tainted:

os << "(index is tainted)";

break;

}

auto BR = std::make_unique<PathSensitiveBugReport>(*BT, os.str(), errorNode);

BR->addVisitor(std::move(Visitor));

checkerContext.emitReport(std::move(BR));

}

#ifndef NDEBUG

LLVM_DUMP_METHOD void RegionRawOffsetV2::dump() const {

dumpToStream(llvm::errs());

}

▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp

Show All 19 Lines

#include <optional> #include <optional>

using namespace clang; using namespace clang;

using namespace ento; using namespace ento;

using namespace taint; using namespace taint;

namespace { namespace {

class DivZeroChecker : public Checker< check::PreStmt<BinaryOperator> > { class DivZeroChecker : public Checker< check::PreStmt<BinaryOperator> > {

mutable std::unique_ptr<BuiltinBug> BT; mutable std::unique_ptr<BugType> BT;

void reportBug(const char *Msg, ProgramStateRef StateZero, CheckerContext &C, mutable std::unique_ptr<BugType> TaintBT;

std::unique_ptr<BugReporterVisitor> Visitor = nullptr) const; void reportBug(StringRef Msg, ProgramStateRef StateZero,

CheckerContext &C) const;

void reportTaintBug(StringRef Msg, ProgramStateRef StateZero,

CheckerContext &C,

llvm::ArrayRef<SymbolRef> TaintedSyms) const;

public: public:

void checkPreStmt(const BinaryOperator *B, CheckerContext &C) const; void checkPreStmt(const BinaryOperator *B, CheckerContext &C) const;

}; };

} // end anonymous namespace } // end anonymous namespace

static const Expr *getDenomExpr(const ExplodedNode *N) { static const Expr *getDenomExpr(const ExplodedNode *N) {

const Stmt *S = N->getLocationAs<PreStmt>()->getStmt(); const Stmt *S = N->getLocationAs<PreStmt>()->getStmt();

if (const auto *BE = dyn_cast<BinaryOperator>(S)) if (const auto *BE = dyn_cast<BinaryOperator>(S))

return BE->getRHS(); return BE->getRHS();

return nullptr; return nullptr;

} }

void DivZeroChecker::reportBug( void DivZeroChecker::reportBug(StringRef Msg, ProgramStateRef StateZero,

const char *Msg, ProgramStateRef StateZero, CheckerContext &C, CheckerContext &C) const {

steakhalUnsubmitted

Done

It feels odd to have both const char* and a std::string on the same line.
Should we update const char* to a more sophisticated type?

I'm thinking of StringRef. It seems like that type should be used for the Category as well since the PathSensitiveBugReport constructor takes that, so we don't need to have an owning type here.

steakhal: It feels odd to have both `const char*` and a `std::string` on the same line. Should we update…

std::unique_ptr<BugReporterVisitor> Visitor) const {

if (ExplodedNode *N = C.generateErrorNode(StateZero)) { if (ExplodedNode *N = C.generateErrorNode(StateZero)) {

if (!BT) if (!BT)

BT.reset(new BuiltinBug(this, "Division by zero")); BT.reset(new BugType(this, "Division by zero", categories::LogicError));

auto R = std::make_unique<PathSensitiveBugReport>(*BT, Msg, N); auto R = std::make_unique<PathSensitiveBugReport>(*BT, Msg, N);

R->addVisitor(std::move(Visitor));

bugreporter::trackExpressionValue(N, getDenomExpr(N), *R); bugreporter::trackExpressionValue(N, getDenomExpr(N), *R);

C.emitReport(std::move(R)); C.emitReport(std::move(R));

} }

void DivZeroChecker::reportTaintBug(

StringRef Msg, ProgramStateRef StateZero, CheckerContext &C,

llvm::ArrayRef<SymbolRef> TaintedSyms) const {

steakhalUnsubmitted

Done

CheckerContext &C,

- std::vector<SymbolRef> TaintedSyms) const {

+ llvm::ArrayRef<SymbolRef> TaintedSyms) const {

if (ExplodedNode *N = C.generateErrorNode(StateZero)) {

steakhal:

if (ExplodedNode *N = C.generateErrorNode(StateZero)) {

if (!TaintBT)

TaintBT.reset(

new BugType(this, "Division by zero", categories::TaintedData));

auto R = std::make_unique<PathSensitiveBugReport>(*TaintBT, Msg, N);

bugreporter::trackExpressionValue(N, getDenomExpr(N), *R);

for (auto Sym : TaintedSyms)

R->markInteresting(Sym);

C.emitReport(std::move(R));

}

void DivZeroChecker::checkPreStmt(const BinaryOperator *B, void DivZeroChecker::checkPreStmt(const BinaryOperator *B,

CheckerContext &C) const { CheckerContext &C) const {

BinaryOperator::Opcode Op = B->getOpcode(); BinaryOperator::Opcode Op = B->getOpcode();

if (Op != BO_Div && if (Op != BO_Div &&

Op != BO_Rem && Op != BO_Rem &&

Op != BO_DivAssign && Op != BO_DivAssign &&

Op != BO_RemAssign) Op != BO_RemAssign)

return; return;

Show All 14 Lines void DivZeroChecker::checkPreStmt(const BinaryOperator *B,

ProgramStateRef stateNotZero, stateZero; ProgramStateRef stateNotZero, stateZero;

std::tie(stateNotZero, stateZero) = CM.assumeDual(C.getState(), *DV); std::tie(stateNotZero, stateZero) = CM.assumeDual(C.getState(), *DV);

if (!stateNotZero) { if (!stateNotZero) {

assert(stateZero); assert(stateZero);

reportBug("Division by zero", stateZero, C); reportBug("Division by zero", stateZero, C);

return; return;

} }

bool TaintedD = isTainted(C.getState(), *DV); if ((stateNotZero && stateZero)) {

if ((stateNotZero && stateZero && TaintedD)) { std::vector<SymbolRef> taintedSyms = getTaintedSymbols(C.getState(), *DV);

reportBug("Division by a tainted value, possibly zero", stateZero, C, if (!taintedSyms.empty()) {

steakhalUnsubmitted

Done

if ((stateNotZero && stateZero)) {

- std::vector<SymbolRef> taintedSyms = getTaintedSymbols(C.getState(), *DV);

- if (!taintedSyms.empty()) {

+ if (isTainted(C.getState(), *DV)) {

reportTaintBug("Division by a tainted value, possibly zero", stateZero, C,

steakhal:

dkruppAuthorUnsubmitted

Done

We cannot get rid off the getTaintedSymbols() call, as we need to pass all tainted symbols to reportTaintBug if we want to track back multiple variables. taintedSyms is a parameter of reportTaintBug(...)

dkrupp: We cannot get rid off the getTaintedSymbols() call, as we need to pass all tainted symbols to…

steakhalUnsubmitted

Done

Yes, makes sense. mb.
One more thing: if reportTaintBug() takes the taintedSyms vector "by-value", you should express your intent by std::move()-ing your collection expressing that it's meant to be consumed instead of taking a copy.
Otherwise, you could express this intent if the reportTaintBug() take a view type for the collection, such as llvm::ArrayRef<SymbolRef> - which would neither copy nor move from the callsite's vector, being more performant and expressive.

I get that this vector is small and bugreport construction will dominate the runtime anyway so I'm not complaining about this specific case, I'm just noting this for the next time. So, here I'm not expecting any actions.

steakhal: Yes, makes sense. mb. One more thing: if `reportTaintBug()` takes the `taintedSyms` vector "by…

dkruppAuthorUnsubmitted

Done

Fixed as suggested. thanks.

dkrupp: Fixed as suggested. thanks.

std::make_unique<taint::TaintBugVisitor>(*DV)); reportTaintBug("Division by a tainted value, possibly zero", stateZero, C,

taintedSyms);

return; return;

} }

}

steakhalUnsubmitted

Done

return;

}

- SymbolRef TSR = isTainted(C.getState(), *DV);

- if ((stateNotZero && stateZero && TSR)) {

- reportBug("Division by a tainted value, possibly zero",

- categories::TaintedData, stateZero, C, TSR);

- return;

- }

+ if ((stateNotZero && stateZero)) {

+ if (SymbolRef TaintedSym = isTainted(C.getState(), *DV)) {

+ reportBug("Division by a tainted value, possibly zero",

+ categories::TaintedData, stateZero, C, TSR);

+ return;

+ }

// If we get here, then the denom should not be zero. We abandon the implicit

steakhal:

// If we get here, then the denom should not be zero. We abandon the implicit // If we get here, then the denom should not be zero. We abandon the implicit

// zero denom case for now. // zero denom case for now.

C.addTransition(stateNotZero); C.addTransition(stateNotZero);

} }

void ento::registerDivZeroChecker(CheckerManager &mgr) { void ento::registerDivZeroChecker(CheckerManager &mgr) {

mgr.registerChecker<DivZeroChecker>(); mgr.registerChecker<DivZeroChecker>();

} }

bool ento::shouldRegisterDivZeroChecker(const CheckerManager &mgr) { bool ento::shouldRegisterDivZeroChecker(const CheckerManager &mgr) {

return true; return true;

} }

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp

Show All 20 Lines

#include "clang/StaticAnalyzer/Checkers/Taint.h" #include "clang/StaticAnalyzer/Checkers/Taint.h"

#include "clang/StaticAnalyzer/Core/BugReporter/BugType.h" #include "clang/StaticAnalyzer/Core/BugReporter/BugType.h"

#include "clang/StaticAnalyzer/Core/Checker.h" #include "clang/StaticAnalyzer/Core/Checker.h"

#include "clang/StaticAnalyzer/Core/CheckerManager.h" #include "clang/StaticAnalyzer/Core/CheckerManager.h"

#include "clang/StaticAnalyzer/Core/PathSensitive/CallDescription.h" #include "clang/StaticAnalyzer/Core/PathSensitive/CallDescription.h"

#include "clang/StaticAnalyzer/Core/PathSensitive/CallEvent.h" #include "clang/StaticAnalyzer/Core/PathSensitive/CallEvent.h"

#include "clang/StaticAnalyzer/Core/PathSensitive/CheckerContext.h" #include "clang/StaticAnalyzer/Core/PathSensitive/CheckerContext.h"

#include "clang/StaticAnalyzer/Core/PathSensitive/ProgramStateTrait.h" #include "clang/StaticAnalyzer/Core/PathSensitive/ProgramStateTrait.h"

#include "llvm/ADT/StringExtras.h"

#include "llvm/Support/YAMLTraits.h" #include "llvm/Support/YAMLTraits.h"

#include <limits> #include <limits>

#include <memory> #include <memory>

#include <optional> #include <optional>

#include <utility> #include <utility>

#include <vector>

#define DEBUG_TYPE "taint-checker" #define DEBUG_TYPE "taint-checker"

using namespace clang; using namespace clang;

using namespace ento; using namespace ento;

using namespace taint; using namespace taint;

using llvm::ImmutableSet; using llvm::ImmutableSet;

▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines if (D->getName().contains("stdin") && D->isExternC()) {

if (Ty->isPointerType()) if (Ty->isPointerType())

return Ty->getPointeeType() == FILETy; return Ty->getPointeeType() == FILETy;

} }

return false; return false;

} }

SVal getPointeeOf(const CheckerContext &C, Loc LValue) { SVal getPointeeOf(ProgramStateRef State, Loc LValue) {

const QualType ArgTy = LValue.getType(C.getASTContext()); const QualType ArgTy = LValue.getType(State->getStateManager().getContext());

if (!ArgTy->isPointerType() || !ArgTy->getPointeeType()->isVoidType()) if (!ArgTy->isPointerType() || !ArgTy->getPointeeType()->isVoidType())

return C.getState()->getSVal(LValue); return State->getSVal(LValue);

// Do not dereference void pointers. Treat them as byte pointers instead. // Do not dereference void pointers. Treat them as byte pointers instead.

// FIXME: we might want to consider more than just the first byte. // FIXME: we might want to consider more than just the first byte.

return C.getState()->getSVal(LValue, C.getASTContext().CharTy); return State->getSVal(LValue, State->getStateManager().getContext().CharTy);

} }

/// Given a pointer/reference argument, return the value it refers to. /// Given a pointer/reference argument, return the value it refers to.

std::optional<SVal> getPointeeOf(const CheckerContext &C, SVal Arg) { std::optional<SVal> getPointeeOf(ProgramStateRef State, SVal Arg) {

if (auto LValue = Arg.getAs<Loc>()) if (auto LValue = Arg.getAs<Loc>())

steakhalUnsubmitted

Done

BTW I don't know but State->getStateManager().getContext() can give you an ASTContext. And we tend to not put const to variable declarations. See readability-avoid-const-params-in-decls

In other places we tend to refer to ASTContext by the ACtx I think.
We also prefer const refs over mutable refs. Is the mutable ref justified for this case?

steakhal: BTW I don't know but `State->getStateManager().getContext()` can give you an `ASTContext`. And…

dkruppAuthorUnsubmitted

Done

Thanks for the suggestion. I took out ASTContext from the signature.

dkrupp: Thanks for the suggestion. I took out ASTContext from the signature.

return getPointeeOf(C, *LValue); return getPointeeOf(State, *LValue);

return std::nullopt; return std::nullopt;

} }

/// Given a pointer, return the SVal of its pointee or if it is tainted, /// Given a pointer, return the SVal of its pointee or if it is tainted,

/// otherwise return the pointer's SVal if tainted. /// otherwise return the pointer's SVal if tainted.

/// Also considers stdin as a taint source. /// Also considers stdin as a taint source.

std::optional<SVal> getTaintedPointeeOrPointer(const CheckerContext &C, std::optional<SVal> getTaintedPointeeOrPointer(ProgramStateRef State,

SVal Arg) { SVal Arg) {

steakhalUnsubmitted

Done

I'm not sure about passing both the CheckerContext and the State.
The CheckerContext already encapsulates a State, which opens up the possibilities for misuse.

For example, the getPointeeOf() is called in this function, and that will eventually call C.getState() under the hood. So, for me it feels like a bad API design.
What we could do instead, is to pass an ASTContext and a State; resolving this discrepancy.

Could you please check if this is a real concern?

steakhal: I'm not sure about passing both the `CheckerContext` and the `State`. The `CheckerContext`…

const ProgramStateRef State = C.getState(); if (auto Pointee = getPointeeOf(State, Arg))

if (auto Pointee = getPointeeOf(C, Arg))

if (isTainted(State, *Pointee)) // FIXME: isTainted(...) ? Pointee : None; if (isTainted(State, *Pointee)) // FIXME: isTainted(...) ? Pointee : None;

return Pointee; return Pointee;

if (isTainted(State, Arg)) if (isTainted(State, Arg))

return Arg; return Arg;

return std::nullopt;

}

// FIXME: This should be done by the isTainted() API. bool isTaintedOrPointsToTainted(ProgramStateRef State, SVal ExprSVal) {

if (isStdin(Arg, C.getASTContext())) return getTaintedPointeeOrPointer(State, ExprSVal).has_value();

return Arg; }

return std::nullopt; /// Helps in printing taint diagnostics.

/// Marks the incoming parameters of a function interesting (to be printed)

/// when the return value, or the outgoing parameters are tainted.

const NoteTag *taintOriginTrackerTag(CheckerContext &C,

steakhalUnsubmitted

Done

My bad. In LLVM style we use UpperCamelCase for variable names.

steakhal: My bad. In LLVM style we use `UpperCamelCase` for variable names.

std::vector<SymbolRef> TaintedSymbols,

std::vector<ArgIdxTy> TaintedArgs,

const LocationContext *CallLocation) {

return C.getNoteTag([TaintedSymbols = std::move(TaintedSymbols),

TaintedArgs = std::move(TaintedArgs), CallLocation](

PathSensitiveBugReport &BR) -> std::string {

SmallString<256> Msg;

// We give diagnostics only for taint related reports

if (!BR.isInteresting(CallLocation) ||

BR.getBugType().getCategory() != categories::TaintedData) {

return "";

}

if (TaintedSymbols.empty())

return "Taint originated here";

for (auto Sym : TaintedSymbols) {

steakhalUnsubmitted

Done

Generally, in LLVM style we don't put braces to single block statements unless it would hurt readability, which I don't think applies here.

steakhal: Generally, in LLVM style we don't put braces to single block statements unless it would hurt…

BR.markInteresting(Sym);

}

for (auto Arg : TaintedArgs) {

LLVM_DEBUG(llvm::dbgs()

<< "Taint Propagated from argument " << Arg + 1 << "\n");

steakhalUnsubmitted

Done

I was also bad with this recommendation.
I think we can now use structured bindings to get the index and value right there, like:
for (auto [Idx, Sym] : llvm::enumerate(TaintedSymbols))
See

steakhal: I was also bad with this recommendation. I think we can now use structured bindings to get the…

}

return "";

});

} }

bool isTaintedOrPointsToTainted(const Expr *E, const ProgramStateRef &State, /// Helps in printing taint diagnostics.

CheckerContext &C) { /// Marks the function interesting (to be printed)

return getTaintedPointeeOrPointer(C, C.getSVal(E)).has_value(); /// when the return value, or the outgoing parameters are tainted.

const NoteTag *taintPropagationExplainerTag(

CheckerContext &C, std::vector<SymbolRef> TaintedSymbols,

std::vector<ArgIdxTy> TaintedArgs, const LocationContext *CallLocation) {

assert(TaintedSymbols.size() == TaintedArgs.size());

steakhalUnsubmitted

Done

This function should be an implementation detail, as such I wonder if we should make it static.

How about naming this function differently?
I'm thinking of taintOriginTrackerTag, IDK.

steakhal: This function should be an implementation detail, as such I wonder if we should make it…

return C.getNoteTag([TaintedSymbols = std::move(TaintedSymbols),

TaintedArgs = std::move(TaintedArgs), CallLocation](

PathSensitiveBugReport &BR) -> std::string {

SmallString<256> Msg;

steakhalUnsubmitted

Done

If the Call parameter is only used for acquiring the LocationContext, wouldn't it be more descriptive to directly pass the LocationContext to the function instead?
I'm also puzzled that we use getCalleeStackFrame here. I rarely ever see this function, so I'm a bit worried if this pick was intentional. That we pass the 0 as the BlockCount argument only reinforces this instinct.

steakhal: If the `Call` parameter is only used for acquiring the `LocationContext`, wouldn't it be more…

dkruppAuthorUnsubmitted

Done

The call.getCalleeStackFrame(0) gets the location context of the actual call that we are analyzing (in the pre or postcall), and that's what we need to mark interesting. It is intentionally used like this. I changed the parameter to locationcontext as use suggested.

dkrupp: The call.getCalleeStackFrame(0) gets the location context of the actual call that we are…

steakhalUnsubmitted

Done

Okay.

steakhal: Okay.

llvm::raw_svector_ostream Out(Msg);

// We give diagnostics only for taint related reports

if (TaintedSymbols.empty() ||

steakhalUnsubmitted

Done

It should consume the TaintedSymbols and TaintedArgs variables, as such you should std::move out from the original parameters like this:

[TaintedSymbols = std::move(TaintedSymbols), TaintedArgs = std::move(TaintedArgs)](){...}

steakhal: It should consume the `TaintedSymbols` and `TaintedArgs` variables, as such you should `std…

BR.getBugType().getCategory() != categories::TaintedData) {

return "";

}

int nofTaintedArgs = 0;

steakhalUnsubmitted

Done

What does the first half of this condition guard against?
Do you have a test for it to demonstrate?

steakhal: What does the first half of this condition guard against? Do you have a test for it to…

dkruppAuthorUnsubmitted

Done

To only follow taint propagation to function calls which actually result in tainted variables used in the report and not every function which returns a tainted variable.

char* taintDiagnosticPropagation2() is such a test which is failing without this due to giving extra unrelated propagation notes.

dkrupp: To only follow taint propagation to function calls which actually result in tainted variables…

for (auto [Idx, Sym] : llvm::enumerate(TaintedSymbols)) {

steakhalUnsubmitted

Done

Same here about structured bindings.

steakhal: Same here about structured bindings.

if (BR.isInteresting(Sym)) {

BR.markInteresting(CallLocation);

if (TaintedArgs[Idx] != ReturnValueIndex) {

steakhalUnsubmitted

Done

We usually use the operator[] on vector instead of the at(). When in doubt, we assert that the index is in idx < size().
Apply this for all .at() uses.

steakhal: We usually use the `operator[]` on vector instead of the `at()`. When in doubt, we assert that…

LLVM_DEBUG(llvm::dbgs() << "Taint Propagated to argument "

<< TaintedArgs[Idx] + 1 << "\n");

if (nofTaintedArgs == 0)

steakhalUnsubmitted

Done

return "";

}

- if (TaintedSymbols.empty()){

- Out << "Taint originated here";

- return std::string(Out.str());

- }

+ if (TaintedSymbols.empty())

+ return "Taint originated here";

int i = 0;

steakhal:

Out << "Taint propagated to the ";

else

steakhalUnsubmitted

Done

For clang diagnostics we usually use ordinary suffixes like {st,nd,rd,th}. It would be nice to align with the rest of the clang diagnostics on this.
It would require a bit of work on the wording though, I admit.

steakhal: For clang diagnostics we usually use ordinary suffixes like `{st,nd,rd,th}`. It would be nice…

Out << ", ";

Out << TaintedArgs[Idx] + 1

steakhalUnsubmitted

Done

I believe this branch is uncovered by tests.

steakhal: I believe this branch is uncovered by tests.

dkruppAuthorUnsubmitted

Done

Now it is covered. See multipleTaintedSArgs(..) test in taint-diagnostic-visitor.c

dkrupp: Now it is covered. See multipleTaintedSArgs(..) test in taint-diagnostic-visitor.c

<< llvm::getOrdinalSuffix(TaintedArgs[Idx] + 1) << " argument";

steakhalUnsubmitted

Done

I'd recommend using llvm::interleaveComma() in such cases.
You can probably get rid of nofTaintedArgs as well - by using this function.

steakhal: I'd recommend using `llvm::interleaveComma()` in such cases. You can probably get rid of…

dkruppAuthorUnsubmitted

Done

I chose another solution. I hope that is ok too.

dkrupp: I chose another solution. I hope that is ok too.

steakhalUnsubmitted

Done

Looks good. I'm not expecting many cases propagating to multiple arguments anyway.

steakhal: Looks good. I'm not expecting many cases propagating to multiple arguments anyway.

nofTaintedArgs++;

} else {

steakhalUnsubmitted

Done

return std::string(Out.str());

}

- int i = 0;

- for (SymbolRef SR : TaintedSymbols) {

+ for (auto Sym : llvm::enumerate(TaintedSymbols)) {

LLVM_DEBUG(llvm::dbgs() << "Taint Propagated from argument"

- << TaintedArgs.at(i) << "\n");

- BR.markInteresting(SR);

- i++;

- }

+ << TaintedArgs.at(Sym.index()) << "\n");

+ BR.markInteresting(Sym.value());

+ }

return std::string(Out.str());

steakhal:

LLVM_DEBUG(llvm::dbgs() << "Taint Propagated to return value.\n");

Out << "Taint propagated to the return value";

}

steakhalUnsubmitted

Done

I think since you explicitly specify the return type of the lambda, you could omit the spelling of std::string here.

steakhal: I think since you explicitly specify the return type of the lambda, you could omit the spelling…

dkruppAuthorUnsubmitted

Done

not sure. Got a "cannot convert raw_svector_ostream::str() from llvm:StringRef" error.

dkrupp: not sure. Got a "cannot convert raw_svector_ostream::str() from llvm:StringRef" error.

return std::string(Out.str());

});

} }

steakhalUnsubmitted

Done

How about calling this taintPropagationExplainerTag?

steakhal: How about calling this `taintPropagationExplainerTag`?

/// ArgSet is used to describe arguments relevant for taint detection or /// ArgSet is used to describe arguments relevant for taint detection or

/// taint application. A discrete set of argument indexes and a variadic /// taint application. A discrete set of argument indexes and a variadic

/// argument list signified by a starting index are supported. /// argument list signified by a starting index are supported.

class ArgSet { class ArgSet {

public: public:

steakhalUnsubmitted

Done

Please assert that the size of TaintedSymbols must be the same as TaintedArgs. Same in the other function.

steakhal: Please assert that the size of `TaintedSymbols` must be the same as `TaintedArgs`. Same in the…

ArgSet() = default; ArgSet() = default;

steakhalUnsubmitted

Done

const LocationContext* LC = Call.getCalleeStackFrame(0);

- const NoteTag *InjectionTag = C.getNoteTag(

+ return C.getNoteTag(

[TaintedSymbols, TaintedArgs, LC](PathSensitiveBugReport &BR) -> std::string {

steakhal:

ArgSet(ArgVecTy &&DiscreteArgs, ArgSet(ArgVecTy &&DiscreteArgs,

std::optional<ArgIdxTy> VariadicIndex = std::nullopt) std::optional<ArgIdxTy> VariadicIndex = std::nullopt)

: DiscreteArgs(std::move(DiscreteArgs)), : DiscreteArgs(std::move(DiscreteArgs)),

VariadicIndex(std::move(VariadicIndex)) {} VariadicIndex(std::move(VariadicIndex)) {}

bool contains(ArgIdxTy ArgIdx) const { bool contains(ArgIdxTy ArgIdx) const {

if (llvm::is_contained(DiscreteArgs, ArgIdx)) if (llvm::is_contained(DiscreteArgs, ArgIdx))

return true; return true;

return VariadicIndex && ArgIdx >= *VariadicIndex; return VariadicIndex && ArgIdx >= *VariadicIndex;

} }

bool isEmpty() const { return DiscreteArgs.empty() && !VariadicIndex; } bool isEmpty() const { return DiscreteArgs.empty() && !VariadicIndex; }

private: private:

ArgVecTy DiscreteArgs; ArgVecTy DiscreteArgs;

steakhalUnsubmitted

Done

So, if TaintedSymbols.size() > 1, then the note message will look weird.
Could you please have a test for this?

steakhal: So, if `TaintedSymbols.size() > 1`, then the note message will look weird. Could you please…

dkruppAuthorUnsubmitted

Done

Test added multipleTaintedArgs (). I could not provoke the multi-argument message as we only track-back one tainted symbol now.

dkrupp: Test added multipleTaintedArgs (). I could not provoke the multi-argument message as we only…

std::optional<ArgIdxTy> VariadicIndex; std::optional<ArgIdxTy> VariadicIndex;

}; };

/// A struct used to specify taint propagation rules for a function. /// A struct used to specify taint propagation rules for a function.

/// ///

/// If any of the possible taint source arguments is tainted, all of the /// If any of the possible taint source arguments is tainted, all of the

/// destination arguments should also be tainted. If ReturnValueIndex is added /// destination arguments should also be tainted. If ReturnValueIndex is added

/// to the dst list, the return value will be tainted. /// to the dst list, the return value will be tainted.

class GenericTaintRule { class GenericTaintRule {

/// Arguments which are taints sinks and should be checked, and a report /// Arguments which are taints sinks and should be checked, and a report

/// should be emitted if taint reaches these. /// should be emitted if taint reaches these.

ArgSet SinkArgs; ArgSet SinkArgs;

/// Arguments which should be sanitized on function return. /// Arguments which should be sanitized on function return.

ArgSet FilterArgs; ArgSet FilterArgs;

/// Arguments which can participate in taint propagationa. If any of the /// Arguments which can participate in taint propagation. If any of the

/// arguments in PropSrcArgs is tainted, all arguments in PropDstArgs should /// arguments in PropSrcArgs is tainted, all arguments in PropDstArgs should

/// be tainted. /// be tainted.

ArgSet PropSrcArgs; ArgSet PropSrcArgs;

ArgSet PropDstArgs; ArgSet PropDstArgs;

/// A message that explains why the call is sensitive to taint. /// A message that explains why the call is sensitive to taint.

std::optional<StringRef> SinkMsg; std::optional<StringRef> SinkMsg;

▲ Show 20 Lines • Show All 133 Lines • ▼ Show 20 Lines public:

void printState(raw_ostream &Out, ProgramStateRef State, const char *NL, void printState(raw_ostream &Out, ProgramStateRef State, const char *NL,

const char *Sep) const override; const char *Sep) const override;

/// Generate a report if the expression is tainted or points to tainted data. /// Generate a report if the expression is tainted or points to tainted data.

bool generateReportIfTainted(const Expr *E, StringRef Msg, bool generateReportIfTainted(const Expr *E, StringRef Msg,

CheckerContext &C) const; CheckerContext &C) const;

private: private:

const BugType BT{this, "Use of Untrusted Data", "Untrusted Data"}; const BugType BT{this, "Use of Untrusted Data", categories::TaintedData};

bool checkUncontrolledFormatString(const CallEvent &Call, bool checkUncontrolledFormatString(const CallEvent &Call,

CheckerContext &C) const; CheckerContext &C) const;

void taintUnsafeSocketProtocol(const CallEvent &Call, void taintUnsafeSocketProtocol(const CallEvent &Call,

CheckerContext &C) const; CheckerContext &C) const;

/// Default taint rules are initilized with the help of a CheckerContext to /// Default taint rules are initalized with the help of a CheckerContext to

/// access the names of built-in functions like memcpy. /// access the names of built-in functions like memcpy.

void initTaintRules(CheckerContext &C) const; void initTaintRules(CheckerContext &C) const;

/// CallDescription currently cannot restrict matches to the global namespace /// CallDescription currently cannot restrict matches to the global namespace

/// only, which is why multiple CallDescriptionMaps are used, as we want to /// only, which is why multiple CallDescriptionMaps are used, as we want to

/// disambiguate global C functions from functions inside user-defined /// disambiguate global C functions from functions inside user-defined

/// namespaces. /// namespaces.

// TODO: Remove separation to simplify matching logic once CallDescriptions // TODO: Remove separation to simplify matching logic once CallDescriptions

▲ Show 20 Lines • Show All 420 Lines • ▼ Show 20 Lines void GenericTaintChecker::checkPostCall(const CallEvent &Call,

LLVM_DEBUG(for (ArgIdxTy I LLVM_DEBUG(for (ArgIdxTy I

: *TaintArgs) { : *TaintArgs) {

llvm::dbgs() << "PostCall<"; llvm::dbgs() << "PostCall<";

Call.dump(llvm::dbgs()); Call.dump(llvm::dbgs());

llvm::dbgs() << "> actually wants to taint arg index: " << I << '\n'; llvm::dbgs() << "> actually wants to taint arg index: " << I << '\n';

}); });

const NoteTag *InjectionTag = nullptr;

std::vector<SymbolRef> TaintedSymbols;

std::vector<ArgIdxTy> TaintedIndexes;

for (ArgIdxTy ArgNum : *TaintArgs) { for (ArgIdxTy ArgNum : *TaintArgs) {

// Special handling for the tainted return value. // Special handling for the tainted return value.

if (ArgNum == ReturnValueIndex) { if (ArgNum == ReturnValueIndex) {

State = addTaint(State, Call.getReturnValue()); State = addTaint(State, Call.getReturnValue());

std::vector<SymbolRef> TaintedSyms =

getTaintedSymbols(State, Call.getReturnValue());

steakhalUnsubmitted

Done

State = addTaint(State, Call.getReturnValue());

- SymbolRef TaintedSym = isTainted(State, Call.getReturnValue());

- if (TaintedSym) {

+ if (SymbolRef TaintedSym = isTainted(State, Call.getReturnValue())) {

TaintedSymbols.push_back(TaintedSym);

We tend to fuse such declarations:
I've seen other cases like this elsewhere. Please check.

steakhal: We tend to fuse such declarations: I've seen other cases like this elsewhere. Please check.

if (!TaintedSyms.empty()) {

TaintedSymbols.push_back(TaintedSyms[0]);

TaintedIndexes.push_back(ArgNum);

}

steakhalUnsubmitted

Done

I cannot see a test against the "Strange" string. Is this dead code?
Same for the other block.

steakhal: I cannot see a test against the "Strange" string. Is this dead code? Same for the other block.

dkruppAuthorUnsubmitted

Done

It was a debugging code, which I removed. I noticed that in some cases (e.g. if the argument pointer is pointing to an unknown area) we don't get back a tainted symbol even though we call the addtaint on the arg/return value.

dkrupp: It was a debugging code, which I removed. I noticed that in some cases (e.g. if the argument…

steakhalUnsubmitted

Done

State = addTaint(State, Call.getReturnValue());

- std::vector<SymbolRef> TaintedSyms =

- getTaintedSymbols(State, Call.getReturnValue());

- if (!TaintedSyms.empty()) {

- TaintedSymbols.push_back(TaintedSyms[0]);

+ if (SymbolRef RetSym = Call.getReturnValue().getAsSymbol(); RetSym && isTainted(State, RetSym)) {

+ TaintedSymbols.push_back(RetSym);

TaintedIndexes.push_back(ArgNum);

}

continue;

steakhal:

dkruppAuthorUnsubmitted

Done

I think this suggested solution would not be correct here, as ArgSym might not be the actual _tainted_ symbol (inside a more complex expression).

So I would prefer to leave it like this for correctness.

dkrupp: I think this suggested solution would not be correct here, as ArgSym might not be the actual…

continue; continue;

} }

// The arguments are pointer arguments. The data they are pointing at is // The arguments are pointer arguments. The data they are pointing at is

// tainted after the call. // tainted after the call.

if (auto V = getPointeeOf(C, Call.getArgSVal(ArgNum))) if (auto V = getPointeeOf(State, Call.getArgSVal(ArgNum))) {

State = addTaint(State, *V); State = addTaint(State, *V);

std::vector<SymbolRef> TaintedSyms = getTaintedSymbols(State, *V);

if (!TaintedSyms.empty()) {

TaintedSymbols.push_back(TaintedSyms[0]);

TaintedIndexes.push_back(ArgNum);

steakhalUnsubmitted

Done

You could iterate over the symbol dependencies of the SymExpr (of the *V SVal).

SymbolRef PointeeAsSym = V->getAsSymbol();
// eee, can it be null? Sure it can. See isTainted(Region),... for those cases we would need to descend and check their symbol dependencies.
for (SymbolRef SubSym : llvm::make_range(PointeeAsSym->symbol_begin(), PointeeAsSym->symbol_end())) {
  // TODO: check each if it's also tainted, and update the `TaintedSymbols` accordingly, IDK.
}

Something like this should work for most cases (except when *V refers to a tainted region instead of a symbol), I think.

steakhal: You could iterate over the symbol dependencies of the SymExpr (of the `*V` SVal). ```lang=c++…

dkruppAuthorUnsubmitted

Done

I implememented a new function getTaintedSymbols(..) in Taint.cpp which returns all tainted symbols for a complex expr, SVal etc. With this addition, now we can track back multiple tainted symbols reaching a sink.

dkrupp: I implememented a new function getTaintedSymbols(..) in Taint.cpp which returns all tainted…

} }

steakhalUnsubmitted

Done

State = addTaint(State, *V);

- std::vector<SymbolRef> TaintedSyms = getTaintedSymbols(State, *V);

- if (!TaintedSyms.empty()) {

- TaintedSymbols.push_back(TaintedSyms[0]);

+ if (SymbolRef ArgSym = V->getAsSymbol(); ArgSym && isTainted(State, ArgSym)) {

+ TaintedSymbols.push_back(ArgSym);

TaintedIndexes.push_back(ArgNum);

}

// Create a NoteTag callback, which prints to the user where the taintedness

In these cases, the code would acquire all the tainted subsymbols, which then we throw away and keep only the first one.
This is why I suggested the approach I did I'm my last review.

steakhal: In these cases, the code would acquire all the tainted subsymbols, which then we throw away and…

dkruppAuthorUnsubmitted

Done

I think this suggested solution would not be correct here, as ArgSym might not be the actual _tainted_ symbol (inside a more complex expression).

So I would prefer to leave it like this for correctness.

dkrupp: I think this suggested solution would not be correct here, as ArgSym might not be the actual…

steakhalUnsubmitted

Done

Okay, I also checked out the code and verified this. Indeed we would have failing tests with my recommendation.
I still think it's suboptimal. This is somewhat related to the tainted API. It shouldn't make sense to have tainted regions in the first place, which bites here again. Let's keep it as-is, no actions are required.

steakhal: Okay, I also checked out the code and verified this. Indeed we would have failing tests with my…

}

// Create a NoteTag callback, which prints to the user where the taintedness

// was propagated to.

InjectionTag = taintPropagationExplainerTag(C, TaintedSymbols, TaintedIndexes,

Call.getCalleeStackFrame(0));

// Clear up the taint info from the state. // Clear up the taint info from the state.

State = State->remove<TaintArgsOnPostVisit>(CurrentFrame); State = State->remove<TaintArgsOnPostVisit>(CurrentFrame);

C.addTransition(State); C.addTransition(State, InjectionTag);

} }

void GenericTaintChecker::printState(raw_ostream &Out, ProgramStateRef State, void GenericTaintChecker::printState(raw_ostream &Out, ProgramStateRef State,

const char *NL, const char *Sep) const { const char *NL, const char *Sep) const {

printTaint(State, Out, NL, Sep); printTaint(State, Out, NL, Sep);

} }

void GenericTaintRule::process(const GenericTaintChecker &Checker, void GenericTaintRule::process(const GenericTaintChecker &Checker,

const CallEvent &Call, CheckerContext &C) const { const CallEvent &Call, CheckerContext &C) const {

ProgramStateRef State = C.getState(); ProgramStateRef State = C.getState();

const ArgIdxTy CallNumArgs = fromArgumentCount(Call.getNumArgs()); const ArgIdxTy CallNumArgs = fromArgumentCount(Call.getNumArgs());

/// Iterate every call argument, and get their corresponding Expr and SVal. /// Iterate every call argument, and get their corresponding Expr and SVal.

const auto ForEachCallArg = [&C, &Call, CallNumArgs](auto &&Fun) { const auto ForEachCallArg = [&C, &Call, CallNumArgs](auto &&Fun) {

for (ArgIdxTy I = ReturnValueIndex; I < CallNumArgs; ++I) { for (ArgIdxTy I = ReturnValueIndex; I < CallNumArgs; ++I) {

const Expr *E = GetArgExpr(I, Call); const Expr *E = GetArgExpr(I, Call);

Fun(I, E, C.getSVal(E)); Fun(I, E, C.getSVal(E));

} }

}; };

/// Check for taint sinks. /// Check for taint sinks.

ForEachCallArg([this, &Checker, &C, &State](ArgIdxTy I, const Expr *E, SVal) { ForEachCallArg([this, &Checker, &C, &State](ArgIdxTy I, const Expr *E, SVal) {

if (SinkArgs.contains(I) && isTaintedOrPointsToTainted(E, State, C)) // Add taintedness to stdin parameters

if (isStdin(C.getSVal(E), C.getASTContext())) {

State = addTaint(State, C.getSVal(E));

}

steakhalUnsubmitted

Done

I just want to get it confirmed that this hunk is unrelated to your change per-say. Is it?
BTW I don't mind this change being part of this patch, rather the opposite. Finally, we will have it.

steakhal: I just want to get it confirmed that this hunk is unrelated to your change per-say. Is it? BTW…

dkruppAuthorUnsubmitted

Done

It is related in the sense that in isTainted() function call did not return a valid SymbolRef for stdin if we did not make the stdin tainted when we first see it. Caused testcase to fail as it was before. Now it is handled similarly to other tainted symbols.

dkrupp: It is related in the sense that in isTainted() function call did not return a valid SymbolRef…

if (SinkArgs.contains(I) && isTaintedOrPointsToTainted(State, C.getSVal(E)))

Checker.generateReportIfTainted(E, SinkMsg.value_or(MsgCustomSink), C); Checker.generateReportIfTainted(E, SinkMsg.value_or(MsgCustomSink), C);

}); });

/// Check for taint filters. /// Check for taint filters.

ForEachCallArg([this, &C, &State](ArgIdxTy I, const Expr *E, SVal S) { ForEachCallArg([this, &State](ArgIdxTy I, const Expr *E, SVal S) {

steakhalUnsubmitted

Done

/// Check for taint filters.

- ForEachCallArg([this, &C, &State](ArgIdxTy I, const Expr *E, SVal S) {

+ ForEachCallArg([this, &State](ArgIdxTy I, const Expr *E, SVal S) {

if (FilterArgs.contains(I)) {

This unused variable generates a compiler warning.

steakhal: This unused variable generates a compiler warning.

if (FilterArgs.contains(I)) { if (FilterArgs.contains(I)) {

State = removeTaint(State, S); State = removeTaint(State, S);

if (auto P = getPointeeOf(C, S)) if (auto P = getPointeeOf(State, S))

State = removeTaint(State, *P); State = removeTaint(State, *P);

} }

}); });

/// Check for taint propagation sources. /// Check for taint propagation sources.

/// A rule is relevant if PropSrcArgs is empty, or if any of its signified /// A rule is relevant if PropSrcArgs is empty, or if any of its signified

/// args are tainted in context of the current CallEvent. /// args are tainted in context of the current CallEvent.

bool IsMatching = PropSrcArgs.isEmpty(); bool IsMatching = PropSrcArgs.isEmpty();

ForEachCallArg( std::vector<SymbolRef> TaintedSymbols;

[this, &C, &IsMatching, &State](ArgIdxTy I, const Expr *E, SVal) { std::vector<ArgIdxTy> TaintedIndexes;

IsMatching = IsMatching || (PropSrcArgs.contains(I) && ForEachCallArg([this, &C, &IsMatching, &State, &TaintedSymbols,

isTaintedOrPointsToTainted(E, State, C)); &TaintedIndexes](ArgIdxTy I, const Expr *E, SVal) {

std::optional<SVal> TaintedSVal =

getTaintedPointeeOrPointer(State, C.getSVal(E));

IsMatching =

IsMatching || (PropSrcArgs.contains(I) && TaintedSVal.has_value());

steakhalUnsubmitted

Done

&TaintedIndexes](ArgIdxTy I, const Expr *E, SVal) {

- IsMatching =

- IsMatching || (PropSrcArgs.contains(I) &&

- isTaintedOrPointsToTainted(State, C.getSVal(E)));

- std::optional<SVal> TaintedSVal =

+ std::optional<SVal> TaintedSVal =

getTaintedPointeeOrPointer(State, C.getSVal(E));

+ IsMatching |= (PropSrcArgs.contains(I) && TaintedSVal.has_value());

// We track back tainted arguments except for stdin

Here getTaintedPointeeOrPointer would be called two times, unnecessarily.

steakhal: Here `getTaintedPointeeOrPointer` would be called two times, unnecessarily.

// We track back tainted arguments except for stdin

if (TaintedSVal && !isStdin(*TaintedSVal, C.getASTContext())) {

std::vector<SymbolRef> TaintedArgSyms =

getTaintedSymbols(State, *TaintedSVal);

if (!TaintedArgSyms.empty()) {

llvm::append_range(TaintedSymbols, TaintedArgSyms);

TaintedIndexes.push_back(I);

}

steakhalUnsubmitted

Done

if (!TaintedArgSyms.empty()) {

- TaintedSymbols.insert(TaintedSymbols.begin(), TaintedArgSyms.begin(),

- TaintedArgSyms.end());

+ llvm::append_range(TaintedSymbols, TaintedArgSyms);

TaintedIndexes.push_back(I);

steakhal:

steakhalUnsubmitted

Done

I observed you didn't take any action about this suggestion.
It leaves me wonder if this suggestion - in general - makes sense or if there are other reasons what I cannot foresee.
I've seen you using the fully spelled-out version in total 8 times.
Shouldn't we prefer the shorter, more expressive version instead?

steakhal: I observed you didn't take any action about this suggestion. It leaves me wonder if this…

dkruppAuthorUnsubmitted

Done

Sorry I overlooked this comment. I like this shorter version. It is so much consize! Changed at all places. Thanks for the suggestion.

dkrupp: Sorry I overlooked this comment. I like this shorter version. It is so much consize! Changed…

}

}); });

steakhalUnsubmitted

Done

I prefer to move declarations close to their uses.

steakhal: I prefer to move declarations close to their uses.

if (!IsMatching) if (!IsMatching)

return; return;

const auto WouldEscape = [](SVal V, QualType Ty) -> bool { const auto WouldEscape = [](SVal V, QualType Ty) -> bool {

if (!isa<Loc>(V)) if (!isa<Loc>(V))

return false; return false;

Show All 26 Lines ForEachCallArg(

llvm::dbgs() << "> prepares tainting arg index: " << I << '\n'; llvm::dbgs() << "> prepares tainting arg index: " << I << '\n';

}); });

Result = F.add(Result, I); Result = F.add(Result, I);

} }

}); });

if (!Result.isEmpty()) if (!Result.isEmpty())

State = State->set<TaintArgsOnPostVisit>(C.getStackFrame(), Result); State = State->set<TaintArgsOnPostVisit>(C.getStackFrame(), Result);

C.addTransition(State); const NoteTag *InjectionTag = taintOriginTrackerTag(

C, std::move(TaintedSymbols), std::move(TaintedIndexes),

steakhalUnsubmitted

Done

const NoteTag *InjectionTag = taintOriginTrackerTag(

- C, TaintedSymbols, TaintedIndexes, Call.getCalleeStackFrame(0));

+ C, std::move(TaintedSymbols), std::move(TaintedIndexes), Call.getCalleeStackFrame(0));

C.addTransition(State, InjectionTag);

steakhal:

Call.getCalleeStackFrame(0));

C.addTransition(State, InjectionTag);

steakhalUnsubmitted

Done

State = State->set<TaintArgsOnPostVisit>(C.getStackFrame(), Result);

- InjectionTag = createTaintPreTag(C,TaintedSymbols,TaintedIndexes,Call);

+ const NoteTag *InjectionTag = createTaintPreTag(C,TaintedSymbols,TaintedIndexes,Call);

C.addTransition(State, InjectionTag);

steakhal:

} }

bool GenericTaintRule::UntrustedEnv(CheckerContext &C) { bool GenericTaintRule::UntrustedEnv(CheckerContext &C) {

return !C.getAnalysisManager() return !C.getAnalysisManager()

.getAnalyzerOptions() .getAnalyzerOptions()

.ShouldAssumeControlledEnvironment; .ShouldAssumeControlledEnvironment;

} }

bool GenericTaintChecker::generateReportIfTainted(const Expr *E, StringRef Msg, bool GenericTaintChecker::generateReportIfTainted(const Expr *E, StringRef Msg,

CheckerContext &C) const { CheckerContext &C) const {

assert(E); assert(E);

std::optional<SVal> TaintedSVal{getTaintedPointeeOrPointer(C, C.getSVal(E))}; std::optional<SVal> TaintedSVal =

getTaintedPointeeOrPointer(C.getState(), C.getSVal(E));

steakhalUnsubmitted

Done

assert(E);

- std::optional<SVal> TaintedSVal{

- getTaintedPointeeOrPointer(C.getState(), C.getSVal(E))};

+ std::optional<SVal> TaintedSVal =

+ getTaintedPointeeOrPointer(C.getState(), C.getSVal(E));

if (!TaintedSVal)

steakhal:

if (!TaintedSVal) if (!TaintedSVal)

return false; return false;

// Generate diagnostic. // Generate diagnostic.

if (ExplodedNode *N = C.generateNonFatalErrorNode()) { if (ExplodedNode *N = C.generateNonFatalErrorNode()) {

auto report = std::make_unique<PathSensitiveBugReport>(BT, Msg, N); auto report = std::make_unique<PathSensitiveBugReport>(BT, Msg, N);

report->addRange(E->getSourceRange()); report->addRange(E->getSourceRange());

report->addVisitor(std::make_unique<TaintBugVisitor>(*TaintedSVal)); for (auto TaintedSym : getTaintedSymbols(C.getState(), *TaintedSVal)) {

report->markInteresting(TaintedSym);

}

C.emitReport(std::move(report)); C.emitReport(std::move(report));

steakhalUnsubmitted

Done

report->addRange(E->getSourceRange());

- SymbolRef TSR = isTainted(C.getState(),*TaintedSVal);

- if (TSR)

+ if (SymbolRef TaintedSym = isTainted(C.getState(),*TaintedSVal))

report->markInteresting(TSR);

C.emitReport(std::move(report));

steakhal:

steakhalUnsubmitted

Done

report->addRange(E->getSourceRange());

- std::vector<SymbolRef> TaintedSyms =

- getTaintedSymbols(C.getState(), *TaintedSVal);

- for (auto TaintedSym : TaintedSyms) {

- report->markInteresting(TaintedSym);

+ for (SymbolRef Sym : getTaintedSymbols(C.getState(), *TaintedSVal)) {

+ report->markInteresting(Sym);

}

C.emitReport(std::move(report));

steakhal:

return true; return true;

} }

return false; return false;

} }

/// TODO: remove checking for printf format attributes and socket whitelisting /// TODO: remove checking for printf format attributes and socket whitelisting

/// from GenericTaintChecker, and that means the following functions: /// from GenericTaintChecker, and that means the following functions:

/// getPrintfFormatArgumentNum, /// getPrintfFormatArgumentNum,

▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

clang/lib/StaticAnalyzer/Checkers/Taint.cpp

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines ProgramStateRef taint::addPartialTaint(ProgramStateRef State,

TaintedSubRegions Regs = SavedRegs ? *SavedRegs : F.getEmptyMap(); TaintedSubRegions Regs = SavedRegs ? *SavedRegs : F.getEmptyMap();

Regs = F.add(Regs, SubRegion, Kind); Regs = F.add(Regs, SubRegion, Kind);

ProgramStateRef NewState = State->set<DerivedSymTaint>(ParentSym, Regs); ProgramStateRef NewState = State->set<DerivedSymTaint>(ParentSym, Regs);

assert(NewState); assert(NewState);

return NewState; return NewState;

} }

bool taint::isTainted(ProgramStateRef State, const Stmt *S, bool taint::isTainted(ProgramStateRef State, const Stmt *S,

const LocationContext *LCtx, TaintTagType Kind) { const LocationContext *LCtx, TaintTagType Kind) {

SVal val = State->getSVal(S, LCtx); return !getTaintedSymbolsImpl(State, S, LCtx, Kind, /*ReturnFirstOnly=*/true)

return isTainted(State, val, Kind); .empty();

steakhalUnsubmitted

Done

const LocationContext *LCtx, TaintTagType Kind) {

- return !getTaintedSymbols(State, S, LCtx, Kind, true).empty();

+ return !getTaintedSymbols(State, S, LCtx, Kind, /*ReturnFirstOnly=*/true).empty();

}

bool taint::isTainted(ProgramStateRef State, SVal V, TaintTagType Kind) {

We usually pass booleans by "name".

steakhal: We usually pass booleans by "name".

} }

steakhalUnsubmitted

Done

TBH I'm not sure if I like that now we allocate unbounded amount of times (because of getTaintedSymbols() is recursive and returns by value), where we previously did not.

What we could possibly do is to compute the elements of this sequence lazily.
I'm thinking of the llvm::mapped_iterator, but I'm not sure if it's possible to have something like that as a return type as it might encode the map function in the type or something like that.
Anyway, I'm just saying that it would be nice to not do more than it's necessary, and especially not allocate a lot of short-lived objects there.

Do you think there is a way to have the cake and eat it too?

I did some investigation, and one could get pretty far in the implementation, and maybe even complete it but it would be really complicated as of now. Maybe we could revisit this subject when we have coroutines.

So, I would suggest to have two sets of APIs:

the usual isTainted(.) -> bool
and a getTaintedSymbols(.) -> vector<Sym>

The important point would be that the isTainted() version would not eagerly collect all tainted sub-syms but return on finding the first one.
While, the getTaintedSymbols() would collect eagerly all of them, as its name suggests.

Imagine if getTaintedSymbolsImpl() had an extra flag like bool returnAfterFirstMatch. This way isTainted() can call it like that. While in the other case, the parameter would be false, and eagerly collect all symbols.

This is probably the best of both worlds, as it prevents isTainted from doing extra work and if we need to iterate over the tainted symbols, we always iterate over all of them, so doing it lazily wouldn't gain us much in that case anyway.
As a bonus, the user-facing API would be self-descriptive.

WDYT?

steakhal: TBH I'm not sure if I like that now we allocate unbounded amount of times (because of…

dkruppAuthorUnsubmitted

Done

Good idea. I implemented the early return option in getTaintedSymbols(). This is used now by the isTainted() function.

dkrupp: Good idea. I implemented the early return option in getTaintedSymbols(). This is used now by…

dkruppAuthorUnsubmitted

Done

First I wanted to avoid the getTaintedSymbols()->getTaintedSymbolsImpl() proxy calls as it is too bloated IMHO.
But I see your point that it is safer. So I changed it.

dkrupp: First I wanted to avoid the getTaintedSymbols()->getTaintedSymbolsImpl() proxy calls as it is…

bool taint::isTainted(ProgramStateRef State, SVal V, TaintTagType Kind) { bool taint::isTainted(ProgramStateRef State, SVal V, TaintTagType Kind) {

if (SymbolRef Sym = V.getAsSymbol()) return !getTaintedSymbolsImpl(State, V, Kind, /*ReturnFirstOnly=*/true)

return isTainted(State, Sym, Kind); .empty();

if (const MemRegion *Reg = V.getAsRegion())

return isTainted(State, Reg, Kind);

return false;

} }

bool taint::isTainted(ProgramStateRef State, const MemRegion *Reg, bool taint::isTainted(ProgramStateRef State, const MemRegion *Reg,

TaintTagType K) { TaintTagType K) {

if (!Reg) return !getTaintedSymbolsImpl(State, Reg, K, /*ReturnFirstOnly=*/true)

return false; .empty();

}

// Element region (array element) is tainted if either the base or the offset bool taint::isTainted(ProgramStateRef State, SymbolRef Sym, TaintTagType Kind) {

// are tainted. return !getTaintedSymbolsImpl(State, Sym, Kind, /*ReturnFirstOnly=*/true)

if (const ElementRegion *ER = dyn_cast<ElementRegion>(Reg)) .empty();

return isTainted(State, ER->getSuperRegion(), K) || }

isTainted(State, ER->getIndex(), K);

std::vector<SymbolRef> taint::getTaintedSymbols(ProgramStateRef State,

const Stmt *S,

const LocationContext *LCtx,

TaintTagType Kind) {

return getTaintedSymbolsImpl(State, S, LCtx, Kind, /*ReturnFirstOnly=*/false);

}

if (const SymbolicRegion *SR = dyn_cast<SymbolicRegion>(Reg)) std::vector<SymbolRef> taint::getTaintedSymbols(ProgramStateRef State, SVal V,

return isTainted(State, SR->getSymbol(), K); TaintTagType Kind) {

return getTaintedSymbolsImpl(State, V, Kind, /*ReturnFirstOnly=*/false);

}

if (const SubRegion *ER = dyn_cast<SubRegion>(Reg)) std::vector<SymbolRef> taint::getTaintedSymbols(ProgramStateRef State,

return isTainted(State, ER->getSuperRegion(), K); SymbolRef Sym,

TaintTagType Kind) {

return getTaintedSymbolsImpl(State, Sym, Kind, /*ReturnFirstOnly=*/false);

}

return false; std::vector<SymbolRef> taint::getTaintedSymbols(ProgramStateRef State,

const MemRegion *Reg,

TaintTagType Kind) {

return getTaintedSymbolsImpl(State, Reg, Kind, /*ReturnFirstOnly=*/false);

} }

bool taint::isTainted(ProgramStateRef State, SymbolRef Sym, TaintTagType Kind) { std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,

const Stmt *S,

const LocationContext *LCtx,

TaintTagType Kind,

bool returnFirstOnly) {

SVal val = State->getSVal(S, LCtx);

return getTaintedSymbolsImpl(State, val, Kind, returnFirstOnly);

}

std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,

SVal V, TaintTagType Kind,

bool returnFirstOnly) {

if (SymbolRef Sym = V.getAsSymbol())

return getTaintedSymbolsImpl(State, Sym, Kind, returnFirstOnly);

if (const MemRegion *Reg = V.getAsRegion())

return getTaintedSymbolsImpl(State, Reg, Kind, returnFirstOnly);

return {};

}

std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,

const MemRegion *Reg,

TaintTagType K,

bool returnFirstOnly) {

std::vector<SymbolRef> TaintedSymbols;

if (!Reg)

return TaintedSymbols;

// Element region (array element) is tainted if either the base or the offset

// are tainted.

if (const ElementRegion *ER = dyn_cast<ElementRegion>(Reg)) {

std::vector<SymbolRef> TaintedIndex =

getTaintedSymbolsImpl(State, ER->getIndex(), K, returnFirstOnly);

llvm::append_range(TaintedSymbols, TaintedIndex);

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

steakhalUnsubmitted

Done

if (const ElementRegion *ER = dyn_cast<ElementRegion>(Reg)) {

- SymbolRef TSR;

- if (TSR = isTainted(State, ER->getSuperRegion(), K))

+ if (SymbolRef TaintedSym = isTainted(State, ER->getIndex(), K))

return TSR;

- else if (TSR = isTainted(State, ER->getIndex(), K))

+ if (SymbolRef TaintedSym = isTainted(State, ER->getSuperRegion(), K))

return TSR;

}

if (const SymbolicRegion *SR = dyn_cast<SymbolicRegion>(Reg))

I'd probably swap the two branches, so that the index would be tracked if both the region and the index are tainted.
I also wonder if this edge-case could be tested at all.

steakhal: I'd probably swap the two branches, so that the index would be tracked if both the region and…

std::vector<SymbolRef> TaintedSuperRegion =

getTaintedSymbolsImpl(State, ER->getSuperRegion(), K, returnFirstOnly);

llvm::append_range(TaintedSymbols, TaintedSuperRegion);

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

}

if (const SymbolicRegion *SR = dyn_cast<SymbolicRegion>(Reg)) {

std::vector<SymbolRef> TaintedRegions =

getTaintedSymbolsImpl(State, SR->getSymbol(), K, returnFirstOnly);

llvm::append_range(TaintedSymbols, TaintedRegions);

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

}

if (const SubRegion *ER = dyn_cast<SubRegion>(Reg)) {

std::vector<SymbolRef> TaintedSubRegions =

getTaintedSymbolsImpl(State, ER->getSuperRegion(), K, returnFirstOnly);

llvm::append_range(TaintedSymbols, TaintedSubRegions);

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

}

return TaintedSymbols;

}

std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,

SymbolRef Sym,

TaintTagType Kind,

bool returnFirstOnly) {

std::vector<SymbolRef> TaintedSymbols;

if (!Sym) if (!Sym)

return false; return TaintedSymbols;

// Traverse all the symbols this symbol depends on to see if any are tainted. // Traverse all the symbols this symbol depends on to see if any are tainted.

for (SymExpr::symbol_iterator SI = Sym->symbol_begin(), for (SymExpr::symbol_iterator SI = Sym->symbol_begin(),

SE = Sym->symbol_end(); SE = Sym->symbol_end();

SI != SE; ++SI) { SI != SE; ++SI) {

if (!isa<SymbolData>(*SI)) if (!isa<SymbolData>(*SI))

continue; continue;

if (const TaintTagType *Tag = State->get<TaintMap>(*SI)) { if (const TaintTagType *Tag = State->get<TaintMap>(*SI)) {

if (*Tag == Kind) if (*Tag == Kind) {

return true; TaintedSymbols.push_back(*SI);

if (returnFirstOnly)

steakhalUnsubmitted

Done

TaintedSymbols.push_back(*SI);

- if (returnFirstOnly && !TaintedSymbols.empty())

+ if (returnFirstOnly)

return TaintedSymbols; // return early if needed

The second part of the conjunction should be tautologically true.

steakhal: The second part of the conjunction should be tautologically true.

return TaintedSymbols; // return early if needed

}

} }

if (const auto *SD = dyn_cast<SymbolDerived>(*SI)) { if (const auto *SD = dyn_cast<SymbolDerived>(*SI)) {

// If this is a SymbolDerived with a tainted parent, it's also tainted. // If this is a SymbolDerived with a tainted parent, it's also tainted.

if (isTainted(State, SD->getParentSymbol(), Kind)) std::vector<SymbolRef> TaintedParents = getTaintedSymbolsImpl(

return true; State, SD->getParentSymbol(), Kind, returnFirstOnly);

llvm::append_range(TaintedSymbols, TaintedParents);

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

steakhalUnsubmitted

Done

Here we still have the TSR token.

steakhal: Here we still have the `TSR` token.

// If this is a SymbolDerived with the same parent symbol as another // If this is a SymbolDerived with the same parent symbol as another

// tainted SymbolDerived and a region that's a sub-region of that tainted // tainted SymbolDerived and a region that's a sub-region of that

// symbol, it's also tainted. // tainted symbol, it's also tainted.

if (const TaintedSubRegions *Regs = if (const TaintedSubRegions *Regs =

State->get<DerivedSymTaint>(SD->getParentSymbol())) { State->get<DerivedSymTaint>(SD->getParentSymbol())) {

const TypedValueRegion *R = SD->getRegion(); const TypedValueRegion *R = SD->getRegion();

for (auto I : *Regs) { for (auto I : *Regs) {

// FIXME: The logic to identify tainted regions could be more // FIXME: The logic to identify tainted regions could be more

// complete. For example, this would not currently identify // complete. For example, this would not currently identify

// overlapping fields in a union as tainted. To identify this we can // overlapping fields in a union as tainted. To identify this we can

// check for overlapping/nested byte offsets. // check for overlapping/nested byte offsets.

if (Kind == I.second && R->isSubRegionOf(I.first)) if (Kind == I.second && R->isSubRegionOf(I.first)) {

return true; TaintedSymbols.push_back(SD->getParentSymbol());

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

}

} }

// If memory region is tainted, data is also tainted. // If memory region is tainted, data is also tainted.

if (const auto *SRV = dyn_cast<SymbolRegionValue>(*SI)) { if (const auto *SRV = dyn_cast<SymbolRegionValue>(*SI)) {

if (isTainted(State, SRV->getRegion(), Kind)) std::vector<SymbolRef> TaintedRegions =

return true; getTaintedSymbolsImpl(State, SRV->getRegion(), Kind, returnFirstOnly);

llvm::append_range(TaintedSymbols, TaintedRegions);

if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

} }

steakhalUnsubmitted

Done

What does TSR abbreviate? I would find TaintedSym more descriptive.

steakhal: What does `TSR` abbreviate? I would find `TaintedSym` more descriptive.

dkruppAuthorUnsubmitted

Done

TSR = Tainted Symbol Ref

but I changed it as you suggested.

dkrupp: TSR = Tainted Symbol Ref but I changed it as you suggested.

steakhalUnsubmitted

Done

if (const auto *SRV = dyn_cast<SymbolRegionValue>(*SI)) {

- std::vector<SymbolRef> TaintedRegions =

- getTaintedSymbols(State, SRV->getRegion(), Kind);

- TaintedSymbols.insert(TaintedSymbols.begin(), TaintedRegions.begin(),

- TaintedRegions.end());

+ llvm::append_range(TaintedSymbols, getTaintedSymbols(State, SRV->getRegion(), Kind));

}

// If this is a SymbolCast from a tainted value, it's also tainted.

For such constructs, I would prefer this.

steakhal: For such constructs, I would prefer this.

// If this is a SymbolCast from a tainted value, it's also tainted. // If this is a SymbolCast from a tainted value, it's also tainted.

if (const auto *SC = dyn_cast<SymbolCast>(*SI)) { if (const auto *SC = dyn_cast<SymbolCast>(*SI)) {

if (isTainted(State, SC->getOperand(), Kind)) std::vector<SymbolRef> TaintedCasts =

return true; getTaintedSymbolsImpl(State, SC->getOperand(), Kind, returnFirstOnly);

steakhalUnsubmitted

Done

If returnFirstOnly is true, this getTaintedSymbols() call would still eagerly (needlessly) collect all of the symbols.
I'd recommend propagating the returnFirstOnly parameter to the recursive calls to avoid this problem.
I also encourage you to make use of the llvm::append_range() whenever makes sense.

steakhal: If `returnFirstOnly` is `true`, this `getTaintedSymbols()` call would still eagerly…

dkruppAuthorUnsubmitted

Done

You are perfectly right. I overlooked these calls and because of the the default parameter did got get a warning. now fixed.

dkrupp: You are perfectly right. I overlooked these calls and because of the the default parameter did…

} llvm::append_range(TaintedSymbols, TaintedCasts);

} if (returnFirstOnly && !TaintedSymbols.empty())

return TaintedSymbols; // return early if needed

return false;

} }

PathDiagnosticPieceRef TaintBugVisitor::VisitNode(const ExplodedNode *N,

BugReporterContext &BRC,

PathSensitiveBugReport &BR) {

// Find the ExplodedNode where the taint was first introduced

if (!isTainted(N->getState(), V) ||

isTainted(N->getFirstPred()->getState(), V))

return nullptr;

const Stmt *S = N->getStmtForDiagnostics();

if (!S)

return nullptr;

const LocationContext *NCtx = N->getLocationContext();

PathDiagnosticLocation L =

PathDiagnosticLocation::createBegin(S, BRC.getSourceManager(), NCtx);

if (!L.isValid() || !L.asLocation().isValid())

return nullptr;

return std::make_shared<PathDiagnosticEventPiece>(L, "Taint originated here");

} }

return TaintedSymbols;

}

No newline at end of file

clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp

Show All 29 Lines

using namespace ento; using namespace ento;

using namespace taint; using namespace taint;

namespace { namespace {

class VLASizeChecker class VLASizeChecker

: public Checker<check::PreStmt<DeclStmt>, : public Checker<check::PreStmt<DeclStmt>,

check::PreStmt<UnaryExprOrTypeTraitExpr>> { check::PreStmt<UnaryExprOrTypeTraitExpr>> {

mutable std::unique_ptr<BugType> BT; mutable std::unique_ptr<BugType> BT;

enum VLASize_Kind { mutable std::unique_ptr<BugType> TaintBT;

VLA_Garbage, enum VLASize_Kind { VLA_Garbage, VLA_Zero, VLA_Negative, VLA_Overflow };

VLA_Zero,

VLA_Tainted,

VLA_Negative,

VLA_Overflow

};

/// Check a VLA for validity. /// Check a VLA for validity.

/// Every dimension of the array and the total size is checked for validity. /// Every dimension of the array and the total size is checked for validity.

/// Returns null or a new state where the size is validated. /// Returns null or a new state where the size is validated.

/// 'ArraySize' will contain SVal that refers to the total size (in char) /// 'ArraySize' will contain SVal that refers to the total size (in char)

/// of the array. /// of the array.

ProgramStateRef checkVLA(CheckerContext &C, ProgramStateRef State, ProgramStateRef checkVLA(CheckerContext &C, ProgramStateRef State,

const VariableArrayType *VLA, SVal &ArraySize) const; const VariableArrayType *VLA, SVal &ArraySize) const;

/// Check a single VLA index size expression for validity. /// Check a single VLA index size expression for validity.

ProgramStateRef checkVLAIndexSize(CheckerContext &C, ProgramStateRef State, ProgramStateRef checkVLAIndexSize(CheckerContext &C, ProgramStateRef State,

const Expr *SizeE) const; const Expr *SizeE) const;

void reportBug(VLASize_Kind Kind, const Expr *SizeE, ProgramStateRef State, void reportBug(VLASize_Kind Kind, const Expr *SizeE, ProgramStateRef State,

CheckerContext &C, CheckerContext &C) const;

std::unique_ptr<BugReporterVisitor> Visitor = nullptr) const;

void reportTaintBug(const Expr *SizeE, ProgramStateRef State,

CheckerContext &C, SVal TaintedSVal) const;

public: public:

void checkPreStmt(const DeclStmt *DS, CheckerContext &C) const; void checkPreStmt(const DeclStmt *DS, CheckerContext &C) const;

void checkPreStmt(const UnaryExprOrTypeTraitExpr *UETTE, void checkPreStmt(const UnaryExprOrTypeTraitExpr *UETTE,

CheckerContext &C) const; CheckerContext &C) const;

}; };

} // end anonymous namespace } // end anonymous namespace

▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines ProgramStateRef VLASizeChecker::checkVLAIndexSize(CheckerContext &C,

// See if the size value is known. It can't be undefined because we would have // See if the size value is known. It can't be undefined because we would have

// warned about that already. // warned about that already.

if (SizeV.isUnknown()) if (SizeV.isUnknown())

return nullptr; return nullptr;

// Check if the size is tainted. // Check if the size is tainted.

if (isTainted(State, SizeV)) { if (isTainted(State, SizeV)) {

reportBug(VLA_Tainted, SizeE, nullptr, C, reportTaintBug(SizeE, State, C, SizeV);

std::make_unique<TaintBugVisitor>(SizeV));

return nullptr; return nullptr;

} }

// Check if the size is zero. // Check if the size is zero.

DefinedSVal SizeD = SizeV.castAs<DefinedSVal>(); DefinedSVal SizeD = SizeV.castAs<DefinedSVal>();

ProgramStateRef StateNotZero, StateZero; ProgramStateRef StateNotZero, StateZero;

std::tie(StateNotZero, StateZero) = State->assume(SizeD); std::tie(StateNotZero, StateZero) = State->assume(SizeD);

Show All 24 Lines if (StateNeg && !StatePos) {

return nullptr; return nullptr;

} }

State = StatePos; State = StatePos;

} }

return State; return State;

} }

void VLASizeChecker::reportBug( void VLASizeChecker::reportTaintBug(const Expr *SizeE, ProgramStateRef State,

VLASize_Kind Kind, const Expr *SizeE, ProgramStateRef State, CheckerContext &C, SVal TaintedSVal) const {

CheckerContext &C, std::unique_ptr<BugReporterVisitor> Visitor) const { // Generate an error node.

ExplodedNode *N = C.generateErrorNode(State);

if (!N)

return;

if (!TaintBT)

TaintBT.reset(

new BugType(this, "Dangerous variable-length array (VLA) declaration",

categories::TaintedData));

SmallString<256> buf;

llvm::raw_svector_ostream os(buf);

os << "Declared variable-length array (VLA) ";

os << "has tainted size";

auto report = std::make_unique<PathSensitiveBugReport>(*TaintBT, os.str(), N);

report->addRange(SizeE->getSourceRange());

bugreporter::trackExpressionValue(N, SizeE, *report);

// The vla size may be a complex expression where multiple memory locations

// are tainted.

for (auto Sym : getTaintedSymbols(State, TaintedSVal))

report->markInteresting(Sym);

C.emitReport(std::move(report));

steakhalUnsubmitted

Done

// are tainted.

- std::vector<SymbolRef> TaintedSyms = getTaintedSymbols(State, TaintedSVal);

- for (auto Sym : TaintedSyms)

+ for (SymbolRef Sym : getTaintedSymbols(State, TaintedSVal))

report->markInteresting(Sym);

C.emitReport(std::move(report));

Ah, how awesome it would be to have a markInteresting(llvm::ArrayRef<SymbolRef>) overload.

steakhal: Ah, how awesome it would be to have a `markInteresting(llvm::ArrayRef<SymbolRef>)` overload.

}

void VLASizeChecker::reportBug(VLASize_Kind Kind, const Expr *SizeE,

ProgramStateRef State, CheckerContext &C) const {

// Generate an error node. // Generate an error node.

ExplodedNode *N = C.generateErrorNode(State); ExplodedNode *N = C.generateErrorNode(State);

if (!N) if (!N)

return; return;

if (!BT) if (!BT)

BT.reset(new BuiltinBug( BT.reset(new BugType(this,

this, "Dangerous variable-length array (VLA) declaration")); "Dangerous variable-length array (VLA) declaration",

categories::LogicError));

SmallString<256> buf; SmallString<256> buf;

llvm::raw_svector_ostream os(buf); llvm::raw_svector_ostream os(buf);

os << "Declared variable-length array (VLA) "; os << "Declared variable-length array (VLA) ";

switch (Kind) { switch (Kind) {

case VLA_Garbage: case VLA_Garbage:

os << "uses a garbage value as its size"; os << "uses a garbage value as its size";

steakhalUnsubmitted

Done

Why don't we use a distinct BugType for this?

steakhal: Why don't we use a distinct BugType for this?

dkruppAuthorUnsubmitted

Done

You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler this way. To distinguish identify the tainted reports with the bug category.

dkrupp: You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler…

steakhalUnsubmitted

Done

You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler this way. To distinguish identify the tainted reports with the bug category.

I never checked how BugTypes constitute to bugreport construction, but my gut instinct suggests that we should have two separate instances like we frequently do for other checkers.

steakhal: > You mean a new bug type instances? Would there be an advantage for that? Seemed to be simpler…

break; break;

case VLA_Zero: case VLA_Zero:

os << "has zero size"; os << "has zero size";

break; break;

case VLA_Tainted:

os << "has tainted size";

break;

case VLA_Negative: case VLA_Negative:

os << "has negative size"; os << "has negative size";

break; break;

case VLA_Overflow: case VLA_Overflow:

os << "has too large size"; os << "has too large size";

break; break;

} }

auto report = std::make_unique<PathSensitiveBugReport>(*BT, os.str(), N); auto report = std::make_unique<PathSensitiveBugReport>(*BT, os.str(), N);

report->addVisitor(std::move(Visitor));

report->addRange(SizeE->getSourceRange()); report->addRange(SizeE->getSourceRange());

bugreporter::trackExpressionValue(N, SizeE, *report); bugreporter::trackExpressionValue(N, SizeE, *report);

C.emitReport(std::move(report)); C.emitReport(std::move(report));

} }

void VLASizeChecker::checkPreStmt(const DeclStmt *DS, CheckerContext &C) const { void VLASizeChecker::checkPreStmt(const DeclStmt *DS, CheckerContext &C) const {

if (!DS->isSingleDecl()) if (!DS->isSingleDecl())

return; return;

▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp

	Show All 17 Lines
	const char *const MemoryRefCount =			const char *const MemoryRefCount =
	"Memory (Core Foundation/Objective-C/OSObject)";			"Memory (Core Foundation/Objective-C/OSObject)";
	const char *const MemoryError = "Memory error";			const char *const MemoryError = "Memory error";
	const char *const UnixAPI = "Unix API";			const char *const UnixAPI = "Unix API";
	const char *const CXXObjectLifecycle = "C++ object lifecycle";			const char *const CXXObjectLifecycle = "C++ object lifecycle";
	const char *const CXXMoveSemantics = "C++ move semantics";			const char *const CXXMoveSemantics = "C++ move semantics";
	const char *const SecurityError = "Security error";			const char *const SecurityError = "Security error";
	const char *const UnusedCode = "Unused code";			const char *const UnusedCode = "Unused code";
				const char *const TaintedData = "Tainted data used";
	} // namespace categories			} // namespace categories
	} // namespace ento			} // namespace ento
	} // namespace clang			} // namespace clang

clang/test/Analysis/taint-diagnostic-visitor.c

// RUN: %clang_cc1 -analyze -analyzer-checker=alpha.security.taint,core,alpha.security.ArrayBoundV2 -analyzer-output=text -verify %s // RUN: %clang_cc1 -analyze -analyzer-checker=alpha.security.taint,core,alpha.security.ArrayBoundV2 -analyzer-output=text -verify %s

// This file is for testing enhanced diagnostics produced by the GenericTaintChecker // This file is for testing enhanced diagnostics produced by the GenericTaintChecker

typedef __typeof(sizeof(int)) size_t;

struct _IO_FILE;

typedef struct _IO_FILE FILE;

int scanf(const char *restrict format, ...); int scanf(const char *restrict format, ...);

int system(const char *command); int system(const char *command);

char* getenv( const char* env_var );

size_t strlen( const char* str );

void *malloc(size_t size );

steakhalUnsubmitted

Done

The premerge bots are complaining about these two lines on Windows:

error: 'warning' diagnostics seen but not expected: 
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 12: incompatible redeclaration of library function 'strlen'
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 13: incompatible redeclaration of library function 'malloc'
error: 'note' diagnostics seen but not expected: 
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 12: 'strlen' is a builtin with type 'unsigned long long (const char *)'
  File C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c Line 13: 'malloc' is a builtin with type 'void *(unsigned long long)'
4 errors generated.

I think it's because size_t should be defined as unsigned long long on x86_64. This also means that you should pin the target to x86_64 to satisfy this test on all platforms.

steakhal: The [[ https://buildkite.com/llvm-project/premerge-checks/builds/146760#01877fd9-2f3a-4f8d-9ffa…

void free( void *ptr );

char *fgets(char *str, int n, FILE *stream);

FILE *stdin;

void taintDiagnostic(void) void taintDiagnostic(void)

{ {

char buf[128]; char buf[128];

scanf("%s", buf); // expected-note {{Taint originated here}} scanf("%s", buf); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the 2nd argument}}

system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}} system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}

} }

int taintDiagnosticOutOfBound(void) { int taintDiagnosticOutOfBound(void) {

int index; int index;

int Array[] = {1, 2, 3, 4, 5}; int Array[] = {1, 2, 3, 4, 5};

scanf("%d", &index); // expected-note {{Taint originated here}} scanf("%d", &index); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the 2nd argument}}

return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}} return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}

// expected-note@-1 {{Out of bound memory access (index is tainted)}} // expected-note@-1 {{Out of bound memory access (index is tainted)}}

} }

int taintDiagnosticDivZero(int operand) { int taintDiagnosticDivZero(int operand) {

scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}} scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}

// expected-note@-1 {{Taint originated here}} // expected-note@-1 {{Taint originated here}}

// expected-note@-2 {{Taint propagated to the 2nd argument}}

return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}} return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}

// expected-note@-1 {{Division by a tainted value, possibly zero}} // expected-note@-1 {{Division by a tainted value, possibly zero}}

} }

void taintDiagnosticVLA(void) { void taintDiagnosticVLA(void) {

int x; int x;

scanf("%d", &x); // expected-note {{Value assigned to 'x'}} scanf("%d", &x); // expected-note {{Value assigned to 'x'}}

// expected-note@-1 {{Taint originated here}} // expected-note@-1 {{Taint originated here}}

// expected-note@-2 {{Taint propagated to the 2nd argument}}

steakhalUnsubmitted

Done

I'd suggest to count from 1 instead of 0 when referring to Nth arguments.
You can also use the llvm::getOrdinalSuffix(N) to get nicer messages.
We already count from 1 in the std library checker.

steakhal: I'd suggest to count from 1 instead of 0 when referring to Nth arguments. You can also use the…

int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}} int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}

// expected-note@-1 {{Declared variable-length array (VLA) has tainted size}} // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}

} }

// Tests if the originated note is correctly placed even if the path is

// propagating through variables and expressions

char *taintDiagnosticPropagation(){

char *pathbuf;

char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the return value}}

if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}

// expected-note@-1 {{Taking true branch}}

pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}

// expected-note@-1{{Untrusted data is used to specify the buffer size}}

// expected-note@-2 {{Taint propagated to the return value}}

return pathbuf;

}

return 0;

}

steakhalUnsubmitted

Done

}

- //Tests if the originated note is correctly placed even if the path is

- //propagating through variables and expressions

- char* taintDiagnosticPropagation(){

+ // Tests if the originated note is correctly placed even if the path is

+ // propagating through variables and expressions

+ char *taintDiagnosticPropagation(){

char *pathbuf;

- char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}

- // expected-note@-1 {{Taint propagated to the return value}}

- if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}

- // expected-note@-1 {{Taking true branch}}

- pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}

- // expected-note@-1{{Untrusted data is used to specify the buffer size}}

- // expected-note@-2 {{Taint propagated to the return value}}

+ char *pathlist = getenv("PATH");

+ // expected-note@-1 {{Taint originated here}}

+ // expected-note@-2 {{Taint propagated to the return value}}

+ // expected-note@+2 {{Assuming 'pathlist' is non-null}}

+ // expected-note@+1 {{Taking true branch}}

+ if (pathlist) {

+ pathbuf = (char*)malloc(strlen(pathlist) + 1);

+ // expected-warning@-1 {{Untrusted data is used to specify the buffer size}}

+ // expected-note@-2 {{Untrusted data is used to specify the buffer size}}

+ // expected-note@-3 {{Taint propagated to the return value}}

return pathbuf;

}

return 0;

}

//Taint origin should be marked correctly even if there are multiple taint

I know this is subjective, but I'd suggest to reformat the tests to match LLVM style guidelines, unless the formatting is important for the test.
Consistency helps the reader and reviewer, as code and tests are read many more times than written.

This applies to the rest of the touched tests.

steakhal: I know this is subjective, but I'd suggest to reformat the tests to match LLVM style guidelines…

steakhalUnsubmitted

Not Done

Originally I meant this to the rest of the test cases you change or add part of this patch. I hope it clarifies.

steakhal: Originally I meant this to the rest of the test cases you change or add part of this patch. I…

dkruppAuthorUnsubmitted

Done

I made some formatting changes you suggested, but
I would like to leave the //expected-note tags as they are now, because then it remains consistent with the rest of the test cases.

Would it be okay like this, or should I reformat the whole file (untouched parts too)?

dkrupp: I made some formatting changes you suggested, but I would like to leave the //expected-note…

// Taint origin should be marked correctly even if there are multiple taint

// sources in the function

char *taintDiagnosticPropagation2(){

char *pathbuf;

char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source

char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the return value}}

char *user_env=getenv("USER_ENV_VAR");//unrelated taint source

if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}

// expected-note@-1 {{Taking true branch}}

pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}

// expected-note@-1{{Untrusted data is used to specify the buffer size}}

// expected-note@-2 {{Taint propagated to the return value}}

return pathbuf;

}

return 0;

}

void testReadStdIn(){

char buf[1024];

fgets(buf, sizeof(buf), stdin);// expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the 1st argument}}

system(buf);// expected-warning {{Untrusted data is passed to a system call}}

// expected-note@-1 {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}

}

void multipleTaintSources(void) {

int x,y,z;

scanf("%d", &x); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the 2nd argument}}

scanf("%d", &y); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the 2nd argument}}

scanf("%d", &z);

int* ptr = (int*) malloc(y + x); // expected-warning {{Untrusted data is used to specify the buffer size}}

// expected-note@-1{{Untrusted data is used to specify the buffer size}}

free (ptr);

}

void multipleTaintedArgs(void) {

int x,y;

scanf("%d %d", &x, &y); // expected-note {{Taint originated here}}

// expected-note@-1 {{Taint propagated to the 2nd argument, 3rd argument}}

int* ptr = (int*) malloc(x + y); // expected-warning {{Untrusted data is used to specify the buffer size}}

// expected-note@-1{{Untrusted data is used to specify the buffer size}}

free (ptr);

}

clang/test/Analysis/taint-tester.c

	Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines
	void stdinTest2(FILE *pIn) {			void stdinTest2(FILE *pIn) {
	FILE *p = stdin;			FILE *p = stdin;
	FILE *pp = p;			FILE *pp = p;
	int ii;			int ii;

	fscanf(pp, "%d", &ii);			fscanf(pp, "%d", &ii);
	int jj = ii;// expected-warning + {{tainted}}			int jj = ii;// expected-warning + {{tainted}}

	fscanf(p, "%d", &ii);			fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
	int jj2 = ii;// expected-warning + {{tainted}}			int jj2 = ii;// expected-warning + {{tainted}}

	ii = 3;			ii = 3;
	int jj3 = ii;// no warning			int jj3 = ii;// no warning

	p = pIn;			p = pIn;
	fscanf(p, "%d", &ii);			fscanf(p, "%d", &ii);
	int jj4 = ii;// no warning			int jj4 = ii;// no warning
	▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct placeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 516389

clang/include/clang/StaticAnalyzer/Checkers/Taint.h

clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h

clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp

clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp

clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp

clang/lib/StaticAnalyzer/Checkers/Taint.cpp

clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp

clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp

clang/test/Analysis/taint-diagnostic-visitor.c

clang/test/Analysis/taint-tester.c

[Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place
ClosedPublic