This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] ArrayBoundCheckerV2 should check the region for taint as well
AbandonedPublic

Authored by steakhal on Aug 29 2023, 8:20 AM.

Details

Summary

Previously, we didn't report OOB accesses if the pointer itself was
tainted. This looks weird, but there is weird code out there, code like
inside the Juliet benchmark.

Diff Detail

Event Timeline

steakhal created this revision.Aug 29 2023, 8:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2023, 8:20 AM
steakhal requested review of this revision.Aug 29 2023, 8:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2023, 8:20 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

I was thinking about adding an improvement like this for the sake of consistency, but I fear that this might cause a surprising amount of false positives. (Did you test it on larger codebases?)

As far as I understand (correct me if I'm wrong), there are situations when TaintChecker marks the memory region of e.g. a returned string as tainted to mark that the contents of the region are unreliable. (There may be other more exotic situations when even the pointer value or the extent becomes unreliable e.g. your testcases where you scanf into a pointer.)

I fear that the overzealous handling of tainted data would produce too many false positives in situations when code indexes strings that contain user input and the SA engine cannot understand the logic that calculates the indices. For example if I understand it correctly a function like

char f(int n) {
  char *var = getenv("FOO");
  return var[n];
}

would be reported as an out of bounds memory access, because the region is tainted and the index value is not known. This is especially problematic on the underflow side (which is introduced by your commit), because I'd assume that it's common to have numeric values (e.g. function arguments) where the programmer knows that they're nonnegative, but the analyzer cannot deduce this.

Also note that the error message "Out of bound memory access (index is tainted)" becomes misleading after this patch -- but don't spend too much time on resolving this, because right now I'm working on a complete rewrite the message generation to replace the current spartan messages with useful error reporting.

clang/test/Analysis/taint-generic.c
1133–1141

Also add the "opposite" of this where you test if (n < BUFSIZE).

I was thinking about adding an improvement like this for the sake of consistency, but I fear that this might cause a surprising amount of false positives. (Did you test it on larger codebases?)

I'm just done backporting the first batch to our fork and scheduling a large-scale measurement.
But my gut instinct suggests there won't be too many new reports, as I believe "tainted" pointers shouldn't be really a thing, except for really exotic cases, like scanf a pointer - which is dubious at best in presence of ASLR.

As far as I understand (correct me if I'm wrong), there are situations when TaintChecker marks the memory region of e.g. a returned string as tainted to mark that the contents of the region are unreliable. (There may be other more exotic situations when even the pointer value or the extent becomes unreliable e.g. your testcases where you scanf into a pointer.)

Taint can only bind to symbols, thus a memregion per-say cannot be tainted, only the "conjured address of it" or the "symbol bound to the pointee's/referred lvalue" can be tainted.
WDYM by "are unreliable" and "becomes unreliable"?

I fear that the overzealous handling of tainted data would produce too many false positives in situations when code indexes strings that contain user input and the SA engine cannot understand the logic that calculates the indices. For example if I understand it correctly a function like

char f(int n) {
  char *var = getenv("FOO");
  return var[n];
}

would be reported as an out of bounds memory access, because the region is tainted and the index value is not known. This is especially problematic on the underflow side (which is introduced by your commit), because I'd assume that it's common to have numeric values (e.g. function arguments) where the programmer knows that they're nonnegative, but the analyzer cannot deduce this.

To me, this is a TP. The envvar might not be present, and even if it's present, the relation of the string length of var to n is not expressed or checked; thus this deserves a report.
But I see your concern, however, I cannot think of valid counterexamples, aka. FPs off the top of my head. I'll come back once I see the real results.

Also note that the error message "Out of bound memory access (index is tainted)" becomes misleading after this patch -- but don't spend too much time on resolving this, because right now I'm working on a complete rewrite the message generation to replace the current spartan messages with useful error reporting.

I agree, but I'm not sure how much more readable something more accurate like "the location where this pointer points to potentially depends on untrusted tainted data". (Note that this would also work for tainted indexes as well).

I must admit, that reporting and diagnostics were low priorities for this patch stack, so I'd focus on addressing logic concerns first for the whole stack and then come back to reporting to make it consistent across the stack.

I was thinking about adding an improvement like this for the sake of consistency, but I fear that this might cause a surprising amount of false positives. (Did you test it on larger codebases?)

I'm just done backporting the first batch to our fork and scheduling a large-scale measurement.
But my gut instinct suggests there won't be too many new reports, as I believe "tainted" pointers shouldn't be really a thing, except for really exotic cases, like scanf a pointer - which is dubious at best in presence of ASLR.

As far as I understand (correct me if I'm wrong), there are situations when TaintChecker marks the memory region of e.g. a returned string as tainted to mark that the contents of the region are unreliable. (There may be other more exotic situations when even the pointer value or the extent becomes unreliable e.g. your testcases where you scanf into a pointer.)

Taint can only bind to symbols, thus a memregion per-say cannot be tainted, only the "conjured address of it" or the "symbol bound to the pointee's/referred lvalue" can be tainted.

Thanks for the clarification! As I dig deeper I see that the functions that look like "let's taint a MemRegion" are in fact only affecting SymbolicRegions (targeting the corresponding symbol). This taint may be seen by the logic that checks the taintedness of the ElementRegion -- but as it's limited to symbolic regions it probably won't appear in the underflow test [which is skipped for symbolic regions in unknown space]. (There are also some calls which taint the pointee's/referred value -- that won't affect us.)

WDYM by "are unreliable" and "becomes unreliable"?

"may be controlled by a malicious actor", basically a synonym of "tainted" but I used "tainted" for things that are marked as tainted by the SA code and "unreliable" for things that would be marked as tainted by an ideal algorithm.

I fear that the overzealous handling of tainted data would produce too many false positives in situations when code indexes strings that contain user input and the SA engine cannot understand the logic that calculates the indices. For example if I understand it correctly a function like

char f(int n) {
  char *var = getenv("FOO");
  return var[n];
}

would be reported as an out of bounds memory access, because the region is tainted and the index value is not known. This is especially problematic on the underflow side (which is introduced by your commit), because I'd assume that it's common to have numeric values (e.g. function arguments) where the programmer knows that they're nonnegative, but the analyzer cannot deduce this.

To me, this is a TP. The envvar might not be present, and even if it's present, the relation of the string length of var to n is not expressed or checked; thus this deserves a report.
But I see your concern, however, I cannot think of valid counterexamples, aka. FPs off the top of my head. I'll come back once I see the real results.

Hmm, you're right that my example is a TP as written; however I could imagine similar FPs where we e.g. check that n < strlen(var) but don't check that n is nonnegative (because that's the responsibility of the caller).

Anyway, let's wait for the test results, they'll be enlightening. (You mostly convinced me that your commit will work, but Murphy's laws are still standing...)

Also note that the error message "Out of bound memory access (index is tainted)" becomes misleading after this patch -- but don't spend too much time on resolving this, because right now I'm working on a complete rewrite the message generation to replace the current spartan messages with useful error reporting.

I agree, but I'm not sure how much more readable something more accurate like "the location where this pointer points to potentially depends on untrusted tainted data". (Note that this would also work for tainted indexes as well).

I must admit, that reporting and diagnostics were low priorities for this patch stack, so I'd focus on addressing logic concerns first for the whole stack and then come back to reporting to make it consistent across the stack.

Improving reporting and diagnostics is also on my TODO list, so we should coordinate progress in that area to avoid redundant work. If you wish to work on it, then I'm happy to review; otherwise I'll do something with it soon (perhaps after you merged these commits to avoid merge conflicts).

steakhal planned changes to this revision.Aug 30 2023, 8:52 AM

I fear that the overzealous handling of tainted data would produce too many false positives in situations when code indexes strings that contain user input and the SA engine cannot understand the logic that calculates the indices. For example if I understand it correctly a function like

char f(int n) {
  char *var = getenv("FOO");
  return var[n];
}

would be reported as an out of bounds memory access, because the region is tainted and the index value is not known. This is especially problematic on the underflow side (which is introduced by your commit), because I'd assume that it's common to have numeric values (e.g. function arguments) where the programmer knows that they're nonnegative, but the analyzer cannot deduce this.

To me, this is a TP. The envvar might not be present, and even if it's present, the relation of the string length of var to n is not expressed or checked; thus this deserves a report.
But I see your concern, however, I cannot think of valid counterexamples, aka. FPs off the top of my head. I'll come back once I see the real results.

Hmm, you're right that my example is a TP as written; however I could imagine similar FPs where we e.g. check that n < strlen(var) but don't check that n is nonnegative (because that's the responsibility of the caller).

Anyway, let's wait for the test results, they'll be enlightening. (You mostly convinced me that your commit will work, but Murphy's laws are still standing...)

I have the results now, I just need to evaluate it. It turns out to require some scripting, as we sooner detect out-of-bounds accesses (by not allowing to form an lvalue to such) the locations change.
Because of this changed location, I have to filter out the cases when the "same line" has a report "disappearing" and "appearing".
The sooner detection also means that I have more paths killed, thus "in the close proximity below the appearing issue some others disappear - if they are dominated".
I have to manually check these because that diff is sort of intentional; to see what's left and look for unintentional cases.

I can already confirm an issue though, similar to the one you showed, and similar variants:

char *str = getenv(name);
if (str && str[0])
  return atoi(str);
return 0;

This sort of code looks good, and I'll think about how to suppress them.
Probably, because of this, we have a huge relative increase for Out of bound memory access (index is tainted) diagnostics (+72%, +447).
The second biggest relative change was for Out of bound memory access (access exceeds upper limit of memory block) (+18%, +1862).
These are preliminary numbers, and I'll continue the validation.

Also note that the error message "Out of bound memory access (index is tainted)" becomes misleading after this patch -- but don't spend too much time on resolving this, because right now I'm working on a complete rewrite the message generation to replace the current spartan messages with useful error reporting.

I agree, but I'm not sure how much more readable something more accurate like "the location where this pointer points to potentially depends on untrusted tainted data". (Note that this would also work for tainted indexes as well).

I must admit, that reporting and diagnostics were low priorities for this patch stack, so I'd focus on addressing logic concerns first for the whole stack and then come back to reporting to make it consistent across the stack.

Improving reporting and diagnostics is also on my TODO list, so we should coordinate progress in that area to avoid redundant work. If you wish to work on it, then I'm happy to review; otherwise I'll do something with it soon (perhaps after you merged these commits to avoid merge conflicts).

I don't plan to work on the checker.

I'll post the result of the evaluation, once I'm happy with the findings. This might need some iterations.

Thanks for the review!
And sorry that I haven't done this evaluation before I posted these for review. Usually, it takes more time for the community to review patches - but this was a pleasant surprise!

I'm thinking of a heuristic like this:

bool IsAtOffsetZero = [ByteOffset] {
  const auto *Int = ByteOffset.getAsInteger();
  return Int && Int->isZero();
}();
// ...
if (state_precedesLowerBound && state_withinLowerBound && !IsAtOffsetZero &&
    isTainted(state, location)) {
  reportTaintOOB(checkerContext, state_precedesLowerBound, location);
  return;
}

It will definitely not work for all cases but should filter out the most annoying ones.

int testTaintedPtr() {
  // Read a pointer from a tainted source and dereference it.
  char *app = getenv("APP_NAME");
  if (!app)
    return -1;
  if (app[0] == '\0') return 0; // no-warning: Allow manual checking for null terminator at offset zero.
  if (app[1] == '\0') return 1; // expected-warning {{Out of bound memory access (index is tainted)}} Any other index than 0 is disallowed.
  if (app[2] == '\0') return 2; // no-warning: The path already sunk.
  return -1;
}
clang/test/Analysis/taint-generic.c
1133–1141

Makes sense.

There are still a few FPs of the kind, where they iterate over the result of getenv in a loop, and continuously checks the character against the zero terminator.
I refined the suppression heuristic as follows:

  • If the offset is zero, don't report taint issue. (as I suggested in the previous heuristic)
  • If the offset is non-zero, calculate the offset for the previous element and check if the value there is proven to be non-zero. If it cannot be zero, don't report this taint issue.

I'll check the results tomorrow.

There are still a few FPs of the kind, where they iterate over the result of getenv in a loop, and continuously checks the character against the zero terminator.
I refined the suppression heuristic as follows:

  • If the offset is zero, don't report taint issue. (as I suggested in the previous heuristic)
  • If the offset is non-zero, calculate the offset for the previous element and check if the value there is proven to be non-zero. If it cannot be zero, don't report this taint issue.

I'll check the results tomorrow.

There are still FPs. I'll refine the heuristic to accept any constraint.

As I thought more about this commit I realized that I have two vague but significant concerns. I'm sorry that I wasn't able to describe these before you started to dig into the details of the heuristics.

(1) I don't think that code like int *ptr; scanf("%ld", &ptr); should be covered by an ArrayBound checker: this is not just a "be careful with the value of the index" issue, but a raw pointer that's completely controlled by an untrusted source. I'd try to cover this kind of bug with either the StdLibrary checker or a small new checker [or perhaps just a clang-tidy check] which would report pointers coming from scanf or other user input even if it doesn't see a dereference of the pointer value.

(2) Taintedness of a memory region (i.e. isTainted(state, location) where location is presumably a MemRegionVal) is poorly defined and might mean three different things:
(a) The 'begin' pointer value of the region is tainted (controlled by an untrusted source). This may be a "we're already dead" situation (e.g. the pointer is an arbitrary value from scanf) or something completely harmless (we are examining something like the second subscript operator in large_array[user_input].field[idx] after verifying that the first subscript with the tainted index is valid), but in either case, the reliability of the 'begin' pointer should not affect the comparison between the region extent and the index value.
(b) The extent of the region is tainted (controlled by an untrusted source, e.g. it's an user input string with an unknown length). In this case it's reasonable to turn on the "paranoid" comparison mode -- however, this situation should be recorded by marking the extent symbol of the region as tainted instead of marking the region itself as tainted. I'd be happy to see a commit that ensures that the extent symbol of user input strings is tainted and its taintedness is handled in ArrayBoundsV2.
(c) The contents of the region are tainted (controlled by an untrusted source). Arguably this should be recorded by marking the contents as tainted, but that's difficult to implement on a string/array, so it's a somewhat reasonable shortcut to put the taint on the region itself. This kind of taint is completely irrelevant for out of bounds memory access and should not affect the comparisons.

In summary, I'd say that this checker should not be affected by the taintedness of the memory region itself (but it should continue to monitor tainted indices, and if possible, it should be extended to handle tainted extent values). I'm not completely opposed to merging this patch if you add enough heuristics and workarounds to make it stable in practice, but I feel that this is not the right way forward.

It's fortunate that we'll meet in person on Monday because there we can discuss this topic (and other plans) in more detail.

As I thought more about this commit I realized that I have two vague but significant concerns. I'm sorry that I wasn't able to describe these before you started to dig into the details of the heuristics.

(1) I don't think that code like int *ptr; scanf("%ld", &ptr); should be covered by an ArrayBound checker: this is not just a "be careful with the value of the index" issue, but a raw pointer that's completely controlled by an untrusted source. I'd try to cover this kind of bug with either the StdLibrary checker or a small new checker [or perhaps just a clang-tidy check] which would report pointers coming from scanf or other user input even if it doesn't see a dereference of the pointer value.

(2) Taintedness of a memory region (i.e. isTainted(state, location) where location is presumably a MemRegionVal) is poorly defined and might mean three different things:
(a) The 'begin' pointer value of the region is tainted (controlled by an untrusted source). This may be a "we're already dead" situation (e.g. the pointer is an arbitrary value from scanf) or something completely harmless (we are examining something like the second subscript operator in large_array[user_input].field[idx] after verifying that the first subscript with the tainted index is valid), but in either case, the reliability of the 'begin' pointer should not affect the comparison between the region extent and the index value.
(b) The extent of the region is tainted (controlled by an untrusted source, e.g. it's an user input string with an unknown length). In this case it's reasonable to turn on the "paranoid" comparison mode -- however, this situation should be recorded by marking the extent symbol of the region as tainted instead of marking the region itself as tainted. I'd be happy to see a commit that ensures that the extent symbol of user input strings is tainted and its taintedness is handled in ArrayBoundsV2.
(c) The contents of the region are tainted (controlled by an untrusted source). Arguably this should be recorded by marking the contents as tainted, but that's difficult to implement on a string/array, so it's a somewhat reasonable shortcut to put the taint on the region itself. This kind of taint is completely irrelevant for out of bounds memory access and should not affect the comparisons.

In summary, I'd say that this checker should not be affected by the taintedness of the memory region itself (but it should continue to monitor tainted indices, and if possible, it should be extended to handle tainted extent values). I'm not completely opposed to merging this patch if you add enough heuristics and workarounds to make it stable in practice, but I feel that this is not the right way forward.

I see your point, makes sense and I'd agree.
Probably, by another checker/check, we could indeed sidestep the problem by handling these situations separately. I'll think about it.

To me, the whole boils down to null-terminated buffers, such as one returned by getenv.
There, we have an implicit dependency between the extent of the region and the position of the null-terminator inside; as such it affects what locations are allowed to be read.
Locations are usually checked by the array-bounds checker (V1 xor V2). Consequently, it's still not clear to me if this falls in/out of this checker's responsibility.
We don't have many APIs like getenv, but users could define custom rules that would cause similar problems, thus ATM only getenv suffers from these new FPs.

The fact that isTainted() could mean so many things, as you enumerated has 3 different meanings, which is a problem on its own. The design decision was to make it deliberately vague to let users express their custom APIs more easily in their yaml files.

The measurement is still running, and I'll post a few words about that once I had a look; but anyway, I feel that this patch is controversial and needs more discussion before it could land.
I plan to abandon this and think about it. We can continue the discussion of course.

steakhal added a comment.EditedSep 1 2023, 5:12 AM

The measurement is still running, and I'll post a few words about that once I had a look; but anyway, I feel that this patch is controversial and needs more discussion before it could land.
I plan to abandon this and think about it. We can continue the discussion of course.

Here are the results after applying the suppression heuristic for getenv, plus D159108 and D159109:

  • =X: Given message appeared X times in baseline.
  • +X (+Y%): Patch added X new reports with this message, and that is Y percent increase compared to the baseline.
  • -X (-Y%): Removed X reports with this message, and that is Y percent decrease compared to the baseline.
* Out of bound memory access (accessed memory precedes memory block)
  =1221, +0 (+0%), -4 (-0%)
* Out of bound memory access (access exceeds upper limit of memory block)
  =10547, +7 (+0%), -5 (-0%)
* Out of bound memory access (index is tainted)
  =624, +71 (+11%), -1 (-0%)

* Memory copy function overflows the destination buffer
  =436, +45 (+10%), -1 (-0%)
* Memory copy function accesses out-of-bound array element
  =443, +19 (+4%), -1 (-0%)
* Memory comparison function accesses out-of-bound array element
  =102, +1 (+1%), -1 (-1%)
* Memory set function overflows the destination buffer
  =119, +10 (+8%), -0 (-0%)
* String copy function overflows the destination buffer
   =42, +3 (+7%), -0 (-0%)

* This was not the most recently acquired lock. Possible lock order reversal
  =81, +0 (+0%), -1 (-1%)
* Division by zero
  =775, +1 (+0%), -0 (-0%)
* Called C++ object pointer is null
  =3719, +0 (+0%), -2 (-0%)
* Potential leak of memory pointed to by
  =1996, +0 (+0%), -2 (-0%)
* Returned pointer value points outside the original object (potential buffer overflow)
  =625, +2 (+0%), -0 (-0%)

I think the mentioned heuristic works quite well, and really decreased the number of FPs for Out of bound memory access (index is tainted).
I still don't want to commit to this patch, so I'm abandoning this.

I don't plan to repeat the measurement for D159108 and D159109, as the numbers related to those messages look nice.
I had a look at the reports for those, and as expected it's barely impossible to make sense of them. But at least nothing dumb stood out.
If you are okay, I'll commit D159108 and D159109 later today. @donat.nagy


EDIT: The evaluation was done on a large and varied set of projects (180+).

steakhal abandoned this revision.Sep 11 2023, 5:24 AM