This is an archive of the discontinued LLVM Phabricator instance.

[asan] Add an interceptor for strtok
ClosedPublic

Authored by mrigger on Feb 26 2017, 11:15 AM.

Diff Detail

Event Timeline

mrigger created this revision.Feb 26 2017, 11:15 AM
ygribov edited edge metadata.Feb 27 2017, 1:45 AM

I suggest to regenerate patch with -U99999, to get more context.

asan_interceptors.cc
549

Please use ASAN_READ_STRING instead, similar to other string interceptors. str_size should be checked only if strict_strings is enabled, otherwise you can only check range up to first delimiter (inclusively).

tests/asan_str_test.cc
118

Better avoid aliasing, strtok arguments are restricted (more below).

121

delim is somewhat misleading name cause it's also used in other context here.

123

What does this add to previous tests?

mrigger updated this revision to Diff 89941.Feb 27 2017, 2:57 PM

Thanks a lot for your feedback @ygribov! I improved the test case based on it. Additionally, I changed the interceptor to use ASAN_READ_STRING_OF_LEN instead of directly calling ASAN_READ_RANGE. However, I think I have not yet understood why using strict_strings should only verify the string until the first delimiter. If I select the passed n to be smaller than strlen(str) + 1 then the test case fails, since it does not detect the overflow. What did I understand wrong? Should I instead implement the test case in projects/compiler-rt/test/asan/TestCases/ and set RUN: %env_asan_opts=strict_string_checks=true? If yes, should I select n to be strlen(str) (like the other interceptors) or 1?

kcc added inline comments.Feb 27 2017, 5:22 PM
lib/asan/asan_interceptors.cc
541 ↗(On Diff #89941)

This should go to lib/sanitizer_common/sanitizer_common_interceptors.inc so that other sanitizers can benefit

However, I think I have not yet understood why using strict_strings should only verify the string until the first delimiter.

I actually learned this lesson the hard way. C/C++ standards require arguments of standard string functions (strcpy, strchr`, etc.) to be strings i.e. zero-terminated arrays. But unfortunately a lot of existing code seems to ignore this rule and also call string functions on arbitrary C arrays. For example

char a[10];
memcpy(a, "11112", 5);  // Non-zero terminated
strchr(a, '2');  // Works in practice, even though a isn't a string, because strchr only checks first 5 chars

By default we'd like to not warn about uninteresting errors so we only check string prefix (unless user explicitly asks us with strict_strings runtime flag).

If I select the passed n to be smaller than strlen(str) + 1 then the test case fails, since it does not detect the overflow. What did I understand wrong?

Yes, I guess the test needs to be updated. You can take a look at how other string tests are done (e.g. test/asan/TestCases//strstr*.c, you'll probly need __asan_poison_memory_region hack).

mrigger updated this revision to Diff 91474.Mar 11 2017, 8:32 AM

Thanks for feedback @ygribov and @kcc and sorry for the delay in addressing the issues in the change! I moved the implementation of the interceptor to lib/sanitizer_common/sanitizer_common_interceptors.inc, fixed the issues that existed when having strict_string_checks disabled, and improved the test case.

kcc edited edge metadata.

LGTM, but I'd like Aleksey to take a closer look.
and thanks!

alekseyshl requested changes to this revision.Mar 14 2017, 6:17 PM
alekseyshl added inline comments.
lib/sanitizer_common/sanitizer_common_interceptors.inc
475 ↗(On Diff #91474)

#if SANITIZER_INTERCEPT_STRTOK and define it to 1 for now.

485 ↗(On Diff #91474)

Why are we not checking result string here? When non-null, it's expected to be a part of the original str and to be readable and valid, right?

489 ↗(On Diff #91474)

What if your first call to strtok returns nullptr?

491 ↗(On Diff #91474)

result_length + 1); and indent it to align with other arguments

494 ↗(On Diff #91474)

First, it should be ..., del_length, del_length + 1); and second, it does not make sense to use COMMON_INTERCEPTOR_READ_STRING_OF_LEN here,

COMMON_INTERCEPTOR_READ_RANGE(ctx, delimiters, REAL(strlen)(delimiters) + 1);

would be enough.

499 ↗(On Diff #91474)

I'd move it below strcasestr interceptor (strstr and strcasestr are related).

test/asan/TestCases/strtok.c
1 ↗(On Diff #91474)

Why not have strtok-1.c, strtok-2.c etc. files? Tests would be quite a bit more readable.

22 ↗(On Diff #91474)

Why other tests do not have comments? Please split them into separate files and add a top line comment explaining what exactly is tested.

This revision now requires changes to proceed.Mar 14 2017, 6:17 PM
kcc added inline comments.Mar 14 2017, 6:29 PM
test/asan/TestCases/strtok.c
22 ↗(On Diff #91474)

Not worth it, IMHO
But what I did miss is that this test does "%clang_asan %s -o %t" 5 times, while 1 is enough.

alekseyshl added inline comments.Mar 14 2017, 6:44 PM
test/asan/TestCases/strtok.c
12 ↗(On Diff #91474)

And, if we go with one file, please wrap these long lines like this:

RUN: ... | \
RUN: FileCheck ...

22 ↗(On Diff #91474)

Ok, fine, one file is acceptable too, but still, comments would be much welcomed!

mrigger updated this revision to Diff 91904.Mar 15 2017, 11:17 AM
mrigger edited edge metadata.

Thanks for the feedback @alekseyshl! Check the in-line comments to see which issues I already fixed, and to which ones I still have questions.

lib/sanitizer_common/sanitizer_common_interceptors.inc
475 ↗(On Diff #91474)

Done.

485 ↗(On Diff #91474)

Other interceptors also do not check the result of the libc functions. See for example strchr or strpbrk. I think all interceptors are based on the assumption that the libc functions do not contain any errors.

489 ↗(On Diff #91474)

Good catch. I fixed the error and added a test case.

491 ↗(On Diff #91474)

Done.

494 ↗(On Diff #91474)

I implemented it like you suggested in my first revision. The current implementation is based on @ygribov's feedback and his explanations about strict_strings. Did I understand them wrong?

499 ↗(On Diff #91474)

Done.

test/asan/TestCases/strtok.c
1 ↗(On Diff #91474)

I agree that the tests would be more readable. However, I observed that other larger *.str test cases follow this structure; for example, see strtol_strict, strncasecmp_strict, and strncmp_strict.

12 ↗(On Diff #91474)

The other test cases, from which I copied this pattern, are all written in one line. Should I still break the existing "convention"?

22 ↗(On Diff #91474)

I added a comment for each test case. Additionally, the test is now only compiled once.

alekseyshl requested changes to this revision.Mar 15 2017, 11:43 AM

Awesome, thank you for doing it!

lib/sanitizer_common/sanitizer_common_interceptors.inc
485 ↗(On Diff #91474)

Ah, right, thanks.

494 ↗(On Diff #91474)

You're explicitly calling strlen on that string, it will look for the \0 anyway. Check what COMMON_INTERCEPTOR_READ_STRING_OF_LEN is doing, substitute your arguments.

Your last argument being just 1 should definitely be fixed, now you're checking just the first char, right?

test/asan/TestCases/strtok.c
12 ↗(On Diff #91474)

Style evolves. This is not critical, especially with your last change, lines are shorter and easier to parse, but still, I prefer to keep all lines under 80 chars. If it is not too big of a problem, I'd appreciate breaking those exceeding 80 chars at FileCheck, as I suggested, and please indent FileCheck:

RUN: FileCheck ...

This revision now requires changes to proceed.Mar 15 2017, 11:43 AM
mrigger added inline comments.Mar 15 2017, 1:18 PM
lib/sanitizer_common/sanitizer_common_interceptors.inc
494 ↗(On Diff #91474)

Note, that we use the not-intercepted strlen. Following, an out-of-bounds access might not result in an error here.

Yes, we only check the first delimiter (for strict_string_checks=0), as @ygribov suggested, if I understood him right. @ygribov, can you please comment to clear things up?

Here is a demonstration of the behavior on an example:

#include <stdio.h>
#include <string.h>

int main() {
    char buf[] = "abcb";
    char del[] = {'b'}; // missing NULL terminator
    char* result = strtok(buf, del);
    printf("%s\n", result);
}

Executing the program with ASAN_OPTIONS=strict_string_checks=1 ./a.out lets ASan output an error. However, executing with ASAN_OPTIONS=strict_string_checks=0 ./a.out does not result in an error.

Here is also the documentation for strict_string_checks:

	strict_string_checks
		- If set check that string arguments are properly null-terminated

I'm happy to change it, but I want to be sure that we implement the expected behavior.

alekseyshl added inline comments.Mar 15 2017, 1:53 PM
lib/sanitizer_common/sanitizer_common_interceptors.inc
494 ↗(On Diff #91474)

The code I proposed uses non-intercepted strlen as well:

COMMON_INTERCEPTOR_READ_RANGE(ctx, delimiters, REAL(strlen)(delimiters) + 1);

The problem I'm trying to point out is not about the first delimiter (that's fine, but it's about the other COMMON_INTERCEPTOR_READ_STRING_OF_LEN use, the one above), it's that you pass n=1 to your second macro, this one:

COMMON_INTERCEPTOR_READ_STRING_OF_LEN(ctx, delimiters, del_length, 1);

which means "1 char". I believe, it was not your intention.

ygribov added inline comments.Mar 15 2017, 2:56 PM
lib/sanitizer_common/sanitizer_common_interceptors.inc
500 ↗(On Diff #91904)

Bind star to ctx?

502 ↗(On Diff #91904)

I'd suggest to handle !intercept_strtok case here as this will allow to reduce function nesting. It's minor though.

508 ↗(On Diff #91904)

Strtok will return NULL or zero-terminated string so you should be able to do plain COMMON_INTERCEPTOR_READ_RANGE on str (probably only when strict_strings is off as otherwise str has been completely checked on the first call).

510 ↗(On Diff #91904)

I agree with Alex - we shouldn't call strlen here, because if str isn't zero-terminated we risk to trigger runtime error. The best we can do is

COMMON_INTERCEPTOR_READ_STRING(ctx, str, 1);

Note that 1 is suboptimal here - we could have used strspn to find the first occurence of delimeters and use that to better estimate the length.

515 ↗(On Diff #91904)

Same to line 508

COMMON_INTERCEPTOR_READ_RANGE(ctx, result, REAL(strlen)(result) + 1);

?

516 ↗(On Diff #91904)

I think else-branch shouldn't be neglected - you can safely do

COMMON_INTERCEPTOR_READ_RANGE(ctx, str, REAL(strlen) + 1)

(we know that none of delimeters has been found so str must have been scanned to the terminating 0).

519 ↗(On Diff #91904)

Same issue with strlen here. Just do

COMMON_INTERCEPTOR_READ_STRING(ctx, delimeters, 1);

(1 is again suboptimal).

494 ↗(On Diff #91474)

Sorry guys, I was out. Frankly I agree with @alekseyshl in that we shouldn't unconditionally call strlen on str and delimeters. They are not necessarily null-terminated so by calling REAL(strlen) when strict_strings is off, you risk triggering a runtime error.

See my suggestions below for how to change the code to avoid strlens.

ygribov added inline comments.Mar 15 2017, 2:59 PM
lib/sanitizer_common/sanitizer_common_interceptors.inc
494 ↗(On Diff #91474)

Your last argument being just 1 should definitely be fixed,
now you're checking just the first char, right?

We can't easily check for more when strict_strings is off (unless we reimplement most of strtok in interceptor).

mrigger updated this revision to Diff 92059.Mar 16 2017, 2:35 PM
mrigger edited edge metadata.

Thanks again for the review @ygribov and @alekseyshl! I believe that I've addressed all the issues that you raised. @ygribov, I followed your suggestions and also split the strict_string_checks and !strict_string_checks paths since I think that the code is more readable like this.

lib/sanitizer_common/sanitizer_common_interceptors.inc
500 ↗(On Diff #91904)

Sorry, I didn't get this comment. Should I change something in the code?

test/asan/TestCases/strtok.c
12 ↗(On Diff #91474)

Done. Note that I also removed intercept_strtok=false since it is not necessary in the current implementation.

alekseyshl accepted this revision.Mar 17 2017, 2:29 PM
alekseyshl added inline comments.
lib/sanitizer_common/sanitizer_common_interceptors.inc
523 ↗(On Diff #92059)

Shouldn't this also use COMMON_INTERCEPTOR_READ_RANGE?

529 ↗(On Diff #92059)

If I get the earlier @ygribov's suggestion correctly, we should check the else branch too:

if (result != nullptr) {
  COMMON_INTERCEPTOR_READ_RANGE(ctx, result, REAL(strlen)(result) + 1);
} else if (str != nullptr) {
  // No delimiter were found, it's safe to assume that the entire str was scanned.
  COMMON_INTERCEPTOR_READ_RANGE(ctx, str, REAL(strlen)(str) + 1);
}
This revision is now accepted and ready to land.Mar 17 2017, 2:29 PM
mrigger updated this revision to Diff 92215.Mar 17 2017, 4:00 PM

Thanks for all the help! Is there still something that we should improve before merging the change?

lib/sanitizer_common/sanitizer_common_interceptors.inc
523 ↗(On Diff #92059)

Good point! It is clearer to use COMMON_INTERCEPTOR_READ_RANGE here.

529 ↗(On Diff #92059)

Right, I forgot about that case.

Looks good, thanks!

Can someone with commit access commit the change?

Okay, thanks!

This revision was automatically updated to reflect the committed changes.