This patch adds support for the following libc functions in DFSAN: strnlen, strncat, strsep and sscanf. It further enables _tolower via the options file.
Details
Diff Detail
- Repository
- rCRT Compiler Runtime
Event Timeline
Hello! Thank you for the patch.
I think your changes for __dfsw_strsep are conflating the taint value of a pointer with the taint value of the value(s) it points to.
This is subtle - I believe I made the same mistake when I first worked with the DFSan code :)
With this in mind, please check the other functions.
Since this is a large patch and string manipulations need careful review, please send separate changes for each group of functions (e.g. first patch is __dfsw_strsep + __dfso_strsep + tests for this code).
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | The s_label represents the taint label for s (the pointer). This line would clobber the taint label of the pointer (s) with a taint label from s[0][0..n]. I think this line should be deleted. | |
| 214 | I think res && (res != base) is never true. Checking an implementation of strsep (http://git.musl-libc.org/cgit/musl/tree/src/string/strsep.c) I think this line should be (res != *s). This is different because *s may be changed by the call to strsep just above. | |
| 239 | Delete this line. Like above, s_origin represents the origin value for the pointer s, not the origin value for for data in base[0..n]. | |
| compiler-rt/test/dfsan/custom.cpp | ||
| 1638 | Also | |
| 1643 | Also | |
| 1645 | ASSERT_LABEL(rv, dfsan_union(some_different_label_x, n_label)); | |
| 1649 | Also | |
| 1654 | Change this to ASSERT_READ_LABEL(rv, strlen(rv), m_label); | |
| 1657 | some_different_label_y would also be set on rv. | |
Hello browneee,
Thank you very much for the comments! I will take a closer look at the issue you pointed out.
Do you prefer me to send a couple of diffs (one per each function) at once or one by one?
Kind regards,
Please send separate changes one by one (e.g. first patch would be for strsep alone and would include several functions of implementation __dfsw_strsep + __dfso_strsep + test code for this).
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | Agree, s_label represents the taint associated with the **s pointer. However I am now wondering if that is the taint wich we would like to return. *ret_label = res ? s_label : 0; We would taint the return value with the value of the pointer, not the data. It means that if we operate on a string for which the characters are tainted, but the pointer itself isn't, we are likely going to return label 0. My understanding was that we are more concerned with the taint of the data, not the pointer, am I missing something? | |
| 214 | good point, my bad | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | Yes, we are usually more concerned with the taint of the data, not the pointer. With strict dependencies: // If the input data is tainted, the output data would be tainted (because it is derived from the input data). Because s[0] == ret (or ret==null), (for the non-null case) the output shadow bytes are the same bytes as input shadow bytes and so these taint labels for the string data in shadow memory do not need to be explicitly propagated in this function. I think the only case actually changing/copying string data is writing a delimiter byte to NULL, which you handled. | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | I am working on the changes and I came across a strange behavior that I would like to ask about. It turned out that if we do char *s = " ... "; Then, the s_label within the handler is 0, not "label". This is unexpected, as we would like the pointer itself to be labeled here. | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | I'm not sure what p_s is in your example. Could you provide a more complete example? Here is an example showing taint labels at different levels of indirection: #include <assert.h>
#include <string.h>
#include <sanitizer/dfsan_interface.h>
int main(void) {
  char *s = " ... ";
  char **p_s = &s;
  char ***pp_s = &p_s;
  dfsan_label i_label = 1 << 0;
  dfsan_label j_label = 1 << 1;
  dfsan_label k_label = 1 << 2;
  dfsan_label m_label = 1 << 3;
  // string data
  dfsan_set_label(i_label, s, strlen(s));
  // pointer s
  dfsan_set_label(j_label, &s, sizeof(s));
  // pointer to pointer s
  dfsan_set_label(k_label, &p_s, sizeof(p_s));
  // pointer to pointer to pointer s
  dfsan_set_label(m_label, &pp_s, sizeof(pp_s));
  assert(pp_s[0][0] == s);
  // string data
  assert(dfsan_read_label(s, strlen(s)) == i_label);
  // pointer s
  assert(dfsan_read_label(&s, sizeof(s)) == j_label);
  // pointer to pointer s
  assert(dfsan_read_label(&p_s, sizeof(p_s)) == k_label);
  // pointer to pointer to pointer s
  assert(dfsan_read_label(&pp_s, sizeof(pp_s)) == m_label);
  return 0;
} | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | Hello, Thank you for the comment. I should have provided a more complete example. char *s = "Hello world/"; char *delim = " /"; dfsan_set_label(label, &s, sizeof(&s)); char *rv = strep(&s, delim); If I get this right, the line dfsan_set_label(label, &s, sizeof(&s)); should have applied a taint to the s pointer. SANITIZER_INTERFACE_ATTRIBUTE char *__dfsw_strsep(char **s, const char *delim,
                                                  dfsan_label s_label,
                                                  dfsan_label delim_label,
                                                  dfsan_label *ret_label) {
    fprintf(stderr, "s_label = %d\n", s_label); // -> this prints out "0"If we do exactly the same trick with the delim pointer, the associated delim_label will be equal to label, as expected. It seems to me that either I am missing some important point here or the s_label is not propagated correctly by the compiler runtime for a double pointer to the user-provided handler (in my case __dfsw_strsep). I believe that strsep is currently the first function within dfsan_custom.cpp in which we operate on or check the label of the double pointer. I would like to ask if you maybe came across this behavior. It might be that I misunderstood how this should work and would be grateful for your help. | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | 
 
 The difference is &s adds another level of indirection, delim does not I haven't tested, but I expect this example would work... char *s = "Hello world/";
char **ps = &s;
char *delim = " /";
dfsan_set_label(label, &ps, sizeof(&ps));
char *rv = strep(ps, delim);
// Inside __dfsw_strsep ...
    fprintf(stderr, "s_label = %d\n", s_label); // should now print nonzero label | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 213 | You're right, thank you! | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 221 | When res != NULL, then res is derived from *s, not from s. e.g. *ret_label = res ? dfsan_get_label(base) : 0; | |
| 224 | I think dfsan_get_label(base) should also be a part of this output. | |
| 241 | As above, when res != NULL, then res is derived from *s, not from s. if (res)
    *ret_origin = dfsan_get_origin(base); | |
| 243 | As above, I think dfsan_get_origin(base) should be the first priority, as it is the origin with the most direct value flow. Note that dfsan_get_origin(base) gives the origin id for base whereas dfsan_read_origin_of_first_taint(base, ...) gives origin id for *base. | |
| compiler-rt/test/dfsan/custom.cpp | ||
| 1659 | I think this test can call strsep fewer times. It just needs to set all the different labels first. Different inputs to taint (each one should be a distinct taint label): 
 This is less than 8, so we can do it all at once (unless you want to test different taint labels on different bytes of the input strings). | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 221 | Apologies for a delay. I came across some difficulty with using dfsan_get_label inside the dfsan_custom.cpp file. | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 221 | char *base = *s; Here base is loaded, so we can also load the shadow. dfsan_get_label(base) == dfsan_read_label(s, sizeof(*s)) If base was an argument, then it would have a corresponding label argument as part of the wrapper function, so you also wouldn't need dfsan_get_label. | |
there is a strange build error which seems unrelated to my change - please let me know if that's an issue, I will try to rebase to newest master then
We're getting really close!
Yes, that build error looks unrelated. Someone should fix it soon.
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 221 | base, sizeof(base) does not make sense - the size does not correspond to the pointer used. Either dfsan_read_label(base, sizeof(*base)) // first byte pointed to by base dfsan_read_label(base, strlen(base)) // whole string pointed to by base dfsan_read_label(&base, sizeof(base)) // the base pointer In this case I think we want the base pointer. dfsan_read_label(&base, sizeof(base))  // the base pointer Lets go with the second option to avoid taking the address of a variable. This is semantically equivalent to my first suggestion: dfsan_get_label(base) == dfsan_read_label(&base, sizeof(base)) == base_label. | |
| 224 | Also use base_label here. | |
| 240 | dfsan_get_origin(base) == dfsan_read_origin_of_first_taint(&base, sizeof(base)) As noted above base, strlen(base) is a meaningfully valid pointer, length - but it is not the level of indirection we want here. I think we want the same solution as above. dfsan_origin base_origin = dfsan_read_origin_of_first_taint(s, sizeof(*s)) at the start of the function, just after declaring and assigning base. | |
| 243 | Also use base_origin here. | |
| compiler-rt/test/dfsan/custom.cpp | ||
| 1641 | remove the & It still works because sizeof(pointer_var) == sizeof(&pointer_var), but it doesn't logically match up like it should. (Sorry, this one is my fault - I made this mistake giving an example in an earlier comment.) | |
| 1645 | remove & in sizeof | |
| 1650 | The value rv has a data flow from the the string pointer, not the string bytes. This is related to line 221 (using base instead of &base) in the other file. | |
| 1654 | rv is a pointer, and I think it's origin should match the pointer s, not the bytes in the string. I think this should be: ASSERT_INIT_ORIGIN_EQ_ORIGIN(&rv, s); This is related to line 240 (*ret_origin = dfsan_read_origin_of_first_taint(base, strlen(base));) in the other file being the wrong level of indirection. (Side note: the existing ASSERT_INIT_ORIGIN_EQ_ORIGIN macro feels a bit odd in that the arguments are at different levels of indirection - but not something to fix as part of this change) | |
| 1660 | remove & in sizeof | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 221 | Thank you for your comments! Just to clarify: In other words, would that be correct to: 
 | |
| compiler-rt/lib/dfsan/dfsan_custom.cpp | ||
|---|---|---|
| 221 | 
 No, *ret_label holds the taint label for the return value, which is a pointer. The return value pointer is derived from *s, aka base. This is the pointer to the string contents, not the string contents themselves. 
 I think we should only use base_label here. 
 Correct. 
 No, other way around. We want the base_label, but not the string data. 
 Yes, we could have both base_label and the taints of the string data in the else case. | |
Thank you, I have uploaded a new diff with "&" fixed in the test function.
How should be proceed with the other functions once strsep is done? Should I open another review or add the code to the current patch?
Separate review.
Keep patches as small as possible, do one review for each patch.
If you'd like me to submit it for you, please provide your email to credit the patch to.
Got it, thank you.
Please submit the patch for e-mail t.kuchta@samsung.com
I will prepare next reviews for the other functions.
Thank you!
The s_label represents the taint label for s (the pointer).
This line would clobber the taint label of the pointer (s) with a taint label from s[0][0..n].
I think this line should be deleted.