This is an archive of the discontinued LLVM Phabricator instance.

[DFSAN] Add support for strnlen, strncat, strsep, sscanf and _tolower
ClosedPublic

Authored by tkuchta on Jan 10 2023, 7:51 AM.

Details

Summary

This patch adds support for the following libc functions in DFSAN: strnlen, strncat, strsep and sscanf. It further enables _tolower via the options file.

Diff Detail

Event Timeline

tkuchta created this revision.Jan 10 2023, 7:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 10 2023, 7:51 AM
Herald added a subscriber: Enna1. · View Herald Transcript
tkuchta requested review of this revision.Jan 10 2023, 7:51 AM
tkuchta updated this revision to Diff 487936.Jan 10 2023, 12:04 PM

applied newer version of clang-format

browneee requested changes to this revision.Jan 10 2023, 6:10 PM

Hello! Thank you for the patch.

I think your changes for __dfsw_strsep are conflating the taint value of a pointer with the taint value of the value(s) it points to.
This is subtle - I believe I made the same mistake when I first worked with the DFSan code :)

With this in mind, please check the other functions.

Since this is a large patch and string manipulations need careful review, please send separate changes for each group of functions (e.g. first patch is __dfsw_strsep + __dfso_strsep + tests for this code).

compiler-rt/lib/dfsan/dfsan_custom.cpp
213

The s_label represents the taint label for s (the pointer).

This line would clobber the taint label of the pointer (s) with a taint label from s[0][0..n].

I think this line should be deleted.

214

I think res && (res != base) is never true.

Checking an implementation of strsep (http://git.musl-libc.org/cgit/musl/tree/src/string/strsep.c)
res can either be NULL or the same value as base.

I think this line should be (res != *s). This is different because *s may be changed by the call to strsep just above.

239

Delete this line.

Like above, s_origin represents the origin value for the pointer s, not the origin value for for data in base[0..n].

compiler-rt/test/dfsan/custom.cpp
1641

Also
dfsan_set_label(<some_different_label_x>, &p_delim, sizeof(&p_delim));

1646

Also
ASSERT_READ_ZERO_LABEL(rv, strlen(rv))

1648

ASSERT_LABEL(rv, dfsan_union(some_different_label_x, n_label));

1652

Also
dfsan_set_label(<some_different_label_y>, &p_s, sizeof(&p_s));

1657

Change this to

ASSERT_READ_LABEL(rv, strlen(rv), m_label);
ASSERT_LABEL(rv, some_different_label_y);

1660

some_different_label_y would also be set on rv.

This revision now requires changes to proceed.Jan 10 2023, 6:10 PM

Hello browneee,
Thank you very much for the comments! I will take a closer look at the issue you pointed out.
Do you prefer me to send a couple of diffs (one per each function) at once or one by one?

Kind regards,

Please send separate changes one by one (e.g. first patch would be for strsep alone and would include several functions of implementation __dfsw_strsep + __dfso_strsep + test code for this).

tkuchta added inline comments.Jan 16 2023, 5:51 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

Agree, s_label represents the taint associated with the **s pointer. However I am now wondering if that is the taint wich we would like to return.
For example, if we have
if (flags().strict_data_dependencies) {

*ret_label = res ? s_label : 0;

We would taint the return value with the value of the pointer, not the data. It means that if we operate on a string for which the characters are tainted, but the pointer itself isn't, we are likely going to return label 0. My understanding was that we are more concerned with the taint of the data, not the pointer, am I missing something?

214

good point, my bad

browneee added inline comments.Jan 17 2023, 12:35 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

Yes, we are usually more concerned with the taint of the data, not the pointer.

With strict dependencies:
// If the input pointer is tainted, the output pointer would be tainted (because it is derived from the input pointer - maybe the same value).
taint(s[0]) == dfsan_read_label(s, sizeof(s)) ====> taint(ret) == ret_label[0]

// If the input data is tainted, the output data would be tainted (because it is derived from the input data).
taint(s[0][0]) == MEM_TO_SHADOW(s[0])[0] ====> taint(ret[0]) == MEM_TO_SHADOW(ret)[0]

Because s[0] == ret (or ret==null), (for the non-null case) the output shadow bytes are the same bytes as input shadow bytes and so these taint labels for the string data in shadow memory do not need to be explicitly propagated in this function.

I think the only case actually changing/copying string data is writing a delimiter byte to NULL, which you handled.

tkuchta added inline comments.Jan 30 2023, 12:15 PM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

I am working on the changes and I came across a strange behavior that I would like to ask about.

It turned out that if we do

char *s = " ... ";
dfsan_set_label(label, &p_s, sizeof(&p_s));

Then, the s_label within the handler is 0, not "label". This is unexpected, as we would like the pointer itself to be labeled here.
I believe this has something to do with the fact that the input string in strsep is a double pointer. For example this works as expected for the delimiter string, which is a single pointer.
It's either I'm doing something incorrectly or there is some issue with propagating labels for double pointers.
Have you perhaps come across this behavior before?

browneee added inline comments.Feb 2 2023, 4:52 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

I'm not sure what p_s is in your example. Could you provide a more complete example?
(and maybe all in one function, not half in __dfsw_strsep and half in another function)

Here is an example showing taint labels at different levels of indirection:

#include <assert.h>
#include <string.h>
#include <sanitizer/dfsan_interface.h>

int main(void) {
  char *s = " ... ";
  char **p_s = &s;
  char ***pp_s = &p_s;

  dfsan_label i_label = 1 << 0;
  dfsan_label j_label = 1 << 1;
  dfsan_label k_label = 1 << 2;
  dfsan_label m_label = 1 << 3;

  // string data
  dfsan_set_label(i_label, s, strlen(s));
  // pointer s
  dfsan_set_label(j_label, &s, sizeof(s));
  // pointer to pointer s
  dfsan_set_label(k_label, &p_s, sizeof(p_s));
  // pointer to pointer to pointer s
  dfsan_set_label(m_label, &pp_s, sizeof(pp_s));

  assert(pp_s[0][0] == s);

  // string data
  assert(dfsan_read_label(s, strlen(s)) == i_label);
  // pointer s
  assert(dfsan_read_label(&s, sizeof(s)) == j_label);
  // pointer to pointer s
  assert(dfsan_read_label(&p_s, sizeof(p_s)) == k_label);
  // pointer to pointer to pointer s
  assert(dfsan_read_label(&pp_s, sizeof(pp_s)) == m_label);

  return 0;
}
tkuchta added inline comments.Feb 2 2023, 6:08 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

Hello,

Thank you for the comment.

I should have provided a more complete example.
What I meant is a behavior I've found while working on the tests.
In the test file I have something like that:

char *s = "Hello world/";
char *delim = " /";
dfsan_set_label(label, &s, sizeof(&s));
char *rv = strep(&s, delim);

If I get this right, the line

dfsan_set_label(label, &s, sizeof(&s));

should have applied a taint to the s pointer.
However, when inside strsep, if we check the s_label, it's 0, not label

SANITIZER_INTERFACE_ATTRIBUTE char *__dfsw_strsep(char **s, const char *delim,
                                                  dfsan_label s_label,
                                                  dfsan_label delim_label,
                                                  dfsan_label *ret_label) {
    fprintf(stderr, "s_label = %d\n", s_label); // -> this prints out "0"

If we do exactly the same trick with the delim pointer, the associated delim_label will be equal to label, as expected.

It seems to me that either I am missing some important point here or the s_label is not propagated correctly by the compiler runtime for a double pointer to the user-provided handler (in my case __dfsw_strsep).

I believe that strsep is currently the first function within dfsan_custom.cpp in which we operate on or check the label of the double pointer.

I would like to ask if you maybe came across this behavior. It might be that I misunderstood how this should work and would be grateful for your help.
Thank you

browneee added inline comments.Feb 3 2023, 3:53 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

If we do exactly the same trick with the delim pointer, the associated delim_label will be equal to label, as expected.

char *rv = strep(&s, delim);

The difference is &s adds another level of indirection, delim does not

I haven't tested, but I expect this example would work...

char *s = "Hello world/";
char **ps = &s;
char *delim = " /";
dfsan_set_label(label, &ps, sizeof(&ps));
char *rv = strep(ps, delim);

// Inside __dfsw_strsep ...
    fprintf(stderr, "s_label = %d\n", s_label); // should now print nonzero label
tkuchta added inline comments.Feb 4 2023, 1:10 PM
compiler-rt/lib/dfsan/dfsan_custom.cpp
213

You're right, thank you!

tkuchta updated this revision to Diff 495268.Feb 6 2023, 1:52 PM

Please find the patch for strsep only attached

browneee added inline comments.Feb 7 2023, 8:02 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
221

When res != NULL, then res is derived from *s, not from s.

e.g.

*ret_label = res ? dfsan_get_label(base) : 0;
224

I think dfsan_get_label(base) should also be a part of this output.

241

As above, when res != NULL, then res is derived from *s, not from s.

if (res)
    *ret_origin = dfsan_get_origin(base);
243

As above, I think dfsan_get_origin(base) should be the first priority, as it is the origin with the most direct value flow.

Note that dfsan_get_origin(base) gives the origin id for base whereas dfsan_read_origin_of_first_taint(base, ...) gives origin id for *base.

compiler-rt/test/dfsan/custom.cpp
1662

I think this test can call strsep fewer times. It just needs to set all the different labels first.

Different inputs to taint (each one should be a distinct taint label):

  • *p_delim (with dfsan_set_label(..., p_delim, strlen(p_delim));)
  • p_delim (with dfsan_set_label(..., &p_delim, sizeof(&p_delim));)
  • *s (with dfsan_set_label(..., s, strlen(s));)
  • s aka p_s (with dfsan_set_label(..., &s, sizeof(&s)); or dfsan_set_label(..., &p_s, sizeof(&p_s));)
  • pp_s (with dfsan_set_label(..., &pp_s, sizeof(&pp_s));)

This is less than 8, so we can do it all at once (unless you want to test different taint labels on different bytes of the input strings).

vitalybuka resigned from this revision.Feb 7 2023, 3:22 PM
tkuchta added inline comments.Mar 1 2023, 2:08 PM
compiler-rt/lib/dfsan/dfsan_custom.cpp
221

Apologies for a delay.

I came across some difficulty with using dfsan_get_label inside the dfsan_custom.cpp file.
It seems that including the dfsan_interface.h header there, which would be needed for dfsan_get_label, causes other conflicts and build errors.
Would there be another way to use that function inside dfsan_custom.cpp?

browneee added inline comments.Mar 2 2023, 9:10 AM
compiler-rt/lib/dfsan/dfsan_custom.cpp
221
char *base = *s;

Here base is loaded, so we can also load the shadow.

dfsan_get_label(base) == dfsan_read_label(s, sizeof(*s))

If base was an argument, then it would have a corresponding label argument as part of the wrapper function, so you also wouldn't need dfsan_get_label.

tkuchta updated this revision to Diff 510041.Mar 31 2023, 8:07 AM
tkuchta marked an inline comment as done.

Hello, I applied the review comments for strsep.

tkuchta updated this revision to Diff 510047.Mar 31 2023, 8:22 AM

Updates after the review of strsep.

there is a strange build error which seems unrelated to my change - please let me know if that's an issue, I will try to rebase to newest master then

We're getting really close!

Yes, that build error looks unrelated. Someone should fix it soon.

compiler-rt/lib/dfsan/dfsan_custom.cpp
221

base, sizeof(base) does not make sense - the size does not correspond to the pointer used.

Either

dfsan_read_label(base, sizeof(*base))   // first byte pointed to by base
dfsan_read_label(base, strlen(base))     // whole string pointed to by base
dfsan_read_label(&base, sizeof(base))  // the base pointer

In this case I think we want the base pointer.

dfsan_read_label(&base, sizeof(base)) // the base pointer
should be equivalent to doing
dfsan_label base_label = dfsan_read_label(s, sizeof(*s)) up at the start of the function, just after we declare base then use base_label here.

Lets go with the second option to avoid taking the address of a variable.

This is semantically equivalent to my first suggestion: dfsan_get_label(base) == dfsan_read_label(&base, sizeof(base)) == base_label.
Sorry I didn't consider the other constraints (no dfsan_get_label in this file because the pass is not run on this code; avoid taking address of variable) and suggest this in the first place.

224

Also use base_label here.

240

dfsan_get_origin(base) == dfsan_read_origin_of_first_taint(&base, sizeof(base))

As noted above base, strlen(base) is a meaningfully valid pointer, length - but it is not the level of indirection we want here.

I think we want the same solution as above.

dfsan_origin base_origin = dfsan_read_origin_of_first_taint(s, sizeof(*s)) at the start of the function, just after declaring and assigning base.

243

Also use base_origin here.

compiler-rt/test/dfsan/custom.cpp
1644

remove the &

It still works because sizeof(pointer_var) == sizeof(&pointer_var), but it doesn't logically match up like it should.

(Sorry, this one is my fault - I made this mistake giving an example in an earlier comment.)

1648

remove & in sizeof

1653

The value rv has a data flow from the the string pointer, not the string bytes.
It should have m_label not k_label.

This is related to line 221 (using base instead of &base) in the other file.

1657

rv is a pointer, and I think it's origin should match the pointer s, not the bytes in the string.

I think this should be:

ASSERT_INIT_ORIGIN_EQ_ORIGIN(&rv, s);

This is related to line 240 (*ret_origin = dfsan_read_origin_of_first_taint(base, strlen(base));) in the other file being the wrong level of indirection.

(Side note: the existing ASSERT_INIT_ORIGIN_EQ_ORIGIN macro feels a bit odd in that the arguments are at different levels of indirection - but not something to fix as part of this change)

1663

remove & in sizeof

tkuchta added inline comments.Mar 31 2023, 2:33 PM
compiler-rt/lib/dfsan/dfsan_custom.cpp
221

Thank you for your comments! Just to clarify:
If we have strict data dependencies we would like *ret_label to be related to the taints of the contents of the string, is that correct? Should we use the base_label there too? My understanding is that base_label represents a taint applied to the pointer, not to the data?

In other words, would that be correct to:

  1. use taints of the string data in the first case (strict dependencies) -> therefore no base_label there as it represents the taint of the pointer
  2. use the taints of the string data + the taints of the pointer in the second case -> therefore using base_label there
browneee added inline comments.Mar 31 2023, 4:35 PM
compiler-rt/lib/dfsan/dfsan_custom.cpp
221

If we have strict data dependencies we would like *ret_label to be related to the taints of the contents of the string, is that correct?

No, *ret_label holds the taint label for the return value, which is a pointer. The return value pointer is derived from *s, aka base. This is the pointer to the string contents, not the string contents themselves.

Should we use the base_label there too?

I think we should only use base_label here.

My understanding is that base_label represents a taint applied to the pointer, not to the data?

Correct.

In other words, would that be correct to:

  1. use taints of the string data in the first case (strict dependencies) -> therefore no base_label there as it represents the taint of the pointer

No, other way around. We want the base_label, but not the string data.

  1. use the taints of the string data + the taints of the pointer in the second case -> therefore using base_label there

Yes, we could have both base_label and the taints of the string data in the else case.

tkuchta updated this revision to Diff 510442.Apr 3 2023, 3:28 AM

Applied the latest review comments for strsep

browneee accepted this revision.Apr 13 2023, 9:57 AM

Please fix the additional &.

Do you need me to submit it for you?

compiler-rt/test/dfsan/custom.cpp
1644

Remove & in sizeof.

1648

Remove & in sizeof.

This revision is now accepted and ready to land.Apr 13 2023, 9:57 AM
tkuchta updated this revision to Diff 513333.Apr 13 2023, 1:12 PM

Thank you, I have uploaded a new diff with "&" fixed in the test function.

How should be proceed with the other functions once strsep is done? Should I open another review or add the code to the current patch?

Separate review.

Keep patches as small as possible, do one review for each patch.

If you'd like me to submit it for you, please provide your email to credit the patch to.

Got it, thank you.
Please submit the patch for e-mail t.kuchta@samsung.com
I will prepare next reviews for the other functions.
Thank you!

This revision was automatically updated to reflect the committed changes.
Herald added a subscriber: Restricted Project. · View Herald TranscriptApr 24 2023, 1:15 PM