[TySan] A Type Sanitizer (Clang)
Needs ReviewPublic

Authored by hfinkel on Tue, Apr 18, 4:15 PM.

Details

Summary

This patch introduces the runtime components of a type sanitizer: a sanitizer for type-based aliasing violations.

C/C++ have type-based aliasing rules, and LLVM's optimizer can exploit these given TBAA metadata added by Clang. Roughly, a pointer of given type cannot be used to access an object of a different type (with, of course, certain exceptions). Unfortunately, there's a lot of code in the wild that violates these rules (e.g. for type punning), and such code often must be built with -fno-strict-aliasing. Performance is often sacrificed as a result. Part of the problem is the difficulty of finding TBAA violations. Hopefully, this sanitizer will help.

https://reviews.llvm.org/D32197 (Runtime)
https://reviews.llvm.org/D32198 (LLVM)

The Clang changes seems mostly formulaic, the one specific change being that when the TBAA sanitizer is enabled, TBAA is always generated, even at -O0.

Clang's TBAA representation currently has a problem representing unions, as demonstrated by the one XFAIL'd test in the runtime patch. We'll update the TBAA representation to fix this, and at the same time, update the sanitizer.

Diff Detail

hfinkel created this revision.Tue, Apr 18, 4:15 PM

As a note: As follow-up work, we might want to extend Clang to add TBAA metadata to allocas and lifetime.start intrinsics so that the instrumentation pass can mark the types of memory upon declaration/construction instead of only upon first access.

pcc added a subscriber: pcc.Wed, Apr 19, 10:05 AM
filcab added a subscriber: filcab.Wed, Apr 19, 10:19 AM

Missing a sanitizer-ld.c test for freebsd.

include/clang/Basic/Attr.td
1849 ↗(On Diff #95654)

Shouldn't you be extending the no_sanitize attribute instead, as it says in the comment?

lib/CodeGen/CodeGenModule.cpp
128

hasOneOf(SanitizerKind::Thread | SanitizerKind::TBAA)?

lib/CodeGen/CodeGenTBAA.cpp
96

Interesting that TSan needs TBAA (per the comment in the chunk above), but is not in this condition.

Missing a sanitizer-ld.c test for freebsd.

Thanks for pointing this out. I'm going to remove the freebsd change for now. I suspect it works, but I've not tested it yet.

include/clang/Basic/Attr.td
1849 ↗(On Diff #95654)

Indeed. I think that happened also. I'll make sure the tests reflect that.

lib/CodeGen/CodeGenTBAA.cpp
96

Yea, I didn't understand the TSan-needs-TBAA bit either (considering that it does not need it at -O0?).

hfinkel updated this revision to Diff 95829.Wed, Apr 19, 2:58 PM

Updated per review comments (use only no_sanitize("tbaa") instead of adding no_sanitize_tbaa and don't touch freebsd for now).

I don't like calling this a "TBAA sanitizer". What we're sanitizing is the object model and effective type rules; it seems irrelevant which specific compiler analysis passes would result in your program misbehaving if you break the rules. I would also expect that we will extend this in future to assign types to storage even in cases where there is no store (for instance, we should be able to catch float f() { int n; return *(float*)&n; } despite there being no TBAA violation in the naive IR).

How about renaming this to something more like -fsanitize=type?

...

I would also expect that we will extend this in future to assign types to storage even in cases where there is no store (for instance, we should be able to catch float f() { int n; return *(float*)&n; } despite there being no TBAA violation in the naive IR).

Yes. My thought was that we'd make Clang generate tbaa metadata on allocas and lifetime.start intrinsics (and globals) so that we can mark the memory types upon creation. Would that catch everything?

How about renaming this to something more like -fsanitize=type?

I'm fine with that. Do you like TypeSanitizer or TypeAccessSantizer or TypeAliasingSanitizer best?

One potential concern with calling it the type sanitizer is that we have an abbreviation overlap with the thread sanitizer.

...

I would also expect that we will extend this in future to assign types to storage even in cases where there is no store (for instance, we should be able to catch float f() { int n; return *(float*)&n; } despite there being no TBAA violation in the naive IR).

Yes. My thought was that we'd make Clang generate tbaa metadata on allocas and lifetime.start intrinsics (and globals) so that we can mark the memory types upon creation. Would that catch everything?

Also, we'd also want to do something similar for byval arguments. This may just be additional motivation to allow metadata on function arguments (there's an RFC on llvm-dev about this presently).

How about renaming this to something more like -fsanitize=type?

I'm fine with that. Do you like TypeSanitizer or TypeAccessSantizer or TypeAliasingSanitizer best?

One potential concern with calling it the type sanitizer is that we have an abbreviation overlap with the thread sanitizer.

! In D32199#731252, @hfinkel wrote:

How about renaming this to something more like -fsanitize=type?

I'm fine with that. Do you like TypeSanitizer or TypeAccessSantizer or TypeAliasingSanitizer best?

I think calling it a type aliasing sanitizer would somewhat conflate the details of the mechanism with the fundamentals of the check itself. For example:

variant<int, float> v;
int &n = v.get<int>;
v = 1.3f;
int m = n;

... is a lifetime bug, not an aliasing bug, but would be caught by this check just the same. I'd be tempted to suggest EffectiveTypeSanitizer, since we seem to be more-or-less directly implementing C's effective type rules, except that name isn't so good for the C++ case. And in the longer term we will probably want to provide an option to enforce the real C++ lifetime rules whereby a store with certain !tbaa metadata is not sufficient to change the type of storage.

One potential concern with calling it the type sanitizer is that we have an abbreviation overlap with the thread sanitizer.

Perhaps we could abbreviate it as "tysan"? *shrug*

! In D32199#731252, @hfinkel wrote:

How about renaming this to something more like -fsanitize=type?

I'm fine with that. Do you like TypeSanitizer or TypeAccessSantizer or TypeAliasingSanitizer best?

I think calling it a type aliasing sanitizer would somewhat conflate the details of the mechanism with the fundamentals of the check itself. For example:

variant<int, float> v;
int &n = v.get<int>;
v = 1.3f;
int m = n;

... is a lifetime bug, not an aliasing bug, but would be caught by this check just the same.

Good point.

I'd be tempted to suggest EffectiveTypeSanitizer, since we seem to be more-or-less directly implementing C's effective type rules, except that name isn't so good for the C++ case. And in the longer term we will probably want to provide an option to enforce the real C++ lifetime rules whereby a store with certain !tbaa metadata is not sufficient to change the type of storage.

As I've currently implemented it, both reads and writes set the type of previously-unknown storage, and after that it says fixed (unless you memcpy to it, memset it, or its lifetime ends (the type gets reset on lifetime.start/end and for malloc/allocas/etc.). There's a flag to enable the "writes always set the type" rule, but that's not the default. Is this too strict?

One potential concern with calling it the type sanitizer is that we have an abbreviation overlap with the thread sanitizer.

Perhaps we could abbreviate it as "tysan"? *shrug*

SGTM.

As I've currently implemented it, both reads and writes set the type of previously-unknown storage, and after that it says fixed (unless you memcpy to it, memset it, or its lifetime ends (the type gets reset on lifetime.start/end and for malloc/allocas/etc.). There's a flag to enable the "writes always set the type" rule, but that's not the default. Is this too strict?

That seems like it will have at least three flavors of false positive:

  1. C's "effective type" rule allows writes to set the type pretty much unconditionally, unless the storage is for a variable with a declared type
  2. After a placement new in C++, you should be able to use the storage as a new type
  3. Storing directly to a member access on a union (ie, with the syntax x.a = b) in C++ permits using the storage as the new type

If we want to follow the relevant language rules by default, that would suggest that "writes always set the type" should be enabled by default in C and disabled by default in C++. That may not be the right decision for other reasons, though. In C++, writes through union members and new-expressions should probably (re)set the type (do you have intrinsics the frontend can use to do so?).

As I've currently implemented it, both reads and writes set the type of previously-unknown storage, and after that it says fixed (unless you memcpy to it, memset it, or its lifetime ends (the type gets reset on lifetime.start/end and for malloc/allocas/etc.). There's a flag to enable the "writes always set the type" rule, but that's not the default. Is this too strict?

That seems like it will have at least three flavors of false positive:

  1. C's "effective type" rule allows writes to set the type pretty much unconditionally, unless the storage is for a variable with a declared type
  2. After a placement new in C++, you should be able to use the storage as a new type
  3. Storing directly to a member access on a union (ie, with the syntax x.a = b) in C++ permits using the storage as the new type

    If we want to follow the relevant language rules by default, that would suggest that "writes always set the type" should be enabled by default in C and disabled by default in C++. That may not be the right decision for other reasons, though. In C++, writes through union members and new-expressions should probably (re)set the type

Fair enough. For now we'll default to write-sets-the-type as the default. We can always add 'sticky' types later to correspond to types set by declaration.

(do you have intrinsics the frontend can use to do so?).

I had thought that we could just use a lifetime.end/start pair to mark the placement new, etc. However, it might be better to use some dedicated intrinsic for this purpose?

As I've currently implemented it, both reads and writes set the type of previously-unknown storage, and after that it says fixed (unless you memcpy to it, memset it, or its lifetime ends (the type gets reset on lifetime.start/end and for malloc/allocas/etc.). There's a flag to enable the "writes always set the type" rule, but that's not the default. Is this too strict?

That seems like it will have at least three flavors of false positive:

  1. C's "effective type" rule allows writes to set the type pretty much unconditionally, unless the storage is for a variable with a declared type

To come back to this point: We don't really implement these rules now, and it is not clear that we will. The problem here is that, if we take the specification literally, then we can't use our current TBAA at all. The problem is that if we have:

write x, !tbaa "int"
read x, !tbaa "int"
write x, !tbaa "float"

TBAA will currently tell us that the "float" write aliases with neither the preceding read nor the preceding write. As a result, we might move the "float" write to before the read (which is wrong) or before the write (also wrong). It seems to me that following the effective-type rules strictly will require that we only emit TBAA for memory accesses we can prove are to declared variables (as these are the only ones whose types don't get changed just by writing to them). We could certainly do that (*), although it is going to make TBAA awfully limited. As @dberlin has asserted, GCC does not implement these rules either.

To be fair, there are inferences we might draw from TBAA on all access that are not incorrect. For example, if we have:

write x, !tbaa "int"
write y, !tbaa "float"
read x, !tbaa "int"

and we can indeed conclude that the write to y and the read from x don't alias (because the write happens before the read). This is because the effective type of y must be float after the write and so we know that the read from x must be accessing a different object. We can also conclude that the writes don't alias, but only because of the later read. The sad part is that if we use this information to reorder the read before the write to y (which we might do to eliminate the read), we now lose our ability to use TBAA to tell us anything.

Also, a strict reading of C's access rules seems to rule out the premise underlying our struct-path TBAA entirely. So long as I'm accessing a value using a struct that has some member, including recursively, with that type, then it's fine. The matching of the relative offsets is a sufficient, but not necessary, condition for well-defined access. C++ has essentially the same language (and, thus, potentially the same problem).

While I'd like the sanitizer to follow the rules, that's really useful only to the extent that the compilers follow the rules. If the compilers are making stronger assumptions, then I think that the sanitizer should also. OTOH, maybe we should change our TBAA representation/implementation to actually follow the rules, and then have a sanitizer that does the same.

(*) The best way I can think of to do this is to tag globals and allocas with tbaa according to their declared type, something similar for function arguments, and then in TBAA, instead of comparing the TBAA metadata of both operands directly, we call getUnderlyingObjects on the pointers, get the corresponding most-generic TBAA from the objects themselves, and then compare that TBAA to the TBAA from the other access.

  1. After a placement new in C++, you should be able to use the storage as a new type
  2. Storing directly to a member access on a union (ie, with the syntax x.a = b) in C++ permits using the storage as the new type

Yes, although for the sake of discussion, this is only true if a is a "non-class, non-array type, or of a class type with a trivial default constructor that is not
deleted, or an array of such types." That seems potentially useful.

If we want to follow the relevant language rules by default, that would suggest that "writes always set the type" should be enabled by default in C and disabled by default in C++. That may not be the right decision for other reasons, though. In C++, writes through union members and new-expressions should probably (re)set the type (do you have intrinsics the frontend can use to do so?).

Also, in C, memcpy gets to copy the type for a destination that does not have declared types. It looks like the same is true for C++ for trivially-copyable types. Is the first read/write sets the unknown type (i.e. memory from malloc/calloc/memset, etc.) correct for C++ also? I recall discussing something along these lines in Kona.

If you're going to try to enforce the declared type of memory, you'll also need something like the C effective type rule to handle char buffers in C++. As far as I can tell, it's not actually legal under the spec to cast an array of chars to an arbitrary type and access it that way — you have to do something to establish that there's an object of that type there first. If you memcpy'ed into that buffer from an object of the right type, that would be sufficient to create a new formal object of that type, but I don't see any way to sensibly apply that rule to e.g. the POSIX "read" function. It seems to me that you at least need to have a rule saying that it's okay to access a formal object of type char/char[] using an arbitrarily-typed l-value.

If you're going to try to enforce the declared type of memory, you'll also need something like the C effective type rule to handle char buffers in C++. As far as I can tell, it's not actually legal under the spec to cast an array of chars to an arbitrary type and access it that way — you have to do something to establish that there's an object of that type there first.
If you memcpy'ed into that buffer from an object of the right type, that would be sufficient to create a new formal object of that type, but I don't see any way to sensibly apply that rule to e.g. the POSIX "read" function. It seems to me that you at least need to have a rule saying that it's okay to access a formal object of type char/char[] using an arbitrarily-typed l-value.

I agree. That's exactly what the current implementation does (I get that for free from our TBAA setup). I get this for free from the TBAA scheme because the current checks are symmetric (just like the TBAA checks in the optimizer). I had wondered whether this symmetry was an over-approximation in some cases, but perhaps it is not.

  1. C's "effective type" rule allows writes to set the type pretty much unconditionally, unless the storage is for a variable with a declared type

To come back to this point: We don't really implement these rules now, and it is not clear that we will. The problem here is that, if we take the specification literally, then we can't use our current TBAA at all. The problem is that if we have:

write x, !tbaa "int"
read x, !tbaa "int"
write x, !tbaa "float"

TBAA will currently tell us that the "float" write aliases with neither the preceding read nor the preceding write.

Right, C's TBAA rules do not (in general) permit a store to be reordered before a memory operation of a different type, they only allow loads to be moved before stores. (Put another way, they do not tell you that pointers point to distinct memory locations, just that a stored value cannot be observed by a load of a different type.) You get the more general "distinct memory locations" result only for objects of a declared type.

C++ is similar, except that (because object lifetimes do not currently begin magically due to a store) you /can/ reorder stores past a memory operation of a different type if you know no object's lifetime began in between. (But currently we do not record all lifetime events in IR, so we can't do that today. Also, we may be about to lose the property that you can statically determine a small number of places that might start an object lifetime.)

Also, a strict reading of C's access rules seems to rule out the premise underlying our struct-path TBAA entirely. So long as I'm accessing a value using a struct that has some member, including recursively, with that type, then it's fine. The matching of the relative offsets is a sufficient, but not necessary, condition for well-defined access. C++ has essentially the same language (and, thus, potentially the same problem).

I agree this rule is garbage, but it's not as permissive as I think you're suggesting. The rule says that you can use an lvalue of struct type to access memory of struct field type. In C this happens during struct assignment, for instance. It does *not* permit using an lvalue of struct field type to access unrelated fields of the same struct. So C appears to allow this nonsense:

char *p = malloc(8);
*(int*)p = 0;
*(int*)(p + 4) = 0;
struct S {int n; float f;} s = *(struct S*)p; // use lvalue of type `struct S` to access object of effective type `int`, to initialize a `float`

but not this nonsense:

float q = ((struct S*)p)->f; // ub, cannot use lvalue of type `float` to access object of effective type `int`

... which just means that we can't make much use of TBAA when emitting struct copies in C.

In C++, on the other hand, the rule is even more garbage, since there is no way to perform a memory access with a glvalue of class type. (The closest you get is that a defaulted union construction/assignment copies the object representation, but that's expressed in terms of copying a sequence of unsigned chars, and in any case those are member functions and so already require an object of the correct type to exist.) See wg21.link/cwg2051

While I'd like the sanitizer to follow the rules, that's really useful only to the extent that the compilers follow the rules. If the compilers are making stronger assumptions, then I think that the sanitizer should also.

I agree.

If we want to follow the relevant language rules by default, that would suggest that "writes always set the type" should be enabled by default in C and disabled by default in C++. That may not be the right decision for other reasons, though. In C++, writes through union members and new-expressions should probably (re)set the type (do you have intrinsics the frontend can use to do so?).

Also, in C, memcpy gets to copy the type for a destination that does not have declared types. It looks like the same is true for C++ for trivially-copyable types. Is the first read/write sets the unknown type (i.e. memory from malloc/calloc/memset, etc.) correct for C++ also?

As I recall, "store can create an object" is the broad direction that SG12 agreed on for the cases where you have a pointer into a raw storage buffer (that is, a char array), and we want the low-level storage allocation functions to give us such a buffer.

  1. C's "effective type" rule allows writes to set the type pretty much unconditionally, unless the storage is for a variable with a declared type

To come back to this point: We don't really implement these rules now, and it is not clear that we will. The problem here is that, if we take the specification literally, then we can't use our current TBAA at all. The problem is that if we have:

write x, !tbaa "int"
read x, !tbaa "int"
write x, !tbaa "float"

TBAA will currently tell us that the "float" write aliases with neither the preceding read nor the preceding write.

Right, C's TBAA rules do not (in general) permit a store to be reordered before a memory operation of a different type, they only allow loads to be moved before stores. (Put another way, they do not tell you that pointers point to distinct memory locations, just that a stored value cannot be observed by a load of a different type.) You get the more general "distinct memory locations" result only for objects of a declared type.

C++ is similar, except that (because object lifetimes do not currently begin magically due to a store) you /can/ reorder stores past a memory operation of a different type if you know no object's lifetime began in between. (But currently we do not record all lifetime events in IR, so we can't do that today. Also, we may be about to lose the property that you can statically determine a small number of places that might start an object lifetime.)

Also, a strict reading of C's access rules seems to rule out the premise underlying our struct-path TBAA entirely. So long as I'm accessing a value using a struct that has some member, including recursively, with that type, then it's fine. The matching of the relative offsets is a sufficient, but not necessary, condition for well-defined access. C++ has essentially the same language (and, thus, potentially the same problem).

I agree this rule is garbage, but it's not as permissive as I think you're suggesting. The rule says that you can use an lvalue of struct type to access memory of struct field type. In C this happens during struct assignment, for instance. It does *not* permit using an lvalue of struct field type to access unrelated fields of the same struct. So C appears to allow this nonsense:

char *p = malloc(8);
*(int*)p = 0;
*(int*)(p + 4) = 0;
struct S {int n; float f;} s = *(struct S*)p; // use lvalue of type `struct S` to access object of effective type `int`, to initialize a `float`

but not this nonsense:

float q = ((struct S*)p)->f; // ub, cannot use lvalue of type `float` to access object of effective type `int`

... which just means that we can't make much use of TBAA when emitting struct copies in C.

In C++, on the other hand, the rule is even more garbage, since there is no way to perform a memory access with a glvalue of class type. (The closest you get is that a defaulted union construction/assignment copies the object representation, but that's expressed in terms of copying a sequence of unsigned chars, and in any case those are member functions and so already require an object of the correct type to exist.) See wg21.link/cwg2051

Our struct-path TBAA does the following:

struct X { int a, b; };
X x { 50, 100 };
X *o = (X*) (((int*) &x) + 1);

int a_is_b = o->a; // This is UB (or so we say)?

Because we assume that the (type, offset) tuples are identified entities in the type-aliasing tree. Practically speaking, this certainly makes sense to me. However, I don't see anything in the language that actually forbids this behavior. In case it matters, because in the above case the type of the struct actually matches, we similarly forbid:

struct X { int a, b; };
struct Y { int a; float b; };
X x { 50, 100 };
Y *o = (X*) (((int*) &x) + 1);

int a_is_b = o->a; // This is UB (or so we say)?

as is this:

struct X { int a, b; };
struct Y { int a; float b; X h; /* in case this matters for the aggregate members thing */ };
X x { 50, 100 };
Y *o = (X*) (((int*) &x) + 1);

int a_is_b = o->a; // This is UB (or so we say)?

(although, as you say, this shouldn't matter in C++ because we don't have struct glvalues)

In any case, am I missing something?

While I'd like the sanitizer to follow the rules, that's really useful only to the extent that the compilers follow the rules. If the compilers are making stronger assumptions, then I think that the sanitizer should also.

I agree.

If we want to follow the relevant language rules by default, that would suggest that "writes always set the type" should be enabled by default in C and disabled by default in C++. That may not be the right decision for other reasons, though. In C++, writes through union members and new-expressions should probably (re)set the type (do you have intrinsics the frontend can use to do so?).

Also, in C, memcpy gets to copy the type for a destination that does not have declared types. It looks like the same is true for C++ for trivially-copyable types. Is the first read/write sets the unknown type (i.e. memory from malloc/calloc/memset, etc.) correct for C++ also?

As I recall, "store can create an object" is the broad direction that SG12 agreed on for the cases where you have a pointer into a raw storage buffer (that is, a char array), and we want the low-level storage allocation functions to give us such a buffer.

What about a read after a calloc (or memset)?

hfinkel updated this revision to Diff 96135.Fri, Apr 21, 7:18 AM
hfinkel retitled this revision from [TBAASan] A TBAA Sanitizer (Clang) to [TySan] A Type Sanitizer (Clang).
hfinkel edited the summary of this revision. (Show Details)

Rename TBAASanitizer -> TypeSanitizer

hfinkel updated this revision to Diff 96264.Fri, Apr 21, 5:16 PM

Output metadata to provide the types of globals (similar to how Clang marks globals for asan).