This is an archive of the discontinued LLVM Phabricator instance.

Implement function type checker for the undefined behavior sanitizer.
ClosedPublic

Authored by pcc on Aug 9 2013, 1:30 AM.

Details

Summary

This uses function prefix data to store
function type information at the function pointer.

Diff Detail

Event Timeline

pcc updated this revision to Unknown Object (????).Aug 9 2013, 1:31 PM
  • Also test for the RTTI pointer check code
pcc updated this revision to Unknown Object (????).Sep 15 2013, 6:32 PM

Rebase.

pcc added a comment.Sep 15 2013, 6:35 PM

The LLVM-side dependency for this has landed now.

pcc updated this revision to Unknown Object (????).Oct 10 2013, 7:08 PM
Rebase.

Clever! The approach only makes me feel a little uneasy =)

It would be polite to issue a diagnostic if someone explicitly asks for this (not via -fsanitize=undefined) and we don't support it for their target.

lib/CodeGen/CGExpr.cpp
3128 ↗(On Diff #4825)

This is a bit hard to parse. (!TargetDecl || !isa<FunctionDecl>(TargetDecl)) maybe?

3129 ↗(On Diff #4825)

* on the right.

3133 ↗(On Diff #4825)

Some indication of what "PD" stands for here would be useful.

3133–3134 ↗(On Diff #4825)

We usually format this as

llvm::Type *PDStructTyElems[] = {
  PDSig->getType(),
  FTRTTIConst->getType()
};

or similar.

3138–3142 ↗(On Diff #4825)

What happens if the callee address is at the end of a page and doesn't have the extra data? Could this load trigger a segfault?

3155–3156 ↗(On Diff #4825)

Likewise.

lib/CodeGen/CodeGenFunction.cpp
523–524 ↗(On Diff #4825)

It'd be nice if we could only do this for address-taken functions. When we take the address of a function, we could emit linkonce_odr thunk with the extra data, and use that instead of the original address... except that would break address-of-function comparisons between instrumented and uninstrumented code. Can we provide an option to enable that behavior for the case where we're OK with an ABI change?

529–530 ↗(On Diff #4825)

Does this include the calling convention attributes?

531 ↗(On Diff #4825)

Spaces around braces.

lib/CodeGen/TargetInfo.cpp
607–608 ↗(On Diff #4825)

What do 'F' and 'T' demangle as? An invalid instruction encoding would give me more confidence here.

pcc updated this revision to Unknown Object (????).Oct 18 2013, 4:06 PM
  • Address code review comments
pcc added a comment.Oct 18 2013, 4:09 PM

It would be polite to issue a diagnostic if someone explicitly asks for this (not via -fsanitize=undefined) and we don't support it for their target.

Hmm, we should probably be doing this for other sanitizers too (e.g. MSan and DFSan are only supported on 64-bit Linux).

lib/CodeGen/CGExpr.cpp
3128 ↗(On Diff #4825)

Done.

3129 ↗(On Diff #4825)

Done.

3133 ↗(On Diff #4825)

Globally replaced 'PD' with 'Prefix'.

3133–3134 ↗(On Diff #4825)

Done. clang-format prefers something closer to the formatting I originally used; I'll raise this with the clang-format folks.

3138–3142 ↗(On Diff #4825)

It could in principle. In practice, Clang always aligns function bodies to 16 bytes, GCC does it at -O>=2 and GCC (4.4 and 4.6) codegens an empty function body at -O0 with >4 bytes. A quick survey of a number of system libraries (libc, libm, libstdc++) on an Ubuntu machine reveals that the functions are suitably aligned. I think the circumstances in which a segfault might occur are sufficiently rare as to justify doing a single load. If it does turn out to be a problem in practice we could consider splitting the load (perhaps conditional on a command line flag).

3155–3156 ↗(On Diff #4825)

Done.

lib/CodeGen/CodeGenFunction.cpp
523–524 ↗(On Diff #4825)

I like the idea of doing this as an option, but I would prefer to land this version first.

529–530 ↗(On Diff #4825)

Not with the Itanium ABI, which does not have a representation for calling conventions, but the Microsoft ABI (for which calling conventions matter more) does, so once RTTI has been implemented for that ABI, this should respect calling conventions there.

531 ↗(On Diff #4825)

Done.

lib/CodeGen/TargetInfo.cpp
607–608 ↗(On Diff #4825)

"rex.RX push %rsp", according to objdump. But I don't think it really matters what this decodes to. If the instruction pointer finds itself here the situation isn't dissimilar to jumping into the middle of a multibyte instruction. (Besides, the RTTI pointer could decode to anything.)

rsmith accepted this revision.Oct 18 2013, 4:34 PM

LGTM

lib/CodeGen/CGExpr.cpp
3128 ↗(On Diff #5042)

It'd be nice if we could erase these checks if the optimizer resolves the call.

lib/CodeGen/TargetInfo.cpp
607–608 ↗(On Diff #4825)

I think it matters: it's not unreasonable for a function to start as:

jmp +8
6 byte loop body
check loop condition
conditional jmp back to loop body

IIUC, your check would flag a false positive on indirect calls to such a function.

However, since the 'F' is an instruction prefix that is redundant on a 'T' (pushq %rsp) instruction, I think it's reasonable to assume that the FT signature will not occur in uninstrumented code.

pcc added a comment.Oct 18 2013, 5:59 PM

What about the compiler-rt changes?

lib/CodeGen/CGExpr.cpp
3128 ↗(On Diff #5042)

Sure. A design point of function prefix data is that it permits this sort of optimization. Maybe if I get a chance I'll see if I can teach the optimizer to do this.

lib/CodeGen/TargetInfo.cpp
607–608 ↗(On Diff #4825)

I see. I think I agree with your reasoning.

pcc closed this revision.Oct 20 2013, 2:33 PM

Closed by commit rL193058 (authored by @pcc).