This is an archive of the discontinued LLVM Phabricator instance.

[dfsan] Add origin tls/move/read APIs
ClosedPublic

Authored by stephan.yichao.zhao on Feb 11 2021, 4:52 PM.

Details

Summary

This is a part of https://reviews.llvm.org/D95835.

Added

  1. TLS storage
  2. a weak global used to set by instrumented code
  3. move origins

These APIs are similar to MSan's APIs

https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/msan/msan_poisoning.cpp

We first improved MSan's by https://reviews.llvm.org/D94572 and https://reviews.llvm.org/D94552.
So the correctness has been verified by MSan.
After the DFSan instrument code is ready, we will be adding more test cases.

  1. read

To reduce origin tracking cost, some of the read APIs return only
the origin from the first taint data.

Note that we did not add origin set APIs here because they are related
to code instrumentation, will be added later with IR transformation code.

Diff Detail

Event Timeline

stephan.yichao.zhao requested review of this revision.Feb 11 2021, 4:52 PM
stephan.yichao.zhao created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2021, 4:52 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
stephan.yichao.zhao edited the summary of this revision. (Show Details)Feb 11 2021, 4:52 PM
stephan.yichao.zhao edited the summary of this revision. (Show Details)

rebased

https://github.com/llvm/llvm-project/commit/f2133f2e318d8ce87e9972d288485d9bc49fef5c should fix the failure related to MemProfiler-x86_64-linux-dynamic.TestCases::test_malloc_load_store.c

morehouse added inline comments.Feb 16 2021, 11:01 AM
compiler-rt/lib/dfsan/dfsan.cpp
261

Why return them in the same u64? Would it be simpler to return the label and write the origin to a pointer passed in as an arg?

268–276

Can be simplified.

328

Why do we need to check the shadow addr? Shouldn't it always be valid?

546

Lets create constants kOriginAlign = sizeof(dfsan_origin) and kOriginAlignMask = ~(kOriginAlign - 1), and use those everywhere instead of hardcoding 3.

Also, maybe short functions for AlignUp and AlignDown would improve readability.

551

Please fix this lint.

559
564–569

Similar structure to the above; may be a good candidate for a helper function or lambda.

auto ChainAndWriteOriginIfTainted = [=](uptr src, uptr size, uptr dest) {
...
};
572

Nit: Can reduce nesting if we invert the condition: if (beg == end) return;.

574

This skips the origin for the unaligned start of src.

585
598

Can we use this also for memcpy? It looks like we're checking byte-by-byte either way, so why not share the implementation?

653

Under what circumstances would we not have valid shadow addresses for the range?

657
658

Any downside to always copying in reverse order for simplicity?

stephan.yichao.zhao marked 12 inline comments as done.

update

compiler-rt/lib/dfsan/dfsan.cpp
261

We will be using this in the instrumented code. I saw today's IR sometimes lowers returning a pair like {i32, i32*} into returning a 64bit int as an optimization.
And returning {i32, i32*} is also an optimization of returning one element as return value, and the other as an argument pointer.
So I did this manually.

But I do not have strong option. Which one do you feel is better in this case?

328

We know the current DFSan memory layout is not correct. For example, addresses (0, 0x10000) are mapped to themselves.
For unknown reason this can happen at memmove or memcpy used inside a signal call.
This caused crash at runtime, and gdb coredump shows this. But I am not sure the full execution path that can lead to this.

Msan's is like this: it maps 0-0x10000 to 0x01....
We could try removing this check after fixing the memory layout issue after supporting 8bit.

Added comments.

551

https://github.com/llvm/llvm-project/commit/a7538fee3a0256a8891e746823f7b0f0ade84e62 commented out ChainOrigin for fixing compilation warnings. Uncommented.

564–569

defined a static ChainAndWriteOriginIfTainted.

574

The first "if (beg < d) {" block works for those bytes.

598

I feel this could be a small optimization if our CPUs do not work the same when prefetching data in different ways. Based on the link's experiments, the latency also depends on code pattern, like the order of the original data initialization. So if the origin is copied or moved in the same order as memcpy/memmove, the result performance would be consistent to those user memcpy/memmove.

https://stackoverflow.com/questions/13339582/why-is-linux-memmove-implemented-the-way-it-is shows that most memcpy does a forward copy and most memmove does a backward copy. This matches CopyOrigin and ReverseCopyOrigin.

Actually I am not sure in practice how much this can help performance, but since we have already had both ReverseCopyOrigin and CopyOrigin (following Msan), it would be fine to keep both.

653

This is similar to the check in GetOriginIfTainted. Will be revisiting this after memory layout is fixed. Added comments.

658

Added comments.

morehouse added inline comments.Feb 17 2021, 3:45 PM
compiler-rt/lib/dfsan/dfsan.cpp
574

What if dst is aligned properly but src is not? Then we skip the beg < d case.

585

Looks like this comment wasn't actually addressed.

stephan.yichao.zhao marked 2 inline comments as done.Feb 17 2021, 6:26 PM
compiler-rt/lib/dfsan/dfsan.cpp
574

This is a good point. I also missed the 'behavior'. I feel this is an side effect of 4 bytes share the same origin.

Consider copying 5 bytes from 0x12 to 0x20.
B0 and B1 share an origin O0, B2-B4 share another origin O1.
After the copy, B0-B3 should share one origin. But they have either O0 or O1 at src, so we can only keep one.

Ignoring copying B0 B1 is like copying, then being overwritten.

This can also happen at the end: at the case the code indeed copies both, but the last one wins.

I added some comments. Hopefully unaligned copies are not common.

This revision is now accepted and ready to land.Feb 18 2021, 7:37 AM