This is an archive of the discontinued LLVM Phabricator instance.

[dfsan] Track origin at loads
ClosedPublic

Authored by stephan.yichao.zhao on Apr 21 2021, 9:27 AM.

Details

Summary
The first version of origin tracking tracks only memory stores. Although
this is sufficient for understanding correct flows, it is hard to figure
out where an undefined value is read from. To find reading undefined values,
we still have to do a reverse binary search from the last store in the chain
with printing and logging at possible code paths. This is
quite inefficient.

Tracking memory load instructions can help this case. The main issues of
tracking loads are performance and code size overheads.

With tracking only stores, the code size overhead is 38%,
memory overhead is 1x, and cpu overhead is 3x. In practice #load is much
larger than #store, so both code size and cpu overhead increases. The
first blocker is code size overhead: link fails if we inline tracking
loads. The workaround is using external function calls to propagate
metadata. This is also the workaround ASan uses. The cpu overhead
is ~10x. This is a trade off between debuggability and performance, 
and will be used only when debugging cases that tracking only stores
is not enough.

Diff Detail

Event Timeline

stephan.yichao.zhao requested review of this revision.Apr 21 2021, 9:27 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 21 2021, 9:27 AM
Herald added subscribers: llvm-commits, Restricted Project. · View Herald Transcript
stephan.yichao.zhao edited the summary of this revision. (Show Details)Apr 21 2021, 9:28 AM
gbalats added inline comments.Apr 21 2021, 11:13 AM
compiler-rt/test/dfsan/origin_track_ld.c
2
3

This is redundant as it's the default.

llvm/lib/Transforms/Instrumentation/DataFlowSanitizer.cpp
256–262

Using different integer values to encode the level of tracking is hard to understand without looking at this exact comment right here. Why can't we use an enum instead with descriptive names?

E.g.,

enum OriginTrackingLevel {
  None,
  StoresOnly,
  LoadsAndStores
};

https://llvm.org/docs/CommandLine.html#selecting-an-alternative-from-a-set-of-possibilities

644

You mean InstAlignment? By the way, for documenting specific arguments of functions, I think the special \param Doxygen syntax would be clearer.

https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments

647
709
745

What do you mean by this? The description doesn't indicate what's the different with loadShadowOrigin. Maybe loadShadowOriginSansLoadTracking?

2441–2442

This function should be called only when tracking origins. Why not make that an assertion, instead of the if-stmt?

llvm/test/Instrumentation/DataFlowSanitizer/origin_track_load.ll
1–2
stephan.yichao.zhao marked 7 inline comments as done.

updated

llvm/lib/Transforms/Instrumentation/DataFlowSanitizer.cpp
256–262

Thank you.

This enum is helpful. It uses optimization levels as an example.
If we considered our use case as some debug levels, it would work

  1. track the basic traces
  2. track more chains to get more details but slow. In the next step, maybe callsites can also be added into this option. I updated the comments to mention this.

Changing this from int to enum will change existing test cases. if we wanted it to use enum, I prefer to doing so in a different CL.

MSan's msan-track-origins is defined like dfsan-track-origins, with 0, 1 and 2 to control different levels.
At least they are consistent for the time being.

644

Thank you,

gbalats added inline comments.Apr 21 2021, 2:45 PM
llvm/lib/Transforms/Instrumentation/DataFlowSanitizer.cpp
256–262

Addressing this in separate change SGTM. Thanks!

644

I'm not entirely sure you can use \param inlined within the text. Or, if you have to add a separate line per parameter. Most cases I see do the former.
Please check.

stephan.yichao.zhao marked an inline comment as done.

updated

gbalats accepted this revision.Apr 21 2021, 5:08 PM
This revision is now accepted and ready to land.Apr 21 2021, 5:08 PM