This is an archive of the discontinued LLVM Phabricator instance.

[ASan] Enable optional ASan recovery
ClosedPublic

Authored by ygribov on Aug 25 2015, 5:53 AM.

Details

Summary

This patch adds flags to clang and libasan to continue execution after detection of first error (in hope that more errors can be detected during single run). Such mode allows for a faster run-report-fix loop on embedded system where fixing a bug and preparing new firmware may take a significant time. To enable, user should build his code with -fsanitize-recover=address and then run it under ASAN_OPTIONS=keep_going=1.

I know kcc's generally negative attitude towards keep_going mode but I want the patch to at least be available publicly so that poor souls dealing with complex systems may access it easily.

Diff Detail

Repository
rL LLVM

Event Timeline

ygribov updated this revision to Diff 33064.Aug 25 2015, 5:53 AM
ygribov retitled this revision from to [ASan] Enable optional ASan recovery.
ygribov updated this object.
ygribov added reviewers: kcc, samsonov.
ygribov set the repository for this revision to rL LLVM.
ygribov added subscribers: llvm-commits, dvyukov, m.ostapenko.
eugenis added inline comments.
include/clang/Frontend/CodeGenOptions.def
118 ↗(On Diff #33064)

Could you make it not specific to ASan? Smth like SanitizeRecover so that it could affect MSan as well in the future.

lib/CodeGen/BackendUtil.cpp
343 ↗(On Diff #33064)

Why do you need separate "noabort" functions for this? See addMemorySanitizerPass which inspects CodeGenOptions in the function body.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
125 ↗(On Diff #33064)

Lets stick to existing terminology (we already have "recover" and "keep-going" names for the same concept).

lib/asan/asan_flags.inc
147 ↗(On Diff #33064)

call it max_errors, similarly to the compiler flag?

lib/asan/asan_poisoning.cc
221 ↗(On Diff #33064)

why "false"? Do you need to check the keep_going flag?

kcc added inline comments.Aug 25 2015, 6:50 PM
lib/asan/asan_report.cc
652 ↗(On Diff #33064)

OMG

ygribov added inline comments.Aug 26 2015, 2:53 AM
include/clang/Frontend/CodeGenOptions.def
118 ↗(On Diff #33064)

Right.

lib/CodeGen/BackendUtil.cpp
343 ↗(On Diff #33064)

Ok.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
125 ↗(On Diff #33064)

Ok, I'll use "recover" then. Most of this was done in pre-recover times.

lib/asan/asan_flags.inc
147 ↗(On Diff #33064)

Sure.

lib/asan/asan_poisoning.cc
221 ↗(On Diff #33064)

Hm, probably yes.

lib/asan/asan_report.cc
652 ↗(On Diff #33064)

This section could be removed if we are ok with ignoring bugs that run in parallel.

ygribov added inline comments.Aug 26 2015, 2:59 AM
lib/asan/asan_report.cc
652 ↗(On Diff #33064)

Could you suggest a cleaner way to do this?

ygribov updated this revision to Diff 33203.Aug 26 2015, 8:10 AM
ygribov added a reviewer: eugenis.
ygribov removed a subscriber: eugenis.

Updated based on initial feedback. It seems we have way too many different names for recovery feature (UBSan's trap-on-error, MSan's keep-going, ASan's recover, noabort, etc.).

samsonov added inline comments.Aug 26 2015, 1:44 PM
include/clang/Basic/Sanitizers.h
82 ↗(On Diff #33203)

Do you actually use it anywhere?

lib/CodeGen/BackendUtil.cpp
214 ↗(On Diff #33203)

Is recovery supposed to work for -fsanitize=kernel-address? Should you mark it as Unrecoverable?

lib/asan/asan_flags.inc
144 ↗(On Diff #33203)

This is really sad, but we tend to use halt_on_error runtime flag for that. This name is already used in TSan, MSan and UBSan.

lib/asan/asan_interface_internal.h
134 ↗(On Diff #33203)

This function is also declared in public header: include/sanitizer/asan_interface.h (which is kind of stupid), so you have to fix it in both places. Looks like it's not an ABI break, though.

lib/asan/asan_internal.h
129 ↗(On Diff #33203)

We have atomic_fetch_add

samsonov edited edge metadata.Aug 26 2015, 1:46 PM

Updated based on initial feedback. It seems we have way too many different names for recovery feature (UBSan's trap-on-error, MSan's keep-going, ASan's recover, noabort, etc.).

Right, naming is hard :( Speaking of user-visible flags, we should use -fsanitize-recover=address flag (UBSan's trap-on-error corresponds to -fsanitize-trap= and is slightly different).

ygribov added inline comments.Aug 26 2015, 1:49 PM
include/clang/Basic/Sanitizers.h
82 ↗(On Diff #33203)

No, that's remains of some removed code.

lib/CodeGen/BackendUtil.cpp
214 ↗(On Diff #33203)

KAsan is actually recoverable by default. Perhaps I should do explicit Recover for CompileKernel as well.

lib/asan/asan_internal.h
129 ↗(On Diff #33203)

Yeah but it may be slower.

filcab added a subscriber: filcab.Aug 26 2015, 2:59 PM

What about the file test/asan/TestCases/Darwin/interface_symbols_darwin.c? Shouldn't it be changed too?

lib/asan/asan_report.cc
695 ↗(On Diff #33203)

Comment on line 632 would apply here too.

697 ↗(On Diff #33203)

The usual "I see no benefit in using LIKELY here".

lib/asan/scripts/asan_symbolize.py
429 ↗(On Diff #33203)

Why? I might have missed the reason for this change.

test/Driver/fsanitize.c
168 ↗(On Diff #33203)

Shouldn't address be added here too?

filcab added inline comments.Aug 26 2015, 2:59 PM
lib/asan/asan_internal.h
125 ↗(On Diff #33203)

Technically it might be unlikely, but this is noise in the code.
If we're not recovering, we will blow up anyway. Removing UNLIKELY would make the code easier to read with 0 practical difference in performance for any case.
In fact, it we're not silencing, *then* we want to be as fast as possible. (I'd go with the simple if, though)

129 ↗(On Diff #33203)

Slower is better than undefined behavior, no? ;-)

lib/asan/asan_report.cc
632 ↗(On Diff #33203)

cmpxchg should work, no?

646 ↗(On Diff #33203)

Again, no benefit in tagging these as UNLIKELY.

652 ↗(On Diff #33203)

At the very least, you can create a helper function with the code executed when you get the lock, no?

659 ↗(On Diff #33203)

This comment should be changed, too. It's not true after this diff gets applied.

661 ↗(On Diff #33203)

Why did you remove the first part of the comment?
It's still the reason why we have the sleep call here.

Thanks a lot for detailed review!

What about the file test/asan/TestCases/Darwin/interface_symbols_darwin.c? Shouldn't it be changed too?

Yup.

lib/asan/asan_internal.h
125 ↗(On Diff #33203)

Sorry, could you give more details? I indeed favor the default (non-recovery) case here. As for the UNLIKELY, we've seen cases where it significantly improved performance.

129 ↗(On Diff #33203)

Again, I was probably too scary to slow down the default path. I'll switch to atomics in next version then.

lib/asan/asan_poisoning.cc
221 ↗(On Diff #33203)

BTW what are these sanitizer_unaligned functions for? I don't think compiler emits then but they are also missing in public headers.

lib/asan/asan_report.cc
632 ↗(On Diff #33203)

Yup.

652 ↗(On Diff #33203)

Ah, so it's all about this tiny goto (not about hacky trial-and-error mutex acquisition). Ok, I'll put the code to a separate function.

659 ↗(On Diff #33203)

Right.

661 ↗(On Diff #33203)

I moved the first part of the comment above because it's applicable to both branches.

lib/asan/scripts/asan_symbolize.py
429 ↗(On Diff #33203)

Hm, I don't remember doing this hunk at all. I'll double check when at work.

test/Driver/fsanitize.c
168 ↗(On Diff #33203)

Ok.

ygribov updated this revision to Diff 33321.Aug 27 2015, 6:28 AM
ygribov edited edge metadata.

Updated based on reviews.

Still missing the Darwin interface symbols file.

Thanks for working on this.

Filipe

lib/asan/asan_interceptors.cc
73 ↗(On Diff #33321)

Why false? Shouldn't it be the flag for halt_on_error?

lib/asan/asan_internal.h
125 ↗(On Diff #33321)

So, I see two things that can happen:

halt_on_error is true:
We go through the if once (maybe more times, if we find several errors in several threads).
QuickCheckSilence returns false
LIKELY Isn't really doing much. Tweaking a jump that is executed once doesn't really matter.

halt_on_error is false:
We go through the if roughly max_errors times (let's say around 100, which seems to me is already a too big number for max_errors (if you actually manage to skip 100 errors and still execute code, that's a huge amount of luck!)

From these two, I would actually say that, at most, you should have an UNLIKELY, and not a LIKELY. *If* we execute that if more than once, then I should always expect it to be false.

Still, even in the case where we execute it "a bunch of times" we're still "almost never" executing it. QuickCheckSilence is called only when reporting errors, which means that:
If halt_on_error, then the program is terminating right after this report. We're not winning anything.
If halt_on_error, then the LIKELY is reversed. And we're still not winning much, we're not expecting this code to be executed more than max_errors, which, I'm assuming, will always be a small number.

My opinion is that it's not making the code faster, and it's making it less clear.

You mentioned performance increases. What tests were you running, and how?

131 ↗(On Diff #33321)

You can change the previous atomic_load to be an atomic_fetch_add of 0 or 1, no?

test/Driver/fsanitize.c
168 ↗(On Diff #33321)

Why didn't you update the number too? Does the test pass?

test/asan/TestCases/Linux/halt_on_error-torture.cc
1 ↗(On Diff #33321)

Why is this test linux only?

kcc added inline comments.Aug 27 2015, 3:17 PM
lib/asan/asan_flags.inc
147 ↗(On Diff #33321)

Shouldn't this go to sanitizer_common?

lib/asan/asan_interceptors.cc
52 ↗(On Diff #33321)

why do we need it here?

lib/asan/asan_internal.h
124 ↗(On Diff #33321)

Does it have to be static inline?
Also, I don't like the name. But it depends on another question later.

lib/asan/asan_report.cc
29 ↗(On Diff #33321)

Don't define this macro, just use the C++ line as is

697 ↗(On Diff #33321)

Agree.

1014 ↗(On Diff #33321)

I think we should revert the name and the meaning of the flag:
make it "int recover" instead of "int fatal" to match the flag.
No?

Also, we need to make the parameter more clear.
One way is to declare a enum {ABORT_ON_ERROR, RECOVER_ON_ERROR} and pass it as the parameter.
Another is to
define two more internal functions
asan_report_error_and_abort() &
asan_report_error_and_maybe_recover()
and instead of calling __asan_report_error with a constant in the last parameter call one of these.

At the very least, write /*recover=*/false

test/asan/TestCases/Linux/halt_on_error-torture.cc
29 ↗(On Diff #33321)

don't we want to check that multiple errors were reported?

check how the tests set ASAN_OPTIONS nowadays.
There were some massive changes...

44 ↗(On Diff #33321)

use C++ new/delete

Still missing the Darwin interface symbols file.

Right, I iterated Phab's comments and completely forgot about this item. I can add a test but I don't have Mac at hand so I'll have to XFAIL it if problem arises after commit.

lib/asan/asan_flags.inc
147 ↗(On Diff #33321)

Hm, perhaps we should implement it for UBSan/MSan and then move to sanitizer_common?

lib/asan/asan_interceptors.cc
52 ↗(On Diff #33321)

This enables fast path once error reporting limit is reached. Otherwise an error in a busy loop may cause significant (100x) performance degradation. One might argue that we should not be interested in continuing execution at all after we can't report more errors but in fully sanitized distribution it may be beneficial to continue execution to provoke errors in dependent processes.

73 ↗(On Diff #33321)

This is covered by quick check above.

lib/asan/asan_internal.h
124 ↗(On Diff #33321)

It doesn't but I thought this would be more efficient.

125 ↗(On Diff #33321)

halt_on_error is true:
We go through the if once (maybe more times, if we find several errors in several threads).

I don't quite see it like this. Quick check may be called in non-fatal context (e.g. to enable fast path in ACCESS_MEMORY_RANGE). Information about likeliness may inform compiler to e.g. allocate registers in favor of preferred code or align it in a more appropriate way (e.g. to the same cache line).

You mentioned performance increases. What tests were you running, and how?

Check https://www.mail-archive.com/address-sanitizer@googlegroups.com/msg00403.html . Adding unlikely hint to noreturning if improved code by some 40%.

131 ↗(On Diff #33321)

Well yeah but this would add a risk of overflow.

lib/asan/asan_report.cc
29 ↗(On Diff #33321)

Ok.

1014 ↗(On Diff #33321)

Frankly I don't like the idea of adding fatal/recover parameter to public interface at all. Fatality should be controlled by ASAN_OPTIONS and be transparent for user code. Thoughts?

test/Driver/fsanitize.c
168 ↗(On Diff #33321)

Yup, it did. I'll reexamine it.

test/asan/TestCases/Linux/halt_on_error-torture.cc
1 ↗(On Diff #33321)

Mainly because I'm afraid that Mac/BSD maintainers will shout at me when I break their bots)

29 ↗(On Diff #33321)

Ok.

44 ↗(On Diff #33321)

Ok.

ygribov updated this revision to Diff 33570.Aug 31 2015, 4:37 AM

Updated based on review. I haven't yet made torture test generic because I'm yet to test this patch on Mac (I plan to do this for next iteration).

ygribov added inline comments.Aug 31 2015, 4:41 AM
test/Driver/fsanitize.c
168 ↗(On Diff #33570)

This should be ok - recovery is not enabled by default for ASan so resulting number didn't change.

Folks, any feedback on this?

kcc edited edge metadata.Sep 4 2015, 11:36 AM

The patch is already too long and has too many comments.
Some parts are less controversial, some are more.
I suggest you to split it at this point and start reviewing/submitting it in parts.
Probably start from the driver/flags changes.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
1099 ↗(On Diff #33570)

I think we use {} in such cases (if followed by an else with {})

lib/asan/asan_flags.inc
141 ↗(On Diff #33570)

No, just add it to sanitizer_common and write something like ("currently supported by asan") in the description.

If we add things that are potentially useful to other sanitizers to one of them (not to sanitizer_common)
we then end up having duplicates in all sanitizers.

lib/asan/asan_internal.h
122 ↗(On Diff #33570)

I don't like the name,
I don't like the fact that it has a parameter,
I don't like that it's static inline,
I don't like its behavior.

We have two situations:

  1. We are reporting an error and don't want to do it (is_error=true).

There is no harm in making this function non-inline in this case.
bool WantToReportMoreError();

  1. We don't know whether there will be an error, but we are shutting down the checking functionality

just in case (or to win a bit of performance).
I don't think we need to do this at all.
The win is not clear, but the added complexity is.

lib/asan/asan_report.cc
626 ↗(On Diff #33570)

This is all exceptionally horrible.
I've spent quite a bit of time and failed to understand how it works,
also I don't understand how to test it.
Why not just grab a lock for the entire duration of error reporting?

The current logic was there because we never reported more than one bug,
now that we do -- simply grab a lock!

698 ↗(On Diff #33570)

class members have the _ suffix.

lib/asan/asan_rtl.cc
126 ↗(On Diff #33570)

please leave __asan_report_error here. It is used as a target for break points in debugger.

You may then check halt_on_error inside __asan_report_error

kcc added a comment.Sep 4 2015, 12:50 PM

One more interesting question that came to my mind: is this going to be async-signal-safe?
If you think it will, please add tests.
If not, the whole idea is kind of questionable.

Some parts are less controversial, some are more.

What's your general impression on the feature though?

I suggest you to split it at this point and start reviewing/submitting it in parts.
Probably start from the driver/flags changes.

Hm but I don't see any functionality which can be moved to independent commit (except the flags part which is really minor). We could split to separate patches for rt, clang and llvm but I don't see how this is going to simplify review.

lib/asan/asan_flags.inc
141 ↗(On Diff #33570)

If we add things that are potentially useful to other sanitizers to one of them (not to sanitizer_common)
we then end up having duplicates in all sanitizers.

That makes sense, we already have too many semi-duplicate flags.

lib/asan/asan_internal.h
122 ↗(On Diff #33570)

Problem is that lack of this check would cause program to crawl in some cases. But ok, this may live in a private branch.

lib/asan/asan_report.cc
626 ↗(On Diff #33570)

also I don't understand how to test it.

Did you ever understand how to test original code :) ? The behavior for fatal errors didn't change by the way.

Why not just grab a lock for the entire duration of error reporting?

Because we may run into second error while reporting the first one in the same thread. Grabbing a lock would cause a deadlock in this case.

698 ↗(On Diff #33570)

I'm not sure that's true for statics, I see numerous places where they aren't undered.

One more interesting question that came to my mind: is this going to be async-signal-safe?
If not, the whole idea is kind of questionable.

I don't see why it shouldn't - the only problematic place is acquiring the mutex which will either wait (in case reporting_tid is different from current_tid) or fail with informative message after timeout.

If you think it will, please add tests.

I could generate errors in endless loop in a thread and bombard it with signals but that's probably not very reliable.

kcc added a comment.Sep 4 2015, 1:35 PM

One more interesting question that came to my mind: is this going to be async-signal-safe?
If not, the whole idea is kind of questionable.

I don't see why it shouldn't - the only problematic place is acquiring the mutex which will either wait (in case reporting_tid is different from current_tid) or fail with informative message after timeout.

If you think it will, please add tests.

I could generate errors in endless loop in a thread and bombard it with signals but that's probably not very reliable.

The more interesting would be to test the case where we have asan-ish reports in signal handlers (and at the same time in regular code).

kcc added a comment.Sep 4 2015, 1:39 PM

What's your general impression on the feature though?

Same as before.
It is an anti-feature for the majority of users and it will bring much additional complexity to the code
and extra maintenance cost.

I am willing to go forward with it because for some (still obscure) reasons you need this feature
and because your contribution is valuable.

kcc added a comment.Sep 4 2015, 1:42 PM

Hm but I don't see any functionality which can be moved to independent commit (except the flags part which is really minor). We could split to separate patches for rt, clang and llvm but I don't see how this is going to simplify review.

Review is an O(N^2) process. The smaller are the patches the quicker you get through.

kcc added inline comments.Sep 4 2015, 1:47 PM
lib/asan/asan_report.cc
626 ↗(On Diff #33570)

Did you ever understand how to test original code :) ?

That's not an argument. We don't want to make things worse just because they are not good at the moment.

Because we may run into second error while reporting the first one in the same thread. Grabbing a lock would cause a deadlock in this case.

Yea. Async signal safety...

Folks,

In case anyone wants this feature it's probably good time to speak up. If there's no interest, I'll probably abandon this revision in a few days.

-Y

eugenis edited edge metadata.Sep 16 2015, 1:05 PM

I think this is a useful feature but I'm a bit worried about the implementation complexity.

We've been running entire Android devices with ASan, and having them crash at boot due to a single bug is moderately annoying.

For the reference, https://source.android.com/devices/tech/debug/asan.html#sanitize_target.

I think this is a useful feature but I'm a bit worried about the implementation complexity.

You mean the report lock part? I agree, although I'm not sure how to simplify it. Perhaps always abort in case of conflict?

We've been running entire Android devices with ASan, and having them crash at boot due to a single bug is moderately annoying.

Yup, that's my use-case as well.

kcc added a subscriber: kcc.Sep 16 2015, 10:25 PM

Let's try to finish it.
I won't be able to participate in the review before 28-th, but feel free to
continue/finish w/o me.

eugenis added inline comments.Sep 21 2015, 3:09 PM
lib/asan/asan_report.cc
649 ↗(On Diff #33570)

I think this timeout complicates things without any benefit. Why not wait indefinitely?

I think the logic should be like this:

  1. if the current thread is already reporting, die immediately
  2. if (recover) {lock()} else { if (!try_lock()) sleep(inf) }
  3. report
  4. unlock()
1006–1007 ↗(On Diff #33570)

I agree, if the outside caller thinks the error is fatal, it can just abort() or something.

samsonov added inline comments.Sep 21 2015, 4:46 PM
lib/asan/asan_internal.h
122 ↗(On Diff #33570)

The name is really confusing. It should be smth. like IgnoreError. Also, instead of passing is_error as a parameter, and doing conditional atomic_fetch_add, can you just split this function into two: one would check if need to ignore the error (halt_on_error is true, or number of reported errors is too large already), and another one would increment the number of detected errors. (could you then call the latter one from ScopedInErrorReport constructor)?

Can you make it ALWAYS_INLINE?

lib/asan/asan_report.cc
698 ↗(On Diff #33570)

There are no trailing _ in simple structs, but there should be in classes.

Sorry, I got distracted by other projects. I'm mostly done with all remarks with just one issue (can't commonize halt_on_error).

lib/asan/asan_flags.inc
141 ↗(On Diff #33570)

Here's the problem - we need two halt_on_error's for ASan and integrated UBSan. Another problem is that msan does some local magic to alias current halt_on_error to keep_going flag.

lib/asan/asan_report.cc
626 ↗(On Diff #33570)

Funny enough, even current ASan isn't asynch signal stable. If you have repeated error in nodeferred signals which are being send to a thread, it may deadlock during error reporting in ScopedInErrorLock. The sequence is roughly:

  • kernel delivers signal
  • trigger error
  • lock report file
  • call internal_write
  • enter kernel
  • kernel delivers next nodeferred signal
  • trigger the same error
  • etc.
ygribov updated this revision to Diff 38931.Nov 2 2015, 9:02 AM
ygribov edited edge metadata.
ygribov removed rL LLVM as the repository for this revision.

I've addressed the remarks from previous review except the one for halt_on_error (we want separate halt_on_error for ASan and UBSan). The llvm and clang parts of the patch will be submitted for review separately per Kostya's request.

BTW I haven't yet ported tests to Darwin, this is something tbd.

ygribov added inline comments.Nov 3 2015, 11:32 PM
lib/asan/asan_report.cc
644 ↗(On Diff #38931)

This could be much simpler if we had recursive locks.

eugenis accepted this revision.Nov 5 2015, 11:33 AM
eugenis edited edge metadata.

LGTM
Thanks for working on this!

This revision is now accepted and ready to land.Nov 5 2015, 11:33 AM
This revision was automatically updated to reflect the committed changes.

BTW I haven't yet ported tests to Darwin, this is something tbd.

Filipe, Kuba, the feature has been merged to upstream and seems to work robustly. But I don't think I'll have time to test it on OS X in near future (I don't have easy access to hardware and my customer does not need it). So I was wondering whether you could help? I don't expect any errors, just maybe some really simple portability fixes.

BTW I haven't yet ported tests to Darwin, this is something tbd.

Filipe, Kuba, the feature has been merged to upstream and seems to work robustly. But I don't think I'll have time to test it on OS X in near future (I don't have easy access to hardware and my customer does not need it). So I was wondering whether you could help? I don't expect any errors, just maybe some really simple portability fixes.

Hi, are you talking about the halt_on_error-signals and halt_on_error-torture testcases? On my OS X machine, they both pass, so you might just want to move them out of the Linux/ directory. Let me know if the OS X buildbot fails and I'll take a look at it.

On my OS X machine, they both pass, so you might just want to move them out of the Linux/ directory

Great, thanks!