This is an archive of the discontinued LLVM Phabricator instance.

[ASan] Enable optional ASan recovery
ClosedPublic

Authored by ygribov on Aug 25 2015, 5:53 AM.

Details

Summary

This patch adds flags to clang and libasan to continue execution after detection of first error (in hope that more errors can be detected during single run). Such mode allows for a faster run-report-fix loop on embedded system where fixing a bug and preparing new firmware may take a significant time. To enable, user should build his code with -fsanitize-recover=address and then run it under ASAN_OPTIONS=keep_going=1.

I know kcc's generally negative attitude towards keep_going mode but I want the patch to at least be available publicly so that poor souls dealing with complex systems may access it easily.

Diff Detail

Repository
rL LLVM

Event Timeline

ygribov updated this revision to Diff 33064.Aug 25 2015, 5:53 AM
ygribov retitled this revision from to [ASan] Enable optional ASan recovery.
ygribov updated this object.
ygribov added reviewers: kcc, samsonov.
ygribov set the repository for this revision to rL LLVM.
ygribov added subscribers: llvm-commits, dvyukov, m.ostapenko.
eugenis added inline comments.
include/clang/Frontend/CodeGenOptions.def
118

Could you make it not specific to ASan? Smth like SanitizeRecover so that it could affect MSan as well in the future.

lib/CodeGen/BackendUtil.cpp
343

Why do you need separate "noabort" functions for this? See addMemorySanitizerPass which inspects CodeGenOptions in the function body.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
125

Lets stick to existing terminology (we already have "recover" and "keep-going" names for the same concept).

lib/asan/asan_flags.inc
147

call it max_errors, similarly to the compiler flag?

lib/asan/asan_poisoning.cc
221

why "false"? Do you need to check the keep_going flag?

kcc added inline comments.Aug 25 2015, 6:50 PM
lib/asan/asan_report.cc
652

OMG

ygribov added inline comments.Aug 26 2015, 2:53 AM
include/clang/Frontend/CodeGenOptions.def
118

Right.

lib/CodeGen/BackendUtil.cpp
343

Ok.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
125

Ok, I'll use "recover" then. Most of this was done in pre-recover times.

lib/asan/asan_flags.inc
147

Sure.

lib/asan/asan_poisoning.cc
221

Hm, probably yes.

lib/asan/asan_report.cc
652

This section could be removed if we are ok with ignoring bugs that run in parallel.

ygribov added inline comments.Aug 26 2015, 2:59 AM
lib/asan/asan_report.cc
652

Could you suggest a cleaner way to do this?

ygribov updated this revision to Diff 33203.Aug 26 2015, 8:10 AM
ygribov added a reviewer: eugenis.
ygribov removed a subscriber: eugenis.

Updated based on initial feedback. It seems we have way too many different names for recovery feature (UBSan's trap-on-error, MSan's keep-going, ASan's recover, noabort, etc.).

samsonov added inline comments.Aug 26 2015, 1:44 PM
include/clang/Basic/Sanitizers.h
82

Do you actually use it anywhere?

lib/CodeGen/BackendUtil.cpp
216

Is recovery supposed to work for -fsanitize=kernel-address? Should you mark it as Unrecoverable?

lib/asan/asan_flags.inc
144

This is really sad, but we tend to use halt_on_error runtime flag for that. This name is already used in TSan, MSan and UBSan.

lib/asan/asan_interface_internal.h
134

This function is also declared in public header: include/sanitizer/asan_interface.h (which is kind of stupid), so you have to fix it in both places. Looks like it's not an ABI break, though.

lib/asan/asan_internal.h
129

We have atomic_fetch_add

samsonov edited edge metadata.Aug 26 2015, 1:46 PM

Updated based on initial feedback. It seems we have way too many different names for recovery feature (UBSan's trap-on-error, MSan's keep-going, ASan's recover, noabort, etc.).

Right, naming is hard :( Speaking of user-visible flags, we should use -fsanitize-recover=address flag (UBSan's trap-on-error corresponds to -fsanitize-trap= and is slightly different).

ygribov added inline comments.Aug 26 2015, 1:49 PM
include/clang/Basic/Sanitizers.h
82

No, that's remains of some removed code.

lib/CodeGen/BackendUtil.cpp
216

KAsan is actually recoverable by default. Perhaps I should do explicit Recover for CompileKernel as well.

lib/asan/asan_internal.h
129

Yeah but it may be slower.

filcab added a subscriber: filcab.Aug 26 2015, 2:59 PM

What about the file test/asan/TestCases/Darwin/interface_symbols_darwin.c? Shouldn't it be changed too?

lib/asan/asan_report.cc
695–700

Comment on line 632 would apply here too.

697

The usual "I see no benefit in using LIKELY here".

lib/asan/scripts/asan_symbolize.py
429

Why? I might have missed the reason for this change.

test/Driver/fsanitize.c
168

Shouldn't address be added here too?

filcab added inline comments.Aug 26 2015, 2:59 PM
lib/asan/asan_internal.h
125

Technically it might be unlikely, but this is noise in the code.
If we're not recovering, we will blow up anyway. Removing UNLIKELY would make the code easier to read with 0 practical difference in performance for any case.
In fact, it we're not silencing, *then* we want to be as fast as possible. (I'd go with the simple if, though)

129

Slower is better than undefined behavior, no? ;-)

lib/asan/asan_report.cc
632–644

cmpxchg should work, no?

646–664

Again, no benefit in tagging these as UNLIKELY.

652

At the very least, you can create a helper function with the code executed when you get the lock, no?

659

This comment should be changed, too. It's not true after this diff gets applied.

661

Why did you remove the first part of the comment?
It's still the reason why we have the sleep call here.

Thanks a lot for detailed review!

What about the file test/asan/TestCases/Darwin/interface_symbols_darwin.c? Shouldn't it be changed too?

Yup.

lib/asan/asan_internal.h
125

Sorry, could you give more details? I indeed favor the default (non-recovery) case here. As for the UNLIKELY, we've seen cases where it significantly improved performance.

129

Again, I was probably too scary to slow down the default path. I'll switch to atomics in next version then.

lib/asan/asan_poisoning.cc
221

BTW what are these sanitizer_unaligned functions for? I don't think compiler emits then but they are also missing in public headers.

lib/asan/asan_report.cc
632–644

Yup.

652

Ah, so it's all about this tiny goto (not about hacky trial-and-error mutex acquisition). Ok, I'll put the code to a separate function.

659

Right.

661

I moved the first part of the comment above because it's applicable to both branches.

lib/asan/scripts/asan_symbolize.py
429

Hm, I don't remember doing this hunk at all. I'll double check when at work.

test/Driver/fsanitize.c
168

Ok.

ygribov updated this revision to Diff 33321.Aug 27 2015, 6:28 AM
ygribov edited edge metadata.

Updated based on reviews.

Still missing the Darwin interface symbols file.

Thanks for working on this.

Filipe

lib/asan/asan_interceptors.cc
73

Why false? Shouldn't it be the flag for halt_on_error?

lib/asan/asan_internal.h
125

So, I see two things that can happen:

halt_on_error is true:
We go through the if once (maybe more times, if we find several errors in several threads).
QuickCheckSilence returns false
LIKELY Isn't really doing much. Tweaking a jump that is executed once doesn't really matter.

halt_on_error is false:
We go through the if roughly max_errors times (let's say around 100, which seems to me is already a too big number for max_errors (if you actually manage to skip 100 errors and still execute code, that's a huge amount of luck!)

From these two, I would actually say that, at most, you should have an UNLIKELY, and not a LIKELY. *If* we execute that if more than once, then I should always expect it to be false.

Still, even in the case where we execute it "a bunch of times" we're still "almost never" executing it. QuickCheckSilence is called only when reporting errors, which means that:
If halt_on_error, then the program is terminating right after this report. We're not winning anything.
If halt_on_error, then the LIKELY is reversed. And we're still not winning much, we're not expecting this code to be executed more than max_errors, which, I'm assuming, will always be a small number.

My opinion is that it's not making the code faster, and it's making it less clear.

You mentioned performance increases. What tests were you running, and how?

131

You can change the previous atomic_load to be an atomic_fetch_add of 0 or 1, no?

test/Driver/fsanitize.c
168

Why didn't you update the number too? Does the test pass?

test/asan/TestCases/Linux/halt_on_error-torture.cc
1 ↗(On Diff #33321)

Why is this test linux only?

kcc added inline comments.Aug 27 2015, 3:17 PM
lib/asan/asan_flags.inc
147

Shouldn't this go to sanitizer_common?

lib/asan/asan_interceptors.cc
52

why do we need it here?

lib/asan/asan_internal.h
124

Does it have to be static inline?
Also, I don't like the name. But it depends on another question later.

lib/asan/asan_report.cc
29

Don't define this macro, just use the C++ line as is

697

Agree.

1014

I think we should revert the name and the meaning of the flag:
make it "int recover" instead of "int fatal" to match the flag.
No?

Also, we need to make the parameter more clear.
One way is to declare a enum {ABORT_ON_ERROR, RECOVER_ON_ERROR} and pass it as the parameter.
Another is to
define two more internal functions
asan_report_error_and_abort() &
asan_report_error_and_maybe_recover()
and instead of calling __asan_report_error with a constant in the last parameter call one of these.

At the very least, write /*recover=*/false

test/asan/TestCases/Linux/halt_on_error-torture.cc
29 ↗(On Diff #33321)

don't we want to check that multiple errors were reported?

check how the tests set ASAN_OPTIONS nowadays.
There were some massive changes...

44 ↗(On Diff #33321)

use C++ new/delete

Still missing the Darwin interface symbols file.

Right, I iterated Phab's comments and completely forgot about this item. I can add a test but I don't have Mac at hand so I'll have to XFAIL it if problem arises after commit.

lib/asan/asan_flags.inc
147

Hm, perhaps we should implement it for UBSan/MSan and then move to sanitizer_common?

lib/asan/asan_interceptors.cc
52

This enables fast path once error reporting limit is reached. Otherwise an error in a busy loop may cause significant (100x) performance degradation. One might argue that we should not be interested in continuing execution at all after we can't report more errors but in fully sanitized distribution it may be beneficial to continue execution to provoke errors in dependent processes.

73

This is covered by quick check above.

lib/asan/asan_internal.h
124

It doesn't but I thought this would be more efficient.

125

halt_on_error is true:
We go through the if once (maybe more times, if we find several errors in several threads).

I don't quite see it like this. Quick check may be called in non-fatal context (e.g. to enable fast path in ACCESS_MEMORY_RANGE). Information about likeliness may inform compiler to e.g. allocate registers in favor of preferred code or align it in a more appropriate way (e.g. to the same cache line).

You mentioned performance increases. What tests were you running, and how?

Check https://www.mail-archive.com/address-sanitizer@googlegroups.com/msg00403.html . Adding unlikely hint to noreturning if improved code by some 40%.

131

Well yeah but this would add a risk of overflow.

lib/asan/asan_report.cc
29

Ok.

1014

Frankly I don't like the idea of adding fatal/recover parameter to public interface at all. Fatality should be controlled by ASAN_OPTIONS and be transparent for user code. Thoughts?

test/Driver/fsanitize.c
168

Yup, it did. I'll reexamine it.

test/asan/TestCases/Linux/halt_on_error-torture.cc
1 ↗(On Diff #33321)

Mainly because I'm afraid that Mac/BSD maintainers will shout at me when I break their bots)

29 ↗(On Diff #33321)

Ok.

44 ↗(On Diff #33321)

Ok.

ygribov updated this revision to Diff 33570.Aug 31 2015, 4:37 AM

Updated based on review. I haven't yet made torture test generic because I'm yet to test this patch on Mac (I plan to do this for next iteration).

ygribov added inline comments.Aug 31 2015, 4:41 AM
test/Driver/fsanitize.c
168

This should be ok - recovery is not enabled by default for ASan so resulting number didn't change.

Folks, any feedback on this?

kcc edited edge metadata.Sep 4 2015, 11:36 AM

The patch is already too long and has too many comments.
Some parts are less controversial, some are more.
I suggest you to split it at this point and start reviewing/submitting it in parts.
Probably start from the driver/flags changes.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
1081–1086

I think we use {} in such cases (if followed by an else with {})

lib/asan/asan_flags.inc
147

No, just add it to sanitizer_common and write something like ("currently supported by asan") in the description.

If we add things that are potentially useful to other sanitizers to one of them (not to sanitizer_common)
we then end up having duplicates in all sanitizers.

lib/asan/asan_internal.h
124

I don't like the name,
I don't like the fact that it has a parameter,
I don't like that it's static inline,
I don't like its behavior.

We have two situations:

  1. We are reporting an error and don't want to do it (is_error=true).

There is no harm in making this function non-inline in this case.
bool WantToReportMoreError();

  1. We don't know whether there will be an error, but we are shutting down the checking functionality

just in case (or to win a bit of performance).
I don't think we need to do this at all.
The win is not clear, but the added complexity is.

lib/asan/asan_report.cc
633

This is all exceptionally horrible.
I've spent quite a bit of time and failed to understand how it works,
also I don't understand how to test it.
Why not just grab a lock for the entire duration of error reporting?

The current logic was there because we never reported more than one bug,
now that we do -- simply grab a lock!

703

class members have the _ suffix.

lib/asan/asan_rtl.cc
131

please leave __asan_report_error here. It is used as a target for break points in debugger.

You may then check halt_on_error inside __asan_report_error

kcc added a comment.Sep 4 2015, 12:50 PM

One more interesting question that came to my mind: is this going to be async-signal-safe?
If you think it will, please add tests.
If not, the whole idea is kind of questionable.

Some parts are less controversial, some are more.

What's your general impression on the feature though?

I suggest you to split it at this point and start reviewing/submitting it in parts.
Probably start from the driver/flags changes.

Hm but I don't see any functionality which can be moved to independent commit (except the flags part which is really minor). We could split to separate patches for rt, clang and llvm but I don't see how this is going to simplify review.

lib/asan/asan_flags.inc
147

If we add things that are potentially useful to other sanitizers to one of them (not to sanitizer_common)
we then end up having duplicates in all sanitizers.

That makes sense, we already have too many semi-duplicate flags.

lib/asan/asan_internal.h
124

Problem is that lack of this check would cause program to crawl in some cases. But ok, this may live in a private branch.

lib/asan/asan_report.cc
633

also I don't understand how to test it.

Did you ever understand how to test original code :) ? The behavior for fatal errors didn't change by the way.

Why not just grab a lock for the entire duration of error reporting?

Because we may run into second error while reporting the first one in the same thread. Grabbing a lock would cause a deadlock in this case.

703

I'm not sure that's true for statics, I see numerous places where they aren't undered.

One more interesting question that came to my mind: is this going to be async-signal-safe?
If not, the whole idea is kind of questionable.

I don't see why it shouldn't - the only problematic place is acquiring the mutex which will either wait (in case reporting_tid is different from current_tid) or fail with informative message after timeout.

If you think it will, please add tests.

I could generate errors in endless loop in a thread and bombard it with signals but that's probably not very reliable.

kcc added a comment.Sep 4 2015, 1:35 PM

One more interesting question that came to my mind: is this going to be async-signal-safe?
If not, the whole idea is kind of questionable.

I don't see why it shouldn't - the only problematic place is acquiring the mutex which will either wait (in case reporting_tid is different from current_tid) or fail with informative message after timeout.

If you think it will, please add tests.

I could generate errors in endless loop in a thread and bombard it with signals but that's probably not very reliable.

The more interesting would be to test the case where we have asan-ish reports in signal handlers (and at the same time in regular code).

kcc added a comment.Sep 4 2015, 1:39 PM

What's your general impression on the feature though?

Same as before.
It is an anti-feature for the majority of users and it will bring much additional complexity to the code
and extra maintenance cost.

I am willing to go forward with it because for some (still obscure) reasons you need this feature
and because your contribution is valuable.

kcc added a comment.Sep 4 2015, 1:42 PM

Hm but I don't see any functionality which can be moved to independent commit (except the flags part which is really minor). We could split to separate patches for rt, clang and llvm but I don't see how this is going to simplify review.

Review is an O(N^2) process. The smaller are the patches the quicker you get through.

kcc added inline comments.Sep 4 2015, 1:47 PM
lib/asan/asan_report.cc
633

Did you ever understand how to test original code :) ?

That's not an argument. We don't want to make things worse just because they are not good at the moment.

Because we may run into second error while reporting the first one in the same thread. Grabbing a lock would cause a deadlock in this case.

Yea. Async signal safety...

Folks,

In case anyone wants this feature it's probably good time to speak up. If there's no interest, I'll probably abandon this revision in a few days.

-Y

eugenis edited edge metadata.Sep 16 2015, 1:05 PM

I think this is a useful feature but I'm a bit worried about the implementation complexity.

We've been running entire Android devices with ASan, and having them crash at boot due to a single bug is moderately annoying.

For the reference, https://source.android.com/devices/tech/debug/asan.html#sanitize_target.

I think this is a useful feature but I'm a bit worried about the implementation complexity.

You mean the report lock part? I agree, although I'm not sure how to simplify it. Perhaps always abort in case of conflict?

We've been running entire Android devices with ASan, and having them crash at boot due to a single bug is moderately annoying.

Yup, that's my use-case as well.

kcc added a subscriber: kcc.Sep 16 2015, 10:25 PM

Let's try to finish it.
I won't be able to participate in the review before 28-th, but feel free to
continue/finish w/o me.

eugenis added inline comments.Sep 21 2015, 3:09 PM
lib/asan/asan_report.cc
671

I think this timeout complicates things without any benefit. Why not wait indefinitely?

I think the logic should be like this:

  1. if the current thread is already reporting, die immediately
  2. if (recover) {lock()} else { if (!try_lock()) sleep(inf) }
  3. report
  4. unlock()
1014–1015

I agree, if the outside caller thinks the error is fatal, it can just abort() or something.

samsonov added inline comments.Sep 21 2015, 4:46 PM
lib/asan/asan_internal.h
124

The name is really confusing. It should be smth. like IgnoreError. Also, instead of passing is_error as a parameter, and doing conditional atomic_fetch_add, can you just split this function into two: one would check if need to ignore the error (halt_on_error is true, or number of reported errors is too large already), and another one would increment the number of detected errors. (could you then call the latter one from ScopedInErrorReport constructor)?

Can you make it ALWAYS_INLINE?

lib/asan/asan_report.cc
703

There are no trailing _ in simple structs, but there should be in classes.

Sorry, I got distracted by other projects. I'm mostly done with all remarks with just one issue (can't commonize halt_on_error).

lib/asan/asan_flags.inc
147

Here's the problem - we need two halt_on_error's for ASan and integrated UBSan. Another problem is that msan does some local magic to alias current halt_on_error to keep_going flag.

lib/asan/asan_report.cc
633

Funny enough, even current ASan isn't asynch signal stable. If you have repeated error in nodeferred signals which are being send to a thread, it may deadlock during error reporting in ScopedInErrorLock. The sequence is roughly:

  • kernel delivers signal
  • trigger error
  • lock report file
  • call internal_write
  • enter kernel
  • kernel delivers next nodeferred signal
  • trigger the same error
  • etc.
ygribov updated this revision to Diff 38931.Nov 2 2015, 9:02 AM
ygribov edited edge metadata.
ygribov removed rL LLVM as the repository for this revision.

I've addressed the remarks from previous review except the one for halt_on_error (we want separate halt_on_error for ASan and UBSan). The llvm and clang parts of the patch will be submitted for review separately per Kostya's request.

BTW I haven't yet ported tests to Darwin, this is something tbd.

ygribov added inline comments.Nov 3 2015, 11:32 PM
lib/asan/asan_report.cc
651

This could be much simpler if we had recursive locks.

eugenis accepted this revision.Nov 5 2015, 11:33 AM
eugenis edited edge metadata.

LGTM
Thanks for working on this!

This revision is now accepted and ready to land.Nov 5 2015, 11:33 AM
This revision was automatically updated to reflect the committed changes.

BTW I haven't yet ported tests to Darwin, this is something tbd.

Filipe, Kuba, the feature has been merged to upstream and seems to work robustly. But I don't think I'll have time to test it on OS X in near future (I don't have easy access to hardware and my customer does not need it). So I was wondering whether you could help? I don't expect any errors, just maybe some really simple portability fixes.

BTW I haven't yet ported tests to Darwin, this is something tbd.

Filipe, Kuba, the feature has been merged to upstream and seems to work robustly. But I don't think I'll have time to test it on OS X in near future (I don't have easy access to hardware and my customer does not need it). So I was wondering whether you could help? I don't expect any errors, just maybe some really simple portability fixes.

Hi, are you talking about the halt_on_error-signals and halt_on_error-torture testcases? On my OS X machine, they both pass, so you might just want to move them out of the Linux/ directory. Let me know if the OS X buildbot fails and I'll take a look at it.

On my OS X machine, they both pass, so you might just want to move them out of the Linux/ directory

Great, thanks!