Page MenuHomePhabricator

[TSAN/AArch64/Android] Changes for AArch64/Android
AbandonedPublic

Authored by rengolin on Jul 27 2015, 11:56 AM.

Details

Summary

There are several problems getting TSAN on Android/AArch64 to work.
Android is missing TLS support with __thread keyword.

Using pthread api to emulate TLS is not going to work, because
of several reasons:

  1. bionic's pthread_create calls the INTERCEPTED version of pthread_mutex_lock
  1. the key cleanup code calls free() and such, but this is AFTER the main

thread has exited. This means that calling pthread_getspecific again returns
an odd address. So this means that the TLS ThreadState pointer now becomes
corrupt DURING the exit process.

So instead, I opted for a non-interceptible workaround for TLS support.

TSAN on Android also requires patches for bionic.
Please refer to https://code.google.com/p/android/issues/detail?id=179564
for details.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

But what are we to do for NOW when the "proper" fix is maybe months off, or possibly longer?

I'm sorry if I'm sceptical, but from what I've seen of bionic, and indeed of the Android stack, most of it is hack-based "waiting" for the proper fix that will never come.

Leaking this mentality out of the Android project into others to help people experiment with a feature is not the kind of thing that is generally acceptable upstream.

If it was me, I'd keep this internally until the day you do have a proper fix. After all, running TSAN on Android is not stopping any release from seeing the light of day.

But these are just my personal opinions. I'll let Kostya and Eugenis argue about the viability of such a workaround.

cheers,
--renato

jasonk updated this revision to Diff 30749.Jul 27 2015, 3:40 PM

Got rid of a lot of debugging cruft.

jasonk updated this revision to Diff 30753.Jul 27 2015, 3:50 PM

Forgot to wrap ASAN + AArch64 under COMPILER_RT_FORCE_TSAN_AARCH64

jasonk updated this revision to Diff 30754.Jul 27 2015, 3:58 PM
jasonk marked 3 inline comments as done.

Quashed whitespace nits (hopefully)

Okay, most easy nits have now been addressed.

This still needs to be reconciled with D11484

rengolin added inline comments.Jul 27 2015, 4:34 PM
lib/sanitizer_common/sanitizer_linux.cc
911 ↗(On Diff #30754)

Is this an Android only function?

Even if it is, wouldn't that break K and older?

lib/sanitizer_common/sanitizer_platform_limits_posix.h
325

this comment is redundant

409

Is this meaning to be 32 or 64 bits? Maybe make it clearer (by using uintNN_t) in the structure above would remove the need to duplicate it?

lib/tsan/rtl/tsan_interceptors.cc
115

This is *really* the wrong place for this...

lib/tsan/rtl/tsan_platform.h
16–17

oops

144

Please, rebase your patch on top of Adhemerval's and you'll get this right.

danalbert added inline comments.Jul 27 2015, 10:21 PM
lib/sanitizer_common/sanitizer_linux.cc
911 ↗(On Diff #30754)

It's actually not an API 21+ thing, but an LP64 thing. The system properties API never should have been exposed, but our time machine isn't working yet, so all we could do was hide them in the new ABIs.

We shouldn't be using this at all, and AFAICT we don't need it either (I didn't even know it was an option). Using $ASAN_OPTIONS has been sufficient for us so far.

zatrazz added inline comments.Jul 29 2015, 7:26 AM
lib/sanitizer_common/sanitizer_linux.cc
878 ↗(On Diff #30754)

This version you are working with seems old, the MIPS clone has been reimplemented in assembly 86d8101714e7e8191a50189dc6c9d2001df2e930 (06/05). I would suggest you to rebase against master. Also, my port to aarch64/linux have the assembly implementation that seems to be working (I am not sure if it would be correct approach to Android).

lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc
17 ↗(On Diff #30754)

Split to fit on 80 lines.

490 ↗(On Diff #30754)

I would prefer the to not change semantics for other arches and let now only aarch64 to use PTRACE_GETREGSET. The other arches maintainers could check if it would be desirable to change as well.

Also, for linux/aarch64 afaik the PTRACE_GETREGSET should be used with iovec interface:

struct iovec regset_io;
regset_io.iov_base = &regs;
regset_io.iov_len = sizeof(regs_struct);
bool isErr = internal_iserror(internal_ptrace(PTRACE_GETREGSET, tid,
                              (void*)NT_PRSTATUS, (void*)&regset_io),
                              &pterrno);

Does this change works for linux/aarch64 besides Android?

lib/tsan/rtl/tsan_aarch64_tls_workaround.cc
27

'This' refers to the cur_thread_placeholder field (as the code suggest) or the how TSanTLSHack struct?

71

Since it will be called by each TLS access and prime division tend to be slow on modern hardware, I would suggest to change to a multiplicative hash instead (like Knuth's Multiplicative).

lib/tsan/rtl/tsan_interceptors.cc
16

What is the idea of this counter? It is initialized and incremented, but nothing else.

lib/tsan/rtl/tsan_interface.cc
36

Both '__tsan_init_count' initialization and increment should be inside SANITIZER_TLS_WORKAROUND_NEEDED.

lib/tsan/rtl/tsan_platform.h
179

Do we really need a special mapping for android/aarch64? I could get a 39-bit VMA mapping for linux/aarch64 without special handling.

zatrazz added inline comments.Jul 29 2015, 10:00 AM
lib/tsan/rtl/tsan_aarch64_tls_workaround.cc
71

My comment does not really apply here since the prime number is constant (and the compiler will optimize it out using montgomery modular multiplication).

jasonk added a comment.Aug 3 2015, 2:57 PM

Thanks for the review guys. Updating to a recent tip broke something in TSAN.
Fixing now....
Sigh...

Once I get past this, I will rebase on top of http://reviews.llvm.org/D11484 and re-upload.

My svn revisions currently are
243502 llvm
243489 llvm/tools/clang
243494 llvm/projects/compiler-rt/

lib/tsan/rtl/tsan_platform.h
179

Well, ASLR seems to be turned off (forced?) in the recent nexus9 kernels. Therefore the existing address math does not work :-(

eugenis added inline comments.Aug 3 2015, 4:17 PM
lib/tsan/rtl/tsan_platform.h
179

This looks like the behavior described in this issue:
https://llvm.org/bugs/show_bug.cgi?id=24155
Is it still random?
On x86_64 it only affects MSan, but on AArch64 with smaller address space, I guess TSan mapping will have to altered, too.

If at all possible, we should try and come up with one mapping that works in both cases, otherwise a kernel upgrade on Linux would break TSan programs.

rengolin edited edge metadata.Aug 14 2015, 2:12 PM

Hi Jason,

Can you reply to the many questions and comments about many parts of the patch as well? Just rebasing but keeping the code unchanged won't help much in getting the patch in good shape.

cheers,
--renato

I just uploaded the first "limping along" version to http://reviews.llvm.org/D11532
It has been rebased on top of http://reviews.llvm.org/D11484

There are some 60 tests passing, with around 130 marked as UNSUPPORTED for aarch64-android
Each of those unsupported tests either fail outright (possibly differently from x86_64-linux), or are flaky.

Please take a look.

I suppose next steps are:
0. code review (please :-)

  1. visit google MTV to square away the required AOSP bionic patch https://android-review.googlesource.com/#/c/162104/
  2. also visit google MTV to discuss adding a new aarch64-linux buildbot labs.llvm.org as well as some oddball stuff I noticed in the existing TSAN tests.

Thanks for all your help!

jasonk marked 8 inline comments as done and an inline comment as not done.Aug 14 2015, 2:43 PM
jasonk added inline comments.
cmake/config-ix.cmake
428–438

guarded with new CMAKE flah COMPILER_RT_FORCE_TSAN_AARCH64

lib/sanitizer_common/sanitizer_linux.cc
989 ↗(On Diff #32187)

newer compiler-rt already takes care of this.

lib/sanitizer_common/sanitizer_platform_limits_posix.h
409

Its for both

lib/tsan/rtl/tsan_aarch64_tls_workaround.cc
28

This struct is used to hold all of the TLS variables used in TSAN.

The cur_thread_placeholder is used originally as a plain C struct that contains the ThreadState TLS var. I just carried that notion along.

lib/tsan/rtl/tsan_interceptors.cc
115

Err. I don't know of a better place to put it.

In order to use REAL(...) I need to DECLARE_REAL()..
in the same file.

lib/tsan/rtl/tsan_platform.h
179

Not sure if https://llvm.org/bugs/show_bug.cgi?id=24155 applies.

ASLR is apparently turned OFF, as in text/data wind up with 0x55 high byte regardless of whether aslr is turned on in /proc switches...

jasonk updated this revision to Diff 32189.Aug 14 2015, 2:44 PM
jasonk marked an inline comment as done and an inline comment as not done.
jasonk edited edge metadata.

some more code review nits fixed.

Wierd thing with phabricator -- my updated commit message seem to have gotten lost.
(!)

jasonk marked an inline comment as done.Aug 14 2015, 2:46 PM
jasonk added inline comments.
lib/sanitizer_common/sanitizer_linux.cc
871 ↗(On Diff #32189)

done

Hi Jason,

This is too huge to review as single chunk. Please split it into smaller pieces.
E.g. all test changes can go in separately.
lib/sanitizer_common/sanitizer_linux.cc changes can be factored into separate "support internal_clone on aarch64".
sanitizer_stoptheworld_linux_libcdep.cc changes into separate "support stoptheworld on aarch64"
lib/sanitizer_common/sanitizer_platform_limits_posix.h changes into separate "provide definitions of aarch64 kernel structures"
etc

I will try to do one pass over this change as is now.

dvyukov added inline comments.Aug 17 2015, 8:22 AM
lib/interception/interception_linux.cc
24

Please do this only #ifdef ANDROID and only for known functions (i.e. internal_strcmp(func_name, "foo") == 0).
It's better to keep such hacks in a very limited, understandable scope. Otherwise it can lead to bad failure modes for other platforms/functions.

lib/sanitizer_common/sanitizer_libc.h
77 ↗(On Diff #32189)

not used, remove

lib/sanitizer_common/sanitizer_platform_limits_posix.h
740

"uptr _size" should work for both 64- and 32-bit mode, isn't it?

lib/tsan/rtl/tsan_aarch64_tls_workaround.cc
20

Place this all into __tsan namespace.

29

Constants must be prefixed with "k"

47

Both TLSHack and Count are globals, they are zero initialized already. Remove this.

53

Call this from Initialize and then you don't need TSanTLSWorkAroundHasInited check. Initialize is executed exactly once.

54

Can you move this into InitializeInterceptors where similar setup is done already?

lib/tsan/rtl/tsan_aarch64_tls_workaround.h
30

Please don't prefix names in __tsan namespace with TSan, they are all TSan-something. It is a pointless prefix that introduces clutter without any benefit.
The same for the other names here.
Also remove extern "C", it is not needed inside of TSan namespace.

33

Looks like debugging leftovers, please remove.

lib/tsan/rtl/tsan_interceptors.cc
101–102

I see that you changed pthread_yield calls to internal_sched_yield below. So I guess you can now just remove this declaration.

121

Can you move this to tsan_interceptors.cc where freebsd handles this?
That would fix both tsan_interceptors.cc and tsan_new_delete.cc

176

What's wrong with errno name?

1505

spaces around &&

1517

spaces around &&

lib/tsan/rtl/tsan_interceptors.h
25

I guess you added this because the existing debug print in ScopedInterceptor::ScopedInterceptor is not always executed.
If you want to commit this, please prefix it with thread id like all other debug output; change INTERCEPT to intercept and, I think, you can change it to DPrint like the current interceptor output. DPrintf2 is used mainly in memory access interceptors as they are executed _very_ frequently.

lib/tsan/rtl/tsan_interface.cc
33

This should go into Initialize function. It is called not only from __tsan_init.

39

This better go into tsan_interceptors.cc.
tsan_interceptors.cc contains system-dependent glues, while this file is more or less platform-independent and "portable".

lib/tsan/rtl/tsan_rtl.cc
349

spaces around &&

jasonk updated this revision to Diff 32605.Aug 19 2015, 1:40 PM
jasonk marked 15 inline comments as done.

Should address all of dvyukov's feedback.
This patch still needs to be split up.

Okay, just uploaded a new version (pre-split).
It should address all of the feedback from dvyukov.

lib/tsan/rtl/tsan_aarch64_tls_workaround.cc
53

Actually, it turns out that Initialize() is called multiple times (!)
because __tsan_init() is also called twice.

But its a no-op because as you correctly pointed out, I don't need to memset a static var :-)

54

It turns out that my bionic patch makes this unnecessary.
removed.

lib/tsan/rtl/tsan_interceptors.cc
121

It almost works.

I can't find a nice way to move the DECLARE_REAL(...) into tsan_interceptors.h (you probably meant the header file, not the .cc, which is *this* file :-)

The #defines also is headachy to move to the header file.
Sorry :-(

176

causes a compiler error.

lib/tsan/rtl/tsan_interface.cc
33

most of this is gone -- only thing I left behind was the init count and Dprintf()

jasonk updated this revision to Diff 32639.Aug 19 2015, 4:22 PM

Updated to r245453

dvyukov added inline comments.Aug 20 2015, 12:47 AM
lib/tsan/rtl/tsan_aarch64_tls_workaround.cc
53

Initialize already has a guard against multiple calls (it is intended to be called multiple times).
So if you need to do any init, do it in Initialize after the guard check.

lib/tsan/rtl/tsan_interceptors.cc
176

Do you mean that <errno.h> is transitively included on android and that header provides own definition of errno?

How is it get included? This file does not include any system headers (for historical reasons).

Jason,

This diff still has all the other changes that you have split. Please, remove the extra stuff and leave just the delta.

Also, I though we'd agreed to base the implementation on Joerg's TLS emulation layer in RT (D12001), and this patch still mentions (and uses) the local TLS work around. Please, rebase onto Joerg's work in progress.

The TSAN for AArch64 is on, and works well on machines with 39-bits VMA, but it's still disabled because it failed the buildbot (which I believe is 42-bits). Adhemerval is getting 42-bits support upstream soon, and we'll re-enable TSAN on AArch64 (since it worked on our 42-bits machine, too).

If all goes well, you'll also be able to remove some of the logic you need for AArch64 enablement, and only leave the Android specific stuff.

cheers,
--renato

rengolin requested changes to this revision.Aug 20 2015, 6:50 AM
rengolin edited edge metadata.
This revision now requires changes to proceed.Aug 20 2015, 6:50 AM
chh added a subscriber: chh.Aug 20 2015, 9:13 AM

Jason,

This diff still has all the other changes that you have split. Please, remove the extra stuff and leave just the delta.

Sorry, I don't know what you mean.

I have already split the patches out. This is hanging around just for reference. I am not expecting this patch to go in as is.

Also, I though we'd agreed to base the implementation on Joerg's TLS emulation layer in RT (D12001), and this patch still mentions (and uses) the local TLS work around. Please, rebase onto Joerg's work in progress.

The TLS emulation is wholly inappropriate for TSAN, as it uses calls that are themselves intercepted by TSAN. EVERY memory access, and every API call made in the user app (including calls to pthread mutex lock, malloc, ett al) REQUIRES TLS vars to already exist. There is a chicken-vs-egg that can't be resolved here.

Sorry, but for now at least, it doesn't look like emutls is appropriate for TSAN.

The TSAN for AArch64 is on, and works well on machines with 39-bits VMA, but it's still disabled because it failed the buildbot (which I believe is 42-bits). Adhemerval is getting 42-bits support upstream soon, and we'll re-enable TSAN on AArch64 (since it worked on our 42-bits machine, too).

Huh. I was under the impression that aarch64 kernel being limited to 39 bits was primarily for performance reasons. For 42 bits, every memory access needs to access full 3 levels of in-memory tables (as opposed to two levels for 39 bits), which would, approximately, add 33% overhead for each memory access???

Or at least that's what I was told by a local android kernel guy

I guess we'll see.

If all goes well, you'll also be able to remove some of the logic you need for AArch64 enablement, and only leave the Android specific stuff.

Errr. okay.

I am not sure what you mean, but I'll do what it takes to get this (not this patch specifically, but the work in TSAN+android+aarch64 in general) integrated.

I have already split the patches out. This is hanging around just for reference. I am not expecting this patch to go in as is.

Right, than it's better to "abandon" this change. It'll still be here for people to look at, but the intent will be clear.

The TLS emulation is wholly inappropriate for TSAN, as it uses calls that are themselves intercepted by TSAN. EVERY memory access, and every API call made in the user app (including calls to pthread mutex lock, malloc, ett al) REQUIRES TLS vars to already exist. There is a chicken-vs-egg that can't be resolved here.

That's not what the agreement in the thread was. I'm not an expert in TLS, so I won't opine, but you should get consensus in the LLVM list before proposing a patch of this magnitude. Discussing on the sanitizer's list is not enough, as most of us don't participate in that list, so you won't hit the people that can really help you.

If the sanitizer sources were completely independent from Compiler-RT, I understand, but they're not, so you should get consensus here.

Huh. I was under the impression that aarch64 kernel being limited to 39 bits was primarily for performance reasons. For 42 bits, every memory access needs to access full 3 levels of in-memory tables (as opposed to two levels for 39 bits), which would, approximately, add 33% overhead for each memory access???

AArch64 can use 39, 42 and 48 bits VMA. For now, it was limited to 39, but some distributions are enabling 42-bits (like Fedora), and it seems that it will be a kernel default to move to 48 bits pretty soon. We have to work with all scenarios.

Or at least that's what I was told by a local android kernel guy

Android is a closed ecosystem, and the decisions are taken internally and may not be in line with the mainline kernel, Linux or BSD distributions, etc. You can't assume that the way Android works on AArch64 is AArch64's behaviour.

I am not sure what you mean, but I'll do what it takes to get this (not this patch specifically, but the work in TSAN+android+aarch64 in general) integrated.

I agree. Focus on the functionality, not an implementation. I guess it's a shame that the sanitizer discussion is happening in a different list, especially when it gets so down to the hardware behaviour. We can't assume all the hardware folks will be in all the lists, so if you want to discuss about Android/TSAN on hardware A, you should at least copy some of the folks from that community.

If I was copied, I'd have brought the discussion to the LLVM list a lot sooner, and we'd have had a lot less disagreements.

At least now we have two fast buildbots on AArch64, one on 39bits and one on 42bits, so we can be more confident once the patches start landing.

Let's more this discussion to the list, and to the specific patches that spawned, and let's abandon this one.

cheers,
--renato

jasonk edited edge metadata.Oct 21 2015, 2:49 PM
jasonk added a reviewer: adasgupt.
jasonk added a subscriber: adasgupt.
rengolin commandeered this revision.Jun 27 2016, 6:43 AM
rengolin abandoned this revision.
rengolin edited reviewers, added: jasonk; removed: rengolin.