This is an archive of the discontinued LLVM Phabricator instance.

tsan: make obtaining current PC faster
ClosedPublic

Authored by dvyukov on Jul 15 2021, 2:05 AM.

Details

Summary

We obtain the current PC is all interceptors and collectively
common interceptor code contributes to overall slowdown
(in particular cheaper str/mem* functions).

The current way to obtain the current PC involves:

4493e1:       e8 3a f3 fe ff          callq  438720 <_ZN11__sanitizer10StackTrace12GetCurrentPcEv>
4493e9:       48 89 c6                mov    %rax,%rsi

and the called function is:

uptr StackTrace::GetCurrentPc() {

438720:       48 8b 04 24             mov    (%rsp),%rax
438724:       c3                      retq

The new way uses address of a local label and involves just:

44a888:       48 8d 35 fa ff ff ff    lea    -0x6(%rip),%rsi

I am not switching all uses of StackTrace::GetCurrentPc to GET_CURRENT_PC
because it may lead some differences in produced reports and break tests.
The difference comes from the fact that currently we have PC pointing
to the CALL instruction, but the new way does not yield any code on its own
so the PC points to a random instruction in the function and symbolizing
that instruction can produce additional inlined frames (if the random
instruction happen to relate to some inlined function).

Diff Detail

Event Timeline

dvyukov created this revision.Jul 15 2021, 2:05 AM
dvyukov requested review of this revision.Jul 15 2021, 2:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 15 2021, 2:05 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
melver accepted this revision.Jul 15 2021, 5:56 AM
melver added inline comments.
compiler-rt/lib/sanitizer_common/sanitizer_stacktrace.h
199

Previously the PC was obtained by a call and we sort of were guaranteed that the compiler wouldn't move code around.

Should this now include compiler barriers before/after?

This revision is now accepted and ready to land.Jul 15 2021, 5:56 AM
dvyukov added inline comments.Jul 15 2021, 8:00 AM
compiler-rt/lib/sanitizer_common/sanitizer_stacktrace.h
199

I am not sure what a compiler barrier will give us here. It can pessimize codegen and the PC still points to a random instruction. So I would wait until we have a reason to add the barrier.
And in the end compiler can move code around calls as well.

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jul 15 2021, 9:19 AM

This seems to break building on arm macs: http://45.33.8.238/macm1/14037/step_4.txt

Please take a look.

Looking. Thanks for the report.

Looks like it's Mac builds on general, my Intel Mac is just slow to cycle: http://45.33.8.238/mac/33533/step_4.txt

The tree's been broken for over 6h by now. Can we revert for now? (Or, ideally, like 5h ago?)

dvyukov reopened this revision.Jul 16 2021, 9:40 AM
This revision is now accepted and ready to land.Jul 16 2021, 9:40 AM
dvyukov updated this revision to Diff 359370.Jul 16 2021, 9:56 AM

Rework the change after numerious bot breakages and reverts.
As some bot breakages shown new inlined frames can also affect
stack printing in sanitizer_common (for some malloc-related reports),
so removing the inlined frames only in tsan does not work.
Also relying on hardcoded function names is fragile.
Additionally with reduced debug info compiler may strip namespaces
and class names from debug info, so that e.g.
"__tsan::ScopedInterceptor::Foo" becomes just "Foo". Stripping such
frames is especially problematic and can collide with user frames.
I've also tried a different approach: for some stacks throw away
inlined functions for the top frame. This does not rely on matching
function names. It almost worked except that for tsan racing stacks
we don't really know if top PC is a user PC (need inline frames)
or it comes from an str/mem interceptor (don't need inline frames).
The current version gives up on introducing no code and adds a NOP,
which becomes the carrier of the proper debug info. This makes
everything much simpler and hopefully more reliable. But unfortunately
it makes the code arch-dependent, so for now it's x86-only.

Benchmarking shows it's still ~4x faster:

#include <stdio.h>

attribute((noinline)) unsigned long get_current_pc1() {
return (unsigned long)__builtin_return_address(0);
}

#define get_current_pc2() ({ here: asm("nop"); ((unsigned long) && here) + 1; })

int main(int argc, char** argv) {
const unsigned long iters = 1e10;
volatile unsigned long pc = 0;
if (argc == 1) {

		for (unsigned long i = 0; i < iters; i++)
			pc = get_current_pc1();

} else {

		for (unsigned long i = 0; i < iters; i++)
			pc = get_current_pc2();

}
}

pc1: 10.38
pc2: 2.58
pc1: 10.87
pc2: 2.56
pc1: 10.99
pc2: 2.59

vitalybuka accepted this revision.Jul 16 2021, 11:20 AM
pcc added a subscriber: pcc.Jul 16 2021, 11:29 AM
pcc added inline comments.
compiler-rt/lib/sanitizer_common/sanitizer_stacktrace.h
209

Maybe since this has to be arch-dependent anyway, instead of the NOP, you could put the LEA instruction here?

dvyukov updated this revision to Diff 359598.Jul 17 2021, 11:40 PM

Switch to using LEA RIP directly.

dvyukov added inline comments.Jul 17 2021, 11:52 PM
compiler-rt/lib/sanitizer_common/sanitizer_stacktrace.h
209

Right!
If we are x86-only and use asm anyway, that's reasonable.
This removes nop byte and avoids pedantic warnings re taking address of a label.
Unfortunately RIP is not supported in MOV instructions (?) (we need 0 offset anyway), so we still inject 4 byte offset.
Also, an asm block can be inlined now, so we could use this to simply speed up StackTrace::GetCurrentPc() for everybody. But StackTrace::GetCurrentPc() would appear on top of most stacks, and again unfortunately attribute artificial does not do what it claims to do. Well, maybe it marks the code with some fancy artificial attribute in later versions of DWARF, but it does not help, I still see it in our stacks. Maybe we could improve llvm-symbolizer to ignore artificial functions, we there does not seem to be a way to do this with addr2line symbolizer.
There is also attribute((no_debug)) (which I learnt about from D43259), but it's clang-only. I am also not sure if we it will work reliably, or it's merely a coincidence that it works. If we simply leave an instruction w/o debug info, I think some DWARF users will use line info from the previous instructions, which is not necessary the caller function, but may be some other inlined function in the caller.
So for now I stayed with a macro.

And now the test started failing because 2 subsequent GET_CURRENT_PC() invocations started to yield the same PC. Compiler figured out that asm blocks can be merged. I guess we could avoid that with volatile asm, but I think it's fine, we don't need specifically different PCs.

PTAL

melver accepted this revision.Jul 19 2021, 2:48 AM

Given this is x86-only now, hopefully there won't be any new surprises.

This revision was landed with ongoing or failed builds.Jul 19 2021, 4:05 AM
This revision was automatically updated to reflect the committed changes.