This is an archive of the discontinued LLVM Phabricator instance.

[TSan] Optimize handling of racy address
ClosedPublic

Authored by protze.joachim on Jul 11 2020, 8:10 AM.

Details

Summary

This patch splits the handling of racy address and racy stack into separate functions. If a race was already reported for the address, we can avoid the cost for collecting the involved stacks.

This patch also removes the race condition in storing the racy address / racy stack. This race allows all threads to report the race. Because all threads get the read lock first, it is quite probable that they all finish the lookup before one thread gets the chance to aquire the write lock.

For certain data race patterns in OpenMP programs, this patch significantly reduces the execution time. As an example the execution times for below code:
master (report_bugs=1): real 0m24s
master (report_bugs=0): real 0m0.2s
patch (report_bugs=1): real 0m0.5s
patch (report_bugs=0): real 0m0.2s

// RUN: clang -fopenmp -fsanitize=thread %s -o %t; env TSAN_OPTIONS="ignore_noninstrumented_modules=1" %t
#include <stdio.h>
int main(void)
{
  long sum=0;
  #pragma omp parallel num_threads(4) //reduction(+:sum)
  for(int i=0; i<1000000; i++)
  {
    sum++;
  } 
  printf("Sum: %ld\n",sum);
}

Diff Detail

Event Timeline

protze.joachim created this revision.Jul 11 2020, 8:10 AM
Herald added subscribers: Restricted Project, sstefan1. · View Herald Transcript
protze.joachim edited the summary of this revision. (Show Details)Jul 11 2020, 8:12 AM

Hi Joachim,

I think this is useful.
As far a I see this changes behavior a bit. Previously if we match existing racy addrs, we still memorize racy stacks. So reports can be suppressed transitively: second report matches addr, but different stacks; then third report matches stacks with the second one, but a new address; and so on. Now we will match address and bail out.
But I think it's OK. I don't remember now why that transitive behavior was important, and if it was at all or not.

Now looking at the code I am confused about AddRacyStacks. Shouldn't we have both address and stacks already memorized at that point? Maybe we should remove it while we are here.

compiler-rt/lib/tsan/rtl/tsan_rtl_report.cpp
444

Please add a helper function for this check b/c it's now duplicated twice.
The same for the address check.

I moved the loops into functions.
I think for the previous controlflow, AddRacyStacks was necessary because of the early return false in line 475. I removed it, as the functions now add the entry directly to the vector.

protze.joachim added a comment.EditedJul 14 2020, 6:01 AM

I agree, that the patch changes the transitive suppression of reports. Unfortunately, this would bring back the overhead of reconstructing the remote stack trace.

As a follow-up patch I was thinking about adding a suppress_equal_location flag.
I would only compare the local stack trace with previously reported stack traces in racy_stacks.
Initially I was thinking about suppress_equal_pc, but especially for wrapper functions, the top pc is function pointer of the wrapper.

Also, I plan to add a suppress_max_stack_depth=n flag. The HandleRacyStacks function will only compare the top n frames in this case.
For race in recursive algorithms, this will allow to remove duplicates with different recursion depth. (For OpenMP, this specifically helps for tasking programs)

I added a missing a return statement

protze.joachim marked an inline comment as done.Jul 14 2020, 6:09 AM

After a pull from master, all tests were successful with this patch.

dvyukov accepted this revision.Jul 15 2020, 3:40 AM

Looks good to me.
You have access, right? If so, please land it.

This revision is now accepted and ready to land.Jul 15 2020, 3:40 AM

I agree, that the patch changes the transitive suppression of reports. Unfortunately, this would bring back the overhead of reconstructing the remote stack trace.

As a follow-up patch I was thinking about adding a suppress_equal_location flag.
I would only compare the local stack trace with previously reported stack traces in racy_stacks.
Initially I was thinking about suppress_equal_pc, but especially for wrapper functions, the top pc is function pointer of the wrapper.

Right, it can do more harm.

Also, I plan to add a suppress_max_stack_depth=n flag. The HandleRacyStacks function will only compare the top n frames in this case.
For race in recursive algorithms, this will allow to remove duplicates with different recursion depth. (For OpenMP, this specifically helps for tasking programs)

What will be the default? What is the useful value for OpenMP programs?

I agree, that the patch changes the transitive suppression of reports. Unfortunately, this would bring back the overhead of reconstructing the remote stack trace.

As a follow-up patch I was thinking about adding a suppress_equal_location flag.
I would only compare the local stack trace with previously reported stack traces in racy_stacks.
Initially I was thinking about suppress_equal_pc, but especially for wrapper functions, the top pc is function pointer of the wrapper.

Right, it can do more harm.

What do you think about comparing the local stack against the two hashes in racy_stacks?
I would propably have this defaulting to false unless we find it convenient for the typical usecase.

Also, I plan to add a suppress_max_stack_depth=n flag. The HandleRacyStacks function will only compare the top n frames in this case.
For race in recursive algorithms, this will allow to remove duplicates with different recursion depth. (For OpenMP, this specifically helps for tasking programs)

What will be the default? What is the useful value for OpenMP programs?

My suggestion for the default is infinite / MAX_INT, i.e. not change existing behavior.

Is there a runtime interface to adjust the flags? This would allow libarcher to tweak this flag during initialization, and I could also choose different defaults for other flags.
I always suggest ignore_noninstrumented_modules=1 for OpenMP and Fortran codes to suppress false reports from libomp and libgfortran.

This revision was automatically updated to reflect the committed changes.

Right, it can do more harm.

What do you think about comparing the local stack against the two hashes in racy_stacks?
I would propably have this defaulting to false unless we find it convenient for the typical usecase.

I would avoid doing this right now because the use case for this is moot, and this is additional code, code reviews and slowdown.

Also, I plan to add a suppress_max_stack_depth=n flag. The HandleRacyStacks function will only compare the top n frames in this case.
For race in recursive algorithms, this will allow to remove duplicates with different recursion depth. (For OpenMP, this specifically helps for tasking programs)

What will be the default? What is the useful value for OpenMP programs?

My suggestion for the default is infinite / MAX_INT, i.e. not change existing behavior.

And what do you plan to use/suggest for openmp? I am asking because if it's, say, 50, then we could probably simply make it the default and no flag at all. But if it's 5, then probably we can't.

Is there a runtime interface to adjust the flags? This would allow libarcher to tweak this flag during initialization, and I could also choose different defaults for other flags.
I always suggest ignore_noninstrumented_modules=1 for OpenMP and Fortran codes to suppress false reports from libomp and libgfortran.

No, I don't think there such interface.
There is something like __default_options global var. It's supposed for end program (b/c there can be only 1). It also needs to override the weak definition we have in the runtime.
You may try. If we are super lucky then it may work and end program may even be able to override it again. But it all will depend on how the openmp library is linked.

I reverted the commit, because all buildbots failed for these tests:

TEST 'ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.RaceWithOffset' FAILED
TEST 'ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.RaceWithOffset2' FAILED

As I understand the issue, the two tests trigger data race on different memory locations, but only the first is reported and all following are suppressed for the same stack.
I updated the tests, so that the PC of the "memory access" is in the stacktrace.
I have no idea why this did not fail before.

I think, the better solution is to use the _pc version of the access functions to use the "real" access location.

No idea why it broke now. THe change to tsan_test_util_posix.cpp looks fine to me.

hans added a subscriber: hans.Jul 20 2020, 4:43 AM

Joachim suggested cherry-picking this (https://reviews.llvm.org/rG7358a1104a02) to the 11.x branch. Dmitry, what do you think?

Joachim suggested cherry-picking this (https://reviews.llvm.org/rG7358a1104a02) to the 11.x branch. Dmitry, what do you think?

I dunno, never dealt with clang backports (continuous releases rulez).

I personally don't mind. We have some tests, so it should not break things completely. But I don't know what risks are involved.

hans added a comment.Jul 20 2020, 6:06 AM

I personally don't mind. We have some tests, so it should not break things completely. But I don't know what risks are involved.

We're early in the process, so I think the risk is low. Joachim, please let me know if there are any problems or follow-ups to this patch.

Pushed to 11.x as 96313d2de45ace49d40606dda71f03396f13ddef.