This is an archive of the discontinued LLVM Phabricator instance.

tsan: Add new annotation functions to promote all accesses to atomic
Needs ReviewPublic

Authored by protze.joachim on Aug 13 2021, 11:33 AM.

Details

Summary

The idea of the new annotation functions is to work very similar to AnnotateIgnoreWriteBegin/End.
The difference is, that the memory accesses should not be ignored but they should be treated like atomic accesses.

I want to use the new annotation to improve the modeling of OpenMP reduction semantics. With this patch, I can detect the data race between the increment of var by the master thread and the other threads performing the reduction.
Without this patch I miss the data race, because the current strategy for OpenMP reduction is to ignore the accesses performed by the OpenMP runtime library (see the changes in ompt-tsan.cpp).
The presence of the reduction does not introduce any synchronization other than ensuring that the consistently reduced result is available at the next synchronization point.

Does this patch make sense to you, or do you think it goes into a completely wrong direction?
I must admit that this patch as it is causes ~10% runtime increase for one of my benchmarks that doesn't even use this annotation. If the patch is in general ok, we can optimize the performance.

I can separate out the OpenMP specific changes into a separate patch.

Diff Detail

Event Timeline

protze.joachim requested review of this revision.Aug 13 2021, 11:33 AM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added subscribers: openmp-commits, Restricted Project, sstefan1. · View Herald Transcript

I must admit that this patch as it is causes ~10% runtime increase for one of my benchmarks that doesn't even use this annotation. If the patch is in general ok, we can optimize the performance.

This is quite unfortunate. And it seems that the slow down is inherent to this approach. Or do you have some ideas of how to remove the overhead from the hot path?

What I wonder is: OpenMP should compile reductions (all? most?) to actual atomic operations, no? Or it should insert some kind of mutex lock/unlock around reduction code.
Either way if we just expose the actual situation to tsan, it should work the way you want.
This side-channel turning of non-atomic accesses into atomics looks strange. If the actual access are atomic, then we should just not lie to tsan that they are non-atomic in the place. And if the actual accesses are non-atomic, then we are masking real bugs.
What am I missing?

protze.joachim added a comment.EditedAug 13 2021, 12:15 PM

I must admit that this patch as it is causes ~10% runtime increase for one of my benchmarks that doesn't even use this annotation. If the patch is in general ok, we can optimize the performance.

This is quite unfortunate. And it seems that the slow down is inherent to this approach. Or do you have some ideas of how to remove the overhead from the hot path?

I just realized, that I did not really compare to the same base but rather used a built from 8/3 for comparison. After I rebuilt my base version from the same main commit the performance is consistent.
This also means, that within the last 10 days we introduced 10% performance regression into main.

I'll run some bisecting to find the commit.

What I wonder is: OpenMP should compile reductions (all? most?) to actual atomic operations, no? Or it should insert some kind of mutex lock/unlock around reduction code.
Either way if we just expose the actual situation to tsan, it should work the way you want.
This side-channel turning of non-atomic accesses into atomics looks strange. If the actual access are atomic, then we should just not lie to tsan that they are non-atomic in the place. And if the actual accesses are non-atomic, then we are masking real bugs.
What am I missing?

The OpenMP runtime / compiler codegen use different strategies to implement the reduction. In some cases actually atomics are used. In other cases, the reduction is implemented as part of a barrier. Wearing my OpenMP application developer hat, I trust the OpenMP implementation. From this perspective casting memory accesses implementing the reduction to atomic makes sense to me.

Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.

I successfully removed the performance regression from the patch.

I couldn't reproduce the performance regression on main, which I have seen in intermediate tests.

Going back to the branch instead of the multiplication. Doesn't seem to have a performance impact.

Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.

I see. This makes sense.

Re performance, what matters more now is performance of the new runtime that is slowly being upstreamed in pieces:
https://github.com/dvyukov/llvm-project/pull/3

This change itself can be ported to the new runtime, but it still adds instructions to fast paths. And more important it turns what's being compile-time constant into a non-compile time constant (we heavily rely on inlining so kAtomic is known to compiler to be 0 statically, so any branches are eliminates and any derived values are known). I think it will have more negative impact on the new runtime we also trace atomic flag:

ev->isAtomic = !!(typ & kAccessAtomic);

and use it to create vector consts:

const m128 access_read_atomic = _mm_set1_epi32((typ & (kAccessRead | kAccessAtomic)) << 30);

if kAccessAtomic is not a const, compiler will need to create several globals and load one of the other. Also for non-atomic writes this is 0, so clearing the register.
I also afraid of long-term effect. Say if we add "if (kAtomic)" (some slow path for atomics only, which does not affect normal accesses at all), with this change it becomes problematic (not only it adds a branch, it also affects register allocation, code layout, etc).

So I would like to consider any alternative as much as possible first.

I see that OpenMP reductions are always tied to a single variable (?):
https://www.openmp.org/spec-html/5.0/openmpsu107.html
If at least the modified variables are known, then I think it's possible to do the following: ignore all reduction accesses as it is done now + emit additional (fake) tsan atomic store annotations for the target variables/memory locations.
I think it should have the same effect w/o long-term tsan runtime tax. Unfortunately tsan atomic hooks do the actual atomic operation as well, but for starters you can try __atomic_fetch_add(0).
Will this work? Do I miss anything?

Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.

I see. This makes sense.

Re performance, what matters more now is performance of the new runtime that is slowly being upstreamed in pieces:
https://github.com/dvyukov/llvm-project/pull/3

This is interesting. I'm curious to see whether the new runtime is similarly sensitive to contentious shared read accesses by the application.
With the current runtime I observed 100-1000x runtime overhead, if all threads on a node concurrently read the same data (like a vector in vector-matrix multiplication).

This change itself can be ported to the new runtime, but it still adds instructions to fast paths. And more important it turns what's being compile-time constant into a non-compile time constant (we heavily rely on inlining so kAtomic is known to compiler to be 0 statically, so any branches are eliminates and any derived values are known). I think it will have more negative impact on the new runtime we also trace atomic flag:

ev->isAtomic = !!(typ & kAccessAtomic);

and use it to create vector consts:

const m128 access_read_atomic = _mm_set1_epi32((typ & (kAccessRead | kAccessAtomic)) << 30);

if kAccessAtomic is not a const, compiler will need to create several globals and load one of the other. Also for non-atomic writes this is 0, so clearing the register.
I also afraid of long-term effect. Say if we add "if (kAtomic)" (some slow path for atomics only, which does not affect normal accesses at all), with this change it becomes problematic (not only it adds a branch, it also affects register allocation, code layout, etc).

So I would like to consider any alternative as much as possible first.

I see. So, I'll do some further experiments. Casting and treating the memory accesses like atomics came just up as an idea yesterday.
In fact we are looking for a solution, that would work with gcc compiled code in the same way to cover Fortran codes as well. So, we prefer to not modify the compiler.

I see that OpenMP reductions are always tied to a single variable (?):
https://www.openmp.org/spec-html/5.0/openmpsu107.html

The OpenMP reduction clause can take a list of variables ("one or more list items") and always reduces the thread-local copies of the variables into a single copy of the variables. Besides the pre-defined reduction operations, the application can also define own reduction operations.
A user-defined reduction could be maxloc(struct{double,size_t}) to find the maximum value and it's index. Each thread would first find the local maximum and the index, and the reduction just combines the results of all threads. The OpenMP runtime takes care of the synchronization between the reduction steps, but again, this synchronization should not impact the analysis of the application.

If at least the modified variables are known, then I think it's possible to do the following: ignore all reduction accesses as it is done now + emit additional (fake) tsan atomic store annotations for the target variables/memory locations.

For a portable implementation, libarcher performs the HB annotations and ignoreWrite annotations in OpenMP tool callbacks defined in the OpenMP standard. At this point, we don't know the memory range of the reduction variables.
The semantic of the reduction callbacks is "all reduction-related memory accesses are performed between reduction-begin/end callbacks". These callbacks do not even reflect the synchronization semantic and can be outside of the critical path, i.e., outside of the locked region.

I think it should have the same effect w/o long-term tsan runtime tax. Unfortunately tsan atomic hooks do the actual atomic operation as well, but for starters you can try __atomic_fetch_add(0).
Will this work? Do I miss anything?

I agree that it should not impact the critical path for the majority of cases.
Another possible use case that I see for this kind of annotation is an heuristic "lockset" analysis approach: Assume that no pair of memory accesses under lock has a race, but don't derive HB from the locking. This would allow to find data races currently hidden by the HB introduced from locked regions.

Another possible use case that I see for this kind of annotation is an heuristic "lockset" analysis approach: Assume that no pair of memory accesses under lock has a race, but don't derive HB from the locking. This would allow to find data races currently hidden by the HB introduced from locked regions.

Interesting and I think may be possible.
Do I understand correctly that such mutexes will synchronize memory, but only for the duration of the critical section? A complete support for this may be done by creating a shadow copy of the thread vector clock on mutex lock and then restoring the shadow copy on mutex unlock.
I think there is also a somewhat hacky way to achieve the same in a different way. When we report a race we restore MutexSet for the previous access, and we know the current MutexSet. If there is any intersection for such "lockset" mutex (or maybe just for write-locked mutexes), then we stop reporting the race.
Or, this idea can actually be combined with this change in the following way. We keep most of this change, but don't change the race detection logic itself, but a thread still knows when it's inside of an "atomic section". But we also add atomic section entry/exit to the trace, so that we can restore it for previous memory accesses. Then in ReportRace we check if the current thread is inside of the atomic section and the previous memory access was inside of an atomic section as well.
But this will involve replaying the trace for the other thread, so I am not sure what will be faster creating a shadow vector clock or this... For the new tsan runtime creating a shadow vector clock may be faster, since all vector clocks are fixed size (512 bytes).

vitalybuka resigned from this revision.Aug 17 2021, 4:44 PM

Leaving to @dvyukov