This is an archive of the discontinued LLVM Phabricator instance.

[tsan] Do not instrument reads/writes to instruction profile counters.
ClosedPublic

Authored by zaks.anna on Mar 14 2016, 3:48 PM.

Details

Summary

We have known races on profile counters, which can be reproduced by enabling -fsanitize=thread and -fprofile-instr-generate simultaneously on a multi-threaded program. This patch avoids reporting those races by not instrumenting the reads and writes coming from the instruction profiler.

Diff Detail

Event Timeline

zaks.anna updated this revision to Diff 50666.Mar 14 2016, 3:48 PM
zaks.anna retitled this revision from to [tsan] Do not instrument reads/writes to instruction profile counters..
zaks.anna updated this object.
zaks.anna added reviewers: vsk, kcc, dvyukov, kubamracek.
zaks.anna added a subscriber: llvm-commits.
vsk added inline comments.Mar 14 2016, 4:09 PM
lib/ProfileData/InstrProf.cpp
257 ↗(On Diff #50666)

I just realized that this isn't exhaustive. We can end up with loads/stores with non-GEP pointer operands. Consider the following example:

% cc -O3 -fprofile-instr-generate -S -emit-llvm -x c - -o -
void foo() {}

int main() {
  for (int I = 0; I < 10; ++I) foo();
}

This produces:

%1 = load <2 x i64>, <2 x i64>* bitcast ([2 x i64]* @__profc_main to <2 x i64>*), align 8
%.promoted3 = load i64, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_foo, i64 0, i64 0), align 8
%2 = add i64 %.promoted3, 10
%3 = add <2 x i64> %1, <i64 1, i64 10>
store <2 x i64> %3, <2 x i64>* bitcast ([2 x i64]* @__profc_main to <2 x i64>*), align 8
store i64 %2, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_foo, i64 0, i64 0), align 8

So I think we should add a check like this:

if (auto *BCI = dyn_cast<BitCastInst>(Addr))
  Addr = BCI->getOperand(0);
265 ↗(On Diff #50666)

This is fine, but we can avoid constructing the Triple object if you're ok with doing StringRef(GV->getSection()).endswith(getInstrProfCountersSectionName(/*AddSegment=*/false).

davidxl added inline comments.
include/llvm/ProfileData/InstrProf.h
239 ↗(On Diff #50666)

Document the right behavior here -- Return true when ..., otherwise return false.

lib/ProfileData/InstrProf.cpp
252 ↗(On Diff #50666)

If I read the code correctly, the logic seems to be the opposite of what it is suppose to mean -- It should return true when the variable is in profile counter section, not the 'false'

257 ↗(On Diff #50666)

I suppose there is common utility helper functions to do the stripping -- e.g. getBaseAddr or something like that?

zaks.anna added inline comments.Mar 14 2016, 5:09 PM
include/llvm/ProfileData/InstrProf.h
239 ↗(On Diff #50666)

Thanks for catching this!

lib/ProfileData/InstrProf.cpp
257 ↗(On Diff #50666)

Transforms/Utils would be a good place for this but I cannot find anything like it there.

A function that strips everything up to the base pointer would be more involved, we'd need to loop/recurse until we hit something that's not a GEP constant, GEP instruction, BitCast and more?

So far, the intention here is to just pattern match what is known to be added by the profiler pass. Hopefully, this would be sufficiently robust.

265 ↗(On Diff #50666)

It's probably cleaner/more readable with the Triple. Are you worried about speed?

davidxl added inline comments.Mar 14 2016, 5:22 PM
lib/ProfileData/InstrProf.cpp
257 ↗(On Diff #50666)

Value::stripInBoundsOffsets() might be right interface.

dvyukov edited edge metadata.Mar 15 2016, 2:23 AM

Why is this race benign? Does llvm define behavior of data races?

vsk edited edge metadata.Mar 15 2016, 10:09 AM

Racy updates of profile counters pose real problems. However, it doesn't make sense for TSan to diagnose these races because there is no reasonable way for the user to fix the issue (short of turning off profiling instrumentation).

Racy updates of profile counters pose real problems. However, it doesn't make sense for TSan to diagnose these races because there is no reasonable way for the user to fix the issue (short of turning off profiling instrumentation).

Yeah, but it makes sense for us to fix the races instead of shutting up tsan :)

How about we mark that load as "Monotonic"?

Is the profiler team OK with that? My understanding is that you did not want anything that causes run time overhead on x86 and ARM. In any case, that would allow us to unblock TSan and give you time to decide what you want to do.

What is the right way to teach tsan to be quiet about benign races?

David

dvyukov's point is that there are no "benign races" based on LLVM semantics. People might argue that their race is benign because it will result in atomic reads/writes on the architectures they care about. However, in that case, they should annotate their loads and stores as atomic explicitly, which would lower to no-op on the architectures they care about but will also be modeled correctly at the LLVM level. (TSan's pass reasons about LLVM semantics.) This is why I suggested to use "Monotonic". (http://llvm.org/docs/Atomics.html#id10)

What is the right way to teach tsan to be quiet about benign races?

There is no such thing as benign data races in llvm, so there is nothing to teach tsan about.
http://llvm.org/docs/Atomics.html#optimization-outside-atomic

No -- tsan warning should not be the right motivation for this kind of change.

You seem to assume that marking the profiler memory accesses as monotonically atomic will degrade performance. Why? If that's the case then there is a bug in llvm optimizer, and we need to fix it.
There is no difference between the compiler-emitted instrumentation and similar hand-written instrumentation. And we definitely don't want users to write racy code. But we don't want users to be penalized for writing correct code as well.

A related question -- why is it useful to turn on tsan for profile instrumented build? What addition user errors does it try to catch?

There is no point. Provided that the instrumentation is correct.
FWIW has already caught several bugs both in llvm and gcc codegen (e.g. incorrectly widened stores). So it can make sense to keep it turned on.

A related question -- why is it useful to turn on tsan for profile instrumented build? What addition user errors does it try to catch?

Developers might be interested in turning on both profiling and TSan in their CI or when running tests. There is no reason to disallow this and require them to re-run the tests separately.

It is not about this assumption, but more about racy counter updates not being a problem practically (other than the precision of the profile data might be
affected slightly in practice).

"Racy counter updates not being a problem practically" assumes atomicity of the loads and stores. My understanding is that we assume that the loads and stores are atomic and on top of that we do not care about the precision of the profile data.

You seem to assume that marking the profiler memory accesses as monotonically atomic will degrade performance. Why? If that's the case then there is a
bug in llvm optimizer, and we need to fix it.

It might -- at least we should measure it. Note that instrumentation run is already pretty slow, so we need to be careful not to regress it.

Which architecture do you worry about? If these are pointer sized reads and writes, which are atomic on x86 and ARM, the monotonic loads and stores will turn into no-ops.

Do they turn on all sanitizers + profiling or just certain combinations? I am just trying to understand the use case here.

The question is if we should detect the profiler and sanitizer being turned on at the same time and error out in the driver or if we should allow it. I do not see a reason to disallow this. To answer your other question, most of the sanitizers cannot be turned on simultaneously.

It is not about this assumption, but more about racy counter updates not being a problem practically

Choosing an incorrect code all others equal does not look like the right approach to me.

It might -- at least we should measure it. Note that instrumentation run is already pretty slow, so we need to be careful not to regress it.

Again, is that's the case, that's a very real bug in llvm codegen that needs to be fixed. Patching all uses of atomic operations in the world to work around that bug is not going to scale (and just wrong).

kcc edited edge metadata.Mar 15 2016, 11:43 AM
kcc added a subscriber: kcc.

Just for the context, here is an older discussion on the same subject:
https://groups.google.com/forum/#!topic/llvm-dev/cDqYgnxNEhY

We are probably talking about different things here. My main concern is the additional constraints
imposed on the optimizer that can penalize optimizations (no-op or not on a target).

I see. Are there specific benchmarks that I could use for testing this impact?

Since it is known certain combinations of instrumentions do not work together, why not documenting the
behavior and let user filter out the false warnings?

These are not corner case combinations. We encourage users to run their tests with code coverage. At the same time, we advise to use sanitizers for bug finding. It is very likely that people will want to have both in their CI or when running tests.

Back to your proposal to use Monotonic load. It does not fix the actual race (to fix the race, atomic
fetch_add instruction is needed) nor can it silence tsan.

It does silence TSan, which is what I mainly care about. TSan performs it's instrumentation on LLVM IR level, where it sees the atomic load/store.

Having said that, I think it is reasonable to introduce an option to enable atomic profile counter update.
Tsan can then be safely combined with that.

Are you saying that you are OK with having this option but it should not be turned on by default unless we can show that the overhead of using Monotonic is minimal?

so the use case include configs like:

  • coverage + tsan
  • coverage + asan
  • coverage + ...san ?

Correct.

In the small case I tried, tsan replaces the monotonic load with __tsan_atomic_load32(..) call -- i still
saw the runtime warning.

Could you send me the test case? In the small test case I tried, I did not see a runtime warning:) Also, have you made both the load and the store monotonic?

We are probably talking about different things here. My main concern is the additional constraints imposed on the optimizer that can penalize optimizations (no-op or not on a target). For instance, memory dependence analysis treats Monotonic loads as 'read+write' (ModRef). This can certainly affect DSE, etc. On x86, the pseudo instruction generated for Monotonic load is ACQUIRE_MOV with Volatile bit set. It might have some impact too.

The first part of this is llvm optimizer bugs that needs to be fixed.
The volatile is both llvm bug and fixes a bug in current profiler implementation. We don't want profile counters to be cached in registers infinitely. Atomics guarantee eventual visibility of stores to other threads. And that's exactly what we want here. Volatile is a crude way to achieve that. Ideally, optimizer eliminates duplicate stores when it can prove that there is finite number of them; but does not try to eliminate potentially infinite sequences of stores.

There is _absolutely_ no difference between this compiler-emitted instrumentation and user code. If optimizer is bad, then user needs to choose between correct and fast code. And that's definitely not what we want long term.

Back to your proposal to use Monotonic load. It does not fix the actual race

Your definition of race disagrees with C/C++ standards and llvm rules. The proposal does fix the races and eliminates undefined behavior. It is legal to do atomic_store(p, atomic_load(p) + 1) if that's what you want (compiler ought to emit single ADD instruction for that).

I don't agree with you about the 'incorrect code' part.

Check out http://llvm.org/docs/Atomics.html#optimization-outside-atomic
"The basic 'load' and 'store' allow a variety of optimizations, but can lead to undefined results in a concurrent environment"

Anna reported that there is in fact some degradation if we use monotonic atomics. I am fine if we ignore the counters during tsan instrumentation, but also file bugs on optimizer (there must be no performance difference). This whole discussion has little to do with profiler, we are discussing that llvm does not permit users to have fast and correct code at the same time.

I don't that switch to atomic RMWs under tsan is a good idea: having precise counters only in tsan build does not make sense. That could be a separate option (if user wants precise counters), but it is orthogonal to this discussion.

Fine with that too if that is a must, but what is wrong with test configuration modified to avoid this
known limitation?
According to Anna, the following configs are tested:

I was not talking about any specific test configurations. This is what compiler/tools users could choose to use. We (as compiler vendors) have no control over this.

  • coverage + tsan
  • coverage + asan

..
If the motivation is to save build time (by merging intrumentations),
How about just

  • coverage + asan
  • tsan

-...
This is also faster. There does not seem to be any value to repeatedly enable coverage.

We do not really control what the users of these tools choose to use in their workflow / CI. It is possible they want to use coverage and TSan and not care about ASan. Our choice is to either return an error when both coverage and tsan are requested or just to support this configuration. I think it would be very surprising to users to see that they cannot use coverage with TSan. Also, I do not see a reason why it should be disallowed.

All right. I am fine with the direction the original patch is going (ignore
counter vars).

thanks,

David

yes -- C++ memory model does not allow speculative store motion. However in that example, The compiler can do the following to remove the memory access of x in loop:

I don't see how it relates. Races still lead to undefined behavior.

DSE won't happen with monotonic used (as below) -- I am not sure if that is a bug.

DSE is permitted by C/C++ standards in this case. So, yes, this is a bug.

I think this is already assuming that the original program is race free (i.e., plain load/stores are atomic,). Otherwise the behavior of the original program would also be undefined.

Indeed. That's why behavior of the current profiling counters is undefined.

Do you have a reference?

C/C++ standards does not impose any constraints on thread scheduler. So if we have:

atomic_store(p, 1);
atomic_store(p, 2);

Elimination of the first store does not affect behavior of the issuing thread (it's dead for the issuing thread). And for other threads we pretend that thread scheduler never gives other threads a chance to observe value 1 (either stores happen very quickly so other threads don't observe the first one, or we have some kind of a cooperative scheduler that does not schedule threads in between these stores, or something along these lines).

Besides, if 'monotonic' is considered completely 'no-op' (for x86) such as optimizer should not be imposed with any constraints, why Tsan behaves differently here (on x86)?

Because monotonic-atomic-store is still an atomic store, and atomic stores do not race with other atomic operations.

I am not sure what 'pretend' here means. It is possible that value '1' is still visible to other threads, or not?

Well, if we eliminate it, then it won't be visible.

If we don't eliminate it, then value 1 may or may not be visible to other threads depending on scheduler (among other things). There are no requirements for scheduler. So implementation in which value 1 is never visible to other threads is legal.

Sorry it may seem obvious to you, but here is the puzzle: suppose the two cases I mentioned produces *identical* code in the final binary -- it either has race or not, but not both. HoweverTsan says it has race if produced from one form of IR while no race from another?

Consider that we have two programs: one overflows int, another overflows unsigned. Both programs can potentially produce the same machine code and the same result. But the first one still contains a bug that leads to undefined behavior of the program. While the second does not.
The same situations happens with atomic and plain memory accesses. The fact that you get correct machine code on a particular version of a particular compiler proves nothing in general case. There are well defined semantics on language level, and that's what important.

zaks.anna updated this revision to Diff 51330.Mar 22 2016, 1:17 PM
zaks.anna edited edge metadata.

I've addressed the code review comments and moved the check into ThreadSanitizer.cpp. This localizes the code and makes it more flexible to add suppressions for other generated IR in the future.

dvyukov accepted this revision.Mar 22 2016, 1:25 PM
dvyukov edited edge metadata.

LGTM

This revision is now accepted and ready to land.Mar 22 2016, 1:25 PM
zaks.anna closed this revision.Mar 29 2016, 4:26 PM

In r264805