This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] libomp: eliminate pause from atomic CAS loops
ClosedPublic

Authored by AndreyChurbanov on Feb 19 2021, 12:21 PM.

Details

Summary

Pause is good in spin-loops waiting for something.
Atomic CAS loops do not wait for anything,
each CAS failure means some other thread progressed.

So pause only causes unnecessary slowdown.

Diff Detail

Event Timeline

AndreyChurbanov requested review of this revision.Feb 19 2021, 12:21 PM

I think this is architecture specific.

If a CAS failed spuriously, then immediately retry is good. If it failed because another core wrote to the cache line, then we have established that said cache line is somewhat contended, in which case pause() may let the other threads progress faster.

This might be difficult to microbenchmark.

I think this is architecture specific.

If a CAS failed spuriously, then immediately retry is good. If it failed because another core wrote to the cache line, then we have established that said cache line is somewhat contended, in which case pause() may let the other threads progress faster.

If one thread succeeded then ALL other threads competing on the cache line would fail. So they will not progress faster executing pause instruction. Only gain can be for succeeding thread, that means it executes nothing but atomic in a loop, and that is not a real code example...

Actually the atomic functions are never called by clang, so it is kind of cleanup change in any case.

Experiments calling an internal atomic function directly in a loop show either no difference on some processors (where pause instruction is fast enough), or significant speedup (where pause instruction is slow).

One clear gain of this patch is reduced size of the runtime - I see 4096 bytes smaller size of the binary if the patch applied.

If you still think the patch is not good, I am fine to abandon it. But to me it is pretty harmless, and as I mentioned, it reduces the size of the runtime.

I think this is architecture specific.

If a CAS failed spuriously, then immediately retry is good. If it failed because another core wrote to the cache line, then we have established that said cache line is somewhat contended, in which case pause() may let the other threads progress faster.

If one thread succeeded then ALL other threads competing on the cache line would fail. So they will not progress faster executing pause instruction. Only gain can be for succeeding thread, that means it executes nothing but atomic in a loop, and that is not a real code example...

Actually the atomic functions are never called by clang, so it is kind of cleanup change in any case.

Experiments calling an internal atomic function directly in a loop show either no difference on some processors (where pause instruction is fast enough), or significant speedup (where pause instruction is slow).

One clear gain of this patch is reduced size of the runtime - I see 4096 bytes smaller size of the binary if the patch applied.

If you still think the patch is not good, I am fine to abandon it. But to me it is pretty harmless, and as I mentioned, it reduces the size of the runtime.

I think we should keep this patch. These atomic functions are just trying to mimic what a hardware instruction would accomplish. In line with Andrey's reasoning, if a CAS fails here then there is nothing to wait for since the other thread that caused the fail is done with the atomic. i.e., the critical section is the atomic itself. Hence, the "spin wait" is already over and another retry should take place immediately.

Someone measuring the performance on a few architectures soundly beats my speculation. No objection here, thanks!

This revision is now accepted and ready to land.Mar 8 2021, 11:54 AM