This is an archive of the discontinued LLVM Phabricator instance.

Use hyperbarrier by default on all architectures
ClosedPublic

Authored by Hahnfeld on Nov 22 2017, 7:42 AM.

Details

Summary

All architectures except x86_64 used the linear barrier implementation
by default which doesn't give good performance for a larger number
of threads.

Improvements for PARALLEL overhead (EPCC) with this patch on a Power8
system (2 sockets x 10 cores x 8 threads, OMP_PLACES=cores)

 20 threads:  4.55us -> 3.49us
 40 threads:  8.84us -> 4.06us
 80 threads: 19.18us -> 4.74us
160 threads: 54.22us -> 6.73us

Diff Detail

Repository
rOMP OpenMP

Event Timeline

Hahnfeld created this revision.Nov 22 2017, 7:42 AM

The idea was that 32-bit machines will probably have small number of cores (2, or 4, or ...). Then hyper barrier can have bigger overhead. Can you check if 2 or 4 threads work faster on hyper barrier comparing to linear? If not, then maybe the condition could be fixed in different way, e.g. adding Power arch to the x86_64, leaving linear barrier for 32-bit archs.

BTW, the comments "hyper2: C78980" could be safely removed I think. This is some very old info that says nothing nowadays (at least to me:).

The idea was that 32-bit machines will probably have small number of cores (2, or 4, or ...). Then hyper barrier can have bigger overhead. Can you check if 2 or 4 threads work faster on hyper barrier comparing to linear? If not, then maybe the condition could be fixed in different way, e.g. adding Power arch to the x86_64, leaving linear barrier for 32-bit archs.

I might be seeing a slightly better average with the linear barrier for 2 threads (1 percent?), but a higher standard deviation - not really sure about this.
The hyper barrier clearly wins for 4 threads by about 5 percent and naturally for all higher thread counts.
(Tested on the same Power system.)

So in theory, the hyper barrier collapses to a linear barrier for all thread counts less than 5 because we have a branch factor of 4, right? Obviously with a higher overhead because of the more complex code, but the synchronization pattern (which threads waits for which child) remains the same...

BTW, the comments "hyper2: C78980" could be safely removed I think. This is some very old info that says nothing nowadays (at least to me:).

Ok, will do after we agreed on the general direction of this. (I always thought these to be references to an internal bug tracker? There are more references in kmp_atomic.cpp)

AndreyChurbanov accepted this revision.Nov 24 2017, 8:04 AM

LGTM

If some architecture will see the slowdown (unlikely) because of this change we can restore the linear barrier default for them.

This revision is now accepted and ready to land.Nov 24 2017, 8:04 AM

Adding D40722 as a dependence because the improved barrier made it more likely to hit the problem fixed there.

This revision was automatically updated to reflect the committed changes.
pawosm01 added a subscriber: pawosm01.EditedNov 7 2018, 5:37 AM

I need to contact someone who introduced KMP_REVERSE_HYPER_BAR implementation (unfortunately, it was added here in a huge collective commit years ago, Tue Oct 7 16:25:50 2014).
I have a piece of proprietary code that stopped to fail on AArch64 after I reverted D40358. Digging deeper I found out that it's not the problem with hyperbariers per se. As I undefined KMP_REVERSE_HYPER_BAR, the problem went away. What else I found is that the problem can be reproduced on x86_64 too, only less frequent (once per

for i in `seq 1 1000`; do

execution). As I'd have to spend a time to figure out the logic behind it and verify its validity, in the meantime, can someone familiar with this code have a look at it and confirm reverse implementation is sane?

Paul,

Can you share more info on the problem on x86_64? What kind of failure you see? What is OS and HW? Can you share a reproducer?

The code you are talking about is quite ancient. I doubt anybody can guarantee the code correctness, but we don't see any problems in many years of testing. I cannot tell anything on AArch64, but if we could reproduce the problem on x86_64, we'd be happy to investigate it.

Thanks,
Andrey

Not much I'm able to say. The error is the malformed integer value resulting in incorrect computation result, more often on AArch64, less often on x86_64, both architectures running under Linux. The code does something unusual so I'm not surprised no one faced this before. Also keep in mind that I would never find it if hyperbarriers weren't introduced for all architectures.