This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Fix hidden helper + affinity assignment
ClosedPublic

Authored by jlpeyton on May 4 2021, 9:03 PM.

Details

Summary

When KMP_AFFINITY is set, each thread's gtid value is used as an index into the place list to determine the thread's placement. With hidden helpers enabled, this gtid value is shifted down leading to unexpected shifted thread placement. This patch restores the previous behavior by adjusting the mask index to take the number of hidden helper threads into account.

Hidden helper threads are given the full initial mask and do not participate in any of the affinity mechanisms besides the initial affinity assignment.

Diff Detail

Event Timeline

jlpeyton created this revision.May 4 2021, 9:03 PM
jlpeyton requested review of this revision.May 4 2021, 9:03 PM

Do we set any affinity for the hidden helper threads, or are they free floating?

@ronlieb Does this patch fix the performance issue you experienced for the SPEC CPU benchmark?

testing results: 128 cores used.

In my production tree:
Current helpers=false num_helpers=0
Success 619.lbm_s base refspeed ratio=77.36, runtime=67.711483

Helpers=false num_helpers=8 ( more like upstream) slowdown
Success 619.lbm_s base refspeed ratio=63.64, runtime=82.301507

Helpers=false num_helpers=8 patch applied: slowdown slightly improved, but not recovered sufficiently
Success 619.lbm_s base refspeed ratio=66.47, runtime=78.807917

Helpers=false num_helpers=0 patch applied , performance recovered,
Success 619.lbm_s base refspeed ratio=77.31, runtime=67.752051

@ronlieb Thanks for testing! When I saw this patch, I thought that broken thread-binding might possibly cause the observed performance issue.

@tianshilei1992 this patch is just another case, confirming my statement in D77609:

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously.

I wonder, whether it would improve the situation, if we move the hidden helper threads to __kmp_threads[-8:-1], i.e., below the initial thread?

Hi Joachim, welcome, happy to try again if you come up with another patch.
FI: our benchmark run depends on setting this env-var
export GOMP_CPU_AFFINITY = 0-127

hbae removed a reviewer: jlpeyton.
hbae added a comment.May 5 2021, 6:16 AM

Ignore my last action (my mistake).

Thanks for helping with the issue! And thanks @ronlieb for testing.

@tianshilei1992 this patch is just another case, confirming my statement in D77609:

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously.

I wonder, whether it would improve the situation, if we move the hidden helper threads to __kmp_threads[-8:-1], i.e., below the initial thread?

Yeah, that should be the most ideal way to do that. However, it doesn't work because negative gtids have special uses, e.g. for locks. We could not potentially set those special gtids to a very "large" negative value (or very small in math) because we don't know the maximum number of helper threads. Since helper thread is not part of spec, we could arguably set a maximum number, and move those special gtids out of this range, but the drawback is, if in the future, helper threads (or similar concept) is adopted to the spec, and if the spec doesn't set a maximum number (which I think is very possible), then we need to "redesign" again. That's why I chose to use the first few slots for helper thread, but didn't expect to break the affinity configuration. :-)

I cannot reproduce the regression with this patch. On 48 core (96 hyperthreads) machine with settings

GOMP_CPU_AFFINITY=0-95
OMP_NUM_THREADS=96

I got the following results:

Current library default:
Success 619.lbm_s base refspeed ratio=67.88, runtime=77.167402
Patch applied:
Success 619.lbm_s base refspeed ratio=87.54, runtime=59.837863
Helpers=false num_helpers=0:
Success 619.lbm_s base refspeed ratio=87.28, runtime=60.015925
Helpers=false num_helpers=0 patch applied:
Success 619.lbm_s base refspeed ratio=87.88, runtime=59.602862

@ronlieb, could you please provide the error output of the library with setting

KMP_AFFINITY=verbose

for your runs with/without patch applied (slightly improved and slow ones)?

Once the outputs should be big enough, it might be better to attach them to the corresponding bug https://bugs.llvm.org/show_bug.cgi?id=49673 in order to not pollute this review with huge testing logs.

Do we set any affinity for the hidden helper threads, or are they free floating?

The hidden helpers do get their affinity set as if they were normal worker threads.

e.g., with this patch and if KMP_AFFINITY=compact (and two hardware threads per core), then
regular gtid 0 is pinned to the first core
regular gtid 9 is pinned to the first core
regular gtid 10 is pinned to the second core
regular gtid 11 is pinned to the second core
regular gtid 12 is pinned to the third core
...
hidden helper gitd 1 is pinned to the first core
hidden helper gtid 2 is pinned to the second core
hidden helper gtid 3 is pinned to the second core
hidden helper gtid 4 is pinned to the third core
...
hidden helper gtid 8 is pinned to the fifth core

Is there a consensus on if we want them free-floating or not? I assume we do, but want to make sure.

Is there a consensus on if we want them free-floating or not? I assume we do, but want to make sure.

That seems to be a reasonable default for now.

sorry i missed the request for the verbose output, is that still needed from me ?

jlpeyton updated this revision to Diff 344202.May 10 2021, 2:31 PM
jlpeyton edited the summary of this revision. (Show Details)

With latest patch,

Added full mask assignment for hidden helper threads.

Took away the check for hidden helper enabled in the adjust_gtid function since the gtid needs to be adjusted regardless whether hidden helpers are enabled or not.

Removed hidden helpers from printing their thread affinity, unless it is a debug build, to help avoid confusion when using KMP_AFFINITY=verbose.

sorry i missed the request for the verbose output, is that still needed from me ?

When @AndreyChurbanov did the testing this removed the slowdown completely. Assuming this is not the case for you the output would probably help.

@AndreyChurbanov, this seems to improve the upstream repo, right? If so we should probably go ahead and land this even while we figure what @ronlieb is seeing.

i am testing your new patch.
i have no objection if you land what you have.
up to you.

ronlieb accepted this revision.May 10 2021, 4:41 PM

i can confirm 619.lbm looks better with the patch, peformance recovered,
thanks

This revision is now accepted and ready to land.May 10 2021, 4:41 PM

Do we set any affinity for the hidden helper threads, or are they free floating?

The hidden helpers do get their affinity set as if they were normal worker threads.

e.g., with this patch and if KMP_AFFINITY=compact (and two hardware threads per core), then
regular gtid 0 is pinned to the first core
regular gtid 9 is pinned to the first core
regular gtid 10 is pinned to the second core
regular gtid 11 is pinned to the second core
regular gtid 12 is pinned to the third core
...
hidden helper gitd 1 is pinned to the first core
hidden helper gtid 2 is pinned to the second core
hidden helper gtid 3 is pinned to the second core
hidden helper gtid 4 is pinned to the third core
...
hidden helper gtid 8 is pinned to the fifth core

Is there a consensus on if we want them free-floating or not? I assume we do, but want to make sure.

I don't think, that such binding of the hidden threads is optimal. On our dual-socket system with sub-numa clustering this would bind all helper threads to the same sub-numa domain.
I'd suggest a second env var to control the binding of the helper threads (with some reasonable default).

This is nothing to block this patch, but can be implemented in a followup patch.

This revision was landed with ongoing or failed builds.May 11 2021, 6:55 AM
This revision was automatically updated to reflect the committed changes.