Page MenuHomePhabricator

[OpenMP] Fixed a crash in hidden helper thread
ClosedPublic

Authored by tianshilei1992 on Mar 17 2021, 7:07 PM.

Details

Summary

It is reported that after enabling hidden helper thread, the program
can hit the assertion new_gtid < __kmp_threads_capacity sometimes. The root
cause is explained as follows. Let's say the default __kmp_threads_capacity is
N. If hidden helper thread is enabled, __kmp_threads_capacity will be offset
to N+8 by default. If the number of threads we need exceeds N+8, e.g. via
num_threads clause, we need to expand __kmp_threads. In
__kmp_expand_threads, the expansion starts from __kmp_threads_capacity, and
repeatedly doubling it until the new capacity meets the requirement. Let's
assume the new requirement is Y. If Y happens to meet the constraint
(N+8)*2^X=Y where X is the number of iterations, the new capacity is not
enough because we have 8 slots for hidden helper threads.

Here is an example.

#include <vector>

int main(int argc, char *argv[]) {
  constexpr const size_t N = 1344;
  std::vector<int> data(N);

#pragma omp parallel for
  for (unsigned i = 0; i < N; ++i) {
    data[i] = i;
  }

#pragma omp parallel for num_threads(N)
  for (unsigned i = 0; i < N; ++i) {
    data[i] += i;
  }

  return 0;
}

My CPU is 20C40T, then __kmp_threads_capacity is 160. After offset,
__kmp_threads_capacity becomes 168. 1344 = (160+8)*2^3, then the assertions
hit.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Mar 17 2021, 7:07 PM
tianshilei1992 requested review of this revision.Mar 17 2021, 7:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2021, 7:07 PM

avoid potential integer overflow

protze.joachim requested changes to this revision.Mar 18 2021, 2:39 AM

The test case should really try to exceed the current capacity by 1.

I don't think, this patch really solves the fundamental issue resulting in the assertion.

openmp/runtime/src/kmp_runtime.cpp
3556–3564

I don't think, this is the right fix for the problem.
__kmp_threads_capacity is the size of the __kmp_threads array. If a call to __kmp_expand_threads asks to expand the array by 1, you don't need to expand by additional hidden threads (as they are not placed at the end). The hidden threads were already part of the __kmp_threads_capacity before expansion.

3624

This might not call __kmp_expand_threads(1), if not all __kmp_hidden_helper_threads_num are created before this code is reached and __kmp_all_nth - __kmp_created_hidden_helper_threads + __kmp_hidden_helper_threads_num >= capacity.

3655

Please don't forget about this fix. I don't care whether you fix it here or in a follow-up patch.

3658–3660

This for-loop implicitly assumes, that it will find an empty space in [0:__kmp_threads_capacity), i.e. implicitly assumes gtid < __kmp_threads_capacity , as stated in the assertion below - but you skip the first few index in the range. Line 3620 fails to provide this guarantee for your modified numbering scheme.

openmp/runtime/test/tasking/hidden_helper_task/num_threads.cpp
12 ↗(On Diff #331442)

omp_get_num_threads() will always return 1 in serial context.

This revision now requires changes to proceed.Mar 18 2021, 2:39 AM
protze.joachim added inline comments.Mar 18 2021, 4:33 AM
openmp/runtime/src/kmp_runtime.cpp
4327

This assertion triggers for your test application, when the runtime is built in debug mode.

openmp/runtime/test/tasking/hidden_helper_task/num_threads.cpp
25 ↗(On Diff #331442)

The assertion in kmp_runtime.cpp:4323 triggers also for num_threads(__kmp_threads_capacity+1).

In that case, __kmp_expand_threads is never called, so your patch would have no effect at all.

To really fix the issue, all cases which try to calculate the required number of threads and eventually would call __kmp_expand_threads need to consider the extra space for the hidden threads, so that they

  • first realize that space is missing and
  • second request enough space for the hidden threads.

With this consideration, __kmp_expand_threads does not need any change.

Reproducer for the assertion in Line 3660:

#include <omp.h>
#include <vector>
#include <thread>
#include <chrono>

void dummy_root(){
  int nthreads = omp_get_max_threads();
  std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}


int main(int argc, char *argv[]) {
  const int N = 4 * omp_get_num_procs();
  std::vector<int> data(N);
  std::thread root(dummy_root);
#pragma omp parallel for num_threads(N)
  for (unsigned i = 0; i < N; ++i) {
    data[i] += i;
  }

  root.join();
  return 0;
}
tianshilei1992 added inline comments.Mar 18 2021, 6:01 AM
openmp/runtime/src/kmp_runtime.cpp
3556–3564

Even w/o hidden helper thread, expansion by 1 will not result in increment by 1 because the newCapacity always doubles. Say originally it is 32, and we ask for expansion by 1, newCapacity will be 64 instead of 33. Therefore, whether we add extra space for hidden helper thread doesn’t waste too much memory here.

3655

Sure. I’ll include in this patch.

3658–3660

Because we already take the number of hidden helper thread into account when setting __kmp_threads_capacity, this assumption holds, right?

openmp/runtime/test/tasking/hidden_helper_task/num_threads.cpp
12 ↗(On Diff #331442)

This logic is from code to set the capacity, specifically in the comment. It uses $OMP_NUM_THREADS originally so here I changed to omp_get_num_threads().

protze.joachim added inline comments.Mar 18 2021, 7:11 AM
openmp/runtime/src/kmp_runtime.cpp
3658–3660

No, as I showed with my reproducer, your patched code will never be reached in specific cases.
Also in the case, I annotated in your test-case: when the application asks for N = __kmp_threads_capacity threads, your patch is never reached, but the assertion still triggers (an prevents access beyond allocated memory!).

openmp/runtime/test/tasking/hidden_helper_task/num_threads.cpp
12 ↗(On Diff #331442)

Right, omp_get_max_threads() is the function, which gives you $OMP_NUM_THREADS. omp_get_num_threads() does something different.

Please don't rely on the names of the functions, but check the OpenMP spec.

tianshilei1992 added inline comments.Mar 18 2021, 8:44 AM
openmp/runtime/src/kmp_runtime.cpp
3658–3660

I got your point. That is a nice catch! Thanks. I'll come up with a good way to fix it.

[AMD Official Use Only - Internal Distribution Only]

Slightly related:
In Wednesdays multi company meeting, we concluded that the helper task patch should be reverted from llvm 12 while we continued to actively work issues in trunk.

Who is taking care of that ? or whom should we notify ?

Ron

I didn't recall we have that conclusion. My memory told me the patch will be reverted if we can't fix issues before the release. No?

We should pull this from the 12 release. Lots of effort at the last minute to stop a complicated patch asserting, after it has been patched several times already, is unlikely to yield a stable release.

tianshilei1992 marked 5 inline comments as done.Mar 18 2021, 9:59 AM
tianshilei1992 added inline comments.
openmp/runtime/src/kmp_runtime.cpp
4327

Can you try again?

[AMD Official Use Only - Internal Distribution Only]

I totally agree with reverting this from 12.
Lets help our llvm release engineers produce a quality release.
Knowingly leaving in a patch that has this much contention an churn is not going to lead to a quality release.

Also, that is what I thought we concluded in the meeting.

Ron

I filed https://bugs.llvm.org/show_bug.cgi?id=49631 to make release managers aware that we have a problem here.

Ron,

Prefer if you  remove this from the mail " [AMD Official Use Only - Internal Distribution Only]"

Thanks
Ravi

[AMD Public Use]

Removed in both places , sorry about that, darn mailer configuration.

Shilei,

How much time do  you think you  need to resolve or conclude to revert or disable with macros in 12.0

Some would like to stabilize their performance numbers and would like to do it as early as possible.
Thanks
Ravi

tianshilei1992 added a comment.EditedMar 18 2021, 11:10 AM

Shilei,

How much time do  you think you  need to resolve or conclude to revert or disable with macros in 12.0

Some would like to stabilize their performance numbers and would like to do it as early as possible.
Thanks
Ravi

For the assertion problem, I expect this patch to fix that, and hopefully people can give it a shot. For the performance regression, I didn't observe it at least for now with HPC2021. I'll contact Ron for his reproducer.

[AMD Public Use]

Shilei
I offered you a spec cpu 619.lbm reproducer for the performance issue.
takes 2 minutes or less to compile and run.
Do you want that?

Ron

[AMD Public Use]

Shilei
I offered you a spec cpu 619.lbm reproducer for the performance issue.
takes 2 minutes or less to compile and run.
Do you want that?

Ron

Yes, please.

[AMD Public Use]

Awesome, I will send it along in private email due to spec confidentiality rules.
Ie. I cannot attach the source here.

Look for something shortly.
Please ask me for any help you might need on the reproducer.

Ron

In Wednesdays multi company meeting, we concluded that the helper task patch should be reverted from llvm 12 while we continued to actively work issues in trunk.

Who is taking care of that ? or whom should we notify ?

I also don't recall that conclusion.

We should pull this from the 12 release. Lots of effort at the last minute to stop a complicated patch asserting, after it has been patched several times already, is unlikely to yield a stable release.

Pulling this is not necessarily easy either, I haven't checked though.
Have you or @ronlieb tried this solution, especially if the environment
variable now works to disable all the side-effects?


Let me be direct for a second so we don't end up here again in a few months:
The patch was on phab for ~1 year, nobody cared, this is a very common phenomena.
It also has been merged for weeks. I get the fact that we want a stable release
but showing up last minute just saying we need to pull stuff is *not* helpful
from an overall perspective. I say this especially because the number of people/
organizations that develop and upstream complex features is very limited. If you
want to benefit from such efforts you should be prepared to help, IMHO. That does
mean to do some testing and reviewing *before* the last release candidate is due.
Not to say this was not tested, but the capabilities are arguably different here.

[AMD Public Use]

The environment variable LIBOMP_USE_HIDDEN_HELPER_TASK=0 does not solve the issues I have seen.
Nor did it resolve the issue in the simple test case Joachim posted...
Which I tried both with and without LIBOMP_USE_HIDDEN_HELPER_TASK=0
Joachim might be able to confirm same?

Ron

LIBOMP_NUM_HIDDEN_HELPER_THREADS=0 avoids the segfault/assertion for the two test cases I attached to the bugzilla issue. This kind of makes sense, as 0 hidden threads cannot create a hole in the __kmp_threads array.

If you still see performance regression (I could not reproduce this with lbm built from SPEC CPU 2006, for which I explicitly turned on the contained OpenMP code), I guess the code adds some additional synchronization, which was not there before.

tianshilei1992 added a comment.EditedMar 18 2021, 12:38 PM

I did some experiments with different versions of lbm:

  • HPC2021: didn't observe performance regression (three variants: with hht, with hht but disable it via env, and w/o hht by reverting the change)
  • Ron's reproducer: observed performance regression if running with numactl --localalloc --physcpubind=0-xxx. In this case, disabling it via env can help. If running w/o numactl, almost no performance difference. (unclear the tiny difference is noise or not)

All run 10 times.

__kmp_hidden_helper_initialize() always initializes all hidden threads at once. Right? In this case, you modifications make sense.

I tested the patch applied to the release branch. It fixes both of my reproducers.
Please add my reproducers as test cases.

Also remove the unnecessary code of your first shot.

openmp/runtime/src/kmp_runtime.cpp
3556–3564

This change is unnecessary.

Please add my reproducers as test cases.

I just saw, that you fused the reproducers into a single test. I'd prefer to have them as separate tests. This will help to easier spot the source of future failures. The individual parallel regions might also change the capacity, so that the individual issues are not triggered.

tianshilei1992 marked 2 inline comments as done.Mar 18 2021, 1:30 PM

__kmp_hidden_helper_initialize() always initializes all hidden threads at once. Right? In this case, you modifications make sense.

Right.

Let me be direct for a second so we don't end up here again in a few months:
The patch was on phab for ~1 year, nobody cared, this is a very common phenomena.
It also has been merged for weeks. I get the fact that we want a stable release
but showing up last minute just saying we need to pull stuff is *not* helpful
from an overall perspective. I say this especially because the number of people/
organizations that develop and upstream complex features is very limited. If you
want to benefit from such efforts you should be prepared to help, IMHO. That does
mean to do some testing and reviewing *before* the last release candidate is due.
Not to say this was not tested, but the capabilities are arguably different here.

Well, sort of. We reported segfaults on it pretty much when it landed,
https://reviews.llvm.org/D77609 has a comment from Jan 18 this year. It landed after:

The test still doesn't work ideally on amdgpu, but it no longer crashes, and
some of the print statements within the target region are seen.

which given it's host only code was a fairly clear sign that things weren't right.

Granted Ron & I didn't review or read the change (as far as I know), but then we're
mostly fighting to get amdgpu working and didn't anticipate a host-only change
breaking gpu offloading.

In the spirit of direct, I didn't care about this change until it landed and broke stuff.

protze.joachim accepted this revision.Mar 18 2021, 2:05 PM

LGTM after the latest update.

This revision is now accepted and ready to land.Mar 18 2021, 2:05 PM
This revision was automatically updated to reflect the committed changes.

In the spirit of direct, I didn't care about this change until it landed and broke stuff.

As far as I can tell, we merged Jan 25, Ron reported an issue March 15. In addition to the
review time, it was upstream for 8 weeks before you reported it broke stuff. Given that delay
I would not throw rocks at people claiming they did not do any testing. It's not like we don't
try to setup LLVM/OpenMP CI and such.

[AMD Public Use]

We have a continuous integration process that takes essentially trunk changes and moves them a month's worth at a time, into our production testing branch. This branch has 100's of hours of testing.
We recently moved from Dec 8 to Jan 26 commits, and did so about 2 weeks ago, and that is when we started to see the problems. Would we like to test the larger batches of changes sooner? Yes of course. We reported the problem fairly quickly after we saw the assert issue.

There are other companies who wait until LLVM releases and then move to integrating and testing that release branch source. These companies have yet to start testing llvm12, so they have not seen the patch in question.

Hope that provides a bit of clarity into the time lags that we see and will see moving from commit into product releases.

Ron

We reported the problem fairly quickly after we saw the assert issue.

I appreciate your report. Seriously. However, no one would like to tell us how to reproduce the bug. Even now this patch has already been merged, I still didn't get any reproducer (in any form) from whom reported the issue at the very beginning. I can get that we're approaching release, and we want a stable product. However, if nobody provides steps to reproduce bugs, and just asks to revert patch, we will probably NEVER have new features.

The AMD AOCC Compiler team reported to me this morning that they are able to reproduce the SPEC CPU performance regressions when the patch is present.

They are able to recover the lost performance when they set the two environment variables using SPEC confg file rules

preENV_LIBOMP_USE_HIDDEN_HELPER_TASK=OFF
preENV_LIBOMP_NUM_HIDDEN_HELPER_THREADS=0

which for a non speccpu program would be simply

export  LIBOMP_USE_HIDDEN_HELPER_TASK=OFF
export  LIBOMP_NUM_HIDDEN_HELPER_THREADS=0