This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][OMPT] thread_num determination during execution of nested serialized parallel regions
ClosedPublic

Authored by vladaindjic on Sep 29 2021, 5:04 AM.

Details

Summary

__ompt_get_task_info_internal function is adapted to support thread_num
determination during the execution of multiple nested serialized
parallel regions enclosed by a regular parallel region.

Consider the following program that contains parallel region R1 executed
by two threads. Let the worker thread T of region R1 executes serialized
parallel regions R2 that encloses another serialized parallel region R3.
Note that the thread T is the master thread of both R2 and R3 regions.

Assume that __ompt_get_task_info_internal function is called with the
argument “ancestor_level == 1” during the execution of region R3.
The function should determine the “thread_num” of the thread T inside
the team of region R2, whose implicit task is at level 1 inside the
hierarchy of active tasks. Since the thread T is the master thread of
region R2, one should expected that “thread_num” takes a value 0.
After the while loop finishes, the following stands: “lwt != NULL”,
“prev_lwt == NULL”, “prev_team” represents the team information about
the innermost serialized parallel region R3. This results in executing
the assignment “thread_num = prev_team->t.t_master_tid”. Note that
“prev_team->t.t_master_tid” was initialized at the moment of
R2’s creation and represents the “thread_num” of the thread T inside
the region R1 which encloses R2. Since the thread T is the worker thread
of the region R1, “the thread_num” takes value 1, which is a contradiction.

This patch proposes to use “lwt” instead of “prev_lwt” when determining
the “thread_num”. If “lwt” exists, the task at the requested level belongs
to the serialized parallel region. Since the serialized parallel region
is executed by one thread only, the “thread_num” takes value 0.

Similarly, assume that __ompt_get_task_info_internal function is called
with the argument “ancestor_level == 2” during the execution of region R3.
The function should determine the “thread_num” of the thread T inside the
team of region R1. Since the thread is the worker inside the region R1,
one should expected that “thread_num” takes value 1. After the loop finishes,
the following stands: “lwt == NULL”, “prev_lwt != NULL”, “prev_team” represents
the team information about the innermost serialized parallel region R3.
This leads to execution of the assignment “thread_num = 0”, which causes
a contradiction.

Ignoring the “prev_lwt” leads to executing the assignment
“thread_num = prev_team->t.t_master_tid” instead. From the previous explanation,
it is obvious that “thread_num” takes value 1.

Note that the “prev_lwt” variable is marked as unnecessary and thus removed.

This patch introduces the test case which represents the OpenMP program
described earlier in the summary.

Diff Detail

Event Timeline

vladaindjic created this revision.Sep 29 2021, 5:04 AM
vladaindjic requested review of this revision.Sep 29 2021, 5:04 AM
Herald added a project: Restricted Project. · View Herald Transcript

Overall looks good to me besides one inline comment.
I successfully tested with gcc/11, icc/19 and latest clang.

openmp/runtime/test/ompt/parallel/nested_lwt_thread_num.c
88

There is no guaranteed ordering of implicit task begin on primary thread and the output on the worker thread.
Adding a barrier at the beginning of the parallel region should make the test more robust.

[OpenMP][OMPT] Barrier has been added inside nested_lwt_thread_num.c test case.

As @protze.joachim proposed, the barrier has been added at the beginning of the
regular (the outermost) parallel region.

This revision is now accepted and ready to land.Oct 25 2021, 4:52 AM
vladaindjic marked an inline comment as done.Oct 25 2021, 5:11 AM

@protze.joachim Thank you for the revision. Could you please land this patch for me?

openmp/runtime/test/ompt/parallel/nested_lwt_thread_num.c
88

@protze.joachim Thanks for the comment. As you proposed, the explicit barrier has been added at the beginning of the parallel region. Does the test case look complete now?