This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Implement task record and replay mechanism
ClosedPublic

Authored by yuchenle on Mar 22 2023, 9:29 AM.

Details

Summary

This patch implements the "task record and replay" mechanism. The idea is to be able to store tasks and their dependencies in the runtime so that we do not pay the cost of task creation and dependency resolution for future executions. The objective is to improve fine-grained task performance, both for those from "omp task" and "taskloop".

The entry point of the recording phase is kmpc_start_record_task, and the end of record is triggered by kmpc_end_record_task.

Tasks encapsulated between a record start and a record end are saved, meaning that the runtime stores their dependencies and structures, referred to as TDG, in order to replay them in subsequent executions. In these TDG replays, we start the execution by scheduling all root tasks (tasks that do not have input dependencies), and there will be no involvement of a hash table to track the dependencies, yet tasks do not need to be created again.

At the beginning of kmpc_start_record_task, we must check if a TDG has already been recorded. If yes, the function returns 0 and starts to replay the TDG by calling kmp_exec_tdg; if not, we start to record, and the function returns 1.

An integer uniquely identifies TDGs. Currently, this identifier needs to be incremented manually in the source code. Still, depending on how this feature would eventually be used in the library, the caller function must do it; also, the caller function needs to implement a mechanism to skip the associated region, according to the return value of __kmpc_start_record_task.

Diff Detail

Event Timeline

yuchenle created this revision.Mar 22 2023, 9:29 AM
yuchenle created this object with edit policy "Administrators".
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2023, 9:29 AM
yuchenle requested review of this revision.Mar 22 2023, 9:29 AM
yuchenle changed the edit policy from "Administrators" to "All Users".Mar 22 2023, 9:30 AM

Adding some initial comments

openmp/runtime/src/kmp.h
2485

It may be better to expose these as env variables that can be configured by the user

2497

This should probably use doxygen comments ///

2498–2506

Remember to run clang-format on the changes through the git hook

2513

Be consistent with the TDG in upper case

// Flags related to a TDG
openmp/runtime/src/kmp_taskdeps.cpp
224

I believe KMP_ASSERT instead of printf is better. Others, please advice.

293

This should be if (task) { with a leading space before the bracket. But clang-format can help with these

openmp/runtime/src/kmp_taskdeps.h
98

Remove commented code that's not used.

105

You may be able to use the macros in kmp_debug.h for these messages

openmp/runtime/src/kmp_tasking.cpp
3470

Not needed

yuchenle updated this revision to Diff 507427.Mar 22 2023, 10:41 AM

Changes according to some of @josemonsalve2 's comments

Please upload the patch with full context.

openmp/runtime/src/kmp.h
2489

Use a (inline) function. No need for macros.

rneveu added a subscriber: rneveu.Mar 23 2023, 6:50 AM
yuchenle updated this revision to Diff 508024.Mar 24 2023, 3:34 AM
yuchenle marked 4 inline comments as done.

changed TDG_RECORD macro to inlined function
applied clang-format where it was missed
changed TDG tests, so that they now use clangxx instead of clang, and deleted unnecessary "include" in them

josemonsalve2 added inline comments.Mar 24 2023, 5:04 PM
openmp/runtime/src/kmp_tasking.cpp
1221–1223

It may be better to use and instead of nested if
if (is_taskgraph && __kmp_track_children_task(taskdata) && taskdata->td_taskgroup)

yuchenle updated this revision to Diff 508516.Mar 27 2023, 1:03 AM
yuchenle marked 5 inline comments as done.

Deleted include <new> in tasking.cpp which was not used
Deleted global variable kmp_max_nesting which is not used in this commit
Merge nested if as suggested
Simplified initialization of is_taskgraph in

__kmp_task_finish

I think it's better to guard the entire related code with macro.

openmp/runtime/src/kmp_global.cpp
573

add an extra line to the end

openmp/runtime/src/kmp_tasking.cpp
19

unrelated change

I think it's better to guard the entire related code with macro.

What do you mean by this Shilei? To disable this feature?

I think it's better to guard the entire related code with macro.

What do you mean by this Shilei? To disable this feature?

Like a opt-in feature.

I think it's better to guard the entire related code with macro.

What do you mean by this Shilei? To disable this feature?

Like a opt-in feature.

Why do you think so? What's the harm of leaving it enable as this is just an API function? Are you saying this to save space in the structs?

I think it's better to guard the entire related code with macro.

What do you mean by this Shilei? To disable this feature?

Like a opt-in feature.

Why do you think so? What's the harm of leaving it enable as this is just an API function? Are you saying this to save space in the structs?

It's not just an API function. It contains many runtime checks which can potentially compromise the performance for users that don't need the feature. I'd prefer to take it similar to OMPT.

This comment was removed by josemonsalve2.

I think it's better to guard the entire related code with macro.

What do you mean by this Shilei? To disable this feature?

Like a opt-in feature.

Why do you think so? What's the harm of leaving it enable as this is just an API function? Are you saying this to save space in the structs?

It's not just an API function. It contains many runtime checks which can potentially compromise the performance for users that don't need the feature. I'd prefer to take it similar to OMPT.

Got it. It can produce a performance degradation in regular tasks, even when not used. I think that's a fair idea.

@yuchenle have you measure the overhead of this in regular tasks when no recording is used? This is specially important for you guys that use fine grain tasking

yuchenle updated this revision to Diff 509278.Mar 29 2023, 3:09 AM
yuchenle marked an inline comment as done.

NUM_TDG_LIMIT is now an env variable KMP_MAX_TDGS.
Reverted a few changes unrelated to task record & replay.

yuchenle marked 3 inline comments as done.Apr 3 2023, 2:14 AM

I think it's better to guard the entire related code with macro.

What do you mean by this Shilei? To disable this feature?

Like a opt-in feature.

Why do you think so? What's the harm of leaving it enable as this is just an API function? Are you saying this to save space in the structs?

It's not just an API function. It contains many runtime checks which can potentially compromise the performance for users that don't need the feature. I'd prefer to take it similar to OMPT.

Got it. It can produce a performance degradation in regular tasks, even when not used. I think that's a fair idea.

@yuchenle have you measure the overhead of this in regular tasks when no recording is used? This is specially important for you guys that use fine grain tasking

Sorry for the delay. I was trying to generate some data. Hopefully, I will write some scripts to automate this process in the future, so that anyone (including me) can test this patch's performance impact with ease.
I ran Heat propagation simulation (https://github.com/yuchenle/tdg-benchs/tree/master/heat) on an exclusive node of Marenostrum 4 with different granularities and numbers of threads. So far, according to the results (https://www.dropbox.com/s/jur1qrftmw2epvk/LLVM%20RR%20Perf.xlsx?dl=0) the performance impact is small to unnoticeable.
Though @Munesanz (Adrian Munera) and I agreed on including this patch within a macro to exclude performance concerns.
I will update the patch : )

yuchenle updated this revision to Diff 511100.Apr 5 2023, 7:53 AM

Shielding changes within a macro OMPX_TASKGRAPH. This macro can be defined as 1 or TRUE at CMAKE configuration time via LIBOMP_OMPX_TASKGRAPH. The default value of it is FALSE.

protze.joachim added a subscriber: protze.joachim.EditedApr 5 2023, 8:55 AM

Sorry for the delay. I was trying to generate some data. Hopefully, I will write some scripts to automate this process in the future, so that anyone (including me) can test this patch's performance impact with ease.
I ran Heat propagation simulation (https://github.com/yuchenle/tdg-benchs/tree/master/heat) on an exclusive node of Marenostrum 4 with different granularities and numbers of threads. So far, according to the results (https://www.dropbox.com/s/jur1qrftmw2epvk/LLVM%20RR%20Perf.xlsx?dl=0) the performance impact is small to unnoticeable.

What is the impact to taskbench in the EPCC OpenMP micro benchmark?

Similarly, what is the impact to SPEC OMP 2012 kdtree?

randreshg added inline comments.Apr 12 2023, 7:42 PM
openmp/runtime/test/tasking/omp_record_replay.cpp
22

I think the test will fail if you don't guarantee that this line will be executed atomically.

yuchenle marked an inline comment as done.Apr 13 2023, 6:32 AM
yuchenle added inline comments.
openmp/runtime/test/tasking/omp_record_replay.cpp
22

I think the test will fail if you don't guarantee that this line will be executed atomically.

You are right, if the TDG is asynchronous. However, in the current implementation, TDG execution is synchronous, using taskgroup. Plus, there is only one node in the TDG so there is no contention. Only one thread can access to "num_exec" at any given time : )

yuchenle marked an inline comment as done.Apr 24 2023, 2:12 AM

Sorry for the delay. I was trying to generate some data. Hopefully, I will write some scripts to automate this process in the future, so that anyone (including me) can test this patch's performance impact with ease.
I ran Heat propagation simulation (https://github.com/yuchenle/tdg-benchs/tree/master/heat) on an exclusive node of Marenostrum 4 with different granularities and numbers of threads. So far, according to the results (https://www.dropbox.com/s/jur1qrftmw2epvk/LLVM%20RR%20Perf.xlsx?dl=0) the performance impact is small to unnoticeable.

What is the impact to taskbench in the EPCC OpenMP micro benchmark?

Similarly, what is the impact to SPEC OMP 2012 kdtree?

Hi, I had time to generate some numbers with EPCC microbenchmarks. The results are reported in the same dropbox excel spreadsheet, tab "EPCC chart".
vanilla libomp time is a build with OMPX_TASKGRAPH=FALSE and RR is built with OMPX_TASKGRAPH=1, with the branches introduced in this patch.
The testbed is the same as my previous comment, a node of Manorestrum 4.

baodi added a subscriber: baodi.Apr 24 2023, 12:55 PM
openmp/runtime/test/tasking/omp_record_replay_deps.cpp
2

I think you will need to guard these newly added test cases such that they will only be executed if LIBOMP_OMPX_TASKGRAPH is enabled; otherwise it might cause unexpected failures.

yuchenle updated this revision to Diff 518646.May 2 2023, 12:30 AM

Tests added for record & replay mechanism now require LIBOMP_OMPX_TASKGRAPH=TRUE or =1 CMAKE variable when building. These tests are not run by default to avoid failures, as suggested by @tianshilei1992 .

This revision is now accepted and ready to land.May 12 2023, 10:19 AM