This is an archive of the discontinued LLVM Phabricator instance.

Fix a race in shutdown when tasking is used
ClosedPublic

Authored by tlwilmar on Jan 5 2017, 1:22 PM.

Details

Summary

Jonas Hahnfeld reported a bug in shutdown in the presence of tasking.

This change fixes a race in shutdown code when threads are being reaped. Threads spinning in fork barrier and searching for tasks to steal may identify other threads as potential victims to steal from. The other threads may have already been reaped.

The fix creates a simple flag on the threads that lets them indicate that they are in a reapable state when shutdown is happening. The shutdown code then forces any threads out of the fork barrier and then waits until all the threads are reapable, before reaping any of them.

Diff Detail

Repository
rL LLVM

Event Timeline

tlwilmar updated this revision to Diff 83290.Jan 5 2017, 1:22 PM
tlwilmar retitled this revision from to Fix a race in shutdown when tasking is used.
tlwilmar updated this object.
tlwilmar set the repository for this revision to rL LLVM.
tlwilmar added a subscriber: openmp-commits.
Hahnfeld edited edge metadata.Jan 6 2017, 12:30 AM

Thanks, this patch seems to solve the problem!

runtime/src/kmp_runtime.cpp
4007–4012 ↗(On Diff #83290)

Do we need this here or would it be enough to have the flag completely handled in __kmp_execute_tasks_template?

I don't know whether that would create a race on th_reap_state. If __kmp_execute_tasks_template is not guaranteed to be called at least once before a barrier finishes, why aren't there problems with multiple parallel regions? Each thread will have th.th_reap_state = KMP_SAFE_TO_REAP at the end of the first parallel region...

runtime/src/kmp_wait_release.h
215–224 ↗(On Diff #83290)

Resetting to th.th_reap_state = KMP_SAFE_TO_REAP could then be done at the end of __kmp_execute_tasks_template

Hi Jonas,
There are probably numerous ways of doing this. I answered your comments with why I did it this way.
Thanks!
Terry

runtime/src/kmp_runtime.cpp
4007–4012 ↗(On Diff #83290)

kmp_execute_tasks_template may not be called by all threads, and it may be called multiple times by individual threads, so it's often premature to set the flag to SAFE inside. kmp_initialize_info will be called to reset th_reap_state for each thread.

runtime/src/kmp_wait_release.h
215–224 ↗(On Diff #83290)

As mentioned above, we want to avoid prematurely setting the thread as safe to reap. Note that the cases in which we set the thread as safe to reap are when 1) no tasks have been encountered by any threads; 2) the task team is no longer active; 3) the current thread's task team is NULL. The case inside of __kmp_execute_tasks_template only amounts to "this thread couldn't find any more tasks after randomly searching for some".

Hahnfeld added inline comments.Jan 7 2017, 8:02 AM
runtime/src/kmp_runtime.cpp
4007–4012 ↗(On Diff #83290)

__kmp_initialize_info is not called if a hot team is reused with either the same or a lower number of threads.

int i;
for (i = 0; i < 2; i++) {
	#pragma omp parallel num_threads(2)
	{
		#pragma omp single nowait
		#pragma omp task
		{ printf("Executed by thread #%d!\n", omp_get_thread_num()); }
	}
}

with $ KMP_F_DEBUG=10 ./crash2 3>&1 1>&2 2>&3 | grep -E __kmp_initialize_info1 (sorry for the pipes!)

__kmp_initialize_info1: T#0:0 this_thread=0x60abc0 curtask=(nil)
__kmp_initialize_info1: T#1:1 this_thread=0x617f00 curtask=(nil)
__kmp_initialize_info1: T#0:0 this_thread=0x60abc0 curtask=0x607980
__kmp_initialize_info1: T#1:1 this_thread=0x617f00 curtask=0x610480
Executed by thread #1!
Executed by thread #1!
Finished parallel regions!
4348 ↗(On Diff #83290)

This function is called when a team is reused, maybe we have to add it here?

runtime/src/kmp_wait_release.h
215–224 ↗(On Diff #83290)

Ah, all right, I forgot that returning from __kmp_execute_tasks_template does not mean that all tasks are finished!

tlwilmar added inline comments.Jan 12 2017, 12:45 PM
runtime/src/kmp_runtime.cpp
4007–4012 ↗(On Diff #83290)

You're right... but... I think that we need to set to NOT SAFE whenever we come out of the spin loop in order to reset before both fork and join barriers. However, that impacts how we free and reap the threads. I'll have to tinker with this a bit.

Hahnfeld requested changes to this revision.Jan 31 2017, 4:06 AM
This revision now requires changes to proceed.Jan 31 2017, 4:06 AM
tlwilmar updated this revision to Diff 86519.Jan 31 2017, 4:06 PM
tlwilmar edited edge metadata.

Reap state does not need to be reset after each barrier. If thread attempts to execute tasks, it will be set to NOT SAFE to reap. It only matters in the spin at the fork barrier after shutdown is triggered. Master thread now waits for ALL threads to reach SAFE state before proceeding to clean anything up.

Hahnfeld accepted this revision.Jan 31 2017, 11:40 PM

To write down how I think this works:

  1. No worker thread can set th_reap_state = KMP_NOT_SAFE_TO_REAP after the master thread has passed the barrier.
  2. So master thread waits for all worker threads to finish before reaping.
  3. If all threads have finished, none of them will try to steal so all can be safely reaped.

If that's the case then LGTM!

runtime/src/kmp_runtime.cpp
5267 ↗(On Diff #86519)

I think you can swap the loop and this if statement which does not depend on the loop iteration? (see also line 5284)

5276 ↗(On Diff #86519)

Please reindent this to make it clear

This revision is now accepted and ready to land.Jan 31 2017, 11:40 PM
tlwilmar updated this revision to Diff 86674.Feb 1 2017, 10:56 AM

Jonas -- made the changes you requested (and removed second 'if' checking tasking mode).

This revision was automatically updated to reflect the committed changes.