This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Wait for kernel prior to memory deallocation
AbandonedPublic

Authored by tianshilei1992 on Jul 22 2020, 7:27 PM.

Details

Summary

In the function target, memory deallocation and target_data_end is called
immediately returning from launching kernel. This might cause a race condition
that the corresponding memory is still being used by the kernel and a potential
issue that when the kernel starts to execute, its required data have already
been deallocated, especially when multiple kernels running concurrently. Since
nevertheless, we will block the thread issuing the target offloading at the end
of the target, we just added a synchronization before memory deallocation
to make sure the correctness.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Jul 22 2020, 7:27 PM
jdoerfert accepted this revision.Jul 22 2020, 7:33 PM

LGTM. Typo in the commit message. Also shorten the commit title, e.g., wait for kernel prior to memory deallocation

This revision is now accepted and ready to land.Jul 22 2020, 7:33 PM

Updated based on comments

tianshilei1992 retitled this revision from [OpenMP] Fixed an issue that memory free might be called inappropriately when the kernel is still running to [OpenMP] Wait for kernel prior to memory deallocation.Jul 22 2020, 7:40 PM
tianshilei1992 edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.
ye-luo added a subscriber: ye-luo.Jul 22 2020, 7:59 PM

Does it mean the D2H will always run synchronously after this change?
Does it also mean that target_data_end should be split into data transfer and data free parts?

tianshilei1992 reopened this revision.EditedJul 22 2020, 8:04 PM

Does it mean the D2H will always run synchronously after this change?
Does it also mean that target_data_end should be split into data transfer and data free parts?

That is good point. There is a critical issue in this patch. D2H is still async but the synchronization is lost.

This revision is now accepted and ready to land.Jul 22 2020, 8:04 PM

Indeed, target_data_begin should be split as well. cudaMalloc blocks the whole device. Alternating cudaMalloc and transfer only makes the whole process further slower. Better to make all the allocation and then start queuing the transfer.

Fixed an issue that target may return before D2H is still in progress

tianshilei1992 requested review of this revision.Jul 22 2020, 9:17 PM
tianshilei1992 edited the summary of this revision. (Show Details)
ye-luo added a comment.EditedJul 22 2020, 9:46 PM

OK. It is less broken now.
target_data_end still does Device.deallocTgtPtr and needs a sync before it.
To fully fix this issue, target_data_end must be spitted.

ye-luo accepted this revision.Jul 22 2020, 9:48 PM
This revision is now accepted and ready to land.Jul 22 2020, 9:48 PM
This comment was removed by tianshilei1992.
tianshilei1992 abandoned this revision.Jul 28 2020, 12:42 PM

Will get it fix in another patch.