In the function target, memory deallocation and target_data_end is called
immediately returning from launching kernel. This might cause a race condition
that the corresponding memory is still being used by the kernel and a potential
issue that when the kernel starts to execute, its required data have already
been deallocated, especially when multiple kernels running concurrently. Since
nevertheless, we will block the thread issuing the target offloading at the end
of the target, we just added a synchronization before memory deallocation
to make sure the correctness.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
LGTM. Typo in the commit message. Also shorten the commit title, e.g., wait for kernel prior to memory deallocation
Does it mean the D2H will always run synchronously after this change?
Does it also mean that target_data_end should be split into data transfer and data free parts?
That is good point. There is a critical issue in this patch. D2H is still async but the synchronization is lost.
Indeed, target_data_begin should be split as well. cudaMalloc blocks the whole device. Alternating cudaMalloc and transfer only makes the whole process further slower. Better to make all the allocation and then start queuing the transfer.
OK. It is less broken now.
target_data_end still does Device.deallocTgtPtr and needs a sync before it.
To fully fix this issue, target_data_end must be spitted.