This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][CUDA] Fix potential program crash caused by double free resources
ClosedPublic

Authored by tianshilei1992 on Mar 18 2022, 9:33 AM.

Details

Summary

As we mentioned in the code comments for function ResourcePoolTy::release,
at some point there could be two identical resources on the two sides of Next
mark. It is usually not an issue, unless the following case:

  1. Some resources are not returned.
  2. We need to iterate the pool and free the element.

That will cause double free, which is the case for event pool. Since we don't release
events hold by the data map, it can happen that the Next mark is not reset, and
we have two identical items in the pool. When the pool is destroyed, we will call
cuEventDestroy twice on the same event. In the best case, we can only observe
CUDA errors. In the worst case, it can cause internal failures in CUDART and further
crash.

This patch fixes the issue by tracking all resources that have been given using
an unordered_set. We don't remove it when a resource is returned. When the pool
is destroyed, we merge the pool (a vector) and the set. In this way, we can make
sure that the set contains all resources allocated from the device. We just need
to iterate the set and free the resource accordingly.

For now, only event pool is set to use it. Stream pool is not because we can make
sure all streams are returned when the plugin is destroyed.

Someone might be wondering, why don't we release all events hold in the data map.
That is because, plugins are determined to be destroyed *before* libomptarget.
If we can somehow make the plugin outlast libomptarget, life will be much
easier.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Mar 18 2022, 9:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2022, 9:33 AM
tianshilei1992 requested review of this revision.Mar 18 2022, 9:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2022, 9:33 AM
tianshilei1992 added inline comments.Mar 18 2022, 9:44 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
302

This is the key part. Unluckily if the resources are not returned, the left one will never be overwritten.

Why don't we need the same for Streams?

ye-luo added a subscriber: ye-luo.Mar 18 2022, 12:32 PM
ye-luo added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
237

How about check in every allocated resource from runtime in ResourcesGiven and then use Resources only for pool management. In this way, ResourcesGiven is fully responsible for ownership.

rebase and fix comments

tianshilei1992 marked an inline comment as done.Mar 20 2022, 8:33 PM

Why don't we need the same for Streams?

Because we always sync the stream at the end of a target task, so it is guaranteed that streams are all released in user's code. However, this is not 100% correct. If we get a "dangling" nowait target task, the stream might not be released before the end of user code. Actually, even with this patch, that could still be a problem because we only have an implicit mandatory task wait when libomp is shut down. However, that function is called via __attribute__((destructor)), which means plugins have already been destroyed. That can cause a problem. I don't have a clear clue on how to solve it.

jdoerfert accepted this revision.Mar 25 2022, 9:35 AM

LG, I think this is clean.

This revision is now accepted and ready to land.Mar 25 2022, 9:35 AM
jdoerfert added inline comments.Mar 25 2022, 9:36 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
22

Nit: not needed.

remove unnecessary header

This revision was landed with ongoing or failed builds.Mar 25 2022, 7:49 PM
This revision was automatically updated to reflect the committed changes.