This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Update cached kernel handle when the function instance changes.
ClosedPublic

Authored by tra on Mar 20 2023, 11:25 AM.

Details

Summary

Fixes clang crash caused by a stale function pointer.
The bug has been present for a pretty long time, but we were lucky not to trigger it until D140663.

Diff Detail

Event Timeline

tra created this revision.Mar 20 2023, 11:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 11:25 AM
tra published this revision for review.Mar 20 2023, 11:29 AM
tra edited the summary of this revision. (Show Details)
tra added reviewers: yaxunl, dcastagna.
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2023, 11:29 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Thank you Artem!

clang/lib/CodeGen/CGCUDANV.cpp
1201

The different F is because sometimes at first we use an F with incomplete types as a key and later we use a different F (even if it's coming from the same function) with the instantiated types (usually after we encounter a statement where the size of the template parameter is necessary, like a function call).
Another option could be to always instantiate the complete type for F when we call GetOrCreateLLVMFunction (IIRC) if the function has a global attribute.

If you prefer this solution, the patch LGTM.

tra added inline comments.Mar 20 2023, 2:12 PM
clang/lib/CodeGen/CGCUDANV.cpp
1201

Instantiating a complete function when we normally would not sounds a bit too invasive and may buy us unintended side effects. E.g. I can see it potentially causing code explosion in some cases, where we previously could get by with just a function decl.

Fixing the cache locally maintains the global status quo which seems to be about the right balance between improving the situation, the effort involved, and the potential risk of the fix.

The way we track kernels could potentially be improved (I think it would be better to track them explicitly vs relying on observing the functions flying by), but that should be a separate patch. Right now I just want to fix the regression w/o affecting anything else.

tra updated this revision to Diff 506719.Mar 20 2023, 2:14 PM

Fixed a typo in the test.

Do you know whether the issue happens to HIP too? Thanks.

tra added a comment.Mar 20 2023, 3:23 PM

Do you know whether the issue happens to HIP too? Thanks.

Possibly, but I'm unable to reproduce the crash: https://godbolt.org/z/Gsx9sWG4v

The value cached by KernelHandles in HIP is the the manually created 'Var' and not the function itself. I don't think it ever changes outside of this function.

However, KernelStubs would still point to the original F, and that may potentially be left dangling. Perhaps I should always update both handles and stubs if F changes.

tra added a comment.Mar 20 2023, 4:16 PM

However, KernelStubs would still point to the original F, and that may potentially be left dangling. Perhaps I should always update both handles and stubs if F changes.

Yes, HIP is susceptibel to having the dangling pointer to F. What saves us is that KernelStubs happens to be used only *before* the cached functions are reinstantiated, so we never use those dangling pointers in case of HIP.
OK, looks like I'll need to fix HIP, too.

tra updated this revision to Diff 507091.Mar 21 2023, 12:46 PM

Improved kernel handle cache invalidation.

tra updated this revision to Diff 507093.Mar 21 2023, 12:52 PM

Deallocate replaced Var.

tra updated this revision to Diff 507094.Mar 21 2023, 12:55 PM

Cosmetic cleanup

tra updated this revision to Diff 507101.Mar 21 2023, 1:13 PM

Simplify invalidation logic. No need to replace HIP's handle variable.

tra added a comment.Mar 21 2023, 1:13 PM

I've updated the patch with a more robust version of the handle/stub invalidation.

yaxunl accepted this revision.Mar 21 2023, 2:45 PM

LGTM. Thanks!

This revision is now accepted and ready to land.Mar 21 2023, 2:45 PM
This revision was landed with ongoing or failed builds.Mar 21 2023, 3:36 PM
This revision was automatically updated to reflect the committed changes.