Given this CUDA program
template<typename Lambda> __global__ void run_this(Lambda lambda) { lambda(); } template<typename T> struct remove_reference { using type = T; }; template<typename T> struct remove_reference<T&> { using type = T; }; template<typename T> constexpr typename remove_reference<T>::type&& move(T&& t) { return static_cast<typename remove_reference<T>::type&&>(t); } int main() { auto foo = move([](){}); run_this<<<1, 1, 1>>>([]() __device__ { printf("Hello World\n"); }); return 0; }
the assertion at the top of CGNVCUDARuntime::emitDeviceStub will fail. For release builds the effect is simply a cudaErrorInvalidDeviceFunction error at run time. The reason for this is that the mangled names of the device stub and the actual device side function differ: The stub is called _Z8run_thisIZ4mainE3$_1EvT_, while the device function is _Z8run_thisIZ4mainE3$_0EvT_. The difference comes down to the anonymous struct ID that is maintained and assigned by the ManglerContext. It appears that for the latter getAnonymousStructId is never called for the moved no-op lambda, resulting in an ID of 0 for the kernel.
My proposed solution would be to simply share the ManglerContext used by the CGNVCUDARuntime and CGCXXABI code generators. For this I've added a new ASTContext::getSharedMangleContext function that memoizes created manglers for the given target ABI. From looking at ManglerContext to me at least it doesn't look like that could cause any issues, but then again, I really don't know much about Clang's internals.
Of course an alternative solution could be to make sure that getAnonymousStructId is always called for both lambdas (and in the correct order), but again I don't really know why that is not happening in the first place.