This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Diagnose compiling kernel without offload arch
Needs ReviewPublic

Authored by yaxunl on Apr 15 2021, 5:11 AM.

Details

Reviewers
tra
aaron.ballman
Summary

AMDGPU does not have a common processor (GPU arch). A HIP kernel
must be compiled with a specified processor to be able to be launched
on that processor.

However we cannot simply diagnose missing --offload-arch in clang
driver since a valid HIP program can contain no kernel, which can be
compiled without specifying offload arch and executed on machines
without AMDGPU.

Therefore only HIP programs containing kernels should be diagnosed
when compiled without offload arch.

This patch changes clang driver so that when offload arch is not specified
for HIP, no target CPU is specified for clang -cc1. If HIP program contains
kernel, FE will diagnose it as a fatal error so that the diagnostics will be
emitted only once. This way, we allow HIP programs without kernels to
be compiled without offload arch whereas forbid HIP programs with
kernels to be compiled without offload arch.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 15 2021, 5:11 AM
yaxunl requested review of this revision.Apr 15 2021, 5:11 AM

Drive-by comment on the diagnostic wording.

clang/include/clang/Basic/DiagnosticSemaKinds.td
8260
yaxunl updated this revision to Diff 337883.Apr 15 2021, 1:22 PM
yaxunl marked an inline comment as done.

revised error msg by Aaron's comments

tra added a comment.Apr 15 2021, 2:37 PM

Enforcing explicit GPU target makes sense.

However, I think that singling out a __global__ as the trigger is not sufficient for the intended purpose.

If we can't generate a usable GPU-side binary, then we should produce an error if we need to generate *anything* during GPU-side compilation.
Using __global__ as a proxy would not catch some use cases and, possibly, will produce false positives in others.

E.g. what if I have a TU which only has a __device__ int var = 42; combined with a host-side code to memcpy to/from it? It would still be a valid, if not very useful code, but it would still suffer from runtime being unable to load it on a GPU unless that variable is in a GPU binary compiled with a valid target.

__device__ functions in TUs compiled with -fgpu-rdc would have a similar problem. They would eventually be linked into a GPU binary which will be useless if it's not compiled for correct GPU. Granted, __device__ functions will eventually need to be called from a kernel, so we will error out on a __global__ *somewhere*, but it will miss the problem when such TU does not get to the linking stage (e.g. maybe the user wants to link them at runtime).

clang/include/clang/Basic/DiagnosticSemaKinds.td
8260

How about compiling a file with __device__ functions with -fgpu-rdc? If a kernel with no-arch is an error, then this should be an error, too.

clang/lib/Sema/SemaDeclAttr.cpp
4431

Will this fire if we have an uninstantiated kernel template?

clang/test/SemaCUDA/kernel-no-gpu.cu
7

We'll need few more test cases.

E.g. these should be fine.

template <typename T> __global__ void kernel(T arg ) {};
__global__ void kernel(T arg );