This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Introduce attribute to mark kernels without a language mode
ClosedPublic

Authored by jhuber6 on Dec 16 2022, 9:59 AM.

Details

Summary

We may want to be able to mark certain regions as kernels even without
being in an accepted CUDA or OpenCL language mode. This patch introduces
a new attribute limited to nvptx targets called nvptx_kernel which
will perform the same metadata action as the existing CUDA ones. This
closely mimics the behaviour of the amdgpu_kernel attribute. This
allows for making executable NVPTX device images without using an
existing offloading language model.

I was unsure how to do this, I could potentially re-use all the CUDA
attributes and just replace the CUDA language requirement with an
NVPTX architecture requirement. Also I don't know if I should add more
than just this attribute.

Diff Detail

Event Timeline

jhuber6 created this revision.Dec 16 2022, 9:59 AM
jhuber6 requested review of this revision.Dec 16 2022, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 9:59 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tra added inline comments.Dec 16 2022, 10:26 AM
clang/lib/CodeGen/TargetInfo.cpp
7362

How does AMDGPU track kernels? It may be a good opportunity to stop using metadata for this if we can use a better suited mechanism. E.g. a function attribute or a calling convention.

clang/lib/Sema/SemaDeclAttr.cpp
4872

I'm tempted to addAttr(CUDAGlobal) here, effectively making nvptx_kernel a target-specific alias for it, so we're guaranteed that they both are handled exactly the same everywhere.
On the other hand, it all may be moot -- without CUDA compilation mode, CUDAGlobal handling will be different in this compilation mode.

Can CUDAGlobal itself be allowed to be used as a target-specific attribute for NVPTX in C++ mode?

I think, if possible, we should ideally have only one attribute doing the job, even if it may have somewhat different use cases in CUDA vs C++ compilation modes.

jhuber6 added inline comments.Dec 16 2022, 10:31 AM
clang/lib/CodeGen/TargetInfo.cpp
7362

AMDGPU uses a calling convention, which is probably a better option. I don't know how this still gets reduced in the back-end though.

clang/lib/Sema/SemaDeclAttr.cpp
4872

Yeah that's what I was thinking. Right now we only parse and check all the CUDA attributes in the CUDA language mode. I could change it to allow them whenever we're compiling for the NVPTX architecture instead. I don't think for the vast majority it would have any significant effect.

tra added inline comments.Dec 16 2022, 10:36 AM
clang/lib/CodeGen/TargetInfo.cpp
7362

OK. Switching from metadata to a new calling convention would be nice, but it is likely a bit more complicated and can be handled separately if/when we decide to do it. It's not needed for your purposes.

clang/lib/Sema/SemaDeclAttr.cpp
4872

Let's give it a try.

jhuber6 updated this revision to Diff 483640.Dec 16 2022, 12:25 PM

Changing to use the same CUDA global attributes. This requires a few extra checks for whether or not we were in CUDA mode since previously it just assume any time we saw one of these globals we were in that mode. I added a different spelling as well just for consistency.

tra added a comment.Dec 16 2022, 12:54 PM

LGTM.

General question -- what happens now that the global and launch_bounds are target-specific as opposed to language-specific, if they happen to be used in a C++ compilation targeting x86? I assume they will still be ignored, right?

clang/include/clang/Basic/Attr.td
1199

Nice.

This reminded me that we have a project compiling CUDA, but targeting SPIR-V instead of NVPTX. It looks like this will likely break them. The project is out-of-tree, but I'd still need to figure out how to keep them working. I guess it would be easy enough to expand TargetNVPTX to TargetNVPTXOrSpirV. I'm mostly concerned about logistics of making it happen without disruption.

jhuber6 added inline comments.Dec 16 2022, 12:57 PM
clang/include/clang/Basic/Attr.td
1199

This might've broken more stuff after looking into it, I forgot that AMDGPU still uses the same CUDA attributes, and the host portion of CUDA also checks these. It would be nice if there was a way to say "CUDA" or "NVPTX", wondering if that's possible in the tablegen here.

I wonder whether we could not factorize some code/attribute/logic with AMDGPU or SYCL.
Is the use case to have for example CUDA+HIP+SYCL in the same TU and thus there is a need for different attributes

I wonder whether we could not factorize some code/attribute/logic with AMDGPU or SYCL.
Is the use case to have for example CUDA+HIP+SYCL in the same TU and thus there is a need for different attributes

It would probably be good to have the high level concept of a "kernel" be factored out since this is common between all the offloading languages. The actual implementation it gets lowered to would still need to be distinct since this usually gets turned into some magic bits stashed in the executable for the runtime to read. The use-case for this patch is simply to allow people to compile pure C/C++ code to the NVPTX architecture, but still be able to mark the necessary metadata for kernels and globals.

I've recently thought if we could just apply the same logic used for shared objects with GPU images, that is globals without hidden visibility would be considered __global__ and ones with hidden visibility would be considered __device__ in CUDA terms. I think the only thing preventing us from thinking of a kernel call as a dynamic symbol load is probably the launch parameters. But this is purely theoretical, I don't think we need to worry about moving away from offloading languages or anything.

There are already SYCL specific attributes: https://reviews.llvm.org/D60455

There are already SYCL specific attributes: https://reviews.llvm.org/D60455

We could potentially merge these all into some generic attribute since they all do the same thing on a conceptual level. The unique thing about the existing amdgpu_kernel and corresponding nvptx_kernel is that they don't rely on the language options like SYCL or CUDA. Though, semantically those are definitely involved because the kernel itself is only meaningful to whatever runtime is going to load it (e.g. CUDA or HSA) but we can probably consider that separately to the compilation itself and just think of these as calling conventions.

tschuett added a comment.EditedDec 18 2022, 10:17 AM

There are already SYCL specific attributes: https://reviews.llvm.org/D60455

We could potentially merge these all into some generic attribute since they all do the same thing on a conceptual level. The unique thing about the existing amdgpu_kernel and corresponding nvptx_kernel is that they don't rely on the language options like SYCL or CUDA. Though, semantically those are definitely involved because the kernel itself is only meaningful to whatever runtime is going to load it (e.g. CUDA or HSA) but we can probably consider that separately to the compilation itself and just think of these as calling conventions.

But then you are maybe mixing two concepts. kernel is source code or AST feature. nvptx or AMDGPU are command line flags.

CUDA, Sycl, nvtx, and AMDGPU are modes or calling conventions?

But then you are maybe mixing two concepts. kernel is source code or AST feature. nvptx or AMDGPU are command line flags.

CUDA, Sycl, nvtx, and AMDGPU are modes or calling conventions?

The way I understand it, the architecture determines the actual ISA for the code and the kernel metadata operates like a calling convention for whatever "OS" will be executing it. For example, for the triple amdgcn-amd-amdhsa we generate code for the amdgcn architecture and emit kernels such that the hsa runtime can call them. Similarly, for nvptx64-nvidia-cuda we emit code for nvptx64 and our kernels use the calling convention such that the cuda runtime can call them. I think the main question of this patch is if we can separate the cuda runtime from the CUDA language. That is, we don't need to be using the CUDA language to emit functions that the cuda runtime can call. So this is more or less thinking of these kernel calls as a calling convention for a runtime or operating system rather than as a language feature.

Calling convention is the right model here. Kernels are functions with a different calling convention to the 'normal' functions in a very literal sense. The calling convention modelling in clang is different to attribute handling and changing nvptx to it is probably invasive, though it seems to me it could be done incrementally.

I wouldn't suggest adding a nvptx_kernel calling convention to clang though, rather we could repurpose the amdgpu one to be gpu_kernel. Possibly spelled nvptx_kernel for the user but represented within clang as gpu_kernel.

Related, I think there's a spirv or opencl kernel representation in llvm for amdgpu, I would be interested in collapsing those and the openmp or hip annotation to a single thing if possible.

That's all medium term cleanup ideas, current patch looks good to me.

shangwuyao added inline comments.
clang/include/clang/Basic/Attr.td
1199

What's the plan here for keeping the SPIR-V and AMDGPU working? Would it work if we simply get rid of the TargetSpecificAttr<TargetNVPTX>?

jhuber6 added inline comments.Dec 19 2022, 8:54 AM
clang/include/clang/Basic/Attr.td
1199

Yeah, it would I'll need to update the patch. The best solution would be if there were a way to say "TargetNVPTX or LangOpts.CUDA". Not sure if that's possible in Tablegen. The previous diff I had worked fine, but we should definitely try to avoid rework.

Precommit CI found failures that look relevant to the patch.

We may want to be able to mark certain regions as kernels even without being in an accepted CUDA or OpenCL language mode.

Can you explain this a bit more? Under what circumstances would you want to do this?

Precommit CI found failures that look relevant to the patch.

We may want to be able to mark certain regions as kernels even without being in an accepted CUDA or OpenCL language mode.

Can you explain this a bit more? Under what circumstances would you want to do this?

Yeah, I need to work on this some more. A previous version worked fine but it duplicated some logic, I'm not sure if there's a good way to re-use the existing kernel logic without breaking some of the assumptions. The desire was to be able to emit a kernel that can be called externally via cross-compilation. E.g. clang foo.c --target=nvptx64-nvidia-cuda. The intended use-case was for testing experimental libc implementations using integration tests.

@tra would it be possible to go to the earlier version that simply duplicated a slight amount of logic to introduce the new and separate attribute nvptx_kernel? Overloading CUDA's device attribute is problematic because it's used and checked in several different contexts. I'd like to be able to simplify this code https://github.com/llvm/llvm-project/blob/main/libc/startup/gpu/nvptx/start.cpp.

jhuber6 updated this revision to Diff 508170.Mar 24 2023, 11:18 AM

Updating to simply add an entirely new attribute again. The existing
CUDAGlobal attribute does what we want, but it's also highly coupled with the
CUDA language. This made it pretty much impossible to find a way to re-use it
without breaking existing functionality. The amount of code duplicated is
minimal and this is required to be able to emit a callable kernel targeting
NVPTX directly. I'd like to use this for my ongoing GPU libc project so I'd
appreciate someone looking at this again.

tra accepted this revision.Mar 24 2023, 11:27 AM
This revision is now accepted and ready to land.Mar 24 2023, 11:27 AM
This revision was landed with ongoing or failed builds.Mar 24 2023, 12:42 PM
This revision was automatically updated to reflect the committed changes.