I am removing my objection with the understanding that we will either replace or enhance amdgpu-arch with the cross-architecture tool offload-arch as described in my comments above. Also, this patch changes amd toolchains to use amdgpu-arch. So it will not interfere with how I expect the cross-architecture tool to be used to greatly simplify openmp command line arguments.
Thu, Apr 15
Wed, Apr 14
Dependence on hsa is not necessary. The amdgpu and nvidia drivers both use PCI codes available in /sys . We should use architecture independent methods as much as possible.
I have two serious concerns with this tool . 1. It does not provide the infrastructure to identify runtime capabilities to satisfy requirements of a compiled image. 2. It is not architecturally neutral or extensible to other architectures. I have a different proposed tool called offload-arch . Here is the help text for offload-arch.
Jul 29 2020
This is all excellent feedback. Thank you.
I don't understand what I see on the godbolt link. So far, we have only tested with clang. We will test with gcc to understand the fail.
I will make the change to use numeric values for _DEVICE_ARCH and change "UNKNOWN" to some integer (maybe -1). The value _DEVICE_GPU is intended to be generational within a specific _DEVICE_ARCH.
To be clear, this is primarily for users or library writers to implement device specific or host-only code. This is not something that should be automatic. Users or library authors will opt-in with their own include. So maybe it does not belong in clang/lib/Headers.
As noted in the header comment, I expect the community to help keep this up to date as offloading technology evolves.
Jun 12 2020
LGTM, only one new auto in an existing code with 11 previous autos. Looks like consistent usage.
May 14 2020
Mar 30 2020
This was discussed on llvm-dev three years ago. Here is the thread.
Feb 6 2020
I will rework this patch to
- Try dlopen on relative library name first to for LD_LIBRARY_PATH search. If that fails, I will load using full path name.
- Reorganize the names arrays into a single array to avoid a counter.
Jan 29 2020
Aug 23 2018
I have a longer comment on header files, but let me first understand this patch.
Aug 22 2018
I like the idea of using an automatic include as a cc1 option (-include). However, I would prefer a more general automatic include for OpenMP, not just for math functions (clang_cuda_device_functions.h). Clang cuda automatically includes clang_cuda_runtime_wrapper.h. It includes other files as needed like clang_cuda_device_functions.h. Lets hypothetically call my proposed automatic include for OpenMP , clang_openmp_runtime_wrapper.h.
Jun 25 2018
Why not provide a specific list of --hip-device-lib= for VDI builds? I am not sure about defining functions inside headers instead of using a hip bc lib.
May 7 2018
I agree that George's RMW proposed code is correct. This was my first attempt at an RMW code. Maybe we should implement atomicMax as a device function in architecture-specific (e.g sm_30) device library. This way the code in loop.cu can remain just a call to atomicMax. Such an implementation would need an overloaded atomicMax.
Apr 4 2018
So , will the deviceRTLs/nvptx change? Instead of extern shared, what will it use for those data structures?
Apr 2 2018
Maybe my search is missing something, but the only place I see CUDARelocatableDeviceCode is in lib/Sema/SemaDeclAttr.cpp to allow for extern shared. How could this be causing slowness? I would think forcing extern to be global would be slower.
Feb 5 2018
Here my replys to the inline comments. Everything should be fixed in the next revision.
Feb 1 2018
Sorry, all my great inline comments got lost somehow. I am a newbie to Phabricator. I will try to reconstruct my comments.
Thanks to everyone for the reviews. I hope I replied to all inline comments. Since I sent this to Sam to post, we discovered a major shortcoming. As tra points out, there is a lot of cuda headers in the cuda sdk that are processed. We are able to override asm() expansions with #undef and redefine as an equivalent amdgpu component so the compiler never sees the asm(). I am sure we will need to add more redefines as we broaden our testing. But that is not the big problem. We would like to be able to run cudaclang for AMD GPUs without an install of cuda. Of course you must always install cuda if any of your targeted GPUs are NVidia GPUs. To run cudaclang without cuda when only non-NVidia gpus are specified, we need an open set of headers and we must replace the fatbin tools used in the toolchain. The later can be addressed by using the libomptarget methods for embedding multiple target GPU objects. The former is going to take a lot of work. I am going to be sending an updated patch that has the stubs for the open headers noted in clang_cuda_runtime_wrapper.h. They will be included with the CC1 flag -DUSE_OPEN_HEADERS__. This will be generated by the cuda driver when it finds no cuda installation and all target GPUs are not NVidia.
Dec 19 2016
Justin, the commonality between nvptx and amdgcn LLVM IR is exactly why I would like isGPU(). I actually do want to assume that "isGPU" <--> "isNVPTX || isAMDGCN".
I can email you a bigger patch from our development tree. I would rather not post it in public yet because it still needs some work. Here are two examples from this patch.
Thank you Justin, Yes, I plan to use this extensively in clang for common OpenMP code generation. But I don't have those patches ready yet.
isGPU() may also be used for compilation of cuda code to LLVM IR as alternative to isNVPTX(). I will discuss with google authors first.
I formatted to 80 lines. Thank you for your patience with a new contributor.
Formatted to 80 characters.