User Details
- User Since
- Aug 22 2016, 7:51 AM (353 w, 3 d)
Apr 24 2023
Its ugly, but to avoid requirement to set LD_LIBRARY_PATH for end-users who may need LD_LIBRARY_PATH for their own application, we will modify the compiler installation with these bash commands where _INSTALL_DIR is the installation directory.
Apr 21 2023
Apr 19 2023
Apr 5 2023
Mar 23 2023
This looks good. Please merge. I found it very useful especially in the context of other generated temp files generated after llvm llinking and optimization in the offload driver. For example just listing the temp files with ls -ltr provides a somewhat visual ordering of the compilation flow.
Oct 27 2022
4th update to original D136631 review. Changes:
- Remove unnecessary thread sync following LDS reduction.
- Replaced the bool LDS __is_last_team with LDS copy of the team counter value returned from atomic inc. This removes a few more scalar registers. Performance and resource utilization noticably improved from all updates to this review.
- Change parameter name inival to "rnv" (Reduction Null Value) for consistency.
- Improved header docs and comment block for main function _xteamr_reduction.
Oct 26 2022
- removed duplicate variants of shfl that were the same between amdgcn and nvptx and simplified nvptx variant of shfl_xor_int
- improve performance and lower register utilization by deriving reduction constants from k and passing numteams from codegen to xteamr function. This changes the interface to the xteamr functions and removes dependencies on DeviceRTL mapping functions
Oct 25 2022
- fix lit test of xteamr/test_xteamr.cpp
Oct 24 2022
- - fix for make check-openmp
Oct 19 2022
- change name from ompx_get_team_procs(devid) to ompx_get_device_num_units(devid). Internal names have also been changed
Oct 17 2022
Is the name ompx_get_device_num_units(devid) acceptable?
See answer to inline comment.
I think I get your rational now. A thread executes on a places proc. In fact many threads can execute on (map to) a place proc. We need the same sort of "mapping" for a team or thread group. We did not call a place a thread_proc.
As the name is now, a team_proc is a place where a team can run, That can be a CU, an SM, a VE, a host socket, a host numa domain, a node, or something. The code in this review covers all existing (in trunk) offload plugins that can execute a team.
"Team can technically map to lots of things and soon they will. What about "thread groups"? Or even "sockets"?
Also, there is some duplication I pointed out below."
- move NumberOfTeamProcs into DeviceData per comment from jdoerfert
- Merge branch 'main' into arcpatch-D135162
- Remove change to openmp/libomptarget/plugins/exports so this revision will merge with current trunk
Ignore this, I was just trying to fix https://reviews.llvm.org/D135162 and accidentally created a new revision
The change to openmp/libomptarget/plugins/exports needs to be deleted. The upstream changed this file to include all functions beginning with __tgt_rtl
Oct 5 2022
Oct 4 2022
No longer using this approach
Dec 2 2021
this is ok as is
Dec 1 2021
I forgot to add the "request changes" action.
We want amdgcn to remain on old deviceRTL till we have verified it . I made inline comments on how this could be done.
Aug 26 2021
Jul 22 2021
This looks good. Thanks for the cleanup
May 18 2021
Apr 15 2021
I am removing my objection with the understanding that we will either replace or enhance amdgpu-arch with the cross-architecture tool offload-arch as described in my comments above. Also, this patch changes amd toolchains to use amdgpu-arch. So it will not interfere with how I expect the cross-architecture tool to be used to greatly simplify openmp command line arguments.
Apr 14 2021
Dependence on hsa is not necessary. The amdgpu and nvidia drivers both use PCI codes available in /sys . We should use architecture independent methods as much as possible.
I have two serious concerns with this tool . 1. It does not provide the infrastructure to identify runtime capabilities to satisfy requirements of a compiled image. 2. It is not architecturally neutral or extensible to other architectures. I have a different proposed tool called offload-arch . Here is the help text for offload-arch.
Jul 29 2020
This is all excellent feedback. Thank you.
I don't understand what I see on the godbolt link. So far, we have only tested with clang. We will test with gcc to understand the fail.
I will make the change to use numeric values for _DEVICE_ARCH and change "UNKNOWN" to some integer (maybe -1). The value _DEVICE_GPU is intended to be generational within a specific _DEVICE_ARCH.
To be clear, this is primarily for users or library writers to implement device specific or host-only code. This is not something that should be automatic. Users or library authors will opt-in with their own include. So maybe it does not belong in clang/lib/Headers.
As noted in the header comment, I expect the community to help keep this up to date as offloading technology evolves.
Jun 12 2020
LGTM, only one new auto in an existing code with 11 previous autos. Looks like consistent usage.
May 14 2020
Mar 30 2020
This was discussed on llvm-dev three years ago. Here is the thread.
Feb 6 2020
I will rework this patch to
- Try dlopen on relative library name first to for LD_LIBRARY_PATH search. If that fails, I will load using full path name.
- Reorganize the names arrays into a single array to avoid a counter.
Jan 29 2020
Aug 23 2018
I have a longer comment on header files, but let me first understand this patch.
Aug 22 2018
I like the idea of using an automatic include as a cc1 option (-include). However, I would prefer a more general automatic include for OpenMP, not just for math functions (clang_cuda_device_functions.h). Clang cuda automatically includes clang_cuda_runtime_wrapper.h. It includes other files as needed like clang_cuda_device_functions.h. Lets hypothetically call my proposed automatic include for OpenMP , clang_openmp_runtime_wrapper.h.
Jun 25 2018
Why not provide a specific list of --hip-device-lib= for VDI builds? I am not sure about defining functions inside headers instead of using a hip bc lib.
May 7 2018
I agree that George's RMW proposed code is correct. This was my first attempt at an RMW code. Maybe we should implement atomicMax as a device function in architecture-specific (e.g sm_30) device library. This way the code in loop.cu can remain just a call to atomicMax. Such an implementation would need an overloaded atomicMax.
Apr 4 2018
So , will the deviceRTLs/nvptx change? Instead of extern shared, what will it use for those data structures?
Apr 2 2018
Maybe my search is missing something, but the only place I see CUDARelocatableDeviceCode is in lib/Sema/SemaDeclAttr.cpp to allow for extern shared. How could this be causing slowness? I would think forcing extern to be global would be slower.
Feb 5 2018
Here my replys to the inline comments. Everything should be fixed in the next revision.
Feb 1 2018
Sorry, all my great inline comments got lost somehow. I am a newbie to Phabricator. I will try to reconstruct my comments.
Thanks to everyone for the reviews. I hope I replied to all inline comments. Since I sent this to Sam to post, we discovered a major shortcoming. As tra points out, there is a lot of cuda headers in the cuda sdk that are processed. We are able to override asm() expansions with #undef and redefine as an equivalent amdgpu component so the compiler never sees the asm(). I am sure we will need to add more redefines as we broaden our testing. But that is not the big problem. We would like to be able to run cudaclang for AMD GPUs without an install of cuda. Of course you must always install cuda if any of your targeted GPUs are NVidia GPUs. To run cudaclang without cuda when only non-NVidia gpus are specified, we need an open set of headers and we must replace the fatbin tools used in the toolchain. The later can be addressed by using the libomptarget methods for embedding multiple target GPU objects. The former is going to take a lot of work. I am going to be sending an updated patch that has the stubs for the open headers noted in clang_cuda_runtime_wrapper.h. They will be included with the CC1 flag -DUSE_OPEN_HEADERS__. This will be generated by the cuda driver when it finds no cuda installation and all target GPUs are not NVidia.
Dec 19 2016
Justin, the commonality between nvptx and amdgcn LLVM IR is exactly why I would like isGPU(). I actually do want to assume that "isGPU" <--> "isNVPTX || isAMDGCN".
I can email you a bigger patch from our development tree. I would rather not post it in public yet because it still needs some work. Here are two examples from this patch.
Thank you Justin, Yes, I plan to use this extensively in clang for common OpenMP code generation. But I don't have those patches ready yet.
isGPU() may also be used for compilation of cuda code to LLVM IR as alternative to isNVPTX(). I will discuss with google authors first.
I formatted to 80 lines. Thank you for your patience with a new contributor.
Formatted to 80 characters.