It is lower case openmp not OPENMP.
CMake Error at runtimes/CMakeLists.txt:34 (message): LLVM_ENABLE_RUNTIMES requests OPENMP but directory not found: /home/yeluo/opt/llvm-clang/llvm-project/llvm/runtimes/../../OPENMP
It is lower case openmp not OPENMP.
CMake Error at runtimes/CMakeLists.txt:34 (message): LLVM_ENABLE_RUNTIMES requests OPENMP but directory not found: /home/yeluo/opt/llvm-clang/llvm-project/llvm/runtimes/../../OPENMP
This patch caused severe regression in Clang 11.
https://bugs.llvm.org/show_bug.cgi?id=48177
In D77609#2386308, @jdoerfert wrote:In D77609#2386307, @tianshilei1992 wrote:In D77609#2362867, @jdoerfert wrote:Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.
As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.
What is a right way to detect whether a CMake variable or macro is defined?
In C/C++ (#ifdef) or in CMake (idk)?
The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.
Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.
Getting this even when compiling without offload. You can use the reproducer from the original bug report.
clang++: /home/yeluo/opt/llvm-clang/llvm-project/llvm/include/llvm/ADT/APInt.h:1151: bool llvm::APInt::operator==(const llvm::APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed. PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: /home/packages/llvm/master-patched/bin/clang++ -DADD_ -DH5_USE_16_API -DHAVE_CONFIG_H -Drestrict=__restrict__ -I/home/yeluo/opt/miniqmc/src -I/home/yeluo/opt/miniqmc/build_clang_offlaod_nowait/src -fopenmp -fomit-frame-pointer -fstrict-aliasing -D__forceinline=inline -march=native -O3 -DNDEBUG -ffast-math -std=c++11 -o CMakeFiles/qmcutil.dir/Utilities/tinyxml/tinyxml2.cpp.o -c /home/yeluo/opt/miniqmc/src/Utilities/tinyxml/tinyxml2.cpp 1. <eof> parser at end of file 2. Per-module optimization passes 3. Running pass 'CallGraph Pass Manager' on module '/home/yeluo/opt/miniqmc/src/Utilities/tinyxml/tinyxml2.cpp'. 4. Running pass 'Combine redundant instructions' on function '@_ZN8tinyxml27XMLUtil10IsNameCharEh' #0 0x0000000001ecc523 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/packages/llvm/master-patched/bin/clang+++0x1ecc523) #1 0x0000000001eca25e llvm::sys::RunSignalHandlers() (/home/packages/llvm/master-patched/bin/clang+++0x1eca25e) #2 0x0000000001ecb8cd llvm::sys::CleanupOnSignal(unsigned long) (/home/packages/llvm/master-patched/bin/clang+++0x1ecb8cd) #3 0x0000000001e513b3 (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) (/home/packages/llvm/master-patched/bin/clang+++0x1e513b3) #4 0x0000000001e514ee CrashRecoverySignalHandler(int) (/home/packages/llvm/master-patched/bin/clang+++0x1e514ee) #5 0x00007f18f56923c0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x153c0) #6 0x00007f18f512718b raise /build/glibc-ZN95T4/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1 #7 0x00007f18f5106859 abort /build/glibc-ZN95T4/glibc-2.31/stdlib/abort.c:81:7 #8 0x00007f18f5106729 get_sysdep_segment_value /build/glibc-ZN95T4/glibc-2.31/intl/loadmsgcat.c:509:8 #9 0x00007f18f5106729 _nl_load_domain /build/glibc-ZN95T4/glibc-2.31/intl/loadmsgcat.c:970:34 #10 0x00007f18f5117f36 (/lib/x86_64-linux-gnu/libc.so.6+0x36f36) #11 0x00000000019f4c00 llvm::InstCombinerImpl::foldOrOfICmps(llvm::ICmpInst*, llvm::ICmpInst*, llvm::BinaryOperator&) (/home/packages/llvm/master-patched/bin/clang+++0x19f4c00) #12 0x00000000019fb023 llvm::InstCombinerImpl::visitOr(llvm::BinaryOperator&) (/home/packages/llvm/master-patched/bin/clang+++0x19fb023) #13 0x00000000019d354c llvm::InstCombinerImpl::run() (/home/packages/llvm/master-patched/bin/clang+++0x19d354c) #14 0x00000000019d5788 combineInstructionsOverFunction(llvm::Function&, llvm::InstCombineWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::ProfileSummaryInfo*, unsigned int, llvm::LoopInfo*) (/home/packages/llvm/master-patched/bin/clang+++0x19d5788) #15 0x00000000019d70b1 llvm::InstructionCombiningPass::runOnFunction(llvm::Function&) (/home/packages/llvm/master-patched/bin/clang+++0x19d70b1) #16 0x00000000017c7a68 llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/packages/llvm/master-patched/bin/clang+++0x17c7a68) #17 0x00000000010d0033 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (/home/packages/llvm/master-patched/bin/clang+++0x10d0033) #18 0x00000000017c8117 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/packages/llvm/master-patched/bin/clang+++0x17c8117) #19 0x00000000020fed4a clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (/home/packages/llvm/master-patched/bin/clang+++0x20fed4a) #20 0x0000000002d29c9c clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/home/packages/llvm/master-patched/bin/clang+++0x2d29c9c) #21 0x00000000037e77e3 clang::ParseAST(clang::Sema&, bool, bool) (/home/packages/llvm/master-patched/bin/clang+++0x37e77e3) #22 0x00000000026dc383 clang::FrontendAction::Execute() (/home/packages/llvm/master-patched/bin/clang+++0x26dc383) #23 0x000000000266e4f2 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/packages/llvm/master-patched/bin/clang+++0x266e4f2) #24 0x0000000002789bb2 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/packages/llvm/master-patched/bin/clang+++0x2789bb2) #25 0x0000000000a4568c cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/packages/llvm/master-patched/bin/clang+++0xa4568c) #26 0x0000000000a437ec ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) (/home/packages/llvm/master-patched/bin/clang+++0xa437ec) #27 0x0000000002523de2 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<llvm::Optional<llvm::StringRef> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*) const::$_1>(long) (/home/packages/llvm/master-patched/bin/clang+++0x2523de2) #28 0x0000000001e512c7 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/home/packages/llvm/master-patched/bin/clang+++0x1e512c7) #29 0x00000000025234f7 clang::driver::CC1Command::Execute(llvm::ArrayRef<llvm::Optional<llvm::StringRef> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*) const (/home/packages/llvm/master-patched/bin/clang+++0x25234f7) #30 0x00000000024efd28 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&) const (/home/packages/llvm/master-patched/bin/clang+++0x24efd28) #31 0x00000000024f0247 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*> >&) const (/home/packages/llvm/master-patched/bin/clang+++0x24f0247) #32 0x0000000002509758 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*> >&) (/home/packages/llvm/master-patched/bin/clang+++0x2509758) #33 0x0000000000a43158 main (/home/packages/llvm/master-patched/bin/clang+++0xa43158) #34 0x00007f18f51080b3 __libc_start_main /build/glibc-ZN95T4/glibc-2.31/csu/../csu/libc-start.c:342:3 #35 0x0000000000a404de _start (/home/packages/llvm/master-patched/bin/clang+++0xa404de) clang-12: error: clang frontend command failed with exit code 134 (use -v to see invocation) clang version 12.0.0 (https://github.com/llvm/llvm-project.git ca73dcd8a9ed9cc3ca1c1cc97ab893747791a681) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/packages/llvm/master-patched/bin clang-12: note: diagnostic msg: ********************
In D89041#2345115, @jhuber6 wrote:In D89041#2345055, @ldionne wrote:In D89041#2345002, @ye-luo wrote:Fails at make install.
CMake Error at projects/libcxx/include/cmake_install.cmake:753 (file):
file INSTALL cannot find "/scratch/opt/llvm-clang/build_mirror_offload_nightly/projects/libcxx/__config_site": No such file or directory.This change is surprisingly tricky to land. Thanks for the heads up.
Should be fixed in
commit b5aa67446e01bd277727b05710a42e69ac41e74b Author: Louis Dionne <ldionne@apple.com> Date: Wed Oct 21 12:53:24 2020 -0400 [libc++] Fix the installation of libc++ headers since the __config_site changeI'm still getting build errors, from an existing build that worked earlier today, links against libcxx installed previously.
/home2/3n4/clang/bin/../include/c++/v1/__config:13:10: fatal error: '__config_site' file not found #include <__config_site>When I try a fresh build using an install script,
In file included from /home2/3n4/llvm/trunk/llvm-project/compiler-rt/lib/fuzzer/FuzzerCrossOver.cpp:11: /home2/3n4/llvm/trunk/llvm-project/compiler-rt/lib/fuzzer/FuzzerDefs.h:14:10: fatal error: cassert: No such file or directory #include <cassert>
Fails at make install.
In D77609#2336223, @tianshilei1992 wrote:Enabled unshackled thread by default
LGTM
In D88929#2315640, @JonChesterfield wrote:An alternative approach is to build the deviceRTL for multiple cuda versions and then pick whichever one is the best fit when compiling application code. That has advantages when building the deviceRTL libraries on a different machine to the one that intends to use it.
Cmake isn't my thing, but I see that my trunk build only has libomptarget-nvptx-sm_35.bc when the local card is a sm_50. The downstream amd toolchain builds lots of this library, my install dir has fifteen of them (including sm_50).
In D88929#2315538, @jhuber6 wrote:In D88929#2315519, @ye-luo wrote:Probably not messing with enable_language(CUDA) at the moment, just add cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS) to `openmp/libomptarget/cmake/Modules/LibomptargetGetDependencies.cmake?
That only controls loading the library, since this is where we set all the CUDA options I think it's fine to call it here.
In D88929#2315513, @jhuber6 wrote:In D88929#2315451, @ye-luo wrote:I just realized that this patch affects clang and libomptarget.
I cannot comment on clang. Regarding libomptarget, Could you explain why the detection is not put together with other cuda stuff in openmp/libomptarget/cmake/Modules/LibomptargetGetDependencies.cmakeIf we're sticking with using FindCUDA it's definitely redundant here since it was already called by the time we get here. The support for CUDA language would use the same method but have enable_language(CUDA) somewhere instead of find_package(CUDA)
I just realized that this patch affects clang and libomptarget.
I cannot comment on clang. Regarding libomptarget, Could you explain why the detection is not put together with other cuda stuff in openmp/libomptarget/cmake/Modules/LibomptargetGetDependencies.cmake
3.18 introduces CMAKE_CUDA_ARCHITECTURES. Does 3.18 supports detection? If we know a new way works since 3.18, I think putting both with if-else makes sense.
The link I posted indicated that independent feature is merged since 3.12. Better to avoid deprecated stuff when introducing new cmake lines even though some existing lines may still rely on deprecated cmake.
FindCUDA has been deprecated.
Please explore the following feature without directly calling FindCUDA.
https://gitlab.kitware.com/cmake/cmake/-/merge_requests/1856
The minimal reproducer and full app work now.
Should be good to go now.
After a bit more experiment, the return status of cuGetErrorString can be more than CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE.
In this particular case when the CUDA is deinitialized, the error code cannot be translated by cuGetErrorString any more.
So now only print errStr with CUDA_SUCCESS.
Treat CUDA_ERROR_INVALID_VALUE different from generic !=CUDA_SUCCESS
Hold a second. I'm exploring a bit more in the error message.
The root cause is a known issue and I put up a bug report to track the status.
https://bugs.llvm.org/show_bug.cgi?id=47595
Anyway, this patch should be sufficient for users at the moment.
In D78075#2272474, @tianshilei1992 wrote:In D78075#2272398, @ye-luo wrote:However, OpenMP task has a problem that it must be within
to a parallel region; otherwise the task will be executed immediately. As a
result, if we directly wrap to a regular task, the nowait target outside of a
parallel region is still a synchronous version.The spec says an implicit task can be generated by an implicit parallel region which can be the whole OpenMP program. For this reason, the need of explicit parallel region is a limitation of the llvm OpenMP runtime, right?
Can I have an option to run the nowait region as a regular task instead of an unshackled task? So I can use "parallel" and well established ways to control the thread affinity.
According to the spec, an implicit parallel region is an inactive parallel region that is not generated from a parallel construct. And based on the definition of active parallel region, which is a parallel region that is executed by a team consisting of more than one thread, an inactive parallel region only has one thread. Since we only have one thread, if we encounter a task, executing it immediately does make sense as we don't have another thread to execute it.
If I remember correctly, you may yield the thread inside a target region after enqueuing kernels and transfers. So even with 1 thread, there is chance to run other tasks without finishing this target. Isn't that possible?
However, OpenMP task has a problem that it must be within
to a parallel region; otherwise the task will be executed immediately. As a
result, if we directly wrap to a regular task, the nowait target outside of a
parallel region is still a synchronous version.
The changes I requested as been added. Remove my blocking. Still need other reviews to be addressed.
In D87165#2257164, @jhuber6 wrote:Added additional comments. Should I add them to the doxygen notes at the top?
In D86804#2248214, @jdoerfert wrote:In D86804#2248201, @ye-luo wrote:In D86804#2247785, @jdoerfert wrote:In D86804#2247697, @ye-luo wrote:It seems that functions are marked static so they should be OK. However, including the whole Debug.h in a plugin cpp makes it feel OK to use any function/macro from the header file. But actually only part of the macros are for the plugin. some are only for the libomptarget.
I'm not sure we want to make a distinction, the point is to move to a unified debug/message model. You can choose not only the level of information but also the kind of output (text, json, ...). The messages will then be tied to the webpage via enums, that allows all plugins to emit the same message for the same thing with the same link to more information. There will certainly things that are only used in libomptarget or the plugins, but I don't see how that is any worse than duplicating the parts that are used by both.
I didn't mean to duplicate anything. Instead you need multiple header files. One for common parts, one for libomptarget and one for plugins. The latter two both include the first one. Later you expand OFFLOAD_XXX signals, they can be added to the common file. The return signal is generated by the plugins and captured by the libomptarget. Some users may want to see only the messages captured universally by libomptarget. Some users still want to see the native error message. So the libomptarget and plugin side error handling still needs to be separated.
I fail to see why this machinery is necessary to emit only messages from one place and not the other. I am not against a hierarchy of headers per se, but right now, and maybe also later, there seems to be little point.
I mean, we need to introduce a new env variable, actually two, that allow separate control. Once we have that we can argue about separation.
Alternatively, I would have suggested to define the "location" prior to including the debug header, e.g.,:#define DEBUG_LOCATION "PLUGIN" #include "Debug.h"which we verify like:
#if DEBUG_LOCATION != "PLUGIN" and DEBUG_LOCATION != "OMPTARGET" #error ... #endifAt the end of the day I want to simplify things. A single location for all our debug needs sounds simpler than 1 + #plugins to me, even if we don't use all functionality at each location. If separation does not allow anything we cannot reasonably do in a single location, I doubt it provides a benefit.
In D86804#2247785, @jdoerfert wrote:In D86804#2247697, @ye-luo wrote:It seems that functions are marked static so they should be OK. However, including the whole Debug.h in a plugin cpp makes it feel OK to use any function/macro from the header file. But actually only part of the macros are for the plugin. some are only for the libomptarget.
I'm not sure we want to make a distinction, the point is to move to a unified debug/message model. You can choose not only the level of information but also the kind of output (text, json, ...). The messages will then be tied to the webpage via enums, that allows all plugins to emit the same message for the same thing with the same link to more information. There will certainly things that are only used in libomptarget or the plugins, but I don't see how that is any worse than duplicating the parts that are used by both.
It seems that functions are marked static so they should be OK. However, including the whole Debug.h in a plugin cpp makes it feel OK to use any function/macro from the header file. But actually only part of the macros are for the plugin. some are only for the libomptarget.
I don't feel right having Debug.h shared by libomptarget and plugins especially when Debug.h is not just macro but also functions.
Please document the flags in the patch summary.
I prefer to PrivateArgumentManagerTy moved into its own files.
The rest looks good to me.
Down the road, we may need a way to allocate host pinned memory via the plugin for the host buffer to maximize transfer performance.
In D81054#2229637, @JonChesterfield wrote:As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.
Only minor things.
In D86307#2229059, @jdoerfert wrote:In D86307#2229032, @ye-luo wrote:Why just "small" ones? why not all of them?
In addition to the last paragraph of the new commit message, we also have to copy the data on the host in the right place. That is not free as the size grows.
Why just "small" ones? why not all of them?
LGTM
LGTM
In addition,
What is the current status of this patch?
@lildmh could you update this patch? I'd like to test it against
https://bugs.llvm.org/show_bug.cgi?id=47122
Block the patch temporarily for my earlier questions.
I'm not sure if I'm asking for too much here.
LGTM
Thanks for fixing the bug. It should be good for the moment.
When I think about the existence of recursive mapper, we may still have more sync than needed. I think recursion the whole targetDataBegin/targetDataEnd is convenient but sub-optimal choice.
Recursion should only be done on the map/mapper analysis. Just leave my thoughts here. It needs a discussion beyond this patch.
LGTM.
LGTM. Please mention renaming variables in the summary.
LGTM. My applications run as expected now. PR46824, PR46012, PR46868 all work fine.
Only minor documentation issues.
In D84767#2180280, @ye-luo wrote:This patch
GPU activities: 96.99% 350.05ms 10 35.005ms 1.5680us 350.00ms [CUDA memcpy HtoD]
before the July21 change
GPU activities: 95.33% 20.317ms 4 5.0793ms 1.6000us 20.305ms [CUDA memcpy HtoD]
Still more transfer than it should.
This patch
GPU activities: 96.99% 350.05ms 10 35.005ms 1.5680us 350.00ms [CUDA memcpy HtoD]
before the July21 change
GPU activities: 95.33% 20.317ms 4 5.0793ms 1.6000us 20.305ms [CUDA memcpy HtoD]
Still more transfer than it should.
LGTM
OK. Leave the unrelated renaming to the future.
Only one minor issue. Your initial sophisticated patch made my thought you replaced all the lock/unlock. After splitting, the change becomes very clean.
Should be easy to address my comments and let us get this merged ASAP.
In D84778#2179627, @tianshilei1992 wrote:I don't think it deserves three patches. The goal is to refactor the target function, and this patch just did this only thing. According to the bi-weekly meeting, the renaming could be with other related changes.
In D84778#2179502, @ye-luo wrote:In addtion, should we update target_data_update as well?
I didn't touch target_data_update. Basically I only take care of related code.
Please check the reproducer in https://bugs.llvm.org/show_bug.cgi?id=46868 with LIBOMPTARGET_DEBUG=1.
The reference counting on the base pointer variable has side effects. It was not cleaned up when these variables leave its scope.
Needs to split this patch into three.
The order of 1 and 2 can be flexible
In D84182#2173578, @grokos wrote:After this patch was committed, I tried to run the following example:
#include <stdio.h> int *yptr; int main() { int y[10]; y[1] = 1; yptr = &y[0]; printf("&yptr = %p\n", &yptr); printf("&y[0] = %p\n", &y[0]); #pragma omp target data map(to: yptr[0:5]) #pragma omp target { printf("y = %d\n", yptr[1]); yptr[1] = 10; printf("y = %d\n", yptr[1]); } printf("y = %d\n", yptr[1]); return 0; }The arguments clang generates are:
1) base = &y[0], begin = &yptr, size = 8, type = TARGET_PARAM | TO 2) base = &yptr, begin = &y[0], size = 8, type = PTR_AND_OBJ | TOThe second argument is correct, the first argument doesn't make much sense. I believe it should have its base set to &yptr, not &y[0].
y[0] is not the base for anything, it's only the pointee object.