gtbercea (Gheorghe-Teodor Bercea)
User

Projects

User does not belong to any projects.

User Details

User Since
Dec 29 2016, 12:44 AM (76 w, 5 d)

Recent Activity

Tue, Jun 12

gtbercea updated the diff for D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

Added separate test.

Tue, Jun 12, 8:22 AM

Mon, Jun 11

gtbercea added inline comments to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Mon, Jun 11, 12:52 PM

Fri, Jun 8

gtbercea added a comment to D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls.

I just stumbled upon a very interesting situation.

Fri, Jun 8, 2:52 PM

Thu, Jun 7

gtbercea added a comment to D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls.

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

  1. Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?
  2. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?

    ` In file included from /usr/include/math.h:413: /usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ /usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ 2 errors generated. `
Thu, Jun 7, 7:19 AM
gtbercea added inline comments to D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls.
Thu, Jun 7, 7:17 AM

Wed, Jun 6

gtbercea created D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls.
Wed, Jun 6, 3:45 PM
gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

@tra Thank you for your comments and help with the patch.

Wed, Jun 6, 7:06 AM

Mon, Jun 4

gtbercea updated the summary of D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Mon, Jun 4, 10:06 AM
gtbercea updated the summary of D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Mon, Jun 4, 10:06 AM
gtbercea updated the summary of D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Mon, Jun 4, 10:02 AM
gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

I believe this is a limitation coming from the Cuda toolchain. Not even nvcc supports this case: https://stackoverflow.com/questions/35897002/cuda-nvcc-building-chain-of-libraries

You are absolutely right, thanks for the link. Maybe we should also document somewhere that we don't support that either for OpenMP offloading to NVPTX?

I think this basically renders my approach useless as I meant to compile each device object file for offloading targets directly to a shared library. Those could have been put together at runtime by just loading (and registering) them in the right order. That way we would have been able to keep clang-offload-bundler in its current target agnostic form and didn't need to appease proprietary tools such as nvlink.

With that knowledge I see no other way than what this patch proposes. (I still don't particularly like it because it requires each toolchain to implement their own magic.) Sorry for the delay and my disagreement based on wrong assumptions that I wasn't able to verify as soon as I'd have liked to.

Mon, Jun 4, 8:03 AM

Fri, Jun 1

gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

Hmm, maybe the scope is much larger: I just tried linking an executable that references a declare target function in a shared library. My assumption was that this already works, given that libomptarget's registration functions can be called multiple times. Am I doing something wrong?

Fri, Jun 1, 1:05 PM
gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

I disagree in this context because this patch currently means that static archives will only work with NVPTX and there is no clear path how to "fix" things for other offloading targets. I'll try to work on my proposal over the next few days (sorry, very busy week...), maybe I can put together a prototype of my idea.

Fri, Jun 1, 6:47 AM

Thu, May 31

gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

The error is related to lack of device linking, just like you explained two paragraphs down. This is the error I get:

main.o: In function `__cuda_module_ctor':
main.cu:(.text+0x674): undefined reference to `__cudaRegisterLinkedBinary__nv_c5b75865'
Thu, May 31, 5:18 PM
gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

Thu, May 31, 2:57 PM

Tue, May 29

gtbercea added a comment to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.

Just to clarify one thing in my last comment:

When I say that we didn't aim at having clang compatible with other compilers, I mean the OpenMP offloading descriptors, where all the variables and offloading entry points are. Of course we want to allow the resulting binaries to be compatible with linkers taking inputs of other compilers, so that you can have, e.g., OpenMP and CUDA supported in the same executable, even though working independently.

Tue, May 29, 11:10 AM
gtbercea added inline comments to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Tue, May 29, 11:08 AM
gtbercea added inline comments to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Tue, May 29, 9:02 AM
gtbercea added inline comments to D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Tue, May 29, 9:01 AM

Fri, May 25

gtbercea updated the diff for D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Fri, May 25, 3:04 PM
gtbercea created D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain.
Fri, May 25, 2:33 PM

Thu, May 24

gtbercea updated the diff for D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .

Update patch on top of latest changes.

Thu, May 24, 7:57 AM

May 18 2018

gtbercea added a comment to D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .

If older versions of the compiler will re-enable inlining in the future, the user is free to point the exposed CMAKE flags to an older version of the compiler. If the user doesn't do anything then they are guaranteed a compiler that will inline correctly. Any other default will not guarantee inlining.

May 18 2018, 8:37 AM
gtbercea added a comment to D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .

As long as cmake allows a part of the build to depend on another part of the build I think there is no problem with this. It just so happens that we have a part of the build depending on the clang compiler. It could have depended on the building of any other random file. I don't see this as something un-cmake-like.

May 18 2018, 8:33 AM

May 17 2018

gtbercea added a comment to D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .

I am not at all convinced. I think there are absolutely no good arguments against allowing the just built compiler to build the BCLIB (as a default, this is simply a default which can be overwritten at any time by the user). The drawbacks you mention have no detrimental impact on the quality of life of the compiler developer. Developers improving other parts of the compiler who do not care about OpenMP and device offloading can just disable the building of the BCLIB because it is optional after all. The BCLIB is something that is required for performance not correctness so it can be disabled at any time. Even when BCLIB building is turned on there is no perceptible overhead to it, compared to the time it takes to compile a cpp file somewhere in the source code, the BCLIB builds instantly.

May 17 2018, 2:40 PM

May 16 2018

gtbercea accepted D46901: [libomptarget-nvptx] Test bitcode compiler flags and enable by default.

@gtbercea I'd like to keep that discussion in D46842. Short answer: It's working for me.

May 16 2018, 8:32 AM
gtbercea added inline comments to D46901: [libomptarget-nvptx] Test bitcode compiler flags and enable by default.
May 16 2018, 6:17 AM
gtbercea accepted D46930: [libomptarget-nvptx-bc] Pass found CUDA installations.

LGTM

May 16 2018, 6:10 AM

May 15 2018

gtbercea added a comment to D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .
  1. The just-built compiler doesn't exist by definition when CMake is invoked. This will prevent all kind of dynamic flag checking (Does ${LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER} support -fflag?). This is how I propose to reliably implement a solution for D44992.
May 15 2018, 7:05 AM

May 14 2018

gtbercea added a comment to D44992: [OpenMP] enable bc file compilation using the latest clang.

Thanks for the response. Can you point me to the final solution for this or re-explain it? From the comments I'm not sure I can distill the solution you want.

May 14 2018, 3:02 PM · Restricted Project
gtbercea added a comment to D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .

There is a very good reason why the current compiler is being used for compiling the bc-lib. Any incompatibility in LLVM version between the bclib compiler and the just-built compiler will result in an error when inlining due to different LLVM versions. So the latest build is always the safest default to use, it can never fail.

May 14 2018, 2:12 PM
gtbercea added a comment to D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .

I agree with Jonas, the initial version of the nvptx RTL that we tried to upstream was somewhat like what you try to do here and the consensus was that we do not want to have these dependencies.

If you want to use the just-built clang to compile the BC lib, then the proper (albeit longer) solution is to build clang without the BC lib, then re-run cmake with LIBOMPTARGET_NVPTX_ENABLE_BCLIB enabled and LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER pointing to the just-built clang and re-run ninja.

May 14 2018, 2:01 PM
gtbercea created D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .
May 14 2018, 12:57 PM
gtbercea added a comment to D44992: [OpenMP] enable bc file compilation using the latest clang.

Has this been implemented elsewhere already? Last I tried this flag was still needed here in order to build the bitcode library compilation.

May 14 2018, 12:44 PM · Restricted Project
gtbercea updated the diff for D46840: [OpenMP][libomptarget] Add function for checking SPMD mode.

Use int8_t instead of bool type.

May 14 2018, 12:24 PM
gtbercea created D46840: [OpenMP][libomptarget] Add function for checking SPMD mode.
May 14 2018, 12:16 PM

Apr 3 2018

gtbercea added a comment to D44992: [OpenMP] enable bc file compilation using the latest clang.

Post-commit because your commit didn't trigger an email (please subscribe to openmp-commits!).

IMO this is wrong and should be reverted. What should be done instead is detect whether the compiler supports that flag because it was only added recently. Older compilers (pre 4.0?) are able to build bclib without that flag. In both cases, the build system should enable the bclib by default because it's sensible to do.

Regarding performance: Relocatable code is possibly slower, but we only use that flag to produce bitcode which is inlined by the compiler. However, I'm still working on that feature (see D42922) and I can't say for sure that we won't end up emitting different IR once the support matures. So this definitely needs to be considered. That's also the reason why I didn't submit a patch yet.

Apr 3 2018, 10:45 AM · Restricted Project

Mar 21 2018

gtbercea created D44754: [OpenMP][libomptarget] Initialize global memory stack only once..
Mar 21 2018, 1:09 PM
gtbercea abandoned D44541: [OpenMP][Clang] Move device global stack init before master-workers split.

This leads to usage of statically allocated shared data before their initialization in runtime structures by master thread in kernel_init() function. New patch available with worker and master-side initialization.

Mar 21 2018, 11:55 AM
gtbercea created D44749: [OpenMP][Clang] Add call to global data sharing stack initialization on the workers side.
Mar 21 2018, 11:54 AM

Mar 20 2018

gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Address comments.

Mar 20 2018, 11:31 AM
gtbercea updated the diff for D44541: [OpenMP][Clang] Move device global stack init before master-workers split.

Fix test.

Mar 20 2018, 10:32 AM

Mar 19 2018

gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Fix comment.

Mar 19 2018, 3:57 PM
gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Always pop current frame from slot.

Mar 19 2018, 3:57 PM
gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Share FrameP with all threads in the same warp.

Mar 19 2018, 12:33 PM
gtbercea abandoned D44588: [OpenMP][Clang] Pass global thread ID to outlined function.

After some internal discussion with @ABataev he is going to replace the manual computation of the thread ID with a call to the runtime in a new patch.

Mar 19 2018, 7:45 AM

Mar 16 2018

gtbercea added a dependent revision for D44588: [OpenMP][Clang] Pass global thread ID to outlined function: D44541: [OpenMP][Clang] Move device global stack init before master-workers split.
Mar 16 2018, 3:05 PM
gtbercea added a dependency for D44541: [OpenMP][Clang] Move device global stack init before master-workers split: D44588: [OpenMP][Clang] Pass global thread ID to outlined function.
Mar 16 2018, 3:05 PM
gtbercea created D44588: [OpenMP][Clang] Pass global thread ID to outlined function.
Mar 16 2018, 3:01 PM
gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Choose correct default size depending on slot type: worker or master.

Mar 16 2018, 9:00 AM

Mar 15 2018

gtbercea added a dependent revision for D44537: [OpenMP][libomptarget] Fix master warp check: D44541: [OpenMP][Clang] Move device global stack init before master-workers split.
Mar 15 2018, 2:12 PM
gtbercea added a dependency for D44541: [OpenMP][Clang] Move device global stack init before master-workers split: D44537: [OpenMP][libomptarget] Fix master warp check.
Mar 15 2018, 2:12 PM
gtbercea created D44541: [OpenMP][Clang] Move device global stack init before master-workers split.
Mar 15 2018, 2:11 PM
gtbercea updated the diff for D44537: [OpenMP][libomptarget] Fix master warp check.

Adjust comments.

Mar 15 2018, 1:45 PM
gtbercea updated the diff for D44537: [OpenMP][libomptarget] Fix master warp check.

Insert guards in init function.

Mar 15 2018, 1:38 PM
gtbercea added a dependency for D44537: [OpenMP][libomptarget] Fix master warp check: D44487: [OpenMP][libomptarget] Enable globalization for workers.
Mar 15 2018, 1:19 PM
gtbercea added a dependent revision for D44487: [OpenMP][libomptarget] Enable globalization for workers: D44537: [OpenMP][libomptarget] Fix master warp check.
Mar 15 2018, 1:19 PM
gtbercea created D44537: [OpenMP][libomptarget] Fix master warp check.
Mar 15 2018, 1:19 PM
gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Rebase patch.

Mar 15 2018, 8:35 AM
gtbercea updated the diff for D44486: [OpenMP][libomptarget] Enable usage of shared memory slots.

Rebase patch.

Mar 15 2018, 8:33 AM
gtbercea updated the diff for D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.

Rebase patch.

Mar 15 2018, 8:32 AM
gtbercea updated the diff for D44487: [OpenMP][libomptarget] Enable globalization for workers.

Rebase patch.

Mar 15 2018, 8:28 AM
gtbercea updated the diff for D44486: [OpenMP][libomptarget] Enable usage of shared memory slots.

Rebase patch.

Mar 15 2018, 8:27 AM
gtbercea added inline comments to D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.
Mar 15 2018, 8:25 AM
gtbercea updated the diff for D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.

New algorithm for adding data into an existing slot.

Mar 15 2018, 8:25 AM

Mar 14 2018

gtbercea added inline comments to D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.
Mar 14 2018, 4:08 PM
gtbercea added inline comments to D44487: [OpenMP][libomptarget] Enable globalization for workers.
Mar 14 2018, 1:58 PM
gtbercea added dependencies for D44487: [OpenMP][libomptarget] Enable globalization for workers: D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot, D44486: [OpenMP][libomptarget] Enable usage of shared memory slots.
Mar 14 2018, 11:45 AM
gtbercea added a dependent revision for D44486: [OpenMP][libomptarget] Enable usage of shared memory slots: D44487: [OpenMP][libomptarget] Enable globalization for workers.
Mar 14 2018, 11:45 AM
gtbercea added a dependent revision for D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot: D44487: [OpenMP][libomptarget] Enable globalization for workers.
Mar 14 2018, 11:45 AM
gtbercea added a dependency for D44486: [OpenMP][libomptarget] Enable usage of shared memory slots: D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.
Mar 14 2018, 11:44 AM
gtbercea added a dependent revision for D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot: D44486: [OpenMP][libomptarget] Enable usage of shared memory slots.
Mar 14 2018, 11:44 AM
gtbercea created D44487: [OpenMP][libomptarget] Enable globalization for workers.
Mar 14 2018, 11:43 AM
gtbercea created D44486: [OpenMP][libomptarget] Enable usage of shared memory slots.
Mar 14 2018, 11:39 AM
gtbercea updated the diff for D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.

Add init of tail pointer.

Mar 14 2018, 8:20 AM
gtbercea created D44470: [OpenMP][libomptarget] Enable multiple frames per global memory slot.
Mar 14 2018, 8:04 AM

Mar 13 2018

gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 13 2018, 4:20 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 13 2018, 4:16 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 13 2018, 2:42 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Update patch manually.

Mar 13 2018, 2:33 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Test.

Mar 13 2018, 2:31 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Add bclib.

Mar 13 2018, 2:31 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Improve test robustness for the case when CUDA libdevice cannot be found.
Check that the warning is not emitted when the bc lib is found.

Mar 13 2018, 2:27 PM
gtbercea reopened D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 13 2018, 2:21 PM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Address comments.

Mar 13 2018, 11:59 AM

Mar 12 2018

gtbercea added inline comments to D44260: [OpenMP][libomptarget] Add global memory data sharing support for master-worker sharing..
Mar 12 2018, 7:57 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Add input file.

Mar 12 2018, 7:22 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Fixes.

Mar 12 2018, 7:22 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Rename folder. Fix test.

Mar 12 2018, 7:18 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Change name of folder.

Mar 12 2018, 7:13 AM

Mar 9 2018

gtbercea added inline comments to D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 9 2018, 8:52 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Fix test.

Mar 9 2018, 8:52 AM
gtbercea added inline comments to D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 9 2018, 8:03 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Revert to c_str().

Mar 9 2018, 8:03 AM
gtbercea added inline comments to D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 9 2018, 7:52 AM
gtbercea updated the diff for D43197: [OpenMP] Add flag for linking runtime bitcode library.

Change test.

Mar 9 2018, 7:49 AM
gtbercea added inline comments to D43197: [OpenMP] Add flag for linking runtime bitcode library.
Mar 9 2018, 7:19 AM

Mar 8 2018

gtbercea added inline comments to D44260: [OpenMP][libomptarget] Add global memory data sharing support for master-worker sharing..
Mar 8 2018, 11:17 AM
gtbercea updated the diff for D44260: [OpenMP][libomptarget] Add global memory data sharing support for master-worker sharing..

Address comments.

Mar 8 2018, 11:16 AM
gtbercea added a reviewer for D43197: [OpenMP] Add flag for linking runtime bitcode library: hfinkel.
Mar 8 2018, 10:55 AM