Page MenuHomePhabricator

[OpenMP] Initial implementation of OpenMP offloading library - libomptarget plugins.
ClosedPublic

Authored by grokos on Nov 2 2015, 12:06 PM.

Details

Summary

This patch is a partition of the original patch posted in http://reviews.llvm.org/D14031.

This patch implements the library device plugins that are used to interface with a given device. Currently a generic 64-bit target and a CUDA enables GPU are the only available plugins. The interface of each plugin is documented in http://goo.gl/L1rnKJ.

Depends on http://reviews.llvm.org/D14031.

Diff Detail

Repository
rL LLVM

Event Timeline

sfantao updated this revision to Diff 38954.Nov 2 2015, 12:06 PM
sfantao retitled this revision from to [OpenMP] Initial implementation of OpenMP offloading library - libomptarget plugins..
sfantao updated this object.
sfantao added reviewers: ABataev, hfinkel, jcownie, jlpeyton.
sfantao updated this revision to Diff 39736.Nov 9 2015, 12:19 PM

Rebase and add install rule.

sfantao updated this revision to Diff 41338.Nov 28 2015, 4:10 PM

Add __tgt_rtl_is_valid_binary to the plugins' interface so that they can inform the target agnostic library that an image is valid without having to load it completely.

Hahnfeld added inline comments.Feb 1 2016, 2:03 AM
libomptarget/plugins/cuda/src/rtl.cpp
14–25 ↗(On Diff #41338)

Some of the C header files were not needed on my system (stdio.h, stdlib.h, string.h)

110–113 ↗(On Diff #41338)

You could use a range-based loop

427–430 ↗(On Diff #41338)

This memory is leaked - can this be done with some C++-constructs (stringstream for example)?

566–573 ↗(On Diff #41338)

Doesn't this mean that a default thread_limit = 0 or a too high one will be updated for every kernel? (The variable is signed, is a negative value possible and what does it mean?)

593–600 ↗(On Diff #41338)

So this launch is asynchronous and it gets synced when fetching data from the device.
But what would happen if there is a target region immediately following another one that is not yet synced? And can you later on figure out which kernel to wait for?

libomptarget/plugins/generic-elf-64bit/src/rtl.cpp
14–26 ↗(On Diff #41338)

See other comment about header files

110–112 ↗(On Diff #41338)
  1. DynLibs isn't used anywhere else?
  2. If it is needed: I think this doesn't work as expected (from begin to begin) - then maybe rewrite as range-based loop?
243–245 ↗(On Diff #41338)

So if it's possible, why not do it?

jprice added a subscriber: jprice.Mar 4 2016, 5:23 AM
mkuron added a subscriber: mkuron.Mar 19 2016, 1:14 AM
jleidel added a subscriber: jleidel.Apr 5 2016, 5:25 AM
grokos commandeered this revision.Dec 1 2016, 2:40 PM
grokos added a reviewer: sfantao.
grokos added a subscriber: grokos.
grokos added inline comments.
libomptarget/plugins/cuda/src/rtl.cpp
14–25 ↗(On Diff #41338)

True, I've removed the unnecessary headers.

110–113 ↗(On Diff #41338)

Done!

427–430 ↗(On Diff #41338)

Fixed in new diff.

566–573 ↗(On Diff #41338)

The GPU symbol KernelInfo->ThreadLimit is not used anymore by our new codegen scheme and I have removed it from the new diff.

To address the second question, the variable is indeed signed but in reality it is never assigned a negative value, that would have no meaning. We chose signed because that's what libomp uses as well, e.g. omp_set_max_threads() takes a signed int as argument. Generally the OpenMP standard uses signed ints pretty much everywhere even if negative values do not make any sense.

593–600 ↗(On Diff #41338)

All calls are inserted into the default CUDA stream. Kernels and memcopies in the default stream are executed in order, i.e. if we enqueue two kernels back to back, the latter will be executed after the former has completed. So in this version of the plugin (which does not support yet asynchronous executions and memcopies), there is an implicit sync between kernel launches.

libomptarget/plugins/generic-elf-64bit/src/rtl.cpp
14–26 ↗(On Diff #41338)

Done.

110–112 ↗(On Diff #41338)

Right, this is a bug. DynLibs is meant to be used to keep track of what libraries have been loaded so that we can free the resources allocated for them at the end. I have fixed the loop range in the new diff.

243–245 ↗(On Diff #41338)

It is possible to avoid the temp file, but whatever method we use to do it will not be portable. Let's stick with this implementation.

grokos updated this revision to Diff 79986.Dec 1 2016, 2:42 PM

Revision which addresses the comments from last review and adds support for new functionality (new codegen scheme, execution mode symbol, checks for the validity of num_teams and num_threads used to exexute the kernel).

grokos added a reviewer: jhen.Dec 1 2016, 2:43 PM
Hahnfeld added inline comments.Dec 5 2016, 12:51 AM
libomptarget/CMakeLists.txt
107 ↗(On Diff #79986)

Only this line should be in the diff

libomptarget/plugins/CMakeLists.txt
40 ↗(On Diff #79986)

This should also take LLVM_LIBDIR_SUFFIX into account

libomptarget/plugins/cuda/CMakeLists.txt
22 ↗(On Diff #79986)

LIBOMPTARGET_CMAKE_BUILD_TYPE to match against lowercase?

31 ↗(On Diff #79986)

LLVM_LIBDIR_SUFFIX

libomptarget/plugins/cuda/src/rtl.cpp
593–600 ↗(On Diff #41338)

Okay, so I now understand that this works technically. However, this may not be very intuitive for the user:

#pragma omp target data map(...)
{
  double start = omp_get_wtime();
  #pragma omp target
  {
    // ...
  }
  // no data transfer back
  double runtime = omp_get_wtime() - start;
}

runtime will then only measure the time needed to launch the kernel although the target region is synchronous. (I fully understand that there will be a sync before data is copied back, but that's not my point)

25–31 ↗(On Diff #79986)

Can TARGET_NAME be hardcoded to CUDA?

46–52 ↗(On Diff #79986)

This is never used

197–213 ↗(On Diff #79986)

These are not in the standard, should this really start with OMP_?

496 ↗(On Diff #79986)

Unneccessary cast to CUdeviceptr?

684 ↗(On Diff #79986)

Is that going to be used in the future? If so, please add a comment

libomptarget/plugins/generic-elf-64bit/src/rtl.cpp
50–56 ↗(On Diff #79986)

This is never used

129–170 ↗(On Diff #79986)

This looks the same as in the CUDA plugin. Can all this be refactored into a function?

grokos updated this revision to Diff 86002.Jan 26 2017, 7:36 PM
grokos marked 9 inline comments as done.

Addressed previous comments, ready for another review.

libomptarget/CMakeLists.txt
107 ↗(On Diff #79986)

Fixed.

libomptarget/plugins/CMakeLists.txt
40 ↗(On Diff #79986)

I'm now using LIBOMPTARGET_LIBDIR_SUFFIX as defined in the root CMakeLists.txt.

libomptarget/plugins/cuda/CMakeLists.txt
22 ↗(On Diff #79986)

Done.

31 ↗(On Diff #79986)

Same as above.

libomptarget/plugins/cuda/src/rtl.cpp
593–600 ↗(On Diff #41338)

Right, in fact we've already had complaints about spurious time measurements. I have added a cudaDeviceSynchronize after the kernel launch.

25–31 ↗(On Diff #79986)

TARGET_NAME is set via cmake. We could hadcode it, but the current scheme gives the flexibility to set the name at will (e.g. add version info, etc.) without tampering with the source code.

46–52 ↗(On Diff #79986)

Leftover from initial code, I've removed it in the new diff.

197–213 ↗(On Diff #79986)

The next revision of the standard will define env vars for setting the desired number of teams and team limit. We will then update the plugin with the final names. For now, let's leave those temporary names as they are already used by our buildbot for certain tests.

496 ↗(On Diff #79986)

Fixed.

684 ↗(On Diff #79986)

No, for the time being we have no plans of using dynamically allocated shared memory. In order to eliminate confusion I've removed this variable in the new diff and I use a hard-coded 0 when invoking the kernel launch.

libomptarget/plugins/generic-elf-64bit/src/rtl.cpp
50–56 ↗(On Diff #79986)

Same as above.

129–170 ↗(On Diff #79986)

True, it's the very same code but it just happens to be so. Another plugin for a different architecture may implement this interface function in a different way. Plugins are meant to be developed independently from each other and their code should not reply upon the existence of some common functionality / refactored code.

Some minor last comments

libomptarget/plugins/cuda/CMakeLists.txt
22 ↗(On Diff #79986)

I think this should then be lowercase debug to exactly match what's done in the parent directory

libomptarget/plugins/cuda/src/rtl.cpp
25–31 ↗(On Diff #79986)

All right. But please change the fallback to say CUDA

libomptarget/plugins/generic-elf-64bit/src/rtl.cpp
129–170 ↗(On Diff #79986)

Maybe we can put a function like elf_check_machine(__tgt_device_image *image, uint16_t target) into a file plugins/common/elf.c and compile it into both plugins? The problem with duplicated code is that you always forget to sync them if you change something.

grokos updated this revision to Diff 86368.Jan 30 2017, 4:48 PM
grokos marked 3 inline comments as done.

Addressed the 3 comments from last review.

libomptarget/plugins/cuda/CMakeLists.txt
22 ↗(On Diff #79986)

Good catch! I've fixed it.

libomptarget/plugins/cuda/src/rtl.cpp
25–31 ↗(On Diff #79986)

Done.

libomptarget/plugins/generic-elf-64bit/src/rtl.cpp
129–170 ↗(On Diff #79986)

OK, I created a new elf_common.c under plugins/common defining this function. The file is inlined into both plugins (to inherit the definition of macro DP for each plugin).

Hahnfeld accepted this revision.Jan 30 2017, 11:12 PM

LGTM. Thanks for all the work that goes into this!

This revision is now accepted and ready to land.Jan 30 2017, 11:12 PM
This revision was automatically updated to reflect the committed changes.