This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/
-
libomptarget/
-
plugins/cuda/src/
-
cuda/
-
src/
3/37
rtl.cpp
-
test/offloading/
-
offloading/
1
parallel_offloading_map.c

Differential D74145

[OpenMP][Offloading] Added support for multiple streams so that multiple kernels can be executed concurrently
ClosedPublic

Authored by tianshilei1992 on Feb 6 2020, 10:43 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
grokos
ronlieb

Commits

rGa5153dbc368e: [OpenMP][Offloading] Added support for multiple streams so that multiple…

Summary

It will initialize a number of streams for each device at first. The number can be configured via environment variable LIBOMPTARGET_NUM_STREAMS. For each kernel submission, a stream will be selected in a round-robin manner.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Feb 6 2020, 10:43 AM

Herald added subscribers: openmp-commits, jfb, guansong. · View Herald TranscriptFeb 6 2020, 10:43 AM

tianshilei1992 updated this revision to Diff 242948.Feb 6 2020, 10:44 AM

Thanks! Two comments below.

@ye-luo once the memory transfers are attached to a stream you should be able to offload synchronously from multiple threads at the same time. Could you pull the patch and test it?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95	Make it `uint` please.
532	We need the async versions at the HtoD and at the DtoH sides to use the streams. After the async call we directly have to wait for the stream to make it synchronous but on as specific stream.

jdoerfert added reviewers: JonChesterfield, grokos.Feb 7 2020, 12:16 AM

@jdoerfert I can try it on a test program. miniQMC is choked by the linker at the moment. Is the "map" thread-safe now?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
532	In this direction, the H2D, kernel and D2H optimally can be scheduled as a whole entity in the tasking runtime and use the same stream if they are on the same OpenMP pragma line.

Caught another issue.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
865	This synchronization should be replaced with stream wait.

In D74145#1863379, @ye-luo wrote:

@jdoerfert I can try it on a test program. miniQMC is choked by the linker at the moment. Is the "map" thread-safe now?

Map should be thread safe, yes.

tianshilei1992 marked an inline comment as done.Feb 7 2020, 8:43 AM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95	Right, in case of integer overflow, my bad...

Adding Ron to the list as he's maintaining the amdgcn equivalent to this

Are cuda streams available on all versions of cuda that the rest of openmp works from? I'm not sure when they were introduced.

A couple of minor comments inline. This seems to be a fairly straightforward wrapper over the cuda functionality.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95	vector of pointers to atomic_int is interesting. What's the advantage over vector<atomic_int>? It might be worth putting a few asserts in the code to the effect that resizing the vector after the initial construction will break access from other threads.
262	If we do need the pointer wrapper, this should be make_unique

tianshilei1992 marked 2 inline comments as done.Feb 7 2020, 10:01 AM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95	`atomic_int` is not copyable. And the initialization of all these pointers are after the resize of vector, so we might not need to consider that.
262	`make_unique` only works since C++14.

tianshilei1992 added inline comments.Feb 7 2020, 10:27 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
865	Are you referring to `cudaStreamWaitEvent`?

jdoerfert added inline comments.Feb 7 2020, 10:31 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	Do we have llvm::make_unique? But maybe not necessarily good to use it here anyway. @jon ok to stick with this for now?

ye-luo added inline comments.Feb 7 2020, 10:52 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
865	I mean cuStreamSynchronize

JonChesterfield added inline comments.Feb 7 2020, 11:27 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95	Sure, but you're not copying the element anywhere. Sadly I think we would need to provide the size of the vector up front. reserve() calls copy constructors (which I didn't expect) and they're deleted for atomic_int. I'm not sure the cuda api will permit that. Which leads to the suggestion: std::unique_ptr<std::atomic_int[]>> NextStreamId; // ... NextStreamId = std::make_unique<std::atomic_int[]>(NumberOfDevices); This elides the NumberOfDevices heap allocations and the associated indirection on every access and makes it somewhat more obvious that we can't call various vector api functions. It has the disadvantage that the integers will now definitely be in the same cache line, whereas previously there was a chance that the allocator would put them on different cache lines. Overall I'm fine with either structure.
262	llvm::make_unique was removed by D66259, as we're now assuming C++14. They're semantically identical in this context so it doesn't matter much.

tianshilei1992 added inline comments.Feb 7 2020, 11:43 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	Do you mean that we can assume -std=c++14 is always true?
865	Oh, I got you. Good one, in case of blocking other threads, although the offloading have finished.

JonChesterfield added inline comments.Feb 7 2020, 12:02 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	Other files in LLVM won't build with c++11 any more so >=14 seems a safe bet.

tianshilei1992 marked an inline comment as done.Feb 7 2020, 12:21 PM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	That is cool! Thanks for the information. Will update this part correspondingly.

tianshilei1992 marked an inline comment as not done.Feb 7 2020, 12:48 PM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp

262

Well, I just tried with make_unique but it turns out we're still using C++11 actually.

FAILED: libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o
/home/shiltian/.local/bin/clang++  -DOMPTARGET_DEBUG -DTARGET_NAME=CUDA -Domptarget_rtl_cuda_EXPORTS -I/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/include -I/opt/cuda/10.1/include -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-extra -Wno-pedantic -std=gnu++11 -g -fPIC -MD -MT libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -MF libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o.d -o libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -c /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp
/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:18: error: no member named 'make_unique' in namespace 'std'
      Ptr = std::make_unique<std::atomic_uint>(0);
            ~~~~~^
/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:46: error: expected '(' for function-style cast or type construction
      Ptr = std::make_unique<std::atomic_uint>(0);
                             ~~~~~~~~~~~~~~~~^
2 errors generated.
ninja: build stopped: subcommand failed.

jdoerfert added inline comments.Feb 7 2020, 1:27 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	Change the cmake in a separate commit. Llvm is on 14.

tianshilei1992 added inline comments.Feb 7 2020, 1:32 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	So OpenMP will also switch to C++ 14 in a near future?

JonChesterfield added inline comments.Feb 7 2020, 1:45 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	Sounds good to me. Yep, let's change the cmake now.

tianshilei1992 added inline comments.Feb 7 2020, 2:55 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262	https://reviews.llvm.org/D74258

tianshilei1992 updated this revision to Diff 243403.Feb 8 2020, 12:43 PM

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

@ye-luo Do you have a way to test this or do we need to fix the linker issue first?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249	The hardware will cap the number internally anyway so we should go higher here. Maybe 256?

I did a little experiment to show the performance improvement. Here is the micro benchmark:

#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void kernel() {
  const int num_threads = 64;

#pragma omp parallel for
  for (int i = 0; i < num_threads; ++i) {
    const size_t N = 1UL << 10;

#pragma omp target teams distribute parallel for
    for (size_t i = 0; i < N; ++i) {
      for (size_t j = 0; j < N / 2; ++j) {
        float x = sqrt(pow(3.14159, j));
      }
    }
  }
}

int main(int argc, char *argv[]) {
  const int N = 1000;

  const clock_t start = clock();

  for (int i = 0; i < N; ++i) {
    kernel();
  }

  const clock_t duration = (clock() - start) * 1000 / CLOCKS_PER_SEC / N;

  printf("Avg time: %ld ms\n", duration);

  return 0;
}

The execution result with multiple stream is:

$ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading_ms.prof -f ./parallel_offloading
==32397== NVPROF is profiling process 32397, command: ./parallel_offloading
Avg time: 1081 ms
==32397== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading_ms.prof

And the result w/o multiple stream is:

$ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading.prof -f ./parallel_offloading
==35547== NVPROF is profiling process 35547, command: ./parallel_offloading
Avg time: 5825 ms
==35547== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading.prof

We can see that 1081 vs 5825 ms, approximately 5.4x speedup.

In D74145#1865837, @jdoerfert wrote:

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

Yes, later we will take stream it previous used for data transfer into consideration when selecting stream for kernel, and other potential optimization.

tianshilei1992 added inline comments.Feb 8 2020, 2:13 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249	Sure

tianshilei1992 updated this revision to Diff 243409.Feb 8 2020, 2:24 PM

In D74145#1865837, @jdoerfert wrote:

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

@ye-luo Do you have a way to test this or do we need to fix the linker issue first?

My standalone code can be used to verify multi-stream concurrent execution and whether transfer and execution use the same stream by profiling with nvprof.

In D74145#1865927, @ye-luo wrote:

In D74145#1865837, @jdoerfert wrote:

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

@ye-luo Do you have a way to test this or do we need to fix the linker issue first?

My standalone code can be used to verify multi-stream concurrent execution and whether transfer and execution use the same stream by profiling with nvprof.

What standalone code? Can you run it with this patch? The transfer will use a different stream for now but it should be OK for now. "version 2" will do the same stream.

Add a new test case to check map is working correct.

I'm fine with this, anyone else?

We need to lose std::make_unique before landing as the C++11 = > C++14 move has proven contentious. Otherwise LGTM.

In D74145#1866192, @JonChesterfield wrote:

We need to lose std::make_unique before landing as the C++11 = > C++14 move has proven contentious. Otherwise LGTM.

I can use a macro here like #if __cplusplus < 201402L.

Add a backup statement in case that the library is not compiled with C++14

In D74145#1866224, @tianshilei1992 wrote:

Add a backup statement in case that the library is not compiled with C++14

I would prefer not to do this. Let's wait till Monday and replies on the RFC.

I was thinking of going back to the explicit new, not an ifdef on c++ version. Then we can land this now and optionally revisit once the codebase moves to 14.

I tested the patch. The stream of H2D, D2H and compute behaves asynchronously as expected.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249	I don't like this choice. The hardware limit is 32 which is preferred. Users can play with environment variable if they need more. On the nvprof, it is impossible to digest 256 streams from OpenMP plus other application streams.

In D74145#1866382, @ye-luo wrote:

I tested the patch. The stream of H2D, D2H and compute behaves asynchronously as expected.

I do accept this pending D74258 and the C++14 RFC. If they go through the version of this patch that uses C++14 is fine.

We can discuss and modify the stream number afterwards as necessary (assuming we don't find a consensus now).
This patch is strictly positive so we should work from here.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249	@ye-luo Do you experience a downside to 256 streams? There should not be a performance problem but it should help us to be future and backwards compatible.

This revision is now accepted and ready to land.Feb 9 2020, 10:32 PM

ye-luo added inline comments.Feb 10 2020, 12:34 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249	I don't have strong evidence about performance impact. I though more streams should cost the driver a bit more to monitor and schedule workload to the hardware.

jdoerfert added inline comments.Feb 10 2020, 8:11 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249	I would expect, or maybe hope, that the driver just does the modulo internally. There is no point in tracking more than the number of hardware streams so why would they. To that end they can just do `hw_stream = user_stream % num_hw_streams`, which would make sense because it is portable (=backwards/future compatible).

tianshilei1992 updated this revision to Diff 243610.Feb 10 2020, 10:03 AM

JonChesterfield added inline comments.Feb 10 2020, 10:50 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	It looks like DeviceID should be unsigned here

tianshilei1992 added inline comments.Feb 10 2020, 10:58 AM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	Well, yes, it should be. But if you take a look at what they're used, for example at line 725, you can see the declaration is `int32_t device_id`.

I'll commit this one and D74258 later.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	we make it an unsigned here. I can do that before I commit as well.

grokos added inline comments.Feb 11 2020, 12:01 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	Well, strictly speaking, device IDs in libomptarget are signed. E.g. the default device has an ID of -1 and the host device has ID -10. On the other hand, such negative values should never reach the plugin, if that ever happens then something is buggy in the base library. So it's really up to you to either keep the signed flavor or switch to unsigned.

jdoerfert added inline comments.Feb 11 2020, 12:54 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	signed + assertion(id >= 0) ?

tianshilei1992 added inline comments.Feb 11 2020, 1:06 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	It would be better we put this check in each API call.

jdoerfert added inline comments.Feb 11 2020, 1:14 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182	true, we add them in (a) different commit(s) though. Can you add the check to the assert you have below? (Nit: you can also use `int(NextStreamId.size())` to save some characters)

tianshilei1992 updated this revision to Diff 243977.Feb 11 2020, 1:21 PM

jdoerfert added inline comments.Feb 11 2020, 5:44 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
266	No ifdef needed anymore, C++14 is here.

tianshilei1992 updated this revision to Diff 244049.Feb 11 2020, 6:12 PM

The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?

In D74145#1871257, @grokos wrote:

The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?

My idea is the concept of stream is widely used in different platforms. They might use different terminology.

In D74145#1871269, @tianshilei1992 wrote:

In D74145#1871257, @grokos wrote:

The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?

My idea is the concept of stream is widely used in different platforms. They might use different terminology.

Let's not complicate the name and just interpret "STREAMS" as whatever the equivalent of the platform is. That way we can have a single environment variable.

Closed by commit rGa5153dbc368e: [OpenMP][Offloading] Added support for multiple streams so that multiple… (authored by jdoerfert). · Explain WhyFeb 11 2020, 8:13 PM

This revision was automatically updated to reflect the committed changes.

grokos mentioned this in D70010: [OpenMP][Offloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation.Apr 6 2020, 7:42 PM

The test started to fail recently

openmp/libomptarget/test/offloading/parallel_offloading_map.c
11	With D89523, this line does not compile and the test fails. As I understand D89523, this assignment is non-comforming. Using #define instead should fix the issue

Herald added subscribers: sstefan1, yaxunl. · View Herald TranscriptNov 2 2020, 1:14 PM

In D74145#2369400, @protze.joachim wrote:

The test started to fail recently

I'll fix this issue. Thanks for your notification!

ggeorgakoudis added a subscriber: ggeorgakoudis.Nov 9 2020, 12:26 PM

Revision Contents

Path

Size

openmp/

libomptarget/

plugins/

cuda/

src/

rtl.cpp

112 lines

test/

offloading/

parallel_offloading_map.c

41 lines

Diff 243977

openmp/libomptarget/plugins/cuda/src/rtl.cpp

//===----RTLs/cuda/src/rtl.cpp - Target RTLs Implementation ------- C++ -*-===//		//===----RTLs/cuda/src/rtl.cpp - Target RTLs Implementation ------- C++ -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// RTL for CUDA machine		// RTL for CUDA machine
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include <atomic>
#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <cuda.h>		#include <cuda.h>
#include <list>		#include <list>
		#include <memory>
#include <string>		#include <string>
#include <vector>		#include <vector>

#include "omptargetplugin.h"		#include "omptargetplugin.h"

#ifndef TARGET_NAME		#ifndef TARGET_NAME
#define TARGET_NAME CUDA		#define TARGET_NAME CUDA
#endif		#endif
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines

/// List that contains all the kernels.		/// List that contains all the kernels.
/// FIXME: we may need this to be per device and per library.		/// FIXME: we may need this to be per device and per library.
std::list<KernelTy> KernelsList;		std::list<KernelTy> KernelsList;

/// Class containing all the device information.		/// Class containing all the device information.
class RTLDeviceInfoTy {		class RTLDeviceInfoTy {
std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;		std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;
		std::vector<std::unique_ptr<std::atomic_uint>> NextStreamId;
		jdoerfertUnsubmitted Not Done Reply Inline Actions Make it `uint` please. jdoerfert: Make it `uint` please.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Right, in case of integer overflow, my bad... tianshilei1992: Right, in case of integer overflow, my bad...
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions vector of pointers to atomic_int is interesting. What's the advantage over vector<atomic_int>? It might be worth putting a few asserts in the code to the effect that resizing the vector after the initial construction will break access from other threads. JonChesterfield: vector of pointers to atomic_int is interesting. What's the advantage over vector<atomic_int>?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `atomic_int` is not copyable. And the initialization of all these pointers are after the resize of vector, so we might not need to consider that. tianshilei1992: `atomic_int` is not copyable. And the initialization of all these pointers are after the resize…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Sure, but you're not copying the element anywhere. Sadly I think we would need to provide the size of the vector up front. reserve() calls copy constructors (which I didn't expect) and they're deleted for atomic_int. I'm not sure the cuda api will permit that. Which leads to the suggestion: std::unique_ptr<std::atomic_int[]>> NextStreamId; // ... NextStreamId = std::make_unique<std::atomic_int[]>(NumberOfDevices); This elides the NumberOfDevices heap allocations and the associated indirection on every access and makes it somewhat more obvious that we can't call various vector api functions. It has the disadvantage that the integers will now definitely be in the same cache line, whereas previously there was a chance that the allocator would put them on different cache lines. Overall I'm fine with either structure. JonChesterfield: Sure, but you're not copying the element anywhere. Sadly I think we would need to provide the…

public:		public:
int NumberOfDevices;		int NumberOfDevices;
std::vector<CUmodule> Modules;		std::vector<CUmodule> Modules;
std::vector<CUcontext> Contexts;		std::vector<CUcontext> Contexts;
		std::vector<std::vector<CUstream>> Streams;

// Device properties		// Device properties
std::vector<int> ThreadsPerBlock;		std::vector<int> ThreadsPerBlock;
std::vector<int> BlocksPerGrid;		std::vector<int> BlocksPerGrid;
std::vector<int> WarpSize;		std::vector<int> WarpSize;

// OpenMP properties		// OpenMP properties
std::vector<int> NumTeams;		std::vector<int> NumTeams;
std::vector<int> NumThreads;		std::vector<int> NumThreads;

// OpenMP Environment properties		// OpenMP Environment properties
int EnvNumTeams;		int EnvNumTeams;
int EnvTeamLimit;		int EnvTeamLimit;
		int EnvNumStreams;

// OpenMP Requires Flags		// OpenMP Requires Flags
int64_t RequiresFlags;		int64_t RequiresFlags;

//static int EnvNumThreads;		//static int EnvNumThreads;
static const int HardTeamLimit = 1<<16; // 64k		static const int HardTeamLimit = 1<<16; // 64k
static const int HardThreadLimit = 1024;		static const int HardThreadLimit = 1024;
static const int DefaultNumTeams = 128;		static const int DefaultNumTeams = 128;
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	void clearOffloadEntriesTable(int32_t device_id) {
assert(device_id < (int32_t)FuncGblEntries.size() &&		assert(device_id < (int32_t)FuncGblEntries.size() &&
"Unexpected device id!");		"Unexpected device id!");
FuncGblEntries[device_id].emplace_back();		FuncGblEntries[device_id].emplace_back();
FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();		FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();
E.Entries.clear();		E.Entries.clear();
E.Table.EntriesBegin = E.Table.EntriesEnd = 0;		E.Table.EntriesBegin = E.Table.EntriesEnd = 0;
}		}

		// Get the next stream on a given device in a round robin manner
		CUstream &getNextStream(const int DeviceId) {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions It looks like DeviceID should be unsigned here JonChesterfield: It looks like DeviceID should be unsigned here
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Well, yes, it should be. But if you take a look at what they're used, for example at line 725, you can see the declaration is `int32_t device_id`. tianshilei1992: Well, yes, it should be. But if you take a look at what they're used, for example at line 725…
		jdoerfertUnsubmitted Not Done Reply Inline Actions we make it an unsigned here. I can do that before I commit as well. jdoerfert: we make it an unsigned here. I can do that before I commit as well.
		grokosUnsubmitted Not Done Reply Inline Actions Well, strictly speaking, device IDs in libomptarget are signed. E.g. the default device has an ID of -1 and the host device has ID -10. On the other hand, such negative values should never reach the plugin, if that ever happens then something is buggy in the base library. So it's really up to you to either keep the signed flavor or switch to unsigned. grokos: Well, strictly speaking, device IDs in libomptarget are signed. E.g. the default device has an…
		jdoerfertUnsubmitted Not Done Reply Inline Actions signed + assertion(id >= 0) ? jdoerfert: signed + assertion(id >= 0) ?
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions It would be better we put this check in each API call. tianshilei1992: It would be better we put this check in each API call.
		jdoerfertUnsubmitted Not Done Reply Inline Actions true, we add them in (a) different commit(s) though. Can you add the check to the assert you have below? (Nit: you can also use `int(NextStreamId.size())` to save some characters) jdoerfert: true, we add them in (a) different commit(s) though. Can you add the check to the assert you…
		assert(DeviceId >= 0 &&
		static_cast<size_t>(DeviceId) < NextStreamId.size() &&
		"Unexpected device id!");
		const unsigned int Id = NextStreamId[DeviceId]->fetch_add(1);
		return Streams[DeviceId][Id % EnvNumStreams];
		}

RTLDeviceInfoTy() {		RTLDeviceInfoTy() {
#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
if (char *envStr = getenv("LIBOMPTARGET_DEBUG")) {		if (char *envStr = getenv("LIBOMPTARGET_DEBUG")) {
DebugLevel = std::stoi(envStr);		DebugLevel = std::stoi(envStr);
}		}
#endif // OMPTARGET_DEBUG		#endif // OMPTARGET_DEBUG

DP("Start initializing CUDA\n");		DP("Start initializing CUDA\n");
Show All 16 Lines	#endif // OMPTARGET_DEBUG

if (NumberOfDevices == 0) {		if (NumberOfDevices == 0) {
DP("There are no devices supporting CUDA.\n");		DP("There are no devices supporting CUDA.\n");
return;		return;
}		}

FuncGblEntries.resize(NumberOfDevices);		FuncGblEntries.resize(NumberOfDevices);
Contexts.resize(NumberOfDevices);		Contexts.resize(NumberOfDevices);
		Streams.resize(NumberOfDevices);
		NextStreamId.resize(NumberOfDevices);
ThreadsPerBlock.resize(NumberOfDevices);		ThreadsPerBlock.resize(NumberOfDevices);
BlocksPerGrid.resize(NumberOfDevices);		BlocksPerGrid.resize(NumberOfDevices);
WarpSize.resize(NumberOfDevices);		WarpSize.resize(NumberOfDevices);
NumTeams.resize(NumberOfDevices);		NumTeams.resize(NumberOfDevices);
NumThreads.resize(NumberOfDevices);		NumThreads.resize(NumberOfDevices);

// Get environment variables regarding teams		// Get environment variables regarding teams
char *envStr = getenv("OMP_TEAM_LIMIT");		char *envStr = getenv("OMP_TEAM_LIMIT");
if (envStr) {		if (envStr) {
// OMP_TEAM_LIMIT has been set		// OMP_TEAM_LIMIT has been set
EnvTeamLimit = std::stoi(envStr);		EnvTeamLimit = std::stoi(envStr);
DP("Parsed OMP_TEAM_LIMIT=%d\n", EnvTeamLimit);		DP("Parsed OMP_TEAM_LIMIT=%d\n", EnvTeamLimit);
} else {		} else {
EnvTeamLimit = -1;		EnvTeamLimit = -1;
}		}
envStr = getenv("OMP_NUM_TEAMS");		envStr = getenv("OMP_NUM_TEAMS");
if (envStr) {		if (envStr) {
// OMP_NUM_TEAMS has been set		// OMP_NUM_TEAMS has been set
EnvNumTeams = std::stoi(envStr);		EnvNumTeams = std::stoi(envStr);
DP("Parsed OMP_NUM_TEAMS=%d\n", EnvNumTeams);		DP("Parsed OMP_NUM_TEAMS=%d\n", EnvNumTeams);
} else {		} else {
EnvNumTeams = -1;		EnvNumTeams = -1;
}		}

		// By default let's create 256 streams per device
		EnvNumStreams = 256;
		jdoerfertUnsubmitted Not Done Reply Inline Actions The hardware will cap the number internally anyway so we should go higher here. Maybe 256? jdoerfert: The hardware will cap the number internally anyway so we should go higher here. Maybe 256?
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Sure tianshilei1992: Sure
		ye-luoUnsubmitted Not Done Reply Inline Actions I don't like this choice. The hardware limit is 32 which is preferred. Users can play with environment variable if they need more. On the nvprof, it is impossible to digest 256 streams from OpenMP plus other application streams. ye-luo: I don't like this choice. The hardware limit is 32 which is preferred. Users can play with…
		jdoerfertUnsubmitted Not Done Reply Inline Actions @ye-luo Do you experience a downside to 256 streams? There should not be a performance problem but it should help us to be future and backwards compatible. jdoerfert: @ye-luo Do you experience a downside to 256 streams? There should not be a performance problem…
		ye-luoUnsubmitted Not Done Reply Inline Actions I don't have strong evidence about performance impact. I though more streams should cost the driver a bit more to monitor and schedule workload to the hardware. ye-luo: I don't have strong evidence about performance impact. I though more streams should cost the…
		jdoerfertUnsubmitted Not Done Reply Inline Actions I would expect, or maybe hope, that the driver just does the modulo internally. There is no point in tracking more than the number of hardware streams so why would they. To that end they can just do `hw_stream = user_stream % num_hw_streams`, which would make sense because it is portable (=backwards/future compatible). jdoerfert: I would expect, or maybe hope, that the driver just does the modulo internally. There is no…
		envStr = getenv("LIBOMPTARGET_NUM_STREAMS");
		if (envStr) {
		EnvNumStreams = std::stoi(envStr);
		}

		// Initialize streams for each device
		for (std::vector<CUstream> &S : Streams) {
		S.resize(EnvNumStreams);
		}

		// Initialize the next stream id
		for (std::unique_ptr<std::atomic_uint> &Ptr : NextStreamId) {
		#if __cplusplus < 201402L
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions If we do need the pointer wrapper, this should be make_unique JonChesterfield: If we do need the pointer wrapper, this should be make_unique
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `make_unique` only works since C++14. tianshilei1992: `make_unique` only works since C++14.
		jdoerfertUnsubmitted Not Done Reply Inline Actions Do we have llvm::make_unique? But maybe not necessarily good to use it here anyway. @jon ok to stick with this for now? jdoerfert: Do we have llvm::make_unique? But maybe not necessarily good to use it here anyway. @jon ok to…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions llvm::make_unique was removed by D66259, as we're now assuming C++14. They're semantically identical in this context so it doesn't matter much. JonChesterfield: llvm::make_unique was removed by D66259, as we're now assuming C++14. They're semantically…
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Do you mean that we can assume -std=c++14 is always true? tianshilei1992: Do you mean that we can assume -std=c++14 is always true?
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Other files in LLVM won't build with c++11 any more so >=14 seems a safe bet. JonChesterfield: Other files in LLVM won't build with c++11 any more so >=14 seems a safe bet.
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions That is cool! Thanks for the information. Will update this part correspondingly. tianshilei1992: That is cool! Thanks for the information. Will update this part correspondingly.
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Well, I just tried with `make_unique` but it turns out we're still using C++11 actually. FAILED: libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o /home/shiltian/.local/bin/clang++ -DOMPTARGET_DEBUG -DTARGET_NAME=CUDA -Domptarget_rtl_cuda_EXPORTS -I/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/include -I/opt/cuda/10.1/include -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-extra -Wno-pedantic -std=gnu++11 -g -fPIC -MD -MT libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -MF libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o.d -o libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -c /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:18: error: no member named 'make_unique' in namespace 'std' Ptr = std::make_unique<std::atomic_uint>(0); ~~~~~^ /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:46: error: expected '(' for function-style cast or type construction Ptr = std::make_unique<std::atomic_uint>(0); ~~~~~~~~~~~~~~~~^ 2 errors generated. ninja: build stopped: subcommand failed. tianshilei1992: Well, I just tried with `make_unique` but it turns out we're still using C++11 actually. ```…
		jdoerfertUnsubmitted Not Done Reply Inline Actions Change the cmake in a separate commit. Llvm is on 14. jdoerfert: Change the cmake in a separate commit. Llvm is on 14.
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions So OpenMP will also switch to C++ 14 in a near future? tianshilei1992: So OpenMP will also switch to C++ 14 in a near future?
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Sounds good to me. Yep, let's change the cmake now. JonChesterfield: Sounds good to me. Yep, let's change the cmake now.
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions https://reviews.llvm.org/D74258 tianshilei1992: https://reviews.llvm.org/D74258
		Ptr = std::unique_ptr<std::atomic_uint>(new std::atomic_uint(0));
		#else
		Ptr = std::make_unique<std::atomic_uint>(0);
		#endif
		jdoerfertUnsubmitted Not Done Reply Inline Actions No ifdef needed anymore, C++14 is here. jdoerfert: No ifdef needed anymore, C++14 is here.
		}

// Default state.		// Default state.
RequiresFlags = OMP_REQ_UNDEFINED;		RequiresFlags = OMP_REQ_UNDEFINED;
}		}

~RTLDeviceInfoTy() {		~RTLDeviceInfoTy() {
// Close modules		// Close modules
for (auto &module : Modules)		for (auto &module : Modules)
if (module) {		if (module) {
CUresult err = cuModuleUnload(module);		CUresult err = cuModuleUnload(module);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when unloading CUDA module\n");		DP("Error when unloading CUDA module\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
}		}
}		}

		// Destroy streams before contexts
		for (int I = 0; I < NumberOfDevices; ++I) {
		CUresult err = cuCtxSetCurrent(Contexts[I]);
		if (err != CUDA_SUCCESS) {
		DP("Error when setting current CUDA context\n");
		CUDA_ERR_STRING(err);
		}

		for (auto &S : Streams[I])
		if (S) {
		err = cuStreamDestroy(S);
		if (err != CUDA_SUCCESS) {
		DP("Error when destroying CUDA stream\n");
		CUDA_ERR_STRING(err);
		}
		}
		}

// Destroy contexts		// Destroy contexts
for (auto &ctx : Contexts)		for (auto &ctx : Contexts)
if (ctx) {		if (ctx) {
CUresult err = cuCtxDestroy(ctx);		CUresult err = cuCtxDestroy(ctx);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when destroying CUDA context\n");		DP("Error when destroying CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
}		}
Show All 34 Lines	int32_t __tgt_rtl_init_device(int32_t device_id) {
err = cuCtxCreate(&DeviceInfo.Contexts[device_id], CU_CTX_SCHED_BLOCKING_SYNC,		err = cuCtxCreate(&DeviceInfo.Contexts[device_id], CU_CTX_SCHED_BLOCKING_SYNC,
cuDevice);		cuDevice);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when creating a CUDA context\n");		DP("Error when creating a CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

		err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);
		if (err != CUDA_SUCCESS) {
		DP("Error when setting current CUDA context\n");
		CUDA_ERR_STRING(err);
		}

		for (CUstream &Stream : DeviceInfo.Streams[device_id]) {
		err = cuStreamCreate(&Stream, CU_STREAM_NON_BLOCKING);
		if (err != CUDA_SUCCESS) {
		DP("Error when creating CUDA stream\n");
		CUDA_ERR_STRING(err);
		}
		}

// Query attributes to determine number of threads/block and blocks/grid.		// Query attributes to determine number of threads/block and blocks/grid.
int maxGridDimX;		int maxGridDimX;
err = cuDeviceGetAttribute(&maxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,		err = cuDeviceGetAttribute(&maxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,
cuDevice);		cuDevice);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error getting max grid dimension, use default\n");		DP("Error getting max grid dimension, use default\n");
DeviceInfo.BlocksPerGrid[device_id] = RTLDeviceInfoTy::DefaultNumTeams;		DeviceInfo.BlocksPerGrid[device_id] = RTLDeviceInfoTy::DefaultNumTeams;
} else if (maxGridDimX <= RTLDeviceInfoTy::HardTeamLimit) {		} else if (maxGridDimX <= RTLDeviceInfoTy::HardTeamLimit) {
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	if (e->size) {
// check for to and link variables types:		// check for to and link variables types:
// (DeviceInfo.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY &&		// (DeviceInfo.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY &&
// (e->flags & OMP_DECLARE_TARGET_LINK \|\|		// (e->flags & OMP_DECLARE_TARGET_LINK \|\|
// e->flags == OMP_DECLARE_TARGET_TO))		// e->flags == OMP_DECLARE_TARGET_TO))
if (DeviceInfo.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY) {		if (DeviceInfo.RequiresFlags & OMP_REQ_UNIFIED_SHARED_MEMORY) {
// If unified memory is present any target link or to variables		// If unified memory is present any target link or to variables
// can access host addresses directly. There is no longer a		// can access host addresses directly. There is no longer a
// need for device copies.		// need for device copies.
cuMemcpyHtoD(cuptr, e->addr, sizeof(void *));		cuMemcpyHtoD(cuptr, e->addr, sizeof(void *));
		jdoerfertUnsubmitted Not Done Reply Inline Actions We need the async versions at the HtoD and at the DtoH sides to use the streams. After the async call we directly have to wait for the stream to make it synchronous but on as specific stream. jdoerfert: We need the async versions at the HtoD and at the DtoH sides to use the streams. After the…
		ye-luoUnsubmitted Not Done Reply Inline Actions In this direction, the H2D, kernel and D2H optimally can be scheduled as a whole entity in the tasking runtime and use the same stream if they are on the same OpenMP pragma line. ye-luo: In this direction, the H2D, kernel and D2H optimally can be scheduled as a whole entity in the…
DP("Copy linked variable host address (" DPxMOD ")"		DP("Copy linked variable host address (" DPxMOD ")"
"to device address (" DPxMOD ")\n",		"to device address (" DPxMOD ")\n",
DPxPTR(((void*)e->addr)), DPxPTR(cuptr));		DPxPTR(((void*)e->addr)), DPxPTR(cuptr));
}		}

DeviceInfo.addOffloadEntry(device_id, entry);		DeviceInfo.addOffloadEntry(device_id, entry);

continue;		continue;
▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_data_submit(int32_t device_id, void tgt_ptr, void hst_ptr,
// Set the context we are using.		// Set the context we are using.
CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);		CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when setting CUDA context\n");		DP("Error when setting CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

err = cuMemcpyHtoD((CUdeviceptr)tgt_ptr, hst_ptr, size);		CUstream &Stream = DeviceInfo.getNextStream(device_id);

		err = cuMemcpyHtoDAsync((CUdeviceptr)tgt_ptr, hst_ptr, size, Stream);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when copying data from host to device. Pointers: host = " DPxMOD		DP("Error when copying data from host to device. Pointers: host = " DPxMOD
", device = " DPxMOD ", size = %" PRId64 "\n", DPxPTR(hst_ptr),		", device = " DPxMOD ", size = %" PRId64 "\n",
DPxPTR(tgt_ptr), size);		DPxPTR(hst_ptr), DPxPTR(tgt_ptr), size);
		CUDA_ERR_STRING(err);
		return OFFLOAD_FAIL;
		}

		err = cuStreamSynchronize(Stream);
		if (err != CUDA_SUCCESS) {
		DP("Error when synchronizing async data transfer from host to device. "
		"Pointers: host = " DPxMOD ", device = " DPxMOD ", size = %" PRId64 "\n",
		DPxPTR(hst_ptr), DPxPTR(tgt_ptr), size);
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

int32_t __tgt_rtl_data_retrieve(int32_t device_id, void hst_ptr, void tgt_ptr,		int32_t __tgt_rtl_data_retrieve(int32_t device_id, void hst_ptr, void tgt_ptr,
int64_t size) {		int64_t size) {
// Set the context we are using.		// Set the context we are using.
CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);		CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when setting CUDA context\n");		DP("Error when setting CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

err = cuMemcpyDtoH(hst_ptr, (CUdeviceptr)tgt_ptr, size);		CUstream &Stream = DeviceInfo.getNextStream(device_id);

		err = cuMemcpyDtoHAsync(hst_ptr, (CUdeviceptr)tgt_ptr, size, Stream);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when copying data from device to host. Pointers: host = " DPxMOD		DP("Error when copying data from device to host. Pointers: host = " DPxMOD
", device = " DPxMOD ", size = %" PRId64 "\n", DPxPTR(hst_ptr),		", device = " DPxMOD ", size = %" PRId64 "\n",
DPxPTR(tgt_ptr), size);		DPxPTR(hst_ptr), DPxPTR(tgt_ptr), size);
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

		err = cuStreamSynchronize(Stream);
		if (err != CUDA_SUCCESS) {
		DP("Error when synchronizing async data transfer from device to host. "
		"Pointers: host = " DPxMOD ", device = " DPxMOD ", size = %" PRId64 "\n",
		DPxPTR(hst_ptr), DPxPTR(tgt_ptr), size);
		CUDA_ERR_STRING(err);
		return OFFLOAD_FAIL;
		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

int32_t __tgt_rtl_data_delete(int32_t device_id, void *tgt_ptr) {		int32_t __tgt_rtl_data_delete(int32_t device_id, void *tgt_ptr) {
// Set the context we are using.		// Set the context we are using.
CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);		CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when setting CUDA context\n");		DP("Error when setting CUDA context\n");
▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	if (team_num <= 0) {
cudaBlocksPerGrid = team_num;		cudaBlocksPerGrid = team_num;
DP("Using requested number of teams %d\n", team_num);		DP("Using requested number of teams %d\n", team_num);
}		}

// Run on the device.		// Run on the device.
DP("Launch kernel with %d blocks and %d threads\n", cudaBlocksPerGrid,		DP("Launch kernel with %d blocks and %d threads\n", cudaBlocksPerGrid,
cudaThreadsPerBlock);		cudaThreadsPerBlock);

		CUstream &Stream = DeviceInfo.getNextStream(device_id);

err = cuLaunchKernel(KernelInfo->Func, cudaBlocksPerGrid, 1, 1,		err = cuLaunchKernel(KernelInfo->Func, cudaBlocksPerGrid, 1, 1,
cudaThreadsPerBlock, 1, 1, 0 /bytes of shared memory/, 0, &args[0], 0);		cudaThreadsPerBlock, 1, 1, 0 /bytes of shared memory/,
		Stream, &args[0], 0);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Device kernel launch failed!\n");		DP("Device kernel launch failed!\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

DP("Launch of entry point at " DPxMOD " successful!\n",		DP("Launch of entry point at " DPxMOD " successful!\n",
DPxPTR(tgt_entry_ptr));		DPxPTR(tgt_entry_ptr));

CUresult sync_err = cuCtxSynchronize();		CUresult sync_err = cuStreamSynchronize(Stream);
		ye-luoUnsubmitted Not Done Reply Inline Actions This synchronization should be replaced with stream wait. ye-luo: This synchronization should be replaced with stream wait.
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Are you referring to `cudaStreamWaitEvent`? tianshilei1992: Are you referring to `cudaStreamWaitEvent`?
		ye-luoUnsubmitted Not Done Reply Inline Actions I mean cuStreamSynchronize ye-luo: I mean cuStreamSynchronize
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Oh, I got you. Good one, in case of blocking other threads, although the offloading have finished. tianshilei1992: Oh, I got you. Good one, in case of blocking other threads, although the offloading have…
if (sync_err != CUDA_SUCCESS) {		if (sync_err != CUDA_SUCCESS) {
DP("Kernel execution error at " DPxMOD "!\n", DPxPTR(tgt_entry_ptr));		DP("Kernel execution error at " DPxMOD "!\n", DPxPTR(tgt_entry_ptr));
CUDA_ERR_STRING(sync_err);		CUDA_ERR_STRING(sync_err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
} else {		} else {
DP("Kernel execution at " DPxMOD " successful!\n", DPxPTR(tgt_entry_ptr));		DP("Kernel execution at " DPxMOD " successful!\n", DPxPTR(tgt_entry_ptr));
}		}

Show All 15 Lines

openmp/libomptarget/test/offloading/parallel_offloading_map.c

This file was added.

				// RUN: %libomptarget-compilexx-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-x86_64-pc-linux-gnu
				#include <assert.h>
				#include <stdio.h>

				int main(int argc, char *argv[]) {
				const int num_threads = 64, N = 128;
				int array[num_threads] = {0};

				protze.joachimUnsubmitted Not Done Reply Inline Actions With D89523, this line does not compile and the test fails. As I understand D89523, this assignment is non-comforming. Using #define instead should fix the issue protze.joachim: With D89523, this line does not compile and the test fails. As I understand D89523, this…
				#pragma omp parallel for
				for (int i = 0; i < num_threads; ++i) {
				int tmp[N];

				for (int j = 0; j < N; ++j) {
				tmp[j] = i;
				}

				#pragma omp target teams distribute parallel for map(tofrom : tmp)
				for (int j = 0; j < N; ++j) {
				tmp[j] += j;
				}

				for (int j = 0; j < N; ++j) {
				array[i] += tmp[j];
				}
				}

				// Verify
				for (int i = 0; i < num_threads; ++i) {
				const int ref = (0 + N - 1) * N / 2 + i * N;
				assert(array[i] == ref);
				}

				printf("PASS\n");

				return 0;
				}

				// CHECK: PASS

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Offloading] Added support for multiple streams so that multiple kernels can be executed concurrentlyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 243977

openmp/libomptarget/plugins/cuda/src/rtl.cpp

openmp/libomptarget/test/offloading/parallel_offloading_map.c

[OpenMP][Offloading] Added support for multiple streams so that multiple kernels can be executed concurrently
ClosedPublic