Page MenuHomePhabricator

[OpenMP][Offloading] Added support for multiple streams so that multiple kernels can be executed concurrently
ClosedPublic

Authored by tianshilei1992 on Feb 6 2020, 10:43 AM.

Details

Summary

It will initialize a number of streams for each device at first. The number can be configured via environment variable LIBOMPTARGET_NUM_STREAMS. For each kernel submission, a stream will be selected in a round-robin manner.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Feb 6 2020, 10:43 AM

Thanks! Two comments below.

@ye-luo once the memory transfers are attached to a stream you should be able to offload synchronously from multiple threads at the same time. Could you pull the patch and test it?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95

Make it uint please.

528

We need the async versions at the HtoD and at the DtoH sides to use the streams. After the async call we directly have to wait for the stream to make it synchronous but on as specific stream.

ye-luo added a comment.Feb 7 2020, 1:33 AM

@jdoerfert I can try it on a test program. miniQMC is choked by the linker at the moment. Is the "map" thread-safe now?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
528

In this direction, the H2D, kernel and D2H optimally can be scheduled as a whole entity in the tasking runtime and use the same stream if they are on the same OpenMP pragma line.

ye-luo added a comment.Feb 7 2020, 4:21 AM

Caught another issue.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
861

This synchronization should be replaced with stream wait.

@jdoerfert I can try it on a test program. miniQMC is choked by the linker at the moment. Is the "map" thread-safe now?

Map should be thread safe, yes.

tianshilei1992 marked an inline comment as done.Feb 7 2020, 8:43 AM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
95

Right, in case of integer overflow, my bad...

Adding Ron to the list as he's maintaining the amdgcn equivalent to this

Are cuda streams available on all versions of cuda that the rest of openmp works from? I'm not sure when they were introduced.

A couple of minor comments inline. This seems to be a fairly straightforward wrapper over the cuda functionality.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95

vector of pointers to atomic_int is interesting. What's the advantage over vector<atomic_int>?

It might be worth putting a few asserts in the code to the effect that resizing the vector after the initial construction will break access from other threads.

262

If we do need the pointer wrapper, this should be make_unique

tianshilei1992 marked 2 inline comments as done.Feb 7 2020, 10:01 AM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
95

atomic_int is not copyable. And the initialization of all these pointers are after the resize of vector, so we might not need to consider that.

262

make_unique only works since C++14.

tianshilei1992 added inline comments.Feb 7 2020, 10:27 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
861

Are you referring to cudaStreamWaitEvent?

jdoerfert added inline comments.Feb 7 2020, 10:31 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

Do we have llvm::make_unique? But maybe not necessarily good to use it here anyway. @jon ok to stick with this for now?

ye-luo added inline comments.Feb 7 2020, 10:52 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
861

I mean cuStreamSynchronize

openmp/libomptarget/plugins/cuda/src/rtl.cpp
95

Sure, but you're not copying the element anywhere.

Sadly I think we would need to provide the size of the vector up front. reserve() calls copy constructors (which I didn't expect) and they're deleted for atomic_int. I'm not sure the cuda api will permit that.

Which leads to the suggestion:

std::unique_ptr<std::atomic_int[]>> NextStreamId;
// ...
NextStreamId = std::make_unique<std::atomic_int[]>(NumberOfDevices);

This elides the NumberOfDevices heap allocations and the associated indirection on every access and makes it somewhat more obvious that we can't call various vector api functions.

It has the disadvantage that the integers will now definitely be in the same cache line, whereas previously there was a chance that the allocator would put them on different cache lines.

Overall I'm fine with either structure.

262

llvm::make_unique was removed by D66259, as we're now assuming C++14. They're semantically identical in this context so it doesn't matter much.

tianshilei1992 added inline comments.Feb 7 2020, 11:43 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

Do you mean that we can assume -std=c++14 is always true?

861

Oh, I got you. Good one, in case of blocking other threads, although the offloading have finished.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

Other files in LLVM won't build with c++11 any more so >=14 seems a safe bet.

tianshilei1992 marked an inline comment as done.Feb 7 2020, 12:21 PM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

That is cool! Thanks for the information. Will update this part correspondingly.

tianshilei1992 marked an inline comment as not done.Feb 7 2020, 12:48 PM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

Well, I just tried with make_unique but it turns out we're still using C++11 actually.

FAILED: libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o
/home/shiltian/.local/bin/clang++  -DOMPTARGET_DEBUG -DTARGET_NAME=CUDA -Domptarget_rtl_cuda_EXPORTS -I/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/include -I/opt/cuda/10.1/include -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-extra -Wno-pedantic -std=gnu++11 -g -fPIC -MD -MT libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -MF libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o.d -o libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -c /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp
/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:18: error: no member named 'make_unique' in namespace 'std'
      Ptr = std::make_unique<std::atomic_uint>(0);
            ~~~~~^
/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:46: error: expected '(' for function-style cast or type construction
      Ptr = std::make_unique<std::atomic_uint>(0);
                             ~~~~~~~~~~~~~~~~^
2 errors generated.
ninja: build stopped: subcommand failed.
jdoerfert added inline comments.Feb 7 2020, 1:27 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

Change the cmake in a separate commit. Llvm is on 14.

tianshilei1992 added inline comments.Feb 7 2020, 1:32 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

So OpenMP will also switch to C++ 14 in a near future?

JonChesterfield added inline comments.Feb 7 2020, 1:45 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

Sounds good to me. Yep, let's change the cmake now.

tianshilei1992 added inline comments.Feb 7 2020, 2:55 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
262

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

@ye-luo Do you have a way to test this or do we need to fix the linker issue first?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249

The hardware will cap the number internally anyway so we should go higher here. Maybe 256?

tianshilei1992 added a comment.EditedFeb 8 2020, 1:43 PM

I did a little experiment to show the performance improvement. Here is the micro benchmark:

#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void kernel() {
  const int num_threads = 64;

#pragma omp parallel for
  for (int i = 0; i < num_threads; ++i) {
    const size_t N = 1UL << 10;

#pragma omp target teams distribute parallel for
    for (size_t i = 0; i < N; ++i) {
      for (size_t j = 0; j < N / 2; ++j) {
        float x = sqrt(pow(3.14159, j));
      }
    }
  }
}

int main(int argc, char *argv[]) {
  const int N = 1000;

  const clock_t start = clock();

  for (int i = 0; i < N; ++i) {
    kernel();
  }

  const clock_t duration = (clock() - start) * 1000 / CLOCKS_PER_SEC / N;

  printf("Avg time: %ld ms\n", duration);

  return 0;
}

The execution result with multiple stream is:

$ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading_ms.prof -f ./parallel_offloading
==32397== NVPROF is profiling process 32397, command: ./parallel_offloading
Avg time: 1081 ms
==32397== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading_ms.prof

And the result w/o multiple stream is:

$ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading.prof -f ./parallel_offloading
==35547== NVPROF is profiling process 35547, command: ./parallel_offloading
Avg time: 5825 ms
==35547== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading.prof

We can see that 1081 vs 5825 ms, approximately 5.4x speedup.

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

Yes, later we will take stream it previous used for data transfer into consideration when selecting stream for kernel, and other potential optimization.

tianshilei1992 added inline comments.Feb 8 2020, 2:13 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
249

Sure

ye-luo added a comment.Feb 8 2020, 5:46 PM

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

@ye-luo Do you have a way to test this or do we need to fix the linker issue first?

My standalone code can be used to verify multi-stream concurrent execution and whether transfer and execution use the same stream by profiling with nvprof.

We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.

@ye-luo Do you have a way to test this or do we need to fix the linker issue first?

My standalone code can be used to verify multi-stream concurrent execution and whether transfer and execution use the same stream by profiling with nvprof.

What standalone code? Can you run it with this patch? The transfer will use a different stream for now but it should be OK for now. "version 2" will do the same stream.

Add a new test case to check map is working correct.

I'm fine with this, anyone else?

We need to lose std::make_unique before landing as the C++11 = > C++14 move has proven contentious. Otherwise LGTM.

tianshilei1992 added a comment.EditedFeb 9 2020, 11:35 AM

We need to lose std::make_unique before landing as the C++11 = > C++14 move has proven contentious. Otherwise LGTM.

I can use a macro here like #if __cplusplus < 201402L.

Add a backup statement in case that the library is not compiled with C++14

Add a backup statement in case that the library is not compiled with C++14

I would prefer not to do this. Let's wait till Monday and replies on the RFC.

I was thinking of going back to the explicit new, not an ifdef on c++ version. Then we can land this now and optionally revisit once the codebase moves to 14.

ye-luo added a comment.Feb 9 2020, 8:58 PM

I tested the patch. The stream of H2D, D2H and compute behaves asynchronously as expected.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249

I don't like this choice. The hardware limit is 32 which is preferred. Users can play with environment variable if they need more.
On the nvprof, it is impossible to digest 256 streams from OpenMP plus other application streams.

jdoerfert accepted this revision.Feb 9 2020, 10:32 PM

I tested the patch. The stream of H2D, D2H and compute behaves asynchronously as expected.

I do accept this pending D74258 and the C++14 RFC. If they go through the version of this patch that uses C++14 is fine.

We can discuss and modify the stream number afterwards as necessary (assuming we don't find a consensus now).
This patch is strictly positive so we should work from here.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
249

@ye-luo Do you experience a downside to 256 streams?

There should not be a performance problem but it should help us to be future and backwards compatible.

This revision is now accepted and ready to land.Feb 9 2020, 10:32 PM
ye-luo added inline comments.Feb 10 2020, 12:34 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
249

I don't have strong evidence about performance impact. I though more streams should cost the driver a bit more to monitor and schedule workload to the hardware.

jdoerfert added inline comments.Feb 10 2020, 8:11 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
249

I would expect, or maybe hope, that the driver just does the modulo internally. There is no point in tracking more than the number of hardware streams so why would they. To that end they can just do hw_stream = user_stream % num_hw_streams, which would make sense because it is portable (=backwards/future compatible).

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

It looks like DeviceID should be unsigned here

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

Well, yes, it should be. But if you take a look at what they're used, for example at line 725, you can see the declaration is int32_t device_id.

I'll commit this one and D74258 later.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

we make it an unsigned here. I can do that before I commit as well.

grokos added inline comments.Feb 11 2020, 12:01 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

Well, strictly speaking, device IDs in libomptarget are signed. E.g. the default device has an ID of -1 and the host device has ID -10. On the other hand, such negative values should never reach the plugin, if that ever happens then something is buggy in the base library. So it's really up to you to either keep the signed flavor or switch to unsigned.

jdoerfert added inline comments.Feb 11 2020, 12:54 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

signed + assertion(id >= 0) ?

tianshilei1992 added inline comments.Feb 11 2020, 1:06 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

It would be better we put this check in each API call.

jdoerfert added inline comments.Feb 11 2020, 1:14 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
182

true, we add them in (a) different commit(s) though. Can you add the check to the assert you have below?
(Nit: you can also use int(NextStreamId.size()) to save some characters)

jdoerfert added inline comments.Feb 11 2020, 5:44 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
266

No ifdef needed anymore, C++14 is here.

The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?

The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?

My idea is the concept of stream is widely used in different platforms. They might use different terminology.

The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?

My idea is the concept of stream is widely used in different platforms. They might use different terminology.

Let's not complicate the name and just interpret "STREAMS" as whatever the equivalent of the platform is. That way we can have a single environment variable.

This revision was automatically updated to reflect the committed changes.