It will initialize a number of streams for each device at first. The number can be configured via environment variable LIBOMPTARGET_NUM_STREAMS. For each kernel submission, a stream will be selected in a round-robin manner.
Details
Diff Detail
Event Timeline
Thanks! Two comments below.
@ye-luo once the memory transfers are attached to a stream you should be able to offload synchronously from multiple threads at the same time. Could you pull the patch and test it?
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
95 | Make it uint please. | |
525 | We need the async versions at the HtoD and at the DtoH sides to use the streams. After the async call we directly have to wait for the stream to make it synchronous but on as specific stream. |
@jdoerfert I can try it on a test program. miniQMC is choked by the linker at the moment. Is the "map" thread-safe now?
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
525 | In this direction, the H2D, kernel and D2H optimally can be scheduled as a whole entity in the tasking runtime and use the same stream if they are on the same OpenMP pragma line. |
Caught another issue.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
832 | This synchronization should be replaced with stream wait. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
95 | Right, in case of integer overflow, my bad... |
Are cuda streams available on all versions of cuda that the rest of openmp works from? I'm not sure when they were introduced.
A couple of minor comments inline. This seems to be a fairly straightforward wrapper over the cuda functionality.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
95 | vector of pointers to atomic_int is interesting. What's the advantage over vector<atomic_int>? It might be worth putting a few asserts in the code to the effect that resizing the vector after the initial construction will break access from other threads. | |
259 | If we do need the pointer wrapper, this should be make_unique |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
832 | Are you referring to cudaStreamWaitEvent? |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | Do we have llvm::make_unique? But maybe not necessarily good to use it here anyway. @jon ok to stick with this for now? |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
832 | I mean cuStreamSynchronize |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
95 | Sure, but you're not copying the element anywhere. Sadly I think we would need to provide the size of the vector up front. reserve() calls copy constructors (which I didn't expect) and they're deleted for atomic_int. I'm not sure the cuda api will permit that. Which leads to the suggestion: std::unique_ptr<std::atomic_int[]>> NextStreamId; // ... NextStreamId = std::make_unique<std::atomic_int[]>(NumberOfDevices); This elides the NumberOfDevices heap allocations and the associated indirection on every access and makes it somewhat more obvious that we can't call various vector api functions. It has the disadvantage that the integers will now definitely be in the same cache line, whereas previously there was a chance that the allocator would put them on different cache lines. Overall I'm fine with either structure. | |
259 | llvm::make_unique was removed by D66259, as we're now assuming C++14. They're semantically identical in this context so it doesn't matter much. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | Other files in LLVM won't build with c++11 any more so >=14 seems a safe bet. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | That is cool! Thanks for the information. Will update this part correspondingly. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | Well, I just tried with make_unique but it turns out we're still using C++11 actually. FAILED: libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o /home/shiltian/.local/bin/clang++ -DOMPTARGET_DEBUG -DTARGET_NAME=CUDA -Domptarget_rtl_cuda_EXPORTS -I/home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/include -I/opt/cuda/10.1/include -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-extra -Wno-pedantic -std=gnu++11 -g -fPIC -MD -MT libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -MF libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o.d -o libomptarget/plugins/cuda/CMakeFiles/omptarget.rtl.cuda.dir/src/rtl.cpp.o -c /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:18: error: no member named 'make_unique' in namespace 'std' Ptr = std::make_unique<std::atomic_uint>(0); ~~~~~^ /home/shiltian/Documents/clion/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:259:46: error: expected '(' for function-style cast or type construction Ptr = std::make_unique<std::atomic_uint>(0); ~~~~~~~~~~~~~~~~^ 2 errors generated. ninja: build stopped: subcommand failed. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | Change the cmake in a separate commit. Llvm is on 14. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | So OpenMP will also switch to C++ 14 in a near future? |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 | Sounds good to me. Yep, let's change the cmake now. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
259 |
We will probably need "version 2" functions soon which take additional information, e.g., the stream to be used. I would suggest to test this as is and merge it before we go there. It should already allow overlap between threads that offload. The "version 2" will only shrink the overhead per thread. That said, we are working on the nowait support so there will be other changes soon anyway.
@ye-luo Do you have a way to test this or do we need to fix the linker issue first?
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
246 | The hardware will cap the number internally anyway so we should go higher here. Maybe 256? |
I did a little experiment to show the performance improvement. Here is the micro benchmark:
#include <math.h> #include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <time.h> void kernel() { const int num_threads = 64; #pragma omp parallel for for (int i = 0; i < num_threads; ++i) { const size_t N = 1UL << 10; #pragma omp target teams distribute parallel for for (size_t i = 0; i < N; ++i) { for (size_t j = 0; j < N / 2; ++j) { float x = sqrt(pow(3.14159, j)); } } } } int main(int argc, char *argv[]) { const int N = 1000; const clock_t start = clock(); for (int i = 0; i < N; ++i) { kernel(); } const clock_t duration = (clock() - start) * 1000 / CLOCKS_PER_SEC / N; printf("Avg time: %ld ms\n", duration); return 0; }
The execution result with multiple stream is:
$ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading_ms.prof -f ./parallel_offloading ==32397== NVPROF is profiling process 32397, command: ./parallel_offloading Avg time: 1081 ms ==32397== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading_ms.prof
And the result w/o multiple stream is:
$ /usr/local/cuda/bin/nvprof --output-profile parallel_offloading.prof -f ./parallel_offloading ==35547== NVPROF is profiling process 35547, command: ./parallel_offloading Avg time: 5825 ms ==35547== Generated result file: /home/shiltian/Documents/project/multiple_streams/tests/multistreams/parallel_offloading.prof
We can see that 1081 vs 5825 ms, approximately 5.4x speedup.
Yes, later we will take stream it previous used for data transfer into consideration when selecting stream for kernel, and other potential optimization.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
246 | Sure |
My standalone code can be used to verify multi-stream concurrent execution and whether transfer and execution use the same stream by profiling with nvprof.
What standalone code? Can you run it with this patch? The transfer will use a different stream for now but it should be OK for now. "version 2" will do the same stream.
We need to lose std::make_unique before landing as the C++11 = > C++14 move has proven contentious. Otherwise LGTM.
I was thinking of going back to the explicit new, not an ifdef on c++ version. Then we can land this now and optionally revisit once the codebase moves to 14.
I tested the patch. The stream of H2D, D2H and compute behaves asynchronously as expected.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
246 | I don't like this choice. The hardware limit is 32 which is preferred. Users can play with environment variable if they need more. |
I do accept this pending D74258 and the C++14 RFC. If they go through the version of this patch that uses C++14 is fine.
We can discuss and modify the stream number afterwards as necessary (assuming we don't find a consensus now).
This patch is strictly positive so we should work from here.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
246 | @ye-luo Do you experience a downside to 256 streams? There should not be a performance problem but it should help us to be future and backwards compatible. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
246 | I don't have strong evidence about performance impact. I though more streams should cost the driver a bit more to monitor and schedule workload to the hardware. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
246 | I would expect, or maybe hope, that the driver just does the modulo internally. There is no point in tracking more than the number of hardware streams so why would they. To that end they can just do hw_stream = user_stream % num_hw_streams, which would make sense because it is portable (=backwards/future compatible). |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
182 | It looks like DeviceID should be unsigned here |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
182 | Well, yes, it should be. But if you take a look at what they're used, for example at line 725, you can see the declaration is int32_t device_id. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
182 | Well, strictly speaking, device IDs in libomptarget are signed. E.g. the default device has an ID of -1 and the host device has ID -10. On the other hand, such negative values should never reach the plugin, if that ever happens then something is buggy in the base library. So it's really up to you to either keep the signed flavor or switch to unsigned. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
182 | signed + assertion(id >= 0) ? |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
182 | It would be better we put this check in each API call. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
182 | true, we add them in (a) different commit(s) though. Can you add the check to the assert you have below? |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
263 | No ifdef needed anymore, C++14 is here. |
The patch looks good to me. One final comment: is LIBOMPTARGET_NUM_STREAMS the most appropriate name for the new env var? Because it targets the CUDA plugin specifically, should we change the name to something like LIBOMPTARGET_CUDA_NUM_STREAMS?
My idea is the concept of stream is widely used in different platforms. They might use different terminology.
Let's not complicate the name and just interpret "STREAMS" as whatever the equivalent of the platform is. That way we can have a single environment variable.
Make it uint please.