In this patch, default stream is replaced with an actual per-device stream for better performance as there're plenty of constraints in default stream, according to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Later I'll enable multiple streams to improve the concurrency.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
260 |
|
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
260 | Sure but just to keep align with existing code. Do I need to update existing code as well? |
Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
260 | In a separate patch, please |
According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.
AFAIK, no. Currently we still only have one stream for each device, but it's just not the default stream. Kernels in a stream are executed in order. The asynchronous execution requires multiple streams. I'll check whether existing cases can cover it, and will write one if no.
Can you please write a commit message explaining the change and the plan?
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
260 | You can go either way but I would just keep it as the surrounding code for now. @JonChesterfield @grokos What do you think? Should we run once over the plugin and adjust the coding style now or keep it consistent for the time being? |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
260 | I would say let's keep it consistent. Later on, we can adjust the code style for the whole library but until then I prefer consistency over mixed styles. |