In this patch, default stream is replaced with an actual per-device stream for better performance as there're plenty of constraints in default stream, according to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Later I'll enable multiple streams to improve the concurrency.
Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.
In a separate patch, please
According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.
AFAIK, no. Currently we still only have one stream for each device, but it's just not the default stream. Kernels in a stream are executed in order. The asynchronous execution requires multiple streams. I'll check whether existing cases can cover it, and will write one if no.
Can you please write a commit message explaining the change and the plan?
You can go either way but I would just keep it as the surrounding code for now.
I would say let's keep it consistent. Later on, we can adjust the code style for the whole library but until then I prefer consistency over mixed styles.