In previous patch, in order to optimize performance, we only synchronize once
for each target region. The syncrhonization is via stream synchronization.
However, in the extreme situation, the performce might be bad. Consider the
following case: There is a task that requires transferring huge amount of data
(call many times of data transferring function). It is scheduled to the first
stream. And then we have 255 very light tasks scheduled to the remaining 255
streams (by default we have 256 streams). They can be finished before we do
synchronization at the end of the first task. Next, we get another very huge
task. It will be scheduled again to the first stream. Now the first task
finishes its kernel launch and call stream synchronization. Right now, the
stream already contains two kernels, and the synchronization will wait until the
two kernels finish instead of just the first one for the first task.
In this patch, we introduce stream pool. After each synchronization, the stream
will be returned back to the pool to make sure that for each synchronization,
only expected operations are waited.