This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.
To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.
When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.
With contributions from Johannes Doerfert <johannes@jdoerfert.de>
Depends on D156245
Do we need these three atomic operations with memory_order_seq_cst? Or a more relaxed memory order could be enough?