This patch allows us to configure the port count to what the specific
card would desire for parallelism. For AMDGPU we need to use the maximum
number of hardware parallelism to avoid deadlocks. For NVPTX we don't
have this problem due to the friendlier scheduler, so we use the number
of warps active on an SM times the number of SMs as a good guess.
Note that the max ports currently is going to be smaller than these
numbers. That will be improved in the future.
This is valid if a wave opens at most one port at a time. I think an argument could be made that a wave could try to open one port per thread. Likely to be something we should add to the libc tests.
Async also comprises the sizing, though I currently think we should limit openmp to synchronous calls.