With this patch we:
- pick more sensible defaults for the number of teams, inspired by the old plugin, and configured via LIBOMPTARGET_AMDGPU_TEAMS_PER_CU.
- check the input signal of a kernel launch late, after the queue lock was taken, to avoid a barrier packet more often.
- copy the kernel arguments in one swoop into the appropriate memory.
- manually specialize the callbacks to avoid potential indirect calls.
nit