[libomptarget][amdgpu] Fix truncation error for partial wavefront
The partial barrier implementation involves one wavefront resetting and N-1
waiting. This change future proofs against launching with a number of threads
that is not a multiple of the wavefront size.