If a workgroup size is known to be not greater than wavefront size
the s_barrier instruction is not needed since all threads are guarantied
to come to the same point at the same time.
Note, that fence translated into s_waitcnt still remain, since it is a
separate IR instruction.
See also D31728.