Implement this optimization in SIInsertWaitcnts, where we already have
information about whether there might be outstanding VMEM store
instructions. This has the following advantages:
- Correctly handles atomics-with-return.
- Correctly handles call instructions.
- Should be faster because it does not require running a separate pass.
global_atomic_csub_u32 ... glc is an atomic-with-return which uses VMcnt. The hardware will wait until VMcnt==0 before sending the MSG_DEALLOC_VGPRS message, so there is no point sending it.