Preexisting waitcnt may not update the scoreboard if the instruction
being examined needed to wait on fewer counters than what was encoded in
the old waitcnt instruction. Fixing this results in the elimination of
some redudnat waitcnt.
These changes also enable combining consecutive waitcnt into a single
S_WAITCNT or S_WAITCNT_VSCNT instruction.
Additionally move the code earlier that inserts waitcnt at function
entry. This allows combining more waticnt in some tests.
This condition was already pretty obscure and inverting it has made it even more obscure. Instead, how about: