Starts to hoist waitcnt in loops containing the use of a value that was loaded outside of the loop, which also has any VMEM load inside of the loop that defines a value that is used outside of the loop.
example:
v0 = load(...) loop { ... use(v0) v1 = load(...) ... use(v1) v2 = load(...) } use(v2)
Previously we would not hoist waitcnt to the preheader of any loop which contained any use/def pairs that had any subregisters that were defined and used wholly within the loop. It seems somewhat arbitrary to limit the optimization to loops that only load values but never use them, but I may be missing something. While there is a concern about increased compile time with this change, it is essentially what was done before with FLAT/GLOBAL instructions.
A more thorough approach would try and estimate the minimum number of cycles gained or lost by hoisting the waitcnt, but this would involve further increases in compile time.
How does this set distinguish sub and full register dfes?