Reduce the statefulness of the algorithm in two ways:
- More clearly split generateWaitcntInstBefore into two phases: the first one which determines the required wait, if any, without changing the ScoreBrackets, and the second one which actually inserts the wait and updates the brackets.
- Communicate pre-existing s_waitcnt instructions using an argument to generateWaitcntInstBefore instead of through the ScoreBrackets.
To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.
There are some functional changes:
- The FIXME for the VCCZ bug workaround was implemented: we only wait for SMEM instructions as required instead of waiting on all counters.
- There are some cases where we previously merged some waitcnt instructions together non-locally due to the somewhat odd OldWaitcnt tracking, e.g. we would produce code like this:
ds_read_b32 v0, ... ds_read_b32 v1, ... s_waitcnt lgkmcnt(0) <-- this is a merged wait for both uses use(v0) more code use(v1)
In these cases we will now always first emit a wait for lgkmcnt(1), and then later for lgkmcnt(0). This should basically always be a win, although theoretically there could be cases where it's very slightly worse due to the increased code size. The worst code size regressions in my shader-db are:
WORST REGRESSIONS - Code Size Before After Delta Percentage 1724 1736 12 0.70 % shaders/private/f1-2015/1334.shader_test  2276 2284 8 0.35 % shaders/private/f1-2015/1306.shader_test  4632 4640 8 0.17 % shaders/private/ue4_elemental/62.shader_test  2376 2384 8 0.34 % shaders/private/f1-2015/1308.shader_test  3284 3292 8 0.24 % shaders/private/talos_principle/1955.shader_test 
... so I'm not particularly worried about the rather theoretical downside.