This allows the DPP combiner to kick in more often. For example the
exclusive scan generated by the atomic optimizer for a divergent atomic
add used to look like this:
v_mov_b32_e32 v3, v1 v_mov_b32_e32 v5, v1 v_mov_b32_e32 v6, v1 v_mov_b32_dpp v3, v2 wave_shr:1 row_mask:0xf bank_mask:0xf s_nop 1 v_add_u32_dpp v4, v3, v3 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0 v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf v_mov_b32_dpp v6, v3 row_shr:3 row_mask:0xf bank_mask:0xf v_add3_u32 v3, v4, v5, v6 v_mov_b32_e32 v4, v1 s_nop 1 v_mov_b32_dpp v4, v3 row_shr:4 row_mask:0xf bank_mask:0xe v_add_u32_e32 v3, v3, v4 v_mov_b32_e32 v4, v1 s_nop 1 v_mov_b32_dpp v4, v3 row_shr:8 row_mask:0xf bank_mask:0xc v_add_u32_e32 v3, v3, v4 v_mov_b32_e32 v4, v1 s_nop 1 v_mov_b32_dpp v4, v3 row_bcast:15 row_mask:0xa bank_mask:0xf v_add_u32_e32 v3, v3, v4 s_nop 1 v_mov_b32_dpp v1, v3 row_bcast:31 row_mask:0xc bank_mask:0xf v_add_u32_e32 v1, v3, v1 v_add_u32_e32 v1, v2, v1 v_readlane_b32 s0, v1, 63
But now most of the dpp movs are combined into adds:
v_mov_b32_e32 v3, v1 v_mov_b32_e32 v5, v1 s_nop 0 v_mov_b32_dpp v3, v2 wave_shr:1 row_mask:0xf bank_mask:0xf s_nop 1 v_add_u32_dpp v4, v3, v3 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0 v_mov_b32_dpp v5, v3 row_shr:2 row_mask:0xf bank_mask:0xf v_mov_b32_dpp v1, v3 row_shr:3 row_mask:0xf bank_mask:0xf v_add3_u32 v1, v4, v5, v1 s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xe s_nop 1 v_add_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xc s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:15 row_mask:0xa bank_mask:0xf s_nop 1 v_add_u32_dpp v1, v1, v1 row_bcast:31 row_mask:0xc bank_mask:0xf v_add_u32_e32 v1, v2, v1 v_readlane_b32 s0, v1, 63
Also fix some typos in comments and debug output.