Match the new DAG behavior and use v_perm_b32 when available. Also
does better on SI/CI by expanding 16-bit swaps. Also fix
Why do we get masking both before and after the operation (s_and and s_bfe)? It seems like only one or the other should be required, depending on whether the upper bits of the register are undefined or defined to be zero.
We inserted a zext to satisfy the readfirstlane type constraint. We don't have really any optimizations that would take care of yet, and currently the readfirstlane would still be in the way when it would happen