- Disable "ctlz speculation", which inserts a branch on every ctlz(x) which has defined behavior on x == 0 to check whether x is, in fact zero.
- Add DAG patterns that avoid re-truncating or re-expanding the result of the 16- and 64-bit ctz instructions.
PTX has mov.b32 %dest, {%src1, %src2}
Instead of explicit conversion + subtracting 16, perhaps we could do something like this:
I'm not sure whether it makes any difference in SASS, though.