If the add/sub is not single use, it will need to be materialized
later, in which case using the BMI instruction is a de-optimization in
terms of code-size and throughput.
i.e:
// Good leal -1(%rdi), %eax andl %eax, %eax xorl %eax, %esi ...
// Unecessary BMI (lower throughput, larger code size) leal -1(%rdi), %eax blsr %edi, %eax xorl %eax, %esi ...
Note, this may cause more mov instructions to be emitted sometimes
because BMI instructions only have 1 src and write-only to dst. A
better approach may be to only avoid BMI for (and/xor X, (add/sub
0/-1, X)) if this is the last use of X but NOT the last use of
(add/sub 0/-1, X).
Maybe we should have