Me and David took a look at how different compilers lower "move -1 into eax" when optimizing for size.
Clang will emit "movl $-1, %eax" (5 bytes), GCC and MSVC use "orl $-1, %eax" (3 bytes), ICC uses "pushl $-1; popl %eax" (3 bytes). A fourth alternative would be "xor %eax, %eax; decl %eax" (3 bytes).
A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?
ICC's PUSH/POP approach avoids the dependency problem and has the nice property that it works with all 8-bit immediates under sign extension. However, potentially touching memory seems scary, and IACA says it has a latency of 6 cycles. Is it really that slow, or is this because there's something about the stack that IACA doesn't model? I tried to micro-benchmark the difference between MOV and PUSH/POP on my machine, and the difference was in the noise.
Since ICC emits this code, it would be great if someone from Intel could comment about the size/speed trade-of here.
I'm attaching my attempt at implementing this in LLVM. Please let me know what you think. Suggestions for more reviewers is also welcome.
What is the advantage of defining new pseudo-ops for mov 1 and mov -1 vs. simply using MOV32ri and expanding it post-RA where you could consider all the relevant factors such as OptForSize, OptForMinSize, the immediate value, whether the destination register requires REX, etc.?
I suppose the pseudos convey your intent to later expand to xor+inc/dec, but that hardly seems to justify the added complexity.
The new pseudos are consistent with what's being done with MOV32r0, so my question applies equally to the existing MOV32r0 code.