Me and David took a look at how different compilers lower "move -1 into eax" when optimizing for size.
Clang will emit "movl $-1, %eax" (5 bytes), GCC and MSVC use "orl $-1, %eax" (3 bytes), ICC uses "pushl $-1; popl %eax" (3 bytes). A fourth alternative would be "xor %eax, %eax; decl %eax" (3 bytes).
A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?
ICC's PUSH/POP approach avoids the dependency problem and has the nice property that it works with all 8-bit immediates under sign extension. However, potentially touching memory seems scary, and IACA says it has a latency of 6 cycles. Is it really that slow, or is this because there's something about the stack that IACA doesn't model? I tried to micro-benchmark the difference between MOV and PUSH/POP on my machine, and the difference was in the noise.
Since ICC emits this code, it would be great if someone from Intel could comment about the size/speed trade-of here.
I'm attaching my attempt at implementing this in LLVM. Please let me know what you think. Suggestions for more reviewers is also welcome.