This is an archive of the discontinued LLVM Phabricator instance.

X86: Emit smaller code for moving 8-bit immediates
ClosedPublic

Authored by hans on Nov 24 2015, 3:30 PM.

Details

Summary

Me and David took a look at how different compilers lower "move -1 into eax" when optimizing for size.

Clang will emit "movl $-1, %eax" (5 bytes), GCC and MSVC use "orl $-1, %eax" (3 bytes), ICC uses "pushl $-1; popl %eax" (3 bytes). A fourth alternative would be "xor %eax, %eax; decl %eax" (3 bytes).

A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?

ICC's PUSH/POP approach avoids the dependency problem and has the nice property that it works with all 8-bit immediates under sign extension. However, potentially touching memory seems scary, and IACA says it has a latency of 6 cycles. Is it really that slow, or is this because there's something about the stack that IACA doesn't model? I tried to micro-benchmark the difference between MOV and PUSH/POP on my machine, and the difference was in the noise.

Since ICC emits this code, it would be great if someone from Intel could comment about the size/speed trade-of here.

I'm attaching my attempt at implementing this in LLVM. Please let me know what you think. Suggestions for more reviewers is also welcome.

Diff Detail

Event Timeline

hans updated this revision to Diff 41090.Nov 24 2015, 3:30 PM
hans retitled this revision from to X86: Emit smaller code for moving 8-bit immediates.
hans updated this object.
hans added reviewers: majnemer, mkuper, DavidKreitzer.
hans added a subscriber: llvm-commits.
joerg added a subscriber: joerg.Nov 24 2015, 3:42 PM

Thanks for working on this, it should add some breathing room for the FreeBSD and NetBSD boot blocks. This is part of https://llvm.org/bugs/show_bug.cgi?id=8784.

About the patch, this should be CFI instrumentation if we ever want to go towards instruction precise .eh_frame.

mkuper edited edge metadata.Nov 25 2015, 12:19 AM

Thanks, Hans!

I think Dave is better qualified to answer the performance question than I am.

Thanks for working on this, it should add some breathing room for the FreeBSD and NetBSD boot blocks. This is part of https://llvm.org/bugs/show_bug.cgi?id=8784.

About the patch, this should be CFI instrumentation if we ever want to go towards instruction precise .eh_frame.

The end result of the discussion of r251904 was that we're moving in that direction. See D14948.

lib/Target/X86/X86InstrInfo.cpp
5334

I know this is all over the file, but it's kind of odd.
Any particular reason you can think of that all these static functions that receive a *this parameter aren't private members instead?

test/CodeGen/X86/movtopush.ll
117 ↗(On Diff #41090)

Ha. This looks positively silly. Probably the right thing to do (this is what ICC does here as well), but, nonetheless, silly. :-)

hans updated this revision to Diff 41173.Nov 25 2015, 1:01 PM
hans edited edge metadata.

Addressing comments.

hans added a comment.Nov 25 2015, 1:02 PM

About the patch, this should be CFI instrumentation if we ever want to go towards instruction precise .eh_frame.

The end result of the discussion of r251904 was that we're moving in that direction. See D14948.

Thanks for pointing this out. I've uploaded a new patch that emits CFI, but this is not something I'm really familiar so please take a look at it.

lib/Target/X86/X86InstrInfo.cpp
5334

I'll make it a member instead.

test/CodeGen/X86/movtopush.ll
117 ↗(On Diff #41090)

If you want silly, check out this one :-)

long long f() { return -2; }

built with -Os -m32, ICC (and Clang with my patch) emit:

pushl   $-2
popl    %eax
pushl   $-1
popl    %edx
retl

Maybe we could use CDQ instead there?

hans updated this revision to Diff 41193.Nov 25 2015, 3:41 PM

Try to emit the right stack adjustment for 32- and 64-bit.

joerg added a comment.Nov 26 2015, 9:04 AM

FYI, before:

bootxx_ffsv2 size 7180, 500 free

after:

bootxx_ffsv2 size 7132, 548 free

The code LGTM, modulo minor comments.
Regarding the code sequence itself, I still suggest you wait for input from Dave.

lib/Target/X86/X86InstrCompiler.td
271

It would probably be a good idea to add a schedule, but I'm not sure what it ought to be.
I'm fine with leaving it as a TODO.

lib/Target/X86/X86InstrInfo.cpp
5312

This isn't quite right - usePreciseUnwindInfo() is meant to indicate that we should be using precise CFI *if* we use CFI. It doesn't say anything about whether we should be using CFI at all. The way to determine whether CFI is needed is something like this (taken from FrameLowering):

bool IsWin64Prologue = MF.getTarget().getMCAsmInfo()->usesWindowsCFI();
bool NeedsDwarfCFI =

!IsWin64Prologue && (MMI.hasDebugInfo() || Fn->needsUnwindTableEntry());

What you want here is "!TFL->hasFP(MF) && NeedsDwarfCFI && MF.getMMI().usePreciseUnwindInfo()"
This probably ought to be better encapsulated somewhere, but that's a different patch.

test/CodeGen/X86/movtopush.ll
117–118 ↗(On Diff #41193)

Not sure. CDQ creates a dependency between the two register writes. as opposed to two independent moves.
But using two push/pop pairs probably does as well. And CDQ *is* smaller...

hans updated this revision to Diff 41414.Nov 30 2015, 11:21 AM
hans marked 2 inline comments as done.

Addressing Michael's comments.

silvas added a subscriber: silvas.Nov 30 2015, 6:51 PM

A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?

DEC is what we use in the backend for counted loops, so I wouldn't worry.
(i.e. we lower

for (int i = 0; i < n; i++)
  bar();

into a loop like

1:
...
decl %ebx
jnz 1b

)

For every x86 microarchitecture I'm familiar with, on paper xor+dec seems preferable to push+pop. I would avoid doing push+pop unless we can get some insight into what ICC is shooting for / exploiting here. E.g. push+pop on AMD Jaguar creates microops on the load unit which is a bottleneck point:
(microbenchmarked the 64-bit analog to confirm:
http://reviews.llvm.org/F1123937
http://reviews.llvm.org/F1123938
)

My guess is that ICC is preferring the push+pop because it doesn't touch flags and can so can be easily inserted anywhere.
Note that most x86 nowadays have sideband stack tracking in the instruction decoder to break dependencies between push/pop instructions; e.g. push $-1; pop %rax; push $-1; pop %rdx; won't have a dependency between the two pairs of instructions.

Still, unless intel uarches are doing something funky optimizing this in the decoder, the store forwarding latency will still make this sequence quite poor compared to xor+dec (as IACA points out).

hans added a comment.Nov 30 2015, 7:30 PM

A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?

DEC is what we use in the backend for counted loops, so I wouldn't worry.
(i.e. we lower

for (int i = 0; i < n; i++)
  bar();

into a loop like

1:
...
decl %ebx
jnz 1b

)

For every x86 microarchitecture I'm familiar with, on paper xor+dec seems preferable to push+pop. I would avoid doing push+pop unless we can get some insight into what ICC is shooting for / exploiting here. E.g. push+pop on AMD Jaguar creates microops on the load unit which is a bottleneck point:
(microbenchmarked the 64-bit analog to confirm:
http://reviews.llvm.org/F1123937
http://reviews.llvm.org/F1123938
)

Thanks for looking into this! How does the "or -1" approach compare in your benchmark?

In D14971#299120, @hans wrote:

Thanks for looking into this! How does the "or -1" approach compare in your benchmark?

It is the same execution cost as the MOV version so the result will be the same.

I confirmed with the benchmark: http://reviews.llvm.org/P536
(this also includes 2 more milieu's that stress the ALU in different ways; on Jaguar at least, the ALU throughput for "fast" ops is high enough compared to the decode and retire throughput that neither one ends up actually bottlenecking on the ALU (at most, the bottleneck is a "tie" between the ALU and the retire))

This is where we hit an issue with micro-benchmarking. In this microbenchmark the only difference between the "mov -1" and the "or -1" is the loop-carried dependency which isn't an issue since its loop in the dependence graph is just a single cycle (itself, a lone single-cycle OR operation).
The cost of the false dependency could become higher if the register used by the "or -1" version was the destination of a previous high-latency operation, thus preventing the processor from getting work done "in the shadow" of the high-latency operation. E.g. a code sequence like:

void foo(int x) {
  if (bar(x))
    return;
  .... do a bunch of work dependent on materializing -1 ....
}
bool bar(int x) {
  reg_t* Bucket = computeAddressOfBucket(x);
  reg_t Reg = *Bucket; // Load from a random-ish address, say, not in cache.
  if (Reg != 0)
    return false;
  ... other work ...
}

If we branch predict into the return false of bar (and through the if (bar(x)) which is easy) then we end up in ".... do a bunch of work dependent on materializing -1 ....". If we end up materializing -1 into the same register as the high latency Reg register in function bar, then using "or -1" to materialize it is potentially very detrimental to performance. If only the processor knew that we didn't actually care about the previous value of the register when materializing -1, we may be able to execute dozens of instructions (limited by the processor's retirement buffer (64 COP's for Jaguar, 200-ish micro-ops for the latest big Intel cores)) while waiting for the load from Bucket. If the branch predictions are right, then those dozens of instructions that we managed to get through while waiting for the load are effectively already "done"* once the data arrives from memory. Note that in this situation (correct speculation), all of the stuff that is executed "in the shadow" of the high-latency operation is actually on the critical path (would have had to be executed anyway), so this speculative execution is potentially chopping about min(latency of the high-latency operation, size of the reorder buffer) off the critical path, which can be huge.

To be honest, I don't have measurements on real-world code about how frequently this kind of unfortunate register aliasing happens and how detrimental it is, but it is probably worth avoiding (especially since we have multiple alternatives). Off the top of my head, the only way I can think of indirectly measuring this would be to hack the compiler to insert a bunch of "xor reg, reg" to clear any live-outs of functions that came from memory ops and see the performance effect. Would probably also need a control version that generates the same "xor reg, reg" instructions, but against different registers (e.g. all the same register, which isn't one of the ones we actually care about clearing) to control for the extra decode/icache cost.

*(except for the retirement cost, 2 COP's per cycle in Jaguar, Sandy Bridge / Haswell do 4 fused micro-ops per cycle; the retirement happens in parallel with the rest of the core's operation though, so execution is able to immediately make forward progress)

mkuper added inline comments.Dec 6 2015, 5:12 AM
lib/Target/X86/X86InstrInfo.cpp
5312

I've removed usePreciseUnwindInfo() in r254874 (It ended up being dead code, since it always returns true), just remove it from the check. Sorry for the churn.

hans updated this revision to Diff 42082.Dec 7 2015, 11:19 AM

Rebase, and remove usePreciseUnwindInfo().

hans added a comment.Dec 7 2015, 11:20 AM

DavidKreitzer, have you had time to look at this, and do you have any comments on the performance trade-offs between the push/pop approach and the other alternatives?

DavidKreitzer edited edge metadata.Dec 8 2015, 7:40 AM

Hi Hans,

I'd have responded sooner, but I was travelling last week and am still catching up.

Bottom Line: Using xor/inc & xor/dec for constant loads of 1 & -1 is preferred over using push/pop when optimizing for size. The fact that ICC fails to special-case 1 & -1 is a bit of an oversight. On modern Intel Core processors, the xor/inc & xor/dec idioms will be faster than push/pop. inc/dec are slower than add/sub on some older mainstream processors (e.g. Pentium 4) and also on modern smaller core processors like Silvermont due to the partial flag update. But the xor/inc, xor/dec idioms should still perform no worse than the push/pop sequence.

I also echo Sean's advice to avoid "or -1" in most cases. The one exception (which could be a TODO) is that "or -1" is the best option for machines that are not affected by the false dependence, i.e. strictly in-order machines such as older Atoms. If we know for sure that the code will only be run on such a machine, we could take advantage of this fact.

-Dave

Did you consider only using push/pop under minsize? It is a bit of a judgment call whether push/pop is appropriate for optsize since there is a definite performance cost. ICC does it, but then again, ICC doesn't have a minsize option currently.

hans added a comment.Dec 8 2015, 4:43 PM

Bottom Line: Using xor/inc & xor/dec for constant loads of 1 & -1 is preferred over using push/pop when optimizing for size. The fact that ICC fails to special-case 1 & -1 is a bit of an oversight. On modern Intel Core processors, the xor/inc & xor/dec idioms will be faster than push/pop. inc/dec are slower than add/sub on some older mainstream processors (e.g. Pentium 4) and also on modern smaller core processors like Silvermont due to the partial flag update. But the xor/inc, xor/dec idioms should still perform no worse than the push/pop sequence.

I also echo Sean's advice to avoid "or -1" in most cases. The one exception (which could be a TODO) is that "or -1" is the best option for machines that are not affected by the false dependence, i.e. strictly in-order machines such as older Atoms. If we know for sure that the code will only be run on such a machine, we could take advantage of this fact.

Thanks for the information! Sounds like we're on the same page here. I'll upload a new patch that uses xor inc/dec.

I still think the push/pop trick is pretty neat, but it might be more suitable for minsize, as you say.

hans updated this revision to Diff 42249.Dec 8 2015, 4:43 PM
hans edited edge metadata.

Use the xor inc/dec technique instead.

Let me know what you think.

hans updated this revision to Diff 42330.Dec 9 2015, 1:28 PM

Follow-up from the previous patch:

joerg pointed out on IRC that we should definitely do push-pop for minsize (I agree), and maybe also at optsize for all values that are not -1 and 1 (I don't have a strong opinion yet).

majnemer wondered if the clobbering of EFLAGS was properly tracked. I believe it was: the pattern expands to instructions that declare this dependence; the pattern shouldn't (can't) declare that itself. He also expressed concern that we're increasing the pressure on the EFLAGS. That's true, but I don't think it's a big problem in practice.

A serious problem with the previous patch is that it broke re-materialization. To handle that, I think we have to use pseudo instructions. My new patch does that.

I also realized that the size trade-offs are more complex in 64-bit mode. For example, xor-inc takes 4 bytes there, but 6 bytes if REX-prefixed. That means it's not profitable for registers that need REX (except 64-bit -1). I could use the GR32_NOREX register class on the pseudo instructions, but (I think) that means we'd no longer be able to materialize these constants in REX registers. We could also make the pseudo expand to different things depending on register class, but then it's getting iffy.

And, in 64-bit mode, 32-bit "or -1" and 64-bit push-pop are still only 3 bytes (4 bytes with REX), so shorter than xor-inc -- maybe we should use one of them instead?

To get this patch moving, I would like to punt on the 64-bit issue and just do the 32-bit 1 and -1. What do you think?

DavidKreitzer added inline comments.Dec 11 2015, 7:21 AM
lib/Target/X86/X86InstrCompiler.td
270

What is the advantage of defining new pseudo-ops for mov 1 and mov -1 vs. simply using MOV32ri and expanding it post-RA where you could consider all the relevant factors such as OptForSize, OptForMinSize, the immediate value, whether the destination register requires REX, etc.?

I suppose the pseudos convey your intent to later expand to xor+inc/dec, but that hardly seems to justify the added complexity.

The new pseudos are consistent with what's being done with MOV32r0, so my question applies equally to the existing MOV32r0 code.

lib/Target/X86/X86InstrInfo.cpp
2497

The immediate value needs to be adjusted for MOV32r1/MOV32r_1.

emaste added a subscriber: emaste.Dec 11 2015, 7:22 AM
hans marked an inline comment as done.Dec 11 2015, 10:53 AM
hans added inline comments.
lib/Target/X86/X86InstrCompiler.td
270

The reason I was doing it this way is that I was copying MOV32r0 :-)

(That one was added back in r25872, changed to be expanded by X86MCInstLower in r95433, and then to expand by expandPostRAPseudo in r198254.)

I'm still very new to the backend, so I might be getting this wrong, but one reason I can think of for having a separate instruction for MOV32r0 is to specify that it clobbers EFLAGS, whereas MOV32ri does not.

If we expanded MOV32ri in expandPostRAPseudo instead, there would be no guarantee that the instruction isn't in a place where EFLAGS is live. We could check for that situation and not use xor, but I think that would be a pessimization.

Does that make sense?

hans updated this revision to Diff 42544.Dec 11 2015, 10:54 AM

Get the value right when re-materializing with mov.

DavidKreitzer added inline comments.Dec 11 2015, 3:33 PM
lib/Target/X86/X86InstrCompiler.td
270

I certainly can't argue with the logic that what you are adding is consistent with MOV32r0. And from that perspective, I think your patch is fine the way it is. I just wonder if using MOV32ri universally and having a late expansion that chooses between the many possible implementations is a better long term direction.

I understand your point about the EFLAGS clobber, but note that the pessimism works both ways. The MOV32r0 prevents transformations that would make EFLAGS live across it, e.g. to remove a redundant compare. We'd probably prefer to remove the redundant compare & use "mov $0, %eax" to generate the 0 than leave the redundant compare & use "xor %eax, %eax". (Of course, you could specifically check for MOV32r0 and convert it to MOV32ri when the transformation happens just like X86InstrInfo::rematerialize is doing, but you could do the same sort of thing with the MOV32ri representation.)

lib/Target/X86/X86InstrInfo.cpp
2475

I like this improvement, thanks!

hans added inline comments.Dec 14 2015, 5:03 PM
lib/Target/X86/X86InstrCompiler.td
270

I don't really like the idea of having a pseudo instruction expand in (too many) different ways, but I also feel I don't have enough experience in the backend to argue my case very well :-)

Already, I would have preferred to just do this with patterns, but I think we need the pseudo instruction for rematerialization to work.

But to have a generic MOV32ri that would expand to different things seems like it's almost working against the code generator somehow. For example, how can it get scheduled effectively?

Also, checking whether EFLAGS is live wouldn't be very efficient. For example, X86InstrInfo::isSafeToClobberEFLAGS only considers a few instructions in each direction for efficiency. I would much prefer to have EFLAGS taken into account when choosing to use the instruction.

I still prefer the current patch. We could later add another pseudo for the push/pop lowering with minsize, and then try to figure out the best way to handle 64-bit.

DavidKreitzer added inline comments.Dec 15 2015, 6:34 AM
lib/Target/X86/X86InstrCompiler.td
270

That works for me. Your concern about EFLAGS is definitely justified. And this is small enough in scope that we can easily make changes, if necessary.

This revision was automatically updated to reflect the committed changes.