Download Raw Diff

Details

Reviewers

majnemer
mkuper
DavidKreitzer

Commits

rG08d5905bac16: [X86] Smaller code for materializing 32-bit 1 and -1 constants
rL255656: [X86] Smaller code for materializing 32-bit 1 and -1 constants

Summary

Me and David took a look at how different compilers lower "move -1 into eax" when optimizing for size.

Clang will emit "movl $-1, %eax" (5 bytes), GCC and MSVC use "orl $-1, %eax" (3 bytes), ICC uses "pushl $-1; popl %eax" (3 bytes). A fourth alternative would be "xor %eax, %eax; decl %eax" (3 bytes).

A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?

ICC's PUSH/POP approach avoids the dependency problem and has the nice property that it works with all 8-bit immediates under sign extension. However, potentially touching memory seems scary, and IACA says it has a latency of 6 cycles. Is it really that slow, or is this because there's something about the stack that IACA doesn't model? I tried to micro-benchmark the difference between MOV and PUSH/POP on my machine, and the difference was in the noise.

Since ICC emits this code, it would be great if someone from Intel could comment about the size/speed trade-of here.

I'm attaching my attempt at implementing this in LLVM. Please let me know what you think. Suggestions for more reviewers is also welcome.

Diff Detail

Repository: rL LLVM

Event Timeline

hans updated this revision to Diff 41090.Nov 24 2015, 3:30 PM

hans retitled this revision from to X86: Emit smaller code for moving 8-bit immediates.

hans updated this object.

hans added reviewers: majnemer, mkuper, DavidKreitzer.

hans added a subscriber: llvm-commits.

Thanks for working on this, it should add some breathing room for the FreeBSD and NetBSD boot blocks. This is part of https://llvm.org/bugs/show_bug.cgi?id=8784.

About the patch, this should be CFI instrumentation if we ever want to go towards instruction precise .eh_frame.

Thanks, Hans!

I think Dave is better qualified to answer the performance question than I am.

In D14971#296239, @joerg wrote:

Thanks for working on this, it should add some breathing room for the FreeBSD and NetBSD boot blocks. This is part of https://llvm.org/bugs/show_bug.cgi?id=8784.

About the patch, this should be CFI instrumentation if we ever want to go towards instruction precise .eh_frame.

The end result of the discussion of r251904 was that we're moving in that direction. See D14948.

lib/Target/X86/X86InstrInfo.cpp
5289 ↗	(On Diff #41090)	I know this is all over the file, but it's kind of odd. Any particular reason you can think of that all these static functions that receive a *this parameter aren't private members instead?
test/CodeGen/X86/movtopush.ll
117 ↗	(On Diff #41090)	Ha. This looks positively silly. Probably the right thing to do (this is what ICC does here as well), but, nonetheless, silly. :-)

Addressing comments.

In D14971#296355, @mkuper wrote:

In D14971#296239, @joerg wrote:

About the patch, this should be CFI instrumentation if we ever want to go towards instruction precise .eh_frame.

The end result of the discussion of r251904 was that we're moving in that direction. See D14948.

Thanks for pointing this out. I've uploaded a new patch that emits CFI, but this is not something I'm really familiar so please take a look at it.

lib/Target/X86/X86InstrInfo.cpp
5289 ↗	(On Diff #41090)	I'll make it a member instead.
test/CodeGen/X86/movtopush.ll
117 ↗	(On Diff #41090)	If you want silly, check out this one :-) long long f() { return -2; } built with -Os -m32, ICC (and Clang with my patch) emit: pushl $-2 popl %eax pushl $-1 popl %edx retl Maybe we could use CDQ instead there?

Try to emit the right stack adjustment for 32- and 64-bit.

FYI, before:

bootxx_ffsv2 size 7180, 500 free

after:

bootxx_ffsv2 size 7132, 548 free

The code LGTM, modulo minor comments.
Regarding the code sequence itself, I still suggest you wait for input from Dave.

lib/Target/X86/X86InstrCompiler.td
271 ↗	(On Diff #41193)	It would probably be a good idea to add a schedule, but I'm not sure what it ought to be. I'm fine with leaving it as a TODO.
lib/Target/X86/X86InstrInfo.cpp
5260 ↗	(On Diff #41193)	This isn't quite right - usePreciseUnwindInfo() is meant to indicate that we should be using precise CFI if we use CFI. It doesn't say anything about whether we should be using CFI at all. The way to determine whether CFI is needed is something like this (taken from FrameLowering): bool IsWin64Prologue = MF.getTarget().getMCAsmInfo()->usesWindowsCFI(); bool NeedsDwarfCFI = !IsWin64Prologue && (MMI.hasDebugInfo() \|\| Fn->needsUnwindTableEntry()); What you want here is "!TFL->hasFP(MF) && NeedsDwarfCFI && MF.getMMI().usePreciseUnwindInfo()" This probably ought to be better encapsulated somewhere, but that's a different patch.
test/CodeGen/X86/movtopush.ll
117–118 ↗	(On Diff #41193)	Not sure. CDQ creates a dependency between the two register writes. as opposed to two independent moves. But using two push/pop pairs probably does as well. And CDQ is smaller...

Addressing Michael's comments.

A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?

DEC is what we use in the backend for counted loops, so I wouldn't worry.
(i.e. we lower

for (int i = 0; i < n; i++)
  bar();

into a loop like

1:
...
decl %ebx
jnz 1b

)

For every x86 microarchitecture I'm familiar with, on paper xor+dec seems preferable to push+pop. I would avoid doing push+pop unless we can get some insight into what ICC is shooting for / exploiting here. E.g. push+pop on AMD Jaguar creates microops on the load unit which is a bottleneck point:
(microbenchmarked the 64-bit analog to confirm:
http://reviews.llvm.org/F1123937
http://reviews.llvm.org/F1123938
)

My guess is that ICC is preferring the push+pop because it doesn't touch flags and can so can be easily inserted anywhere.
Note that most x86 nowadays have sideband stack tracking in the instruction decoder to break dependencies between push/pop instructions; e.g. push $-1; pop %rax; push $-1; pop %rdx; won't have a dependency between the two pairs of instructions.

Still, unless intel uarches are doing something funky optimizing this in the decoder, the store forwarding latency will still make this sequence quite poor compared to xor+dec (as IACA points out).

In D14971#299111, @silvas wrote:
A problem with the OR approach is that there's a dependency on the previous value in %eax. The DEC approach avoids that, but maybe DEC is slow on some micro-architectures?

DEC is what we use in the backend for counted loops, so I wouldn't worry.
(i.e. we lower
for (int i = 0; i < n; i++)
  bar();
into a loop like
1:
...
decl %ebx
jnz 1b
)

For every x86 microarchitecture I'm familiar with, on paper xor+dec seems preferable to push+pop. I would avoid doing push+pop unless we can get some insight into what ICC is shooting for / exploiting here. E.g. push+pop on AMD Jaguar creates microops on the load unit which is a bottleneck point:
(microbenchmarked the 64-bit analog to confirm:
http://reviews.llvm.org/F1123937
http://reviews.llvm.org/F1123938
)

Thanks for looking into this! How does the "or -1" approach compare in your benchmark?

In D14971#299120, @hans wrote:

Thanks for looking into this! How does the "or -1" approach compare in your benchmark?

It is the same execution cost as the MOV version so the result will be the same.

I confirmed with the benchmark: http://reviews.llvm.org/P536
(this also includes 2 more milieu's that stress the ALU in different ways; on Jaguar at least, the ALU throughput for "fast" ops is high enough compared to the decode and retire throughput that neither one ends up actually bottlenecking on the ALU (at most, the bottleneck is a "tie" between the ALU and the retire))

This is where we hit an issue with micro-benchmarking. In this microbenchmark the only difference between the "mov -1" and the "or -1" is the loop-carried dependency which isn't an issue since its loop in the dependence graph is just a single cycle (itself, a lone single-cycle OR operation).
The cost of the false dependency could become higher if the register used by the "or -1" version was the destination of a previous high-latency operation, thus preventing the processor from getting work done "in the shadow" of the high-latency operation. E.g. a code sequence like:

void foo(int x) {
  if (bar(x))
    return;
  .... do a bunch of work dependent on materializing -1 ....
}
bool bar(int x) {
  reg_t* Bucket = computeAddressOfBucket(x);
  reg_t Reg = *Bucket; // Load from a random-ish address, say, not in cache.
  if (Reg != 0)
    return false;
  ... other work ...
}

If we branch predict into the return false of bar (and through the if (bar(x)) which is easy) then we end up in ".... do a bunch of work dependent on materializing -1 ....". If we end up materializing -1 into the same register as the high latency Reg register in function bar, then using "or -1" to materialize it is potentially very detrimental to performance. If only the processor knew that we didn't actually care about the previous value of the register when materializing -1, we may be able to execute dozens of instructions (limited by the processor's retirement buffer (64 COP's for Jaguar, 200-ish micro-ops for the latest big Intel cores)) while waiting for the load from Bucket. If the branch predictions are right, then those dozens of instructions that we managed to get through while waiting for the load are effectively already "done"* once the data arrives from memory. Note that in this situation (correct speculation), all of the stuff that is executed "in the shadow" of the high-latency operation is actually on the critical path (would have had to be executed anyway), so this speculative execution is potentially chopping about min(latency of the high-latency operation, size of the reorder buffer) off the critical path, which can be huge.

To be honest, I don't have measurements on real-world code about how frequently this kind of unfortunate register aliasing happens and how detrimental it is, but it is probably worth avoiding (especially since we have multiple alternatives). Off the top of my head, the only way I can think of indirectly measuring this would be to hack the compiler to insert a bunch of "xor reg, reg" to clear any live-outs of functions that came from memory ops and see the performance effect. Would probably also need a control version that generates the same "xor reg, reg" instructions, but against different registers (e.g. all the same register, which isn't one of the ones we actually care about clearing) to control for the extra decode/icache cost.

*(except for the retirement cost, 2 COP's per cycle in Jaguar, Sandy Bridge / Haswell do 4 fused micro-ops per cycle; the retirement happens in parallel with the rest of the core's operation though, so execution is able to immediately make forward progress)

mkuper added inline comments.Dec 6 2015, 5:12 AM

lib/Target/X86/X86InstrInfo.cpp
5260 ↗	(On Diff #41414)	I've removed usePreciseUnwindInfo() in r254874 (It ended up being dead code, since it always returns true), just remove it from the check. Sorry for the churn.

Rebase, and remove usePreciseUnwindInfo().

DavidKreitzer, have you had time to look at this, and do you have any comments on the performance trade-offs between the push/pop approach and the other alternatives?

Hi Hans,

I'd have responded sooner, but I was travelling last week and am still catching up.

Bottom Line: Using xor/inc & xor/dec for constant loads of 1 & -1 is preferred over using push/pop when optimizing for size. The fact that ICC fails to special-case 1 & -1 is a bit of an oversight. On modern Intel Core processors, the xor/inc & xor/dec idioms will be faster than push/pop. inc/dec are slower than add/sub on some older mainstream processors (e.g. Pentium 4) and also on modern smaller core processors like Silvermont due to the partial flag update. But the xor/inc, xor/dec idioms should still perform no worse than the push/pop sequence.

I also echo Sean's advice to avoid "or -1" in most cases. The one exception (which could be a TODO) is that "or -1" is the best option for machines that are not affected by the false dependence, i.e. strictly in-order machines such as older Atoms. If we know for sure that the code will only be run on such a machine, we could take advantage of this fact.

-Dave

Did you consider only using push/pop under minsize? It is a bit of a judgment call whether push/pop is appropriate for optsize since there is a definite performance cost. ICC does it, but then again, ICC doesn't have a minsize option currently.

In D14971#304892, @DavidKreitzer wrote:

Bottom Line: Using xor/inc & xor/dec for constant loads of 1 & -1 is preferred over using push/pop when optimizing for size. The fact that ICC fails to special-case 1 & -1 is a bit of an oversight. On modern Intel Core processors, the xor/inc & xor/dec idioms will be faster than push/pop. inc/dec are slower than add/sub on some older mainstream processors (e.g. Pentium 4) and also on modern smaller core processors like Silvermont due to the partial flag update. But the xor/inc, xor/dec idioms should still perform no worse than the push/pop sequence.

I also echo Sean's advice to avoid "or -1" in most cases. The one exception (which could be a TODO) is that "or -1" is the best option for machines that are not affected by the false dependence, i.e. strictly in-order machines such as older Atoms. If we know for sure that the code will only be run on such a machine, we could take advantage of this fact.

Thanks for the information! Sounds like we're on the same page here. I'll upload a new patch that uses xor inc/dec.

I still think the push/pop trick is pretty neat, but it might be more suitable for minsize, as you say.

Use the xor inc/dec technique instead.

Let me know what you think.

Follow-up from the previous patch:

joerg pointed out on IRC that we should definitely do push-pop for minsize (I agree), and maybe also at optsize for all values that are not -1 and 1 (I don't have a strong opinion yet).

majnemer wondered if the clobbering of EFLAGS was properly tracked. I believe it was: the pattern expands to instructions that declare this dependence; the pattern shouldn't (can't) declare that itself. He also expressed concern that we're increasing the pressure on the EFLAGS. That's true, but I don't think it's a big problem in practice.

A serious problem with the previous patch is that it broke re-materialization. To handle that, I think we have to use pseudo instructions. My new patch does that.

I also realized that the size trade-offs are more complex in 64-bit mode. For example, xor-inc takes 4 bytes there, but 6 bytes if REX-prefixed. That means it's not profitable for registers that need REX (except 64-bit -1). I could use the GR32_NOREX register class on the pseudo instructions, but (I think) that means we'd no longer be able to materialize these constants in REX registers. We could also make the pseudo expand to different things depending on register class, but then it's getting iffy.

And, in 64-bit mode, 32-bit "or -1" and 64-bit push-pop are still only 3 bytes (4 bytes with REX), so shorter than xor-inc -- maybe we should use one of them instead?

To get this patch moving, I would like to punt on the 64-bit issue and just do the 32-bit 1 and -1. What do you think?

DavidKreitzer added inline comments.Dec 11 2015, 7:21 AM

lib/Target/X86/X86InstrCompiler.td
270 ↗	(On Diff #42330)	What is the advantage of defining new pseudo-ops for mov 1 and mov -1 vs. simply using MOV32ri and expanding it post-RA where you could consider all the relevant factors such as OptForSize, OptForMinSize, the immediate value, whether the destination register requires REX, etc.? I suppose the pseudos convey your intent to later expand to xor+inc/dec, but that hardly seems to justify the added complexity. The new pseudos are consistent with what's being done with MOV32r0, so my question applies equally to the existing MOV32r0 code.
lib/Target/X86/X86InstrInfo.cpp
2483 ↗	(On Diff #42330)	The immediate value needs to be adjusted for MOV32r1/MOV32r_1.

emaste added a subscriber: emaste.Dec 11 2015, 7:22 AM

hans marked an inline comment as done.Dec 11 2015, 10:53 AM

hans added inline comments.

lib/Target/X86/X86InstrCompiler.td
270 ↗	(On Diff #42330)	The reason I was doing it this way is that I was copying MOV32r0 :-) (That one was added back in r25872, changed to be expanded by X86MCInstLower in r95433, and then to expand by expandPostRAPseudo in r198254.) I'm still very new to the backend, so I might be getting this wrong, but one reason I can think of for having a separate instruction for MOV32r0 is to specify that it clobbers EFLAGS, whereas MOV32ri does not. If we expanded MOV32ri in expandPostRAPseudo instead, there would be no guarantee that the instruction isn't in a place where EFLAGS is live. We could check for that situation and not use xor, but I think that would be a pessimization. Does that make sense?

Get the value right when re-materializing with mov.

DavidKreitzer added inline comments.Dec 11 2015, 3:33 PM

lib/Target/X86/X86InstrCompiler.td
270 ↗	(On Diff #42544)	I certainly can't argue with the logic that what you are adding is consistent with MOV32r0. And from that perspective, I think your patch is fine the way it is. I just wonder if using MOV32ri universally and having a late expansion that chooses between the many possible implementations is a better long term direction. I understand your point about the EFLAGS clobber, but note that the pessimism works both ways. The MOV32r0 prevents transformations that would make EFLAGS live across it, e.g. to remove a redundant compare. We'd probably prefer to remove the redundant compare & use "mov $0, %eax" to generate the 0 than leave the redundant compare & use "xor %eax, %eax". (Of course, you could specifically check for MOV32r0 and convert it to MOV32ri when the transformation happens just like X86InstrInfo::rematerialize is doing, but you could do the same sort of thing with the MOV32ri representation.)
lib/Target/X86/X86InstrInfo.cpp
2475 ↗	(On Diff #42544)	I like this improvement, thanks!

hans added inline comments.Dec 14 2015, 5:03 PM

lib/Target/X86/X86InstrCompiler.td
270 ↗	(On Diff #42544)	I don't really like the idea of having a pseudo instruction expand in (too many) different ways, but I also feel I don't have enough experience in the backend to argue my case very well :-) Already, I would have preferred to just do this with patterns, but I think we need the pseudo instruction for rematerialization to work. But to have a generic MOV32ri that would expand to different things seems like it's almost working against the code generator somehow. For example, how can it get scheduled effectively? Also, checking whether EFLAGS is live wouldn't be very efficient. For example, X86InstrInfo::isSafeToClobberEFLAGS only considers a few instructions in each direction for efficiency. I would much prefer to have EFLAGS taken into account when choosing to use the instruction. I still prefer the current patch. We could later add another pseudo for the push/pop lowering with minsize, and then try to figure out the best way to handle 64-bit.

DavidKreitzer added inline comments.Dec 15 2015, 6:34 AM

lib/Target/X86/X86InstrCompiler.td
270 ↗	(On Diff #42544)	That works for me. Your concern about EFLAGS is definitely justified. And this is small enough in scope that we can easily make changes, if necessary.

Closed by commit rL255656: [X86] Smaller code for materializing 32-bit 1 and -1 constants (authored by hans). · Explain WhyDec 15 2015, 9:13 AM

This revision was automatically updated to reflect the committed changes.

jevinskie added a subscriber: jevinskie.Dec 16 2015, 3:04 PM

Diff 42866

llvm/trunk/lib/Target/X86/X86InstrCompiler.td

	Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines
	// Other widths can also make use of the 32-bit xor, which may have a smaller			// Other widths can also make use of the 32-bit xor, which may have a smaller
	// encoding and avoid partial register updates.			// encoding and avoid partial register updates.
	def : Pat<(i8 0), (EXTRACT_SUBREG (MOV32r0), sub_8bit)>;			def : Pat<(i8 0), (EXTRACT_SUBREG (MOV32r0), sub_8bit)>;
	def : Pat<(i16 0), (EXTRACT_SUBREG (MOV32r0), sub_16bit)>;			def : Pat<(i16 0), (EXTRACT_SUBREG (MOV32r0), sub_16bit)>;
	def : Pat<(i64 0), (SUBREG_TO_REG (i64 0), (MOV32r0), sub_32bit)> {			def : Pat<(i64 0), (SUBREG_TO_REG (i64 0), (MOV32r0), sub_32bit)> {
	let AddedComplexity = 20;			let AddedComplexity = 20;
	}			}

				let Predicates = [OptForSize, NotSlowIncDec, Not64BitMode],
				AddedComplexity = 1 in {
				// Pseudo instructions for materializing 1 and -1 using XOR+INC/DEC,
				// which only require 3 bytes compared to MOV32ri which requires 5.
				let Defs = [EFLAGS], isReMaterializable = 1, isPseudo = 1 in {
				def MOV32r1 : I<0, Pseudo, (outs GR32:$dst), (ins), "",
				[(set GR32:$dst, 1)]>;
				def MOV32r_1 : I<0, Pseudo, (outs GR32:$dst), (ins), "",
				[(set GR32:$dst, -1)]>;
				}

				// MOV16ri is 4 bytes, so the instructions above are smaller.
				def : Pat<(i16 1), (EXTRACT_SUBREG (MOV32r1), sub_16bit)>;
				def : Pat<(i16 -1), (EXTRACT_SUBREG (MOV32r_1), sub_16bit)>;
				}

	// Materialize i64 constant where top 32-bits are zero. This could theoretically			// Materialize i64 constant where top 32-bits are zero. This could theoretically
	// use MOV32ri with a SUBREG_TO_REG to represent the zero-extension, however			// use MOV32ri with a SUBREG_TO_REG to represent the zero-extension, however
	// that would make it more difficult to rematerialize.			// that would make it more difficult to rematerialize.
	let AddedComplexity = 1, isReMaterializable = 1, isAsCheapAsAMove = 1,			let AddedComplexity = 1, isReMaterializable = 1, isAsCheapAsAMove = 1,
	isCodeGenOnly = 1, hasSideEffects = 0 in			isCodeGenOnly = 1, hasSideEffects = 0 in
	def MOV32ri64 : Ii32<0xb8, AddRegFrm, (outs GR32:$dst), (ins i64i32imm:$src),			def MOV32ri64 : Ii32<0xb8, AddRegFrm, (outs GR32:$dst), (ins i64i32imm:$src),
	"", [], IIC_ALU_NONMEM>, Sched<[WriteALU]>;			"", [], IIC_ALU_NONMEM>, Sched<[WriteALU]>;

	▲ Show 20 Lines • Show All 1,577 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,466 Lines • ▼ Show 20 Lines	bool X86InstrInfo::isSafeToClobberEFLAGS(MachineBasicBlock &MBB,
return false;		return false;
}		}

void X86InstrInfo::reMaterialize(MachineBasicBlock &MBB,		void X86InstrInfo::reMaterialize(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
unsigned DestReg, unsigned SubIdx,		unsigned DestReg, unsigned SubIdx,
const MachineInstr *Orig,		const MachineInstr *Orig,
const TargetRegisterInfo &TRI) const {		const TargetRegisterInfo &TRI) const {
// MOV32r0 is implemented with a xor which clobbers condition code.		bool ClobbersEFLAGS = false;
// Re-materialize it as movri instructions to avoid side effects.		for (const MachineOperand &MO : Orig->operands()) {
unsigned Opc = Orig->getOpcode();		if (MO.isReg() && MO.isDef() && MO.getReg() == X86::EFLAGS) {
if (Opc == X86::MOV32r0 && !isSafeToClobberEFLAGS(MBB, I)) {		ClobbersEFLAGS = true;
		break;
		}
		}

		if (ClobbersEFLAGS && !isSafeToClobberEFLAGS(MBB, I)) {
		// The instruction clobbers EFLAGS. Re-materialize as MOV32ri to avoid side
		// effects.
		int Value;
		switch (Orig->getOpcode()) {
		case X86::MOV32r0: Value = 0; break;
		case X86::MOV32r1: Value = 1; break;
		case X86::MOV32r_1: Value = -1; break;
		default:
		llvm_unreachable("Unexpected instruction!");
		}

DebugLoc DL = Orig->getDebugLoc();		DebugLoc DL = Orig->getDebugLoc();
BuildMI(MBB, I, DL, get(X86::MOV32ri)).addOperand(Orig->getOperand(0))		BuildMI(MBB, I, DL, get(X86::MOV32ri)).addOperand(Orig->getOperand(0))
.addImm(0);		.addImm(Value);
} else {		} else {
MachineInstr *MI = MBB.getParent()->CloneMachineInstr(Orig);		MachineInstr *MI = MBB.getParent()->CloneMachineInstr(Orig);
MBB.insert(I, MI);		MBB.insert(I, MI);
}		}

MachineInstr *NewMI = std::prev(I);		MachineInstr *NewMI = std::prev(I);
NewMI->substituteRegister(Orig->getOperand(0).getReg(), DestReg, SubIdx, TRI);		NewMI->substituteRegister(Orig->getOperand(0).getReg(), DestReg, SubIdx, TRI);
}		}
▲ Show 20 Lines • Show All 2,767 Lines • ▼ Show 20 Lines	static bool Expand2AddrUndef(MachineInstrBuilder &MIB,
// implicit operands.		// implicit operands.
MIB.addReg(Reg, RegState::Undef).addReg(Reg, RegState::Undef);		MIB.addReg(Reg, RegState::Undef).addReg(Reg, RegState::Undef);
// But we don't trust that.		// But we don't trust that.
assert(MIB->getOperand(1).getReg() == Reg &&		assert(MIB->getOperand(1).getReg() == Reg &&
MIB->getOperand(2).getReg() == Reg && "Misplaced operand");		MIB->getOperand(2).getReg() == Reg && "Misplaced operand");
return true;		return true;
}		}

		static bool expandMOV32r1(MachineInstrBuilder &MIB, const TargetInstrInfo &TII,
		bool MinusOne) {
		MachineBasicBlock &MBB = *MIB->getParent();
		DebugLoc DL = MIB->getDebugLoc();
		unsigned Reg = MIB->getOperand(0).getReg();

		// Insert the XOR.
		BuildMI(MBB, MIB.getInstr(), DL, TII.get(X86::XOR32rr), Reg)
		.addReg(Reg, RegState::Undef)
		.addReg(Reg, RegState::Undef);

		// Turn the pseudo into an INC or DEC.
		MIB->setDesc(TII.get(MinusOne ? X86::DEC32r : X86::INC32r));
		MIB.addReg(Reg);

		return true;
		}

// LoadStackGuard has so far only been implemented for 64-bit MachO. Different		// LoadStackGuard has so far only been implemented for 64-bit MachO. Different
// code sequence is needed for other targets.		// code sequence is needed for other targets.
static void expandLoadStackGuard(MachineInstrBuilder &MIB,		static void expandLoadStackGuard(MachineInstrBuilder &MIB,
const TargetInstrInfo &TII) {		const TargetInstrInfo &TII) {
MachineBasicBlock &MBB = *MIB->getParent();		MachineBasicBlock &MBB = *MIB->getParent();
DebugLoc DL = MIB->getDebugLoc();		DebugLoc DL = MIB->getDebugLoc();
unsigned Reg = MIB->getOperand(0).getReg();		unsigned Reg = MIB->getOperand(0).getReg();
const GlobalValue *GV =		const GlobalValue *GV =
Show All 12 Lines
}		}

bool X86InstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {		bool X86InstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {
bool HasAVX = Subtarget.hasAVX();		bool HasAVX = Subtarget.hasAVX();
MachineInstrBuilder MIB(*MI->getParent()->getParent(), MI);		MachineInstrBuilder MIB(*MI->getParent()->getParent(), MI);
switch (MI->getOpcode()) {		switch (MI->getOpcode()) {
case X86::MOV32r0:		case X86::MOV32r0:
return Expand2AddrUndef(MIB, get(X86::XOR32rr));		return Expand2AddrUndef(MIB, get(X86::XOR32rr));
		case X86::MOV32r1:
		return expandMOV32r1(MIB, this, /MinusOne=*/ false);
		case X86::MOV32r_1:
		return expandMOV32r1(MIB, this, /MinusOne=*/ true);
case X86::SETB_C8r:		case X86::SETB_C8r:
return Expand2AddrUndef(MIB, get(X86::SBB8rr));		return Expand2AddrUndef(MIB, get(X86::SBB8rr));
case X86::SETB_C16r:		case X86::SETB_C16r:
return Expand2AddrUndef(MIB, get(X86::SBB16rr));		return Expand2AddrUndef(MIB, get(X86::SBB16rr));
case X86::SETB_C32r:		case X86::SETB_C32r:
return Expand2AddrUndef(MIB, get(X86::SBB32rr));		return Expand2AddrUndef(MIB, get(X86::SBB32rr));
case X86::SETB_C64r:		case X86::SETB_C64r:
return Expand2AddrUndef(MIB, get(X86::SBB64rr));		return Expand2AddrUndef(MIB, get(X86::SBB64rr));
▲ Show 20 Lines • Show All 1,964 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/materialize-one.ll

				; RUN: llc -mtriple=i686-unknown-linux-gnu -mattr=+cmov %s -o - \| FileCheck %s --check-prefix=CHECK32
				; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+cmov %s -o - \| FileCheck %s --check-prefix=CHECK64

				define i32 @one32() optsize {
				entry:
				ret i32 1

				; CHECK32-LABEL: one32
				; CHECK32: xorl %eax, %eax
				; CHECK32-NEXT: incl %eax
				; CHECK32-NEXT: ret

				; FIXME: Figure out the best approach in 64-bit mode.
				; CHECK64-LABEL: one32
				; CHECK64: movl $1, %eax
				; CHECK64-NEXT: retq
				}

				define i32 @minus_one32() optsize {
				entry:
				ret i32 -1

				; CHECK32-LABEL: minus_one32
				; CHECK32: xorl %eax, %eax
				; CHECK32-NEXT: decl %eax
				; CHECK32-NEXT: ret
				}

				define i16 @one16() optsize {
				entry:
				ret i16 1

				; CHECK32-LABEL: one16
				; CHECK32: xorl %eax, %eax
				; CHECK32-NEXT: incl %eax
				; CHECK32-NEXT: retl
				}

				define i16 @minus_one16() optsize {
				entry:
				ret i16 -1

				; CHECK32-LABEL: minus_one16
				; CHECK32: xorl %eax, %eax
				; CHECK32-NEXT: decl %eax
				; CHECK32-NEXT: retl
				}

				define i32 @test_rematerialization() optsize {
				entry:
				; Materialize -1 (thiscall forces it into %ecx).
				tail call x86_thiscallcc void @f(i32 -1)

				; Clobber all registers except %esp, leaving nowhere to store the -1 besides
				; spilling it to the stack.
				tail call void asm sideeffect "", "~{eax},~{ebx},~{ecx},~{edx},~{edi},~{esi},~{ebp},~{dirflag},~{fpsr},~{flags}"()

				; -1 should be re-materialized here instead of getting spilled above.
				ret i32 -1

				; CHECK32-LABEL: test_rematerialization
				; CHECK32: xorl %ecx, %ecx
				; CHECK32-NEXT: decl %ecx
				; CHECK32: calll
				; CHECK32: xorl %eax, %eax
				; CHECK32-NEXT: decl %eax
				; CHECK32-NOT: %eax
				; CHECK32: retl
				}

				define i32 @test_rematerialization2(i32 %x) optsize {
				entry:
				; Materialize -1 (thiscall forces it into %ecx).
				tail call x86_thiscallcc void @f(i32 -1)

				; Clobber all registers except %esp, leaving nowhere to store the -1 besides
				; spilling it to the stack.
				tail call void asm sideeffect "", "~{eax},~{ebx},~{ecx},~{edx},~{edi},~{esi},~{ebp},~{dirflag},~{fpsr},~{flags}"()

				; Define eflags.
				%a = icmp ne i32 %x, 123
				%b = zext i1 %a to i32
				; Cause -1 to be rematerialized right in front of the cmov, which needs eflags.
				; It must therefore not use the xor-dec lowering.
				%c = select i1 %a, i32 %b, i32 -1
				ret i32 %c

				; CHECK32-LABEL: test_rematerialization2
				; CHECK32: xorl %ecx, %ecx
				; CHECK32-NEXT: decl %ecx
				; CHECK32: calll
				; CHECK32: cmpl
				; CHECK32: setne
				; CHECK32-NOT: xorl
				; CHECK32: movl $-1
				; CHECK32: cmov
				; CHECK32: retl
				}

				declare x86_thiscallcc void @f(i32)

This is an archive of the discontinued LLVM Phabricator instance.

X86: Emit smaller code for moving 8-bit immediates
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 42866

llvm/trunk/lib/Target/X86/X86InstrCompiler.td

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

llvm/trunk/test/CodeGen/X86/materialize-one.ll

This is an archive of the discontinued LLVM Phabricator instance.

X86: Emit smaller code for moving 8-bit immediatesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 42866

llvm/trunk/lib/Target/X86/X86InstrCompiler.td

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

llvm/trunk/test/CodeGen/X86/materialize-one.ll

X86: Emit smaller code for moving 8-bit immediates
ClosedPublic