This is an archive of the discontinued LLVM Phabricator instance.

Improve EmitLoweredSelect for contiguous pseudo CMOV instructions.
ClosedPublic

Authored by kbsmith1 on Jul 22 2015, 2:00 PM.

Download Raw Diff

Details

Reviewers

spatel
ab
mkuper

Commits

rG868dc6544453: [X86] Improve EmitLoweredSelect for contiguous CMOV pseudo instructions.
rL244202: [X86] Improve EmitLoweredSelect for contiguous CMOV pseudo instructions.

Summary

This change improves EmitLoweredSelect so that multiple contiguous pseudo CMOV
instructions with the same (or exactly opposite) conditions get lowered using a single
new basic-block. This eliminates a unnecessary extra basic-blocks (and CFG merge points)
when contiguous CMOVs are being lowered.

Diff Detail

Event Timeline

kbsmith1 updated this revision to Diff 30393.Jul 22 2015, 2:00 PM

kbsmith1 retitled this revision from to Improve EmitLoweredSelect for contiguous pseudo CMOV instructions..

kbsmith1 updated this object.

kbsmith1 added reviewers: mkuper, spatel, ab.

kbsmith1 added a subscriber: llvm-commits.Jul 29 2015, 2:36 PM

Ping on this review.

A few nits here and there.
Testing all types is nice, but isn't IMO the most interesting part: the PHI trickery is!

lib/Target/X86/X86ISelLowering.cpp
19785	of -> if
19788	What about something like the IMO more descriptive "isCMOVPseudo" ?
19790–19810	Can you sort these? I see the pseudo lowering switch is equally unordered, can you fix that too?
19814	return false here?
20040	Since this isn't shared between the loops, can you define it in the for(;;) instead?
20042–20045	IMO the field names don't add much information, as they're always accompanied with the same-name variable. What about a pair instead of this struct?
20089–20092	That "all" bothers me ;) -> "Now remove the CMOV(s)." ?
test/CodeGen/X86/pseudo_cmov_lower.ll
2–3	Why the explicit cpu? Also, IIRC we always use the CMOV pseudos for SSE selects, so can you do x86_64 tests for those as well? I don't expect interesting differences (except much more readable tests), so you can probably only use i386 for non-SSE types.
8	I think these could use more explicit tests, for the copies and PHIs.
218–223	These probably require AVX512. You can ignore them, I think.

This revision now requires changes to proceed.Jul 30 2015, 10:56 AM

Ahmed, thank you for the review. Please see my responses to your comments, and I will be uploading an improved version in just a bit.

lib/Target/X86/X86ISelLowering.cpp
19785	OK.
19788	OK.
19790–19810	OK.
19814	OK.
20040	OK.
20042–20045	Let me try a pair and see whether the code looks cleaner.
20089–20092	OK.
test/CodeGen/X86/pseudo_cmov_lower.ll
2–3	This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos. pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly. I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test. I don't understand why x86_64 will make the tests any more readable, can you elaborate on that?
8	Can you explain more what you are looking for here? I cannot really test for copies and phis in the resulting output assembly code, so maybe I just don't quite understand your comment.
218–223	Thanks for that info.

This addresses all of Ahmed's comments in the source code. Still working on addressing comments in the test cases.

This adds a new test, pseudo_cmov_lower2.ll which tests the PHI operand rewrite portion of the change-set. With this change, I think all of Ahmed's comments have been addressed.

Thanks for the update! A couple answers inline.

lib/Target/X86/X86ISelLowering.cpp
20103–20104	Merge both lines into RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg) ?
test/CodeGen/X86/pseudo_cmov_lower.ll
3–4	This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos. Makes sense pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly. That sounds like testing -mattr and CPU features, which should be done elsewhere, no? Here, if all you want is to assert that we're compiling for a target without CMOV, -mattr=-cmov does exactly that (but could be left out with just i386). IMO, explicit CPUs should only be used when the CPU actually matters (e.g. when testing scheduling models?). I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test. Great! I don't understand why x86_64 will make the tests any more readable, can you elaborate on that? Only because of the less noisy ABI, really. For tests like foo4/foo5, with very explicit checks.
9	Right now the tests check for a specific number of branches. I'm also interested in the flow of data and register assignments, leading to the final return value. Basically, the various MOVs in each block, and the final SUB. Consider, for instance, a bug where one of the PHIs has inverted operands (or even the feature where opposite CCs lead to inverted PHI operands): we should test for that. By the way, I was testing this out when I noticed that this: define i32 @foo(i32 %v0, i32 %v1, i32 %v2, i32 %v3, i32 %v4) nounwind { entry: %cmp = icmp slt i32 %v0, 0 ;; <- %v0 disables the opt, %v1 enables it %v3.v4 = select i1 %cmp, i32 %v3, i32 %v4 %v1.v2 = select i1 %cmp, i32 %v1, i32 %v2 %sub = sub i32 %v1.v2, %v3.v4 ret i32 %sub } seems to not trigger the optimization (I get JS twice). Shouldn't it?

Thanks for the additional comments. See some explanations inline. I'll see if I can improve a few of the tests to test for operand orders, but that isn't generally doable on all the tests due to the complexities of how downstream affects the code.

lib/Target/X86/X86ISelLowering.cpp
20103–20104	OK.
test/CodeGen/X86/pseudo_cmov_lower.ll
3–4	OK, I'll just use i386-linux-gnu rather than the -mcpu options.
9	Regarding the first part. It is very difficult to test that the PHIs operands are in the correct order (well, really that they come from the proper BB. The code seen effectively after the transform looks like: BB1: cmp jns BB3: BB2: // empty BB3: phi op1(from BB1), op2(from BB2) phi op11(from BB1), op12(from BB2) The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself. Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. select's of two memory operands ends up being lowered into a select of two LEAs that represent the address of the two memory operands, and then a single dereference of the selected pointer. This means you don't have to speculate a load, or increase the number of loads n the program. The actual IR for the example in question when you get to EmitLoweredSelect is this: (gdb) print BB->dump() BB#0: derived from LLVM BB %entry CMP32mi8 <fi#-1>, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>; mem:LD4[FixedStack-1](align=16) %vreg0<def> = LEA32r <fi#-4>, 1, %noreg, 0, %noreg; GR32:%vreg0 %vreg1<def> = LEA32r <fi#-5>, 1, %noreg, 0, %noreg; GR32:%vreg1 %vreg2<def> = CMOV_GR32 %vreg1<kill>, %vreg0<kill>, 15, %EFLAGS<imp-use> ;GR32:%vreg2,%vreg1,%vreg0 %vreg3<def> = LEA32r <fi#-2>, 1, %noreg, 0, %noreg; GR32:%vreg3 %vreg4<def> = LEA32r <fi#-3>, 1, %noreg, 0, %noreg; GR32:%vreg4 %vreg5<def> = CMOV_GR32 %vreg4<kill>, %vreg3<kill>, 15, %EFLAGS<imp-use>; GR32:%vreg5,%vreg4,%vreg3 %vreg6<def> = MOV32rm %vreg5<kill>, 1, %noreg, 0, %noreg; mem:LD4[<unknown>] GR32:%vreg6,%vreg5 %vreg7<def,tied1> = SUB32rm %vreg6<tied0>, %vreg2<kill>, 1, %noreg, 0, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[<unknown>] GR32:%vreg7,%vreg6,%vreg2 %EAX<def> = COPY %vreg7; GR32:%vreg7 RETL %EAX $1 = void As you can see the pseudo CMOVs are not contiguous, and thus this new code doesn't apply.

ab added inline comments.Jul 31 2015, 12:53 PM

test/CodeGen/X86/pseudo_cmov_lower.ll
9	The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself. I realize that, and this is a shortcoming of our test infrastructure that the MI serialization effort intends to address. Still, right now, in my opinion, overly explicit tests are better than no tests. So, my (radical) advice is: use the utils/update_llc_test_checks.py script, and tweak it locally to remove the SP-offset scrubs (l46). You'll probably want to get rid of foo9, and simplify foo8 (just focus on two types rather than the combination of all?). Also, I think the reuse of %v2 might make the output harder to follow than they would be if you had v1.v2 and v3.v4, but that's just a gut feeling. When someone changes anything downstream that affects this, they can just run the script again, and have a nice diff of what changed. What do you think? If you can come up with a cleaner way, that'd be great! But I'm uncomfortable with no testing at all :/ Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. [...] Thanks for investigating the example!

Updated code to fix the one remaining suggestion. Added a test foo3 to pseudo_cmov_lower2.ll which does the checks that Ahmed requested for exact operand order. This should be pretty stable because of x86_64 register passing conventions really kind of force specific patterns down stream.

Updated pseudo_cmov_lower2.ll test routine foo3, and added routine foo4. foo3 now is a simpler pattern to check for, and foo4 tests correctness of transformation with the "opposite condition" code in EmitLoweredSelect gets used.

Ahmed, I think I have addressed all the comments and improvements that you asked for. Is this now good to go from your point of view?

I'd still prefer tests with only -NEXT from the label to the ret, but this LGTM as is, thanks!

-Ahmed

lib/Target/X86/X86ISelLowering.cpp
19858	rewrting -> rewriting
test/CodeGen/X86/pseudo_cmov_lower.ll
219–224	If this test isn't actually testing CMOV_VNI1, should it be removed?

This revision is now accepted and ready to land.Aug 5 2015, 10:40 AM

Fixed "rewrting" spelling. Added comment on CMOV_V*I1 test to indicate this test is to be
updated if a way is found to generate these pseudo CMOV opcodes.

ab added inline comments.Aug 5 2015, 11:02 AM

test/CodeGen/X86/pseudo_cmov_lower.ll
219–224	Does enabling, say, avx512f work?

Michael,

Could you commit these changes for me since I don't yet have commit permission to the llvm svn repository?

Thank you,
Kevin

Added inline comment.

test/CodeGen/X86/pseudo_cmov_lower.ll
219–224	I will look into that. If it does, then I will remove this specific test, and add such a necessary test back as a separate test file in a separate change set.

Michael, Can you commit these changes for me since I don't yet have commit permission in LLVM svn?

Closed by commit rL244202: [X86] Improve EmitLoweredSelect for contiguous CMOV pseudo instructions. (authored by mkuper). · Explain WhyAug 6 2015, 1:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

192 lines

test/

CodeGen/

X86/

pseudo_cmov_lower.ll

267 lines

pseudo_cmov_lower1.ll

39 lines

pseudo_cmov_lower2.ll

100 lines

Diff 31374

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 19,776 Lines • ▼ Show 20 Lines	static bool checkAndUpdateEFLAGSKill(MachineBasicBlock::iterator SelectItr,
}		}

// We found a def, or hit the end of the basic block and EFLAGS wasn't live		// We found a def, or hit the end of the basic block and EFLAGS wasn't live
// out. SelectMI should have a kill flag on EFLAGS.		// out. SelectMI should have a kill flag on EFLAGS.
SelectItr->addRegisterKilled(X86::EFLAGS, TRI);		SelectItr->addRegisterKilled(X86::EFLAGS, TRI);
return true;		return true;
}		}

		// Return true if it is OK for this CMOV pseudo-opcode to be cascaded
		abUnsubmitted Done Reply Inline Actions of -> if ab: of -> if
		kbsmith1AuthorUnsubmitted Done Reply Inline Actions OK. kbsmith1: OK.
		// together with other CMOV pseudo-opcodes into a single basic-block with
		// conditional jump around it.
		static bool isCMOVPseudo(MachineInstr *MI) {
		abUnsubmitted Done Reply Inline Actions What about something like the IMO more descriptive "isCMOVPseudo" ? ab: What about something like the IMO more descriptive "isCMOVPseudo" ?
		kbsmith1AuthorUnsubmitted Done Reply Inline Actions OK. kbsmith1: OK.
		switch (MI->getOpcode()) {
		case X86::CMOV_FR32:
		case X86::CMOV_FR64:
		case X86::CMOV_GR8:
		case X86::CMOV_GR16:
		case X86::CMOV_GR32:
		case X86::CMOV_RFP32:
		case X86::CMOV_RFP64:
		case X86::CMOV_RFP80:
		case X86::CMOV_V2F64:
		case X86::CMOV_V2I64:
		case X86::CMOV_V4F32:
		case X86::CMOV_V4F64:
		case X86::CMOV_V4I64:
		case X86::CMOV_V16F32:
		case X86::CMOV_V8F32:
		case X86::CMOV_V8F64:
		case X86::CMOV_V8I64:
		case X86::CMOV_V8I1:
		case X86::CMOV_V16I1:
		case X86::CMOV_V32I1:
		case X86::CMOV_V64I1:
		abUnsubmitted Done Reply Inline Actions Can you sort these? I see the pseudo lowering switch is equally unordered, can you fix that too? ab: Can you sort these? I see the pseudo lowering switch is equally unordered, can you fix that…
		kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions OK. kbsmith1: OK.
		return true;

		default:
		return false;
		abUnsubmitted Done Reply Inline Actions return false here? ab: return false here?
		kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions OK. kbsmith1: OK.
		}
		}

MachineBasicBlock *		MachineBasicBlock *
X86TargetLowering::EmitLoweredSelect(MachineInstr *MI,		X86TargetLowering::EmitLoweredSelect(MachineInstr *MI,
MachineBasicBlock *BB) const {		MachineBasicBlock *BB) const {
const TargetInstrInfo *TII = Subtarget->getInstrInfo();		const TargetInstrInfo *TII = Subtarget->getInstrInfo();
DebugLoc DL = MI->getDebugLoc();		DebugLoc DL = MI->getDebugLoc();

// To "insert" a SELECT_CC instruction, we actually have to insert the		// To "insert" a SELECT_CC instruction, we actually have to insert the
// diamond control-flow pattern. The incoming instruction knows the		// diamond control-flow pattern. The incoming instruction knows the
// destination vreg to set, the condition code register to branch on, the		// destination vreg to set, the condition code register to branch on, the
// true/false values to select between, and a branch opcode to use.		// true/false values to select between, and a branch opcode to use.
const BasicBlock *LLVM_BB = BB->getBasicBlock();		const BasicBlock *LLVM_BB = BB->getBasicBlock();
MachineFunction::iterator It = BB;		MachineFunction::iterator It = BB;
++It;		++It;

// thisMBB:		// thisMBB:
// ...		// ...
// TrueVal = ...		// TrueVal = ...
// cmpTY ccX, r1, r2		// cmpTY ccX, r1, r2
// bCC copy1MBB		// bCC copy1MBB
// fallthrough --> copy0MBB		// fallthrough --> copy0MBB
MachineBasicBlock *thisMBB = BB;		MachineBasicBlock *thisMBB = BB;
MachineFunction *F = BB->getParent();		MachineFunction *F = BB->getParent();

// We also lower double CMOVs:		// This code lowers all pseudo-CMOV instructions. Generally it lowers these
		// as described above, by inserting a BB, and then making a PHI at the join
		// point to select the true and false operands of the CMOV in the PHI.
		//
		// The code also handles two different cases of multiple CMOV opcodes
		// in a row.
		//
		// Case 1:
		// In this case, there are multiple CMOVs in a row, all which are based on
		// the same condition setting (or the exact opposite condition setting).
		// In this case we can lower all the CMOVs using a single inserted BB, and
		// then make a number of PHIs at the join point to model the CMOVs. The only
		// trickiness here, is that in a case like:
		//
		// t2 = CMOV cond1 t1, f1
		// t3 = CMOV cond1 t2, f2
		//
		// when rewriting this into PHIs, we have to perform some renaming on the
		abUnsubmitted Done Reply Inline Actions rewrting -> rewriting ab: rewrting -> rewriting
		// temps since you cannot have a PHI operand refer to a PHI result earlier
		// in the same block. The "simple" but wrong lowering would be:
		//
		// t2 = PHI t1(BB1), f1(BB2)
		// t3 = PHI t2(BB1), f2(BB2)
		//
		// but clearly t2 is not defined in BB1, so that is incorrect. The proper
		// renaming is to note that on the path through BB1, t2 is really just a
		// copy of t1, and do that renaming, properly generating:
		//
		// t2 = PHI t1(BB1), f1(BB2)
		// t3 = PHI t1(BB1), f2(BB2)
		//
		// Case 2, we lower cascaded CMOVs such as
		//
// (CMOV (CMOV F, T, cc1), T, cc2)		// (CMOV (CMOV F, T, cc1), T, cc2)
		//
// to two successives branches. For that, we look for another CMOV as the		// to two successives branches. For that, we look for another CMOV as the
// following instruction.		// following instruction.
//		//
// Without this, we would add a PHI between the two jumps, which ends up		// Without this, we would add a PHI between the two jumps, which ends up
// creating a few copies all around. For instance, for		// creating a few copies all around. For instance, for
//		//
// (sitofp (zext (fcmp une)))		// (sitofp (zext (fcmp une)))
//		//
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	X86TargetLowering::EmitLoweredSelect(MachineInstr *MI,
// ucomiss %xmm1, %xmm0		// ucomiss %xmm1, %xmm0
// movss <1.0f>, %xmm0		// movss <1.0f>, %xmm0
// jne .LBB5_4		// jne .LBB5_4
// jp .LBB5_4		// jp .LBB5_4
// xorps %xmm0, %xmm0		// xorps %xmm0, %xmm0
// .LBB5_4:		// .LBB5_4:
// retq		// retq
//		//
MachineInstr *NextCMOV = nullptr;		MachineInstr *CascadedCMOV = nullptr;
		MachineInstr *LastCMOV = MI;
		X86::CondCode CC = X86::CondCode(MI->getOperand(3).getImm());
		X86::CondCode OppCC = X86::GetOppositeBranchCondition(CC);
MachineBasicBlock::iterator NextMIIt =		MachineBasicBlock::iterator NextMIIt =
std::next(MachineBasicBlock::iterator(MI));		std::next(MachineBasicBlock::iterator(MI));
if (NextMIIt != BB->end() && NextMIIt->getOpcode() == MI->getOpcode() &&
		// Check for case 1, where there are multiple CMOVs with the same condition
		// first. Of the two cases of multiple CMOV lowerings, case 1 reduces the
		// number of jumps the most.

		if (isCMOVPseudo(MI)) {
		// See if we have a string of CMOVS with the same condition.
		while (NextMIIt != BB->end() &&
		isCMOVPseudo(NextMIIt) &&
		(NextMIIt->getOperand(3).getImm() == CC \|\|
		NextMIIt->getOperand(3).getImm() == OppCC)) {
		LastCMOV = &*NextMIIt;
		++NextMIIt;
		}
		}

		// This checks for case 2, but only do this if we didn't already find
		// case 1, as indicated by LastCMOV == MI.
		if (LastCMOV == MI &&
		NextMIIt != BB->end() && NextMIIt->getOpcode() == MI->getOpcode() &&
NextMIIt->getOperand(2).getReg() == MI->getOperand(2).getReg() &&		NextMIIt->getOperand(2).getReg() == MI->getOperand(2).getReg() &&
NextMIIt->getOperand(1).getReg() == MI->getOperand(0).getReg())		NextMIIt->getOperand(1).getReg() == MI->getOperand(0).getReg()) {
NextCMOV = &*NextMIIt;		CascadedCMOV = &*NextMIIt;
		}

MachineBasicBlock *jcc1MBB = nullptr;		MachineBasicBlock *jcc1MBB = nullptr;

// If we have a double CMOV, we lower it to two successive branches to		// If we have a cascaded CMOV, we lower it to two successive branches to
// the same block. EFLAGS is used by both, so mark it as live in the second.		// the same block. EFLAGS is used by both, so mark it as live in the second.
if (NextCMOV) {		if (CascadedCMOV) {
jcc1MBB = F->CreateMachineBasicBlock(LLVM_BB);		jcc1MBB = F->CreateMachineBasicBlock(LLVM_BB);
F->insert(It, jcc1MBB);		F->insert(It, jcc1MBB);
jcc1MBB->addLiveIn(X86::EFLAGS);		jcc1MBB->addLiveIn(X86::EFLAGS);
}		}

MachineBasicBlock *copy0MBB = F->CreateMachineBasicBlock(LLVM_BB);		MachineBasicBlock *copy0MBB = F->CreateMachineBasicBlock(LLVM_BB);
MachineBasicBlock *sinkMBB = F->CreateMachineBasicBlock(LLVM_BB);		MachineBasicBlock *sinkMBB = F->CreateMachineBasicBlock(LLVM_BB);
F->insert(It, copy0MBB);		F->insert(It, copy0MBB);
F->insert(It, sinkMBB);		F->insert(It, sinkMBB);

// If the EFLAGS register isn't dead in the terminator, then claim that it's		// If the EFLAGS register isn't dead in the terminator, then claim that it's
// live into the sink and copy blocks.		// live into the sink and copy blocks.
const TargetRegisterInfo *TRI = Subtarget->getRegisterInfo();		const TargetRegisterInfo *TRI = Subtarget->getRegisterInfo();

MachineInstr *LastEFLAGSUser = NextCMOV ? NextCMOV : MI;		MachineInstr *LastEFLAGSUser = CascadedCMOV ? CascadedCMOV : LastCMOV;
if (!LastEFLAGSUser->killsRegister(X86::EFLAGS) &&		if (!LastEFLAGSUser->killsRegister(X86::EFLAGS) &&
!checkAndUpdateEFLAGSKill(LastEFLAGSUser, BB, TRI)) {		!checkAndUpdateEFLAGSKill(LastEFLAGSUser, BB, TRI)) {
copy0MBB->addLiveIn(X86::EFLAGS);		copy0MBB->addLiveIn(X86::EFLAGS);
sinkMBB->addLiveIn(X86::EFLAGS);		sinkMBB->addLiveIn(X86::EFLAGS);
}		}

// Transfer the remainder of BB and its successor edges to sinkMBB.		// Transfer the remainder of BB and its successor edges to sinkMBB.
sinkMBB->splice(sinkMBB->begin(), BB,		sinkMBB->splice(sinkMBB->begin(), BB,
std::next(MachineBasicBlock::iterator(MI)), BB->end());		std::next(MachineBasicBlock::iterator(LastCMOV)), BB->end());
sinkMBB->transferSuccessorsAndUpdatePHIs(BB);		sinkMBB->transferSuccessorsAndUpdatePHIs(BB);

// Add the true and fallthrough blocks as its successors.		// Add the true and fallthrough blocks as its successors.
if (NextCMOV) {		if (CascadedCMOV) {
// The fallthrough block may be jcc1MBB, if we have a double CMOV.		// The fallthrough block may be jcc1MBB, if we have a cascaded CMOV.
BB->addSuccessor(jcc1MBB);		BB->addSuccessor(jcc1MBB);

// In that case, jcc1MBB will itself fallthrough the copy0MBB, and		// In that case, jcc1MBB will itself fallthrough the copy0MBB, and
// jump to the sinkMBB.		// jump to the sinkMBB.
jcc1MBB->addSuccessor(copy0MBB);		jcc1MBB->addSuccessor(copy0MBB);
jcc1MBB->addSuccessor(sinkMBB);		jcc1MBB->addSuccessor(sinkMBB);
} else {		} else {
BB->addSuccessor(copy0MBB);		BB->addSuccessor(copy0MBB);
}		}

// The true block target of the first (or only) branch is always sinkMBB.		// The true block target of the first (or only) branch is always sinkMBB.
BB->addSuccessor(sinkMBB);		BB->addSuccessor(sinkMBB);

// Create the conditional branch instruction.		// Create the conditional branch instruction.
unsigned Opc =		unsigned Opc = X86::GetCondBranchFromCond(CC);
X86::GetCondBranchFromCond((X86::CondCode)MI->getOperand(3).getImm());
BuildMI(BB, DL, TII->get(Opc)).addMBB(sinkMBB);		BuildMI(BB, DL, TII->get(Opc)).addMBB(sinkMBB);

if (NextCMOV) {		if (CascadedCMOV) {
unsigned Opc2 = X86::GetCondBranchFromCond(		unsigned Opc2 = X86::GetCondBranchFromCond(
(X86::CondCode)NextCMOV->getOperand(3).getImm());		(X86::CondCode)CascadedCMOV->getOperand(3).getImm());
BuildMI(jcc1MBB, DL, TII->get(Opc2)).addMBB(sinkMBB);		BuildMI(jcc1MBB, DL, TII->get(Opc2)).addMBB(sinkMBB);
}		}

// copy0MBB:		// copy0MBB:
// %FalseValue = ...		// %FalseValue = ...
// # fallthrough to sinkMBB		// # fallthrough to sinkMBB
copy0MBB->addSuccessor(sinkMBB);		copy0MBB->addSuccessor(sinkMBB);

// sinkMBB:		// sinkMBB:
// %Result = phi [ %FalseValue, copy0MBB ], [ %TrueValue, thisMBB ]		// %Result = phi [ %FalseValue, copy0MBB ], [ %TrueValue, thisMBB ]
// ...		// ...
MachineInstrBuilder MIB =		MachineBasicBlock::iterator MIItBegin = MachineBasicBlock::iterator(MI);
BuildMI(*sinkMBB, sinkMBB->begin(), DL, TII->get(X86::PHI),		MachineBasicBlock::iterator MIItEnd =
MI->getOperand(0).getReg())		std::next(MachineBasicBlock::iterator(LastCMOV));
.addReg(MI->getOperand(1).getReg()).addMBB(copy0MBB)		MachineBasicBlock::iterator SinkInsertionPoint = sinkMBB->begin();
		abUnsubmitted Done Reply Inline Actions Since this isn't shared between the loops, can you define it in the for(;;) instead? ab: Since this isn't shared between the loops, can you define it in the for(;;) instead?
		kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions OK. kbsmith1: OK.
.addReg(MI->getOperand(2).getReg()).addMBB(thisMBB);		DenseMap<unsigned, std::pair<unsigned, unsigned>> RegRewriteTable;
		MachineInstrBuilder MIB;

		// As we are creating the PHIs, we have to be careful if there is more than
		// one. Later CMOVs may reference the results of earlier CMOVs, but later
		abUnsubmitted Done Reply Inline Actions IMO the field names don't add much information, as they're always accompanied with the same-name variable. What about a pair instead of this struct? ab: IMO the field names don't add much information, as they're always accompanied with the same…
		kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions Let me try a pair and see whether the code looks cleaner. kbsmith1: Let me try a pair and see whether the code looks cleaner.
		// PHIs have to reference the individual true/false inputs from earlier PHIs.
		// That also means that PHI construction must work forward from earlier to
		// later, and that the code must maintain a mapping from earlier PHI's
		// destination registers, and the registers that went into the PHI.

		for (MachineBasicBlock::iterator MIIt = MIItBegin; MIIt != MIItEnd; ++MIIt) {
		unsigned DestReg = MIIt->getOperand(0).getReg();
		unsigned Op1Reg = MIIt->getOperand(1).getReg();
		unsigned Op2Reg = MIIt->getOperand(2).getReg();

		// If this CMOV we are generating is the opposite condition from
		// the jump we generated, then we have to swap the operands for the
		// PHI that is going to be generated.
		if (MIIt->getOperand(3).getImm() == OppCC)
		std::swap(Op1Reg, Op2Reg);

// If we have a double CMOV, the second Jcc provides the same incoming		if (RegRewriteTable.find(Op1Reg) != RegRewriteTable.end())
		Op1Reg = RegRewriteTable[Op1Reg].first;

		if (RegRewriteTable.find(Op2Reg) != RegRewriteTable.end())
		Op2Reg = RegRewriteTable[Op2Reg].second;

		MIB = BuildMI(*sinkMBB, SinkInsertionPoint, DL,
		TII->get(X86::PHI), DestReg)
		.addReg(Op1Reg).addMBB(copy0MBB)
		.addReg(Op2Reg).addMBB(thisMBB);

		// Add this PHI to the rewrite table.
		RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg);
		}

		// If we have a cascaded CMOV, the second Jcc provides the same incoming
// value as the first Jcc (the True operand of the SELECT_CC/CMOV nodes).		// value as the first Jcc (the True operand of the SELECT_CC/CMOV nodes).
if (NextCMOV) {		if (CascadedCMOV) {
MIB.addReg(MI->getOperand(2).getReg()).addMBB(jcc1MBB);		MIB.addReg(MI->getOperand(2).getReg()).addMBB(jcc1MBB);
// Copy the PHI result to the register defined by the second CMOV.		// Copy the PHI result to the register defined by the second CMOV.
BuildMI(*sinkMBB, std::next(MachineBasicBlock::iterator(MIB.getInstr())),		BuildMI(*sinkMBB, std::next(MachineBasicBlock::iterator(MIB.getInstr())),
DL, TII->get(TargetOpcode::COPY), NextCMOV->getOperand(0).getReg())		DL, TII->get(TargetOpcode::COPY),
		CascadedCMOV->getOperand(0).getReg())
.addReg(MI->getOperand(0).getReg());		.addReg(MI->getOperand(0).getReg());
NextCMOV->eraseFromParent();		CascadedCMOV->eraseFromParent();
}		}

MI->eraseFromParent(); // The pseudo instruction is gone now.		// Now remove the CMOV(s).
		for (MachineBasicBlock::iterator MIIt = MIItBegin; MIIt != MIItEnd; )
		(MIIt++)->eraseFromParent();

		abUnsubmitted Done Reply Inline Actions That "all" bothers me ;) -> "Now remove the CMOV(s)." ? ab: That "all" bothers me ;) -> "Now remove the CMOV(s)." ?
		kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions OK. kbsmith1: OK.
return sinkMBB;		return sinkMBB;
}		}

MachineBasicBlock *		MachineBasicBlock *
X86TargetLowering::EmitLoweredSegAlloca(MachineInstr *MI,		X86TargetLowering::EmitLoweredSegAlloca(MachineInstr *MI,
MachineBasicBlock *BB) const {		MachineBasicBlock *BB) const {
MachineFunction *MF = BB->getParent();		MachineFunction *MF = BB->getParent();
const TargetInstrInfo *TII = Subtarget->getInstrInfo();		const TargetInstrInfo *TII = Subtarget->getInstrInfo();
DebugLoc DL = MI->getDebugLoc();		DebugLoc DL = MI->getDebugLoc();
const BasicBlock *LLVM_BB = BB->getBasicBlock();		const BasicBlock *LLVM_BB = BB->getBasicBlock();

assert(MF->shouldSplitStack());		assert(MF->shouldSplitStack());
		abUnsubmitted Not Done Reply Inline Actions Merge both lines into RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg) ? ab: Merge both lines into RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg) ?
		kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions OK. kbsmith1: OK.

const bool Is64Bit = Subtarget->is64Bit();		const bool Is64Bit = Subtarget->is64Bit();
const bool IsLP64 = Subtarget->isTarget64BitLP64();		const bool IsLP64 = Subtarget->isTarget64BitLP64();

const unsigned TlsReg = Is64Bit ? X86::FS : X86::GS;		const unsigned TlsReg = Is64Bit ? X86::FS : X86::GS;
const unsigned TlsOffset = IsLP64 ? 0x70 : Is64Bit ? 0x40 : 0x30;		const unsigned TlsOffset = IsLP64 ? 0x70 : Is64Bit ? 0x40 : 0x30;

// BB:		// BB:
▲ Show 20 Lines • Show All 507 Lines • ▼ Show 20 Lines	X86TargetLowering::EmitInstrWithCustomInserter(MachineInstr *MI,
case X86::WIN_ALLOCA:		case X86::WIN_ALLOCA:
return EmitLoweredWinAlloca(MI, BB);		return EmitLoweredWinAlloca(MI, BB);
case X86::SEG_ALLOCA_32:		case X86::SEG_ALLOCA_32:
case X86::SEG_ALLOCA_64:		case X86::SEG_ALLOCA_64:
return EmitLoweredSegAlloca(MI, BB);		return EmitLoweredSegAlloca(MI, BB);
case X86::TLSCall_32:		case X86::TLSCall_32:
case X86::TLSCall_64:		case X86::TLSCall_64:
return EmitLoweredTLSCall(MI, BB);		return EmitLoweredTLSCall(MI, BB);
case X86::CMOV_GR8:
case X86::CMOV_FR32:		case X86::CMOV_FR32:
case X86::CMOV_FR64:		case X86::CMOV_FR64:
case X86::CMOV_V4F32:		case X86::CMOV_GR8:
		case X86::CMOV_GR16:
		case X86::CMOV_GR32:
		case X86::CMOV_RFP32:
		case X86::CMOV_RFP64:
		case X86::CMOV_RFP80:
case X86::CMOV_V2F64:		case X86::CMOV_V2F64:
case X86::CMOV_V2I64:		case X86::CMOV_V2I64:
case X86::CMOV_V8F32:		case X86::CMOV_V4F32:
case X86::CMOV_V4F64:		case X86::CMOV_V4F64:
case X86::CMOV_V4I64:		case X86::CMOV_V4I64:
case X86::CMOV_V16F32:		case X86::CMOV_V16F32:
		case X86::CMOV_V8F32:
case X86::CMOV_V8F64:		case X86::CMOV_V8F64:
case X86::CMOV_V8I64:		case X86::CMOV_V8I64:
case X86::CMOV_GR16:
case X86::CMOV_GR32:
case X86::CMOV_RFP32:
case X86::CMOV_RFP64:
case X86::CMOV_RFP80:
case X86::CMOV_V8I1:		case X86::CMOV_V8I1:
case X86::CMOV_V16I1:		case X86::CMOV_V16I1:
case X86::CMOV_V32I1:		case X86::CMOV_V32I1:
case X86::CMOV_V64I1:		case X86::CMOV_V64I1:
return EmitLoweredSelect(MI, BB);		return EmitLoweredSelect(MI, BB);

case X86::FP32_TO_INT16_IN_MEM:		case X86::FP32_TO_INT16_IN_MEM:
case X86::FP32_TO_INT32_IN_MEM:		case X86::FP32_TO_INT32_IN_MEM:
▲ Show 20 Lines • Show All 5,705 Lines • Show Last 20 Lines

test/CodeGen/X86/pseudo_cmov_lower.ll

				; RUN: llc < %s -mtriple=i386-linux-gnu -o - \| FileCheck %s

				; This test checks that only a single js gets generated in the final code
				abUnsubmitted Not Done Reply Inline Actions Why the explicit cpu? Also, IIRC we always use the CMOV pseudos for SSE selects, so can you do x86_64 tests for those as well? I don't expect interesting differences (except much more readable tests), so you can probably only use i386 for non-SSE types. ab: Why the explicit cpu? Also, IIRC we always use the CMOV pseudos for SSE selects, so can you do…
				kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos. pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly. I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test. I don't understand why x86_64 will make the tests any more readable, can you elaborate on that? kbsmith1: This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV…
				; for lowering the CMOV pseudos that get created for this IR.
				abUnsubmitted Done Reply Inline Actions This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos. Makes sense pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly. That sounds like testing -mattr and CPU features, which should be done elsewhere, no? Here, if all you want is to assert that we're compiling for a target without CMOV, -mattr=-cmov does exactly that (but could be left out with just i386). IMO, explicit CPUs should only be used when the CPU actually matters (e.g. when testing scheduling models?). I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test. Great! I don't understand why x86_64 will make the tests any more readable, can you elaborate on that? Only because of the less noisy ABI, really. For tests like foo4/foo5, with very explicit checks. ab: > This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV…
				kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions OK, I'll just use i386-linux-gnu rather than the -mcpu options. kbsmith1: OK, I'll just use i386-linux-gnu rather than the -mcpu options.
				; CHECK-LABEL: foo1:
				; CHECK: js
				; CHECK-NOT: js
				define i32 @foo1(i32 %v1, i32 %v2, i32 %v3) nounwind {
				abUnsubmitted Done Reply Inline Actions I think these could use more explicit tests, for the copies and PHIs. ab: I think these could use more explicit tests, for the copies and PHIs.
				kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions Can you explain more what you are looking for here? I cannot really test for copies and phis in the resulting output assembly code, so maybe I just don't quite understand your comment. kbsmith1: Can you explain more what you are looking for here? I cannot really test for copies and phis in…
				entry:
				abUnsubmitted Not Done Reply Inline Actions Right now the tests check for a specific number of branches. I'm also interested in the flow of data and register assignments, leading to the final return value. Basically, the various MOVs in each block, and the final SUB. Consider, for instance, a bug where one of the PHIs has inverted operands (or even the feature where opposite CCs lead to inverted PHI operands): we should test for that. By the way, I was testing this out when I noticed that this: define i32 @foo(i32 %v0, i32 %v1, i32 %v2, i32 %v3, i32 %v4) nounwind { entry: %cmp = icmp slt i32 %v0, 0 ;; <- %v0 disables the opt, %v1 enables it %v3.v4 = select i1 %cmp, i32 %v3, i32 %v4 %v1.v2 = select i1 %cmp, i32 %v1, i32 %v2 %sub = sub i32 %v1.v2, %v3.v4 ret i32 %sub } seems to not trigger the optimization (I get JS twice). Shouldn't it? ab: Right now the tests check for a specific number of branches. I'm also interested in the flow…
				kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions Regarding the first part. It is very difficult to test that the PHIs operands are in the correct order (well, really that they come from the proper BB. The code seen effectively after the transform looks like: BB1: cmp jns BB3: BB2: // empty BB3: phi op1(from BB1), op2(from BB2) phi op11(from BB1), op12(from BB2) The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself. Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. select's of two memory operands ends up being lowered into a select of two LEAs that represent the address of the two memory operands, and then a single dereference of the selected pointer. This means you don't have to speculate a load, or increase the number of loads n the program. The actual IR for the example in question when you get to EmitLoweredSelect is this: (gdb) print BB->dump() BB#0: derived from LLVM BB %entry CMP32mi8 <fi#-1>, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>; mem:LD4[FixedStack-1](align=16) %vreg0<def> = LEA32r <fi#-4>, 1, %noreg, 0, %noreg; GR32:%vreg0 %vreg1<def> = LEA32r <fi#-5>, 1, %noreg, 0, %noreg; GR32:%vreg1 %vreg2<def> = CMOV_GR32 %vreg1<kill>, %vreg0<kill>, 15, %EFLAGS<imp-use> ;GR32:%vreg2,%vreg1,%vreg0 %vreg3<def> = LEA32r <fi#-2>, 1, %noreg, 0, %noreg; GR32:%vreg3 %vreg4<def> = LEA32r <fi#-3>, 1, %noreg, 0, %noreg; GR32:%vreg4 %vreg5<def> = CMOV_GR32 %vreg4<kill>, %vreg3<kill>, 15, %EFLAGS<imp-use>; GR32:%vreg5,%vreg4,%vreg3 %vreg6<def> = MOV32rm %vreg5<kill>, 1, %noreg, 0, %noreg; mem:LD4[<unknown>] GR32:%vreg6,%vreg5 %vreg7<def,tied1> = SUB32rm %vreg6<tied0>, %vreg2<kill>, 1, %noreg, 0, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[<unknown>] GR32:%vreg7,%vreg6,%vreg2 %EAX<def> = COPY %vreg7; GR32:%vreg7 RETL %EAX $1 = void As you can see the pseudo CMOVs are not contiguous, and thus this new code doesn't apply. kbsmith1: Regarding the first part. It is very difficult to test that the PHIs operands are in the…
				abUnsubmitted Done Reply Inline Actions The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself. I realize that, and this is a shortcoming of our test infrastructure that the MI serialization effort intends to address. Still, right now, in my opinion, overly explicit tests are better than no tests. So, my (radical) advice is: use the utils/update_llc_test_checks.py script, and tweak it locally to remove the SP-offset scrubs (l46). You'll probably want to get rid of foo9, and simplify foo8 (just focus on two types rather than the combination of all?). Also, I think the reuse of %v2 might make the output harder to follow than they would be if you had v1.v2 and v3.v4, but that's just a gut feeling. When someone changes anything downstream that affects this, they can just run the script again, and have a nice diff of what changed. What do you think? If you can come up with a cleaner way, that'd be great! But I'm uncomfortable with no testing at all :/ Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. [...] Thanks for investigating the example! ab: > The actual assembly generated for the movs is fairly tricky, and very dependent on the whims…
				%cmp = icmp slt i32 %v1, 0
				%v2.v3 = select i1 %cmp, i32 %v2, i32 %v3
				%v1.v2 = select i1 %cmp, i32 %v1, i32 %v2
				%sub = sub i32 %v1.v2, %v2.v3
				ret i32 %sub
				}

				; This test checks that only a single js gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR. This makes
				; sure the code for the lowering for opposite conditions gets tested.
				; CHECK-LABEL: foo11:
				; CHECK: js
				; CHECK-NOT: js
				; CHECK-NOT: jns
				define i32 @foo11(i32 %v1, i32 %v2, i32 %v3) nounwind {
				entry:
				%cmp1 = icmp slt i32 %v1, 0
				%v2.v3 = select i1 %cmp1, i32 %v2, i32 %v3
				%cmp2 = icmp sge i32 %v1, 0
				%v1.v2 = select i1 %cmp2, i32 %v1, i32 %v2
				%sub = sub i32 %v1.v2, %v2.v3
				ret i32 %sub
				}

				; This test checks that only a single js gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo2:
				; CHECK: js
				; CHECK-NOT: js
				define i32 @foo2(i8 %v1, i8 %v2, i8 %v3) nounwind {
				entry:
				%cmp = icmp slt i8 %v1, 0
				%v2.v3 = select i1 %cmp, i8 %v2, i8 %v3
				%v1.v2 = select i1 %cmp, i8 %v1, i8 %v2
				%t1 = sext i8 %v2.v3 to i32
				%t2 = sext i8 %v1.v2 to i32
				%sub = sub i32 %t1, %t2
				ret i32 %sub
				}

				; This test checks that only a single js gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo3:
				; CHECK: js
				; CHECK-NOT: js
				define i32 @foo3(i16 %v1, i16 %v2, i16 %v3) nounwind {
				entry:
				%cmp = icmp slt i16 %v1, 0
				%v2.v3 = select i1 %cmp, i16 %v2, i16 %v3
				%v1.v2 = select i1 %cmp, i16 %v1, i16 %v2
				%t1 = sext i16 %v2.v3 to i32
				%t2 = sext i16 %v1.v2 to i32
				%sub = sub i32 %t1, %t2
				ret i32 %sub
				}

				; This test checks that only a single js gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo4:
				; CHECK: js
				; CHECK-NOT: js
				define float @foo4(i32 %v1, float %v2, float %v3, float %v4) nounwind {
				entry:
				%cmp = icmp slt i32 %v1, 0
				%t1 = select i1 %cmp, float %v2, float %v3
				%t2 = select i1 %cmp, float %v3, float %v4
				%sub = fsub float %t1, %t2
				ret float %sub
				}

				; This test checks that only a single je gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo5:
				; CHECK: je
				; CHECK-NOT: je
				define double @foo5(i32 %v1, double %v2, double %v3, double %v4) nounwind {
				entry:
				%cmp = icmp eq i32 %v1, 0
				%t1 = select i1 %cmp, double %v2, double %v3
				%t2 = select i1 %cmp, double %v3, double %v4
				%sub = fsub double %t1, %t2
				ret double %sub
				}

				; This test checks that only a single je gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo6:
				; CHECK: je
				; CHECK-NOT: je
				define <4 x float> @foo6(i32 %v1, <4 x float> %v2, <4 x float> %v3, <4 x float> %v4) nounwind {
				entry:
				%cmp = icmp eq i32 %v1, 0
				%t1 = select i1 %cmp, <4 x float> %v2, <4 x float> %v3
				%t2 = select i1 %cmp, <4 x float> %v3, <4 x float> %v4
				%sub = fsub <4 x float> %t1, %t2
				ret <4 x float> %sub
				}

				; This test checks that only a single je gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo7:
				; CHECK: je
				; CHECK-NOT: je
				define <2 x double> @foo7(i32 %v1, <2 x double> %v2, <2 x double> %v3, <2 x double> %v4) nounwind {
				entry:
				%cmp = icmp eq i32 %v1, 0
				%t1 = select i1 %cmp, <2 x double> %v2, <2 x double> %v3
				%t2 = select i1 %cmp, <2 x double> %v3, <2 x double> %v4
				%sub = fsub <2 x double> %t1, %t2
				ret <2 x double> %sub
				}

				; This test checks that only a single ja gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR. This combines
				; all the supported types together into one long string of selects based
				; on the same condition.
				; CHECK-LABEL: foo8:
				; CHECK: ja
				; CHECK-NOT: ja
				define void @foo8(i32 %v1,
				i8 %v2, i8 %v3,
				i16 %v12, i16 %v13,
				i32 %v22, i32 %v23,
				float %v32, float %v33,
				double %v42, double %v43,
				<4 x float> %v52, <4 x float> %v53,
				<2 x double> %v62, <2 x double> %v63,
				<8 x float> %v72, <8 x float> %v73,
				<4 x double> %v82, <4 x double> %v83,
				<16 x float> %v92, <16 x float> %v93,
				<8 x double> %v102, <8 x double> %v103,
				i8 * %dst) nounwind {
				entry:
				%add.ptr11 = getelementptr inbounds i8, i8* %dst, i32 2
				%a11 = bitcast i8* %add.ptr11 to i16*

				%add.ptr21 = getelementptr inbounds i8, i8* %dst, i32 4
				%a21 = bitcast i8* %add.ptr21 to i32*

				%add.ptr31 = getelementptr inbounds i8, i8* %dst, i32 8
				%a31 = bitcast i8* %add.ptr31 to float*

				%add.ptr41 = getelementptr inbounds i8, i8* %dst, i32 16
				%a41 = bitcast i8* %add.ptr41 to double*

				%add.ptr51 = getelementptr inbounds i8, i8* %dst, i32 32
				%a51 = bitcast i8* %add.ptr51 to <4 x float>*

				%add.ptr61 = getelementptr inbounds i8, i8* %dst, i32 48
				%a61 = bitcast i8* %add.ptr61 to <2 x double>*

				%add.ptr71 = getelementptr inbounds i8, i8* %dst, i32 64
				%a71 = bitcast i8* %add.ptr71 to <8 x float>*

				%add.ptr81 = getelementptr inbounds i8, i8* %dst, i32 128
				%a81 = bitcast i8* %add.ptr81 to <4 x double>*

				%add.ptr91 = getelementptr inbounds i8, i8* %dst, i32 64
				%a91 = bitcast i8* %add.ptr91 to <16 x float>*

				%add.ptr101 = getelementptr inbounds i8, i8* %dst, i32 128
				%a101 = bitcast i8* %add.ptr101 to <8 x double>*

				; These operations are necessary, because select of two single use loads
				; ends up getting optimized into a select of two leas, followed by a
				; single load of the selected address.
				%t13 = xor i16 %v13, 11
				%t23 = xor i32 %v23, 1234
				%t33 = fadd float %v33, %v32
				%t43 = fadd double %v43, %v42
				%t53 = fadd <4 x float> %v53, %v52
				%t63 = fadd <2 x double> %v63, %v62
				%t73 = fsub <8 x float> %v73, %v72
				%t83 = fsub <4 x double> %v83, %v82
				%t93 = fsub <16 x float> %v93, %v92
				%t103 = fsub <8 x double> %v103, %v102

				%cmp = icmp ugt i32 %v1, 31
				%t11 = select i1 %cmp, i16 %v12, i16 %t13
				%t21 = select i1 %cmp, i32 %v22, i32 %t23
				%t31 = select i1 %cmp, float %v32, float %t33
				%t41 = select i1 %cmp, double %v42, double %t43
				%t51 = select i1 %cmp, <4 x float> %v52, <4 x float> %t53
				%t61 = select i1 %cmp, <2 x double> %v62, <2 x double> %t63
				%t71 = select i1 %cmp, <8 x float> %v72, <8 x float> %t73
				%t81 = select i1 %cmp, <4 x double> %v82, <4 x double> %t83
				%t91 = select i1 %cmp, <16 x float> %v92, <16 x float> %t93
				%t101 = select i1 %cmp, <8 x double> %v102, <8 x double> %t103

				store i16 %t11, i16* %a11, align 2
				store i32 %t21, i32* %a21, align 4
				store float %t31, float* %a31, align 4
				store double %t41, double* %a41, align 8
				store <4 x float> %t51, <4 x float>* %a51, align 16
				store <2 x double> %t61, <2 x double>* %a61, align 16
				store <8 x float> %t71, <8 x float>* %a71, align 32
				store <4 x double> %t81, <4 x double>* %a81, align 32
				store <16 x float> %t91, <16 x float>* %a91, align 32
				store <8 x double> %t101, <8 x double>* %a101, align 32

				ret void
				}

				; This test checks that only a single ja gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; on the same condition.
				; Contrary to my expectations, this doesn't exercise the code for
				; CMOV_V8I1, CMOV_V16I1, CMOV_V32I1, or CMOV_V64I1. Instead the selects all
				; get lowered into vector length number of selects, which all eventually turn
				; into a huge number of CMOV_GR8, which are all contiguous, so the optimization
				; kicks in as long as CMOV_GR8 is supported. I couldn't find a way to get
				; CMOV_VI1 pseudo-opcodes to get generated. If a way exists to get CMOV_V1
				; pseudo-opcodes to be generated, this test should be replaced with one that
				; tests those opcodes.
				abUnsubmitted Not Done Reply Inline Actions These probably require AVX512. You can ignore them, I think. ab: These probably require AVX512. You can ignore them, I think.
				kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions Thanks for that info. kbsmith1: Thanks for that info.
				;
				abUnsubmitted Not Done Reply Inline Actions If this test isn't actually testing CMOV_VNI1, should it be removed? ab: If this test isn't actually testing CMOV_VNI1, should it be removed?
				abUnsubmitted Not Done Reply Inline Actions Does enabling, say, avx512f work? ab: Does enabling, say, avx512f work?
				kbsmith1AuthorUnsubmitted Not Done Reply Inline Actions I will look into that. If it does, then I will remove this specific test, and add such a necessary test back as a separate test file in a separate change set. kbsmith1: I will look into that. If it does, then I will remove this specific test, and add such a…
				; CHECK-LABEL: foo9:
				; CHECK: ja
				; CHECK-NOT: ja
				define void @foo9(i32 %v1,
				<8 x i1> %v12, <8 x i1> %v13,
				<16 x i1> %v22, <16 x i1> %v23,
				<32 x i1> %v32, <32 x i1> %v33,
				<64 x i1> %v42, <64 x i1> %v43,
				i8 * %dst) nounwind {
				entry:
				%add.ptr11 = getelementptr inbounds i8, i8* %dst, i32 0
				%a11 = bitcast i8* %add.ptr11 to <8 x i1>*

				%add.ptr21 = getelementptr inbounds i8, i8* %dst, i32 4
				%a21 = bitcast i8* %add.ptr21 to <16 x i1>*

				%add.ptr31 = getelementptr inbounds i8, i8* %dst, i32 8
				%a31 = bitcast i8* %add.ptr31 to <32 x i1>*

				%add.ptr41 = getelementptr inbounds i8, i8* %dst, i32 16
				%a41 = bitcast i8* %add.ptr41 to <64 x i1>*

				; These operations are necessary, because select of two single use loads
				; ends up getting optimized into a select of two leas, followed by a
				; single load of the selected address.
				%t13 = xor <8 x i1> %v13, %v12
				%t23 = xor <16 x i1> %v23, %v22
				%t33 = xor <32 x i1> %v33, %v32
				%t43 = xor <64 x i1> %v43, %v42

				%cmp = icmp ugt i32 %v1, 31
				%t11 = select i1 %cmp, <8 x i1> %v12, <8 x i1> %t13
				%t21 = select i1 %cmp, <16 x i1> %v22, <16 x i1> %t23
				%t31 = select i1 %cmp, <32 x i1> %v32, <32 x i1> %t33
				%t41 = select i1 %cmp, <64 x i1> %v42, <64 x i1> %t43

				store <8 x i1> %t11, <8 x i1>* %a11, align 16
				store <16 x i1> %t21, <16 x i1>* %a21, align 4
				store <32 x i1> %t31, <32 x i1>* %a31, align 8
				store <64 x i1> %t41, <64 x i1>* %a41, align 16

				ret void
				}

test/CodeGen/X86/pseudo_cmov_lower1.ll

				; RUN: llc < %s -mtriple=i386-linux-gnu -mattr=+sse2 -o - \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-linux-gnu -o - \| FileCheck %s

				; This test checks that only a single jae gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo1:
				; CHECK: jae
				; CHECK-NOT: jae
				define double @foo1(float %p1, double %p2, double %p3) nounwind {
				entry:
				%c1 = fcmp oge float %p1, 0.000000e+00
				%d0 = fadd double %p2, 1.25e0
				%d1 = fadd double %p3, 1.25e0
				%d2 = select i1 %c1, double %d0, double %d1
				%d3 = select i1 %c1, double %d0, double %p2
				%d4 = select i1 %c1, double %p3, double %d1
				%d5 = fsub double %d2, %d3
				%d6 = fadd double %d5, %d4
				ret double %d6
				}

				; This test checks that only a single jae gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR.
				; CHECK-LABEL: foo2:
				; CHECK: jae
				; CHECK-NOT: jae
				define float @foo2(float %p1, float %p2, float %p3) nounwind {
				entry:
				%c1 = fcmp oge float %p1, 0.000000e+00
				%d0 = fadd float %p2, 1.25e0
				%d1 = fadd float %p3, 1.25e0
				%d2 = select i1 %c1, float %d0, float %d1
				%d3 = select i1 %c1, float %d1, float %p2
				%d4 = select i1 %c1, float %d0, float %p3
				%d5 = fsub float %d2, %d3
				%d6 = fadd float %d5, %d4
				ret float %d6
				}

test/CodeGen/X86/pseudo_cmov_lower2.ll

				; RUN: llc < %s -mtriple=x86_64-linux-gnu -o - \| FileCheck %s

				; This test checks that only a single jae gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR. The tricky part
				; of this test is that it tests the special PHI operand rewriting code in
				; X86TargetLowering::EmitLoweredSelect.
				;
				; CHECK-LABEL: foo1:
				; CHECK: jae
				; CHECK-NOT: jae
				define double @foo1(float %p1, double %p2, double %p3) nounwind {
				entry:
				%c1 = fcmp oge float %p1, 0.000000e+00
				%d0 = fadd double %p2, 1.25e0
				%d1 = fadd double %p3, 1.25e0
				%d2 = select i1 %c1, double %d0, double %d1
				%d3 = select i1 %c1, double %d2, double %p2
				%d4 = select i1 %c1, double %d3, double %p3
				%d5 = fsub double %d2, %d3
				%d6 = fadd double %d5, %d4
				ret double %d6
				}

				; This test checks that only a single jae gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR. The tricky part
				; of this test is that it tests the special PHI operand rewriting code in
				; X86TargetLowering::EmitLoweredSelect.
				;
				; CHECK-LABEL: foo2:
				; CHECK: jae
				; CHECK-NOT: jae
				define double @foo2(float %p1, double %p2, double %p3) nounwind {
				entry:
				%c1 = fcmp oge float %p1, 0.000000e+00
				%d0 = fadd double %p2, 1.25e0
				%d1 = fadd double %p3, 1.25e0
				%d2 = select i1 %c1, double %d0, double %d1
				%d3 = select i1 %c1, double %p2, double %d2
				%d4 = select i1 %c1, double %p3, double %d3
				%d5 = fsub double %d2, %d3
				%d6 = fadd double %d5, %d4
				ret double %d6
				}

				; This test checks that only a single js gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR. The tricky part
				; of this test is that it tests the special PHI operand rewriting code in
				; X86TargetLowering::EmitLoweredSelect. It also tests to make sure all
				; the operands of the resulting instructions are from the proper places.
				;
				; CHECK-LABEL: foo3:
				; CHECK: js
				; CHECK-NOT: js
				; CHECK-LABEL: # BB#1:
				; CHECK-DAG: movapd %xmm2, %xmm1
				; CHECK-DAG: movapd %xmm2, %xmm0
				; CHECK-LABEL:.LBB2_2:
				; CHECK: divsd %xmm1, %xmm0
				; CHECK: ret
				define double @foo3(i32 %p1, double %p2, double %p3,
				double %p4, double %p5) nounwind {
				entry:
				%c1 = icmp slt i32 %p1, 0
				%d2 = select i1 %c1, double %p2, double %p3
				%d3 = select i1 %c1, double %p3, double %p4
				%d4 = select i1 %c1, double %d2, double %d3
				%d5 = fdiv double %d4, %d3
				ret double %d5
				}

				; This test checks that only a single js gets generated in the final code
				; for lowering the CMOV pseudos that get created for this IR. The tricky part
				; of this test is that it tests the special PHI operand rewriting code in
				; X86TargetLowering::EmitLoweredSelect. It also tests to make sure all
				; the operands of the resulting instructions are from the proper places
				; when the "opposite condition" handling code in the compiler is used.
				; This should be the same code as foo3 above, because we use the opposite
				; condition code in the second two selects, but we also swap the operands
				; of the selects to give the same actual computation.
				;
				; CHECK-LABEL: foo4:
				; CHECK: js
				; CHECK-NOT: js
				; CHECK-LABEL: # BB#1:
				; CHECK-DAG: movapd %xmm2, %xmm1
				; CHECK-DAG: movapd %xmm2, %xmm0
				; CHECK-LABEL:.LBB3_2:
				; CHECK: divsd %xmm1, %xmm0
				; CHECK: ret
				define double @foo4(i32 %p1, double %p2, double %p3,
				double %p4, double %p5) nounwind {
				entry:
				%c1 = icmp slt i32 %p1, 0
				%d2 = select i1 %c1, double %p2, double %p3
				%c2 = icmp sge i32 %p1, 0
				%d3 = select i1 %c2, double %p4, double %p3
				%d4 = select i1 %c2, double %d3, double %d2
				%d5 = fdiv double %d4, %d3
				ret double %d5
				}