This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
1/1
X86ISelLowering.cpp
1/1
X86InstrInfo.h
2/2
X86InstrInfo.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx.ll
-
sse41.ll
-
stack-folding-fp-avx1.ll
-
stack-folding-fp-sse42.ll

Differential D13988

[X86][SSE] Add general memory folding for (V)INSERTPS instruction
ClosedPublic

Authored by RKSimon on Oct 22 2015, 10:27 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
filcab
andreadb
rob.lougher

Commits

rG7e6606f4f1ee: [X86][SSE] Add general memory folding for (V)INSERTPS instruction
rL252074: [X86][SSE] Add general memory folding for (V)INSERTPS instruction

Summary

This patch improves the memory folding of the inserted float element for the (V)INSERTPS instruction.

The existing implementation occurs in the DAGCombiner and relies on the narrowing of a whole vector load into a scalar load (into a vector) to then allow folding to occur later on. Not only has this proven problematic for debug builds, but it also prevents other memory folds (notably stack reloads) from happening.

This patch removes the old implementation and moves the folding code to the X86 foldMemoryOperand handler. A new private 'special case' function - foldMemoryOperandSpecial - has been added to deal with memory folding of instructions that can't just use the lookup tables (insertps is the first of several that could be done).

It also tweaks the memory operand folding code with an additional pointer offset that allows existing memory addresses to be modified, in this case to convert the vector address to the explicit address of the scalar element that will be inserted.

Unlike the previous implementation we now set the insertion source index to zero, this is ignored for the (V)INSERTPSrm version, so this mainly beneficial so shuffle decodes don't show a pointer offset.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38145.Oct 22 2015, 10:27 AM

RKSimon retitled this revision from to [X86][SSE] Add general memory folding for (V)INSERTPS instruction.

RKSimon updated this object.

RKSimon added reviewers: qcolombet, filcab, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

LGTM, but I'm not the most well versed in the MachineInstr stuff.
Thanks for making the filecheck stuff match more of the instruction, too!

lib/Target/X86/X86ISelLowering.cpp
26640	Please split the clang-format (or manual format) changes from the rest.
lib/Target/X86/X86InstrInfo.cpp
4923	Please split the clang-format changes from the rest.
4951	I would s/Special/Custom/, but that's just me bikeshedding :-)
lib/Target/X86/X86InstrInfo.h
515	Can you double-check the doxygen that gets generated? IIRC, it stops at the first '.'. Unless there's a special case for "e.g.", it's probably best to replace it with a more colloquial "like", or "for example:".

gbedwell added a subscriber: gbedwell.Nov 4 2015, 4:39 AM

rob.lougher added a subscriber: rob.lougher.Nov 4 2015, 4:39 AM

Refreshed patch based on Filipe's comments

Hi Simon,

We discovered a bug internally caused by the non-zeroing of the countS bits in the folding of the insertps load. Although countS bits are ignored when loading from memory on insertps, we need to explicitly set them to 0 as another optimization may later "unfold" the load. This is demonstrated by the following testcase (the checks are based on the RUN lines from the sse41.ll file).

`define <4 x float> @foo(<4 x float>* %v0, <4 x float>* %v1) {
; X32-LABEL: foo:
; X32: BB#0: ; X32-NEXT: movl {{[0-9]+}}(%esp), %eax ; X32-NEXT: movl {{[0-9]+}}(%esp), %ecx ; X32-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero ; X32-NEXT: movaps (%eax), %xmm0 ; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[0] ; X32-NEXT: addps %xmm1, %xmm0 ; X32-NEXT: retl ; ; X64-LABEL: foo: ; X64: BB#0:
; X64-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; X64-NEXT: movaps (%rdi), %xmm0
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[0]
; X64-NEXT: addps %xmm1, %xmm0
; X64-NEXT: retq

%a = getelementptr inbounds <4 x float>, <4 x float>* %v1, i64 0, i64 1
%b = load float, float* %a, align 4
%c = insertelement <4 x float> undef, float %b, i32 0
%d = load <4 x float>, <4 x float>* %v1, align 16
%e = load <4 x float>, <4 x float>* %v0, align 16
%f = shufflevector <4 x float> %e, <4 x float> %d, <4 x i32> <i32 0, i32 1, i32 2, i32 5>
%g = fadd <4 x float> %c, %f
ret <4 x float> %g

}
`
Another minor comment is that your change will do general memory load folding in addition to stack folding, but you've only got tests for stack folding.

In D13988#281097, @rob.lougher wrote:

Another minor comment is that your change will do general memory load folding in addition to stack folding, but you've only got tests for stack folding.

The changes in test/CodeGen/X86/avx.ll and test/CodeGen/X86/sse41.ll cover general folded loads.

I'll add the test case to the patch shortly.

Added Rob's additional test case

Hi Simon,

Thanks for adding the test. This looks good to me.

Rob.

This revision is now accepted and ready to land.Nov 4 2015, 7:44 AM

Closed by commit rL252074: [X86][SSE] Add general memory folding for (V)INSERTPS instruction (authored by RKSimon). · Explain WhyNov 4 2015, 12:50 PM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in rL253606: [X86] Use existing MachineInstrBuilder::addDisp to create offseted pointer. NFC..Nov 19 2015, 1:53 PM

RKSimon mentioned this in D14867: [MachineInstrBuilder] Support for adding a ConstantPoolIndex MO with an additional offset..Nov 20 2015, 3:36 AM

RKSimon mentioned this in rL253795: [MachineInstrBuilder] Support for adding a ConstantPoolIndex MO with an….Nov 21 2015, 1:45 PM

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 252046)

51 lines

	X86InstrInfo.h
	X86InstrInfo.h (revision 252046)

8 lines

	X86InstrInfo.cpp
	X86InstrInfo.cpp (revision 252046)

78 lines

test/

CodeGen/

X86/

	avx.ll
	avx.ll (revision 252046)

6 lines

	sse41.ll
	sse41.ll (revision 252046)

16 lines

	stack-folding-fp-avx1.ll
	stack-folding-fp-avx1.ll (revision 252046)

10 lines

	stack-folding-fp-sse42.ll
	stack-folding-fp-sse42.ll (revision 252046)

10 lines

Diff 39199

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,168 Lines • ▼ Show 20 Lines	if (IsSEXT0 && IsVZero1) {
"Unexpected condition code!");		"Unexpected condition code!");
return LHS.getOperand(0);		return LHS.getOperand(0);
}		}
}		}

return SDValue();		return SDValue();
}		}

static SDValue NarrowVectorLoadToElement(LoadSDNode *Load, unsigned Index,
SelectionDAG &DAG) {
SDLoc dl(Load);
MVT VT = Load->getSimpleValueType(0);
MVT EVT = VT.getVectorElementType();
SDValue Addr = Load->getOperand(1);
SDValue NewAddr = DAG.getNode(
ISD::ADD, dl, Addr.getSimpleValueType(), Addr,
DAG.getConstant(Index * EVT.getStoreSize(), dl,
Addr.getSimpleValueType()));

SDValue NewLoad =
DAG.getLoad(EVT, dl, Load->getChain(), NewAddr,
DAG.getMachineFunction().getMachineMemOperand(
Load->getMemOperand(), 0, EVT.getStoreSize()));
return NewLoad;
}

static SDValue PerformINSERTPSCombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {
SDLoc dl(N);
MVT VT = N->getOperand(1)->getSimpleValueType(0);
assert((VT == MVT::v4f32 \|\| VT == MVT::v4i32) &&
"X86insertps is only defined for v4x32");

SDValue Ld = N->getOperand(1);
if (MayFoldLoad(Ld)) {
// Extract the countS bits from the immediate so we can get the proper
// address when narrowing the vector load to a specific element.
// When the second source op is a memory address, insertps doesn't use
// countS and just gets an f32 from that address.
unsigned DestIndex =
cast<ConstantSDNode>(N->getOperand(2))->getZExtValue() >> 6;

Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(Ld), DestIndex, DAG);

// Create this as a scalar to vector to match the instruction pattern.
SDValue LoadScalarToVector = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Ld);
// countS bits are ignored when loading from memory on insertps, which
// means we don't need to explicitly set them to 0.
return DAG.getNode(X86ISD::INSERTPS, dl, VT, N->getOperand(0),
LoadScalarToVector, N->getOperand(2));
}
return SDValue();
}

static SDValue PerformBLENDICombine(SDNode *N, SelectionDAG &DAG) {		static SDValue PerformBLENDICombine(SDNode *N, SelectionDAG &DAG) {
SDValue V0 = N->getOperand(0);		SDValue V0 = N->getOperand(0);
SDValue V1 = N->getOperand(1);		SDValue V1 = N->getOperand(1);
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector		// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector
// operands and changing the mask to 1. This saves us a bunch of		// operands and changing the mask to 1. This saves us a bunch of
▲ Show 20 Lines • Show All 447 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case X86ISD::PSHUFB:		case X86ISD::PSHUFB:
case X86ISD::PSHUFD:		case X86ISD::PSHUFD:
case X86ISD::PSHUFHW:		case X86ISD::PSHUFHW:
case X86ISD::PSHUFLW:		case X86ISD::PSHUFLW:
case X86ISD::MOVSS:		case X86ISD::MOVSS:
case X86ISD::MOVSD:		case X86ISD::MOVSD:
case X86ISD::VPERMILPI:		case X86ISD::VPERMILPI:
case X86ISD::VPERM2X128:		case X86ISD::VPERM2X128:
case ISD::VECTOR_SHUFFLE: return PerformShuffleCombine(N, DAG, DCI,Subtarget);		case ISD::VECTOR_SHUFFLE: return PerformShuffleCombine(N, DAG, DCI,Subtarget);
		filcabUnsubmitted Done Reply Inline Actions Please split the clang-format (or manual format) changes from the rest. filcab: Please split the clang-format (or manual format) changes from the rest.
case ISD::FMA: return PerformFMACombine(N, DAG, Subtarget);		case ISD::FMA: return PerformFMACombine(N, DAG, Subtarget);
case X86ISD::INSERTPS: {
if (getTargetMachine().getOptLevel() > CodeGenOpt::None)
return PerformINSERTPSCombine(N, DAG, Subtarget);
break;
}
case X86ISD::BLENDI: return PerformBLENDICombine(N, DAG);		case X86ISD::BLENDI: return PerformBLENDICombine(N, DAG);
}		}

return SDValue();		return SDValue();
}		}

/// isTypeDesirableForOp - Return true if the target has native support for		/// isTypeDesirableForOp - Return true if the target has native support for
/// the specified value type and it is 'desirable' to use the type for the		/// the specified value type and it is 'desirable' to use the type for the
▲ Show 20 Lines • Show All 818 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 506 Lines • ▼ Show 20 Lines	MachineInstr commuteInstructionImpl(MachineInstr MI, bool NewMI,
unsigned CommuteOpIdx2) const override;		unsigned CommuteOpIdx2) const override;

private:		private:
MachineInstr * convertToThreeAddressWithLEA(unsigned MIOpc,		MachineInstr * convertToThreeAddressWithLEA(unsigned MIOpc,
MachineFunction::iterator &MFI,		MachineFunction::iterator &MFI,
MachineBasicBlock::iterator &MBBI,		MachineBasicBlock::iterator &MBBI,
LiveVariables *LV) const;		LiveVariables *LV) const;

		/// Handles memory folding for special case instructions, for instance those
		filcabUnsubmitted Done Reply Inline Actions Can you double-check the doxygen that gets generated? IIRC, it stops at the first '.'. Unless there's a special case for "e.g.", it's probably best to replace it with a more colloquial "like", or "for example:". filcab: Can you double-check the doxygen that gets generated? IIRC, it stops at the first '.'. Unless…
		/// requiring custom manipulation of the address.
		MachineInstr foldMemoryOperandCustom(MachineFunction &MF, MachineInstr MI,
		unsigned OpNum,
		ArrayRef<MachineOperand> MOs,
		MachineBasicBlock::iterator InsertPt,
		unsigned Size, unsigned Align) const;

/// isFrameOperand - Return true and the FrameIndex if the specified		/// isFrameOperand - Return true and the FrameIndex if the specified
/// operand and follow operands form a reference to the stack frame.		/// operand and follow operands form a reference to the stack frame.
bool isFrameOperand(const MachineInstr *MI, unsigned int Op,		bool isFrameOperand(const MachineInstr *MI, unsigned int Op,
int &FrameIndex) const;		int &FrameIndex) const;
};		};

} // End llvm namespace		} // End llvm namespace

#endif		#endif

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,846 Lines • ▼ Show 20 Lines	bool X86InstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {
case X86::KSET1Q: return Expand2AddrUndef(MIB, get(X86::KXNORQrr));		case X86::KSET1Q: return Expand2AddrUndef(MIB, get(X86::KXNORQrr));
case TargetOpcode::LOAD_STACK_GUARD:		case TargetOpcode::LOAD_STACK_GUARD:
expandLoadStackGuard(MIB, *this);		expandLoadStackGuard(MIB, *this);
return true;		return true;
}		}
return false;		return false;
}		}

static void addOperands(MachineInstrBuilder &MIB, ArrayRef<MachineOperand> MOs) {		static void addOperands(MachineInstrBuilder &MIB, ArrayRef<MachineOperand> MOs,
		int PtrOffset = 0) {
unsigned NumAddrOps = MOs.size();		unsigned NumAddrOps = MOs.size();

		if (NumAddrOps < 4) {
		// FrameIndex only - add an immediate offset (whether its zero or not).
for (unsigned i = 0; i != NumAddrOps; ++i)		for (unsigned i = 0; i != NumAddrOps; ++i)
MIB.addOperand(MOs[i]);		MIB.addOperand(MOs[i]);
if (NumAddrOps < 4) // FrameIndex only		addOffset(MIB, PtrOffset);
addOffset(MIB, 0);		} else {
		// General Memory Addressing - we need to add any offset to an existing
		// offset.
		assert(MOs.size() == 5 && "Unexpected memory operand list length");
		for (unsigned i = 0; i != NumAddrOps; ++i) {
		const MachineOperand &MO = MOs[i];
		if (i == 3 && PtrOffset != 0) {
		assert((MO.isImm() \|\| MO.isGlobal()) &&
		"Unexpected memory operand type");
		if (MO.isImm()) {
		MIB.addImm(MO.getImm() + PtrOffset);
		} else {
		MIB.addGlobalAddress(MO.getGlobal(), MO.getOffset() + PtrOffset,
		MO.getTargetFlags());
		}
		} else {
		MIB.addOperand(MO);
		}
		}
		}
}		}

static MachineInstr *FuseTwoAddrInst(MachineFunction &MF, unsigned Opcode,		static MachineInstr *FuseTwoAddrInst(MachineFunction &MF, unsigned Opcode,
ArrayRef<MachineOperand> MOs,		ArrayRef<MachineOperand> MOs,
MachineBasicBlock::iterator InsertPt,		MachineBasicBlock::iterator InsertPt,
MachineInstr *MI,		MachineInstr *MI,
const TargetInstrInfo &TII) {		const TargetInstrInfo &TII) {
// Create the base instruction with the memory operand as the first part.		// Create the base instruction with the memory operand as the first part.
Show All 18 Lines	static MachineInstr *FuseTwoAddrInst(MachineFunction &MF, unsigned Opcode,
MBB->insert(InsertPt, NewMI);		MBB->insert(InsertPt, NewMI);

return MIB;		return MIB;
}		}

static MachineInstr *FuseInst(MachineFunction &MF, unsigned Opcode,		static MachineInstr *FuseInst(MachineFunction &MF, unsigned Opcode,
unsigned OpNo, ArrayRef<MachineOperand> MOs,		unsigned OpNo, ArrayRef<MachineOperand> MOs,
MachineBasicBlock::iterator InsertPt,		MachineBasicBlock::iterator InsertPt,
MachineInstr *MI, const TargetInstrInfo &TII) {		MachineInstr *MI, const TargetInstrInfo &TII,
		int PtrOffset = 0) {
// Omit the implicit operands, something BuildMI can't do.		// Omit the implicit operands, something BuildMI can't do.
MachineInstr *NewMI = MF.CreateMachineInstr(TII.get(Opcode),		MachineInstr *NewMI = MF.CreateMachineInstr(TII.get(Opcode),
MI->getDebugLoc(), true);		MI->getDebugLoc(), true);
MachineInstrBuilder MIB(MF, NewMI);		MachineInstrBuilder MIB(MF, NewMI);
		filcabUnsubmitted Done Reply Inline Actions Please split the clang-format changes from the rest. filcab: Please split the clang-format changes from the rest.

for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
if (i == OpNo) {		if (i == OpNo) {
assert(MO.isReg() && "Expected to fold into reg operand!");		assert(MO.isReg() && "Expected to fold into reg operand!");
addOperands(MIB, MOs);		addOperands(MIB, MOs, PtrOffset);
} else {		} else {
MIB.addOperand(MO);		MIB.addOperand(MO);
}		}
}		}

MachineBasicBlock *MBB = InsertPt->getParent();		MachineBasicBlock *MBB = InsertPt->getParent();
MBB->insert(InsertPt, NewMI);		MBB->insert(InsertPt, NewMI);

return MIB;		return MIB;
}		}

static MachineInstr *MakeM0Inst(const TargetInstrInfo &TII, unsigned Opcode,		static MachineInstr *MakeM0Inst(const TargetInstrInfo &TII, unsigned Opcode,
ArrayRef<MachineOperand> MOs,		ArrayRef<MachineOperand> MOs,
MachineBasicBlock::iterator InsertPt,		MachineBasicBlock::iterator InsertPt,
MachineInstr *MI) {		MachineInstr *MI) {
MachineInstrBuilder MIB = BuildMI(*InsertPt->getParent(), InsertPt,		MachineInstrBuilder MIB = BuildMI(*InsertPt->getParent(), InsertPt,
MI->getDebugLoc(), TII.get(Opcode));		MI->getDebugLoc(), TII.get(Opcode));
addOperands(MIB, MOs);		addOperands(MIB, MOs);
return MIB.addImm(0);		return MIB.addImm(0);
}		}

		MachineInstr *X86InstrInfo::foldMemoryOperandCustom(
		filcabUnsubmitted Done Reply Inline Actions I would s/Special/Custom/, but that's just me bikeshedding :-) filcab: I would s/Special/Custom/, but that's just me bikeshedding :-)
		MachineFunction &MF, MachineInstr *MI, unsigned OpNum,
		ArrayRef<MachineOperand> MOs, MachineBasicBlock::iterator InsertPt,
		unsigned Size, unsigned Align) const {
		switch (MI->getOpcode()) {
		case X86::INSERTPSrr:
		case X86::VINSERTPSrr:
		// Attempt to convert the load of inserted vector into a fold load
		// of a single float.
		if (OpNum == 2) {
		unsigned Imm = MI->getOperand(MI->getNumOperands() - 1).getImm();
		unsigned ZMask = Imm & 15;
		unsigned DstIdx = (Imm >> 4) & 3;
		unsigned SrcIdx = (Imm >> 6) & 3;

		unsigned RCSize = getRegClass(MI->getDesc(), OpNum, &RI, MF)->getSize();
		if (Size <= RCSize && 4 <= Align) {
		int PtrOffset = SrcIdx * 4;
		unsigned NewImm = (DstIdx << 4) \| ZMask;
		unsigned NewOpCode =
		(MI->getOpcode() == X86::VINSERTPSrr ? X86::VINSERTPSrm
		: X86::INSERTPSrm);
		MachineInstr *NewMI =
		FuseInst(MF, NewOpCode, OpNum, MOs, InsertPt, MI, *this, PtrOffset);
		NewMI->getOperand(NewMI->getNumOperands() - 1).setImm(NewImm);
		return NewMI;
		}
		}
		break;
		};

		return nullptr;
		}

MachineInstr *X86InstrInfo::foldMemoryOperandImpl(		MachineInstr *X86InstrInfo::foldMemoryOperandImpl(
MachineFunction &MF, MachineInstr *MI, unsigned OpNum,		MachineFunction &MF, MachineInstr *MI, unsigned OpNum,
ArrayRef<MachineOperand> MOs, MachineBasicBlock::iterator InsertPt,		ArrayRef<MachineOperand> MOs, MachineBasicBlock::iterator InsertPt,
unsigned Size, unsigned Align, bool AllowCommute) const {		unsigned Size, unsigned Align, bool AllowCommute) const {
const DenseMap<unsigned,		const DenseMap<unsigned,
std::pair<unsigned,unsigned> > *OpcodeTablePtr = nullptr;		std::pair<unsigned,unsigned> > *OpcodeTablePtr = nullptr;
bool isCallRegIndirect = Subtarget.callRegIndirect();		bool isCallRegIndirect = Subtarget.callRegIndirect();
bool isTwoAddrFold = false;		bool isTwoAddrFold = false;
Show All 13 Lines	MachineInstr *X86InstrInfo::foldMemoryOperandImpl(

// FIXME: AsmPrinter doesn't know how to handle		// FIXME: AsmPrinter doesn't know how to handle
// X86II::MO_GOT_ABSOLUTE_ADDRESS after folding.		// X86II::MO_GOT_ABSOLUTE_ADDRESS after folding.
if (MI->getOpcode() == X86::ADD32ri &&		if (MI->getOpcode() == X86::ADD32ri &&
MI->getOperand(2).getTargetFlags() == X86II::MO_GOT_ABSOLUTE_ADDRESS)		MI->getOperand(2).getTargetFlags() == X86II::MO_GOT_ABSOLUTE_ADDRESS)
return nullptr;		return nullptr;

MachineInstr *NewMI = nullptr;		MachineInstr *NewMI = nullptr;

		// Attempt to fold any custom cases we have.
		if (NewMI =
		foldMemoryOperandCustom(MF, MI, OpNum, MOs, InsertPt, Size, Align))
		return NewMI;

// Folding a memory location into the two-address part of a two-address		// Folding a memory location into the two-address part of a two-address
// instruction is different than folding it other places. It requires		// instruction is different than folding it other places. It requires
// replacing the two registers with the memory location.		// replacing the two registers with the memory location.
if (isTwoAddr && NumOps >= 2 && OpNum < 2 &&		if (isTwoAddr && NumOps >= 2 && OpNum < 2 &&
MI->getOperand(0).isReg() &&		MI->getOperand(0).isReg() &&
MI->getOperand(1).isReg() &&		MI->getOperand(1).isReg() &&
MI->getOperand(0).getReg() == MI->getOperand(1).getReg()) {		MI->getOperand(0).getReg() == MI->getOperand(1).getReg()) {
OpcodeTablePtr = &RegOp2MemOpTable2Addr;		OpcodeTablePtr = &RegOp2MemOpTable2Addr;
▲ Show 20 Lines • Show All 1,755 Lines • Show Last 20 Lines

test/CodeGen/X86/avx.ll

	Show All 26 Lines

	declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i32) nounwind readnone			declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i32) nounwind readnone

	define <4 x float> @insertps_from_vector_load(<4 x float> %a, <4 x float>* nocapture readonly %pb) {			define <4 x float> @insertps_from_vector_load(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
	; CHECK-LABEL: insertps_from_vector_load:			; CHECK-LABEL: insertps_from_vector_load:
	; On X32, account for the argument's move to registers			; On X32, account for the argument's move to registers
	; X32: movl 4(%esp), %eax			; X32: movl 4(%esp), %eax
	; CHECK-NOT: mov			; CHECK-NOT: mov
	; CHECK: insertps $48			; CHECK: vinsertps $48, (%{{...}}), {{.*#+}} xmm0 = xmm0[0,1,2],mem[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = load <4 x float>, <4 x float>* %pb, align 16			%1 = load <4 x float>, <4 x float>* %pb, align 16
	%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 48)			%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 48)
	ret <4 x float> %2			ret <4 x float> %2
	}			}

	;; Use a non-zero CountS for insertps			;; Use a non-zero CountS for insertps
	define <4 x float> @insertps_from_vector_load_offset(<4 x float> %a, <4 x float>* nocapture readonly %pb) {			define <4 x float> @insertps_from_vector_load_offset(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
	; CHECK-LABEL: insertps_from_vector_load_offset:			; CHECK-LABEL: insertps_from_vector_load_offset:
	; On X32, account for the argument's move to registers			; On X32, account for the argument's move to registers
	; X32: movl 4(%esp), %eax			; X32: movl 4(%esp), %eax
	; CHECK-NOT: mov			; CHECK-NOT: mov
	;; Try to match a bit more of the instr, since we need the load's offset.			;; Try to match a bit more of the instr, since we need the load's offset.
	; CHECK: insertps $96, 4(%{{...}}), %			; CHECK: vinsertps $32, 4(%{{...}}), {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = load <4 x float>, <4 x float>* %pb, align 16			%1 = load <4 x float>, <4 x float>* %pb, align 16
	%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 96)			%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 96)
	ret <4 x float> %2			ret <4 x float> %2
	}			}

	define <4 x float> @insertps_from_vector_load_offset_2(<4 x float> %a, <4 x float>* nocapture readonly %pb, i64 %index) {			define <4 x float> @insertps_from_vector_load_offset_2(<4 x float> %a, <4 x float>* nocapture readonly %pb, i64 %index) {
	; CHECK-LABEL: insertps_from_vector_load_offset_2:			; CHECK-LABEL: insertps_from_vector_load_offset_2:
	; On X32, account for the argument's move to registers			; On X32, account for the argument's move to registers
	; X32: movl 4(%esp), %eax			; X32: movl 4(%esp), %eax
	; X32: movl 8(%esp), %ecx			; X32: movl 8(%esp), %ecx
	; CHECK-NOT: mov			; CHECK-NOT: mov
	;; Try to match a bit more of the instr, since we need the load's offset.			;; Try to match a bit more of the instr, since we need the load's offset.
	; CHECK: vinsertps $192, 12(%{{...}},%{{...}}), %			; CHECK: vinsertps $0, 12(%{{...}},%{{...}}), {{.*#+}} xmm0 = mem[0],xmm0[1,2,3]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = getelementptr inbounds <4 x float>, <4 x float>* %pb, i64 %index			%1 = getelementptr inbounds <4 x float>, <4 x float>* %pb, i64 %index
	%2 = load <4 x float>, <4 x float>* %1, align 16			%2 = load <4 x float>, <4 x float>* %1, align 16
	%3 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %2, i32 192)			%3 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %2, i32 192)
	ret <4 x float> %3			ret <4 x float> %3
	}			}

	define <4 x float> @insertps_from_broadcast_loadf32(<4 x float> %a, float* nocapture readonly %fb, i64 %index) {			define <4 x float> @insertps_from_broadcast_loadf32(<4 x float> %a, float* nocapture readonly %fb, i64 %index) {
	▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

test/CodeGen/X86/sse41.ll

Show First 20 Lines • Show All 788 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
ret <8 x i16> %ret		ret <8 x i16> %ret
}		}

; On X32, account for the argument's move to registers		; On X32, account for the argument's move to registers
define <4 x float> @insertps_from_vector_load(<4 x float> %a, <4 x float>* nocapture readonly %pb) {		define <4 x float> @insertps_from_vector_load(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
; X32-LABEL: insertps_from_vector_load:		; X32-LABEL: insertps_from_vector_load:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0]		; X32-NEXT: insertps $48, (%{{...}}), {{.*#+}} xmm0 = xmm0[0,1,2],mem[0]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: insertps_from_vector_load:		; X64-LABEL: insertps_from_vector_load:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0]		; X64-NEXT: insertps $48, (%{{...}}), {{.*#+}} xmm0 = xmm0[0,1,2],mem[0]
; X64-NEXT: retq		; X64-NEXT: retq
%1 = load <4 x float>, <4 x float>* %pb, align 16		%1 = load <4 x float>, <4 x float>* %pb, align 16
%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 48)		%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 48)
ret <4 x float> %2		ret <4 x float> %2
}		}

;; Use a non-zero CountS for insertps		;; Use a non-zero CountS for insertps
;; Try to match a bit more of the instr, since we need the load's offset.		;; Try to match a bit more of the instr, since we need the load's offset.
define <4 x float> @insertps_from_vector_load_offset(<4 x float> %a, <4 x float>* nocapture readonly %pb) {		define <4 x float> @insertps_from_vector_load_offset(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
; X32-LABEL: insertps_from_vector_load_offset:		; X32-LABEL: insertps_from_vector_load_offset:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],mem[1],xmm0[3]		; X32-NEXT: insertps $32, 4(%{{...}}), {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: insertps_from_vector_load_offset:		; X64-LABEL: insertps_from_vector_load_offset:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],mem[1],xmm0[3]		; X64-NEXT: insertps $32, 4(%{{...}}), {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
; X64-NEXT: retq		; X64-NEXT: retq
%1 = load <4 x float>, <4 x float>* %pb, align 16		%1 = load <4 x float>, <4 x float>* %pb, align 16
%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 96)		%2 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %1, i32 96)
ret <4 x float> %2		ret <4 x float> %2
}		}

;; Try to match a bit more of the instr, since we need the load's offset.		;; Try to match a bit more of the instr, since we need the load's offset.
define <4 x float> @insertps_from_vector_load_offset_2(<4 x float> %a, <4 x float>* nocapture readonly %pb, i64 %index) {		define <4 x float> @insertps_from_vector_load_offset_2(<4 x float> %a, <4 x float>* nocapture readonly %pb, i64 %index) {
; X32-LABEL: insertps_from_vector_load_offset_2:		; X32-LABEL: insertps_from_vector_load_offset_2:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X32-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X32-NEXT: shll $4, %ecx		; X32-NEXT: shll $4, %ecx
; X32-NEXT: insertps {{.*#+}} xmm0 = mem[3],xmm0[1,2,3]		; X32-NEXT: insertps $0, 12(%{{...}},%{{...}}), {{.*#+}} xmm0 = mem[0],xmm0[1,2,3]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: insertps_from_vector_load_offset_2:		; X64-LABEL: insertps_from_vector_load_offset_2:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: shlq $4, %rsi		; X64-NEXT: shlq $4, %rsi
; X64-NEXT: insertps {{.*#+}} xmm0 = mem[3],xmm0[1,2,3]		; X64-NEXT: insertps $0, 12(%{{...}},%{{...}}), {{.*#+}} xmm0 = mem[0],xmm0[1,2,3]
; X64-NEXT: retq		; X64-NEXT: retq
%1 = getelementptr inbounds <4 x float>, <4 x float>* %pb, i64 %index		%1 = getelementptr inbounds <4 x float>, <4 x float>* %pb, i64 %index
%2 = load <4 x float>, <4 x float>* %1, align 16		%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %2, i32 192)		%3 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %2, i32 192)
ret <4 x float> %3		ret <4 x float> %3
}		}

define <4 x float> @insertps_from_broadcast_loadf32(<4 x float> %a, float* nocapture readonly %fb, i64 %index) {		define <4 x float> @insertps_from_broadcast_loadf32(<4 x float> %a, float* nocapture readonly %fb, i64 %index) {
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines
}		}

; Test for a bug in X86ISelLowering.cpp:getINSERTPS where we were using		; Test for a bug in X86ISelLowering.cpp:getINSERTPS where we were using
; the destination index to change the load, instead of the source index.		; the destination index to change the load, instead of the source index.
define <4 x float> @pr20087(<4 x float> %a, <4 x float> *%ptr) {		define <4 x float> @pr20087(<4 x float> %a, <4 x float> *%ptr) {
; X32-LABEL: pr20087:		; X32-LABEL: pr20087:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm0[2],mem[2]		; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm0[2],mem[0]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: pr20087:		; X64-LABEL: pr20087:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm0[2],mem[2]		; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm0[2],mem[0]
; X64-NEXT: retq		; X64-NEXT: retq
%load = load <4 x float> , <4 x float> *%ptr		%load = load <4 x float> , <4 x float> *%ptr
%ret = shufflevector <4 x float> %load, <4 x float> %a, <4 x i32> <i32 4, i32 undef, i32 6, i32 2>		%ret = shufflevector <4 x float> %load, <4 x float> %a, <4 x i32> <i32 4, i32 undef, i32 6, i32 2>
ret <4 x float> %ret		ret <4 x float> %ret
}		}

; Edge case for insertps where we end up with a shuffle with mask=<0, 7, -1, -1>		; Edge case for insertps where we end up with a shuffle with mask=<0, 7, -1, -1>
define void @insertps_pr20411(<4 x i32> %shuffle109, <4 x i32> %shuffle116, i32* noalias nocapture %RET) #1 {		define void @insertps_pr20411(<4 x i32> %shuffle109, <4 x i32> %shuffle116, i32* noalias nocapture %RET) #1 {
▲ Show 20 Lines • Show All 195 Lines • Show Last 20 Lines

test/CodeGen/X86/stack-folding-fp-avx1.ll

	Show First 20 Lines • Show All 940 Lines • ▼ Show 20 Lines
	define <8 x float> @stack_fold_insertf128(<4 x float> %a0, <4 x float> %a1) {			define <8 x float> @stack_fold_insertf128(<4 x float> %a0, <4 x float> %a1) {
	;CHECK-LABEL: stack_fold_insertf128			;CHECK-LABEL: stack_fold_insertf128
	;CHECK: vinsertf128 $1, {{-?[0-9]}}(%rsp), {{%ymm[0-9][0-9]}}, {{%ymm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload			;CHECK: vinsertf128 $1, {{-?[0-9]}}(%rsp), {{%ymm[0-9][0-9]}}, {{%ymm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = shufflevector <4 x float> %a0, <4 x float> %a1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%2 = shufflevector <4 x float> %a0, <4 x float> %a1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	ret <8 x float> %2			ret <8 x float> %2
	}			}

	; TODO stack_fold_insertps			define <4 x float> @stack_fold_insertps(<4 x float> %a0, <4 x float> %a1) {
				;CHECK-LABEL: stack_fold_insertps
				;CHECK: vinsertps $17, {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload
				;CHECK-NEXT: {{.*#+}} xmm0 = zero,mem[0],xmm0[2,3]
				%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
				%2 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a0, <4 x float> %a1, i8 209)
				ret <4 x float> %2
				}
				declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i8) nounwind readnone

	define <2 x double> @stack_fold_maxpd(<2 x double> %a0, <2 x double> %a1) {			define <2 x double> @stack_fold_maxpd(<2 x double> %a0, <2 x double> %a1) {
	;CHECK-LABEL: stack_fold_maxpd			;CHECK-LABEL: stack_fold_maxpd
	;CHECK: vmaxpd {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload			;CHECK: vmaxpd {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %a0, <2 x double> %a1)			%2 = call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %a0, <2 x double> %a1)
	ret <2 x double> %2			ret <2 x double> %2
	}			}
	▲ Show 20 Lines • Show All 884 Lines • Show Last 20 Lines

test/CodeGen/X86/stack-folding-fp-sse42.ll

Show First 20 Lines • Show All 631 Lines • ▼ Show 20 Lines	define <4 x float> @stack_fold_hsubps(<4 x float> %a0, <4 x float> %a1) {
;CHECK-LABEL: stack_fold_hsubps		;CHECK-LABEL: stack_fold_hsubps
;CHECK: hsubps {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload		;CHECK: hsubps {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload
%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
%2 = call <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float> %a0, <4 x float> %a1)		%2 = call <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float> %a0, <4 x float> %a1)
ret <4 x float> %2		ret <4 x float> %2
}		}
declare <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float>, <4 x float>) nounwind readnone		declare <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float>, <4 x float>) nounwind readnone

; TODO stack_fold_insertps		define <4 x float> @stack_fold_insertps(<4 x float> %a0, <4 x float> %a1) {
		;CHECK-LABEL: stack_fold_insertps
		;CHECK: insertps $17, {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload
		;CHECK-NEXT: {{.*#+}} xmm0 = zero,mem[0],xmm0[2,3]
		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
		%2 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a0, <4 x float> %a1, i8 209)
		ret <4 x float> %2
		}
		declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i8) nounwind readnone

define <2 x double> @stack_fold_maxpd(<2 x double> %a0, <2 x double> %a1) {		define <2 x double> @stack_fold_maxpd(<2 x double> %a0, <2 x double> %a1) {
;CHECK-LABEL: stack_fold_maxpd		;CHECK-LABEL: stack_fold_maxpd
;CHECK: maxpd {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload		;CHECK: maxpd {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload
%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
%2 = call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %a0, <2 x double> %a1)		%2 = call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %a0, <2 x double> %a1)
ret <2 x double> %2		ret <2 x double> %2
}		}
▲ Show 20 Lines • Show All 485 Lines • Show Last 20 Lines