This is an archive of the discontinued LLVM Phabricator instance.

[x86] enhance mayFoldLoad to check alignment
ClosedPublic

Authored by spatel on Oct 26 2021, 8:08 AM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
RKSimon
lebedev.ri

Commits

rG6c0a2c2804c0: [x86] enhance mayFoldLoad to check alignment

Summary

As noted in D112464, a pre-AVX target can't fold an under-aligned vector load into another op, so we shouldn't report that as a load folding candidate. I only found one caller where this would make a difference -- combineCommutableSHUFP() -- so that's where I added a test to show the (minor) regression.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Oct 26 2021, 8:08 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptOct 26 2021, 8:08 AM

spatel requested review of this revision.Oct 26 2021, 8:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2021, 8:08 AM

Harbormaster completed remote builds in B130709: Diff 382320.Oct 26 2021, 8:09 AM

spatel mentioned this in D112464: [x86] limit vector increment fold to allow load folding.Oct 26 2021, 8:12 AM

craig.topper added inline comments.Oct 26 2021, 8:19 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
5050	What about Subtarget->hasSSEUnalignedMem()?

spatel marked an inline comment as done.Oct 26 2021, 8:47 AM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
5050	Ah, forgot about that possibility. I'll add the clause along with another RUN line to test it.

Patch updated:
Added clause for hasSSEUnalignedMem() and adjusted test file with extra RUN line for that.

Harbormaster completed remote builds in B130727: Diff 382351.Oct 26 2021, 8:51 AM

LGTM

This revision is now accepted and ready to land.Oct 26 2021, 2:22 PM

This revision was landed with ongoing or failed builds.Oct 27 2021, 4:54 AM

Closed by commit rG6c0a2c2804c0: [x86] enhance mayFoldLoad to check alignment (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG6c0a2c2804c0: [x86] enhance mayFoldLoad to check alignment.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

64 lines

test/

CodeGen/

X86/

oddshuffles.ll

66 lines

vec_insert-5.ll

18 lines

Diff 382612

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,033 Lines • ▼ Show 20 Lines	X86TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo) const {		const TargetLibraryInfo *libInfo) const {
return X86::createFastISel(funcInfo, libInfo);		return X86::createFastISel(funcInfo, libInfo);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Other Lowering Hooks		// Other Lowering Hooks
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

static bool MayFoldLoad(SDValue Op, bool AssumeSingleUse = false) {		static bool mayFoldLoad(SDValue Op, const X86Subtarget &Subtarget,
return (AssumeSingleUse \|\| Op.hasOneUse()) && ISD::isNormalLoad(Op.getNode());		bool AssumeSingleUse = false) {
		if (!AssumeSingleUse && !Op.hasOneUse())
		return false;
		if (!ISD::isNormalLoad(Op.getNode()))
		return false;

		// If this is an unaligned vector, make sure the target supports folding it.
		auto *Ld = cast<LoadSDNode>(Op.getNode());
		craig.topperUnsubmitted Done Reply Inline Actions What about Subtarget->hasSSEUnalignedMem()? craig.topper: What about Subtarget->hasSSEUnalignedMem()?
		spatelAuthorUnsubmitted Done Reply Inline Actions Ah, forgot about that possibility. I'll add the clause along with another RUN line to test it. spatel: Ah, forgot about that possibility. I'll add the clause along with another RUN line to test it.
		if (!Subtarget.hasAVX() && !Subtarget.hasSSEUnalignedMem() &&
		Ld->getValueSizeInBits(0) == 128 && Ld->getAlignment() < 16)
		return false;

		// TODO: If this is a non-temporal load and the target has an instruction
		// for it, it should not be folded. See "useNonTemporalLoad()".

		return true;
}		}

static bool MayFoldLoadIntoBroadcastFromMem(SDValue Op, MVT EltVT,		static bool mayFoldLoadIntoBroadcastFromMem(SDValue Op, MVT EltVT,
		const X86Subtarget &Subtarget,
bool AssumeSingleUse = false) {		bool AssumeSingleUse = false) {
if (!MayFoldLoad(Op, AssumeSingleUse))		assert(Subtarget.hasAVX() && "Expected AVX for broadcast from memory");
		if (!mayFoldLoad(Op, Subtarget, AssumeSingleUse))
return false;		return false;

// We can not replace a wide volatile load with a broadcast-from-memory,		// We can not replace a wide volatile load with a broadcast-from-memory,
// because that would narrow the load, which isn't legal for volatiles.		// because that would narrow the load, which isn't legal for volatiles.
auto *Ld = cast<LoadSDNode>(Op.getNode());		auto *Ld = cast<LoadSDNode>(Op.getNode());
return !Ld->isVolatile() \|\|		return !Ld->isVolatile() \|\|
Ld->getValueSizeInBits(0) == EltVT.getScalarSizeInBits();		Ld->getValueSizeInBits(0) == EltVT.getScalarSizeInBits();
}		}
▲ Show 20 Lines • Show All 3,934 Lines • ▼ Show 20 Lines	for (unsigned SubElems = 1; SubElems < NumElems; SubElems *= 2) {
if (SDValue RepeatLoad = EltsFromConsecutiveLoads(		if (SDValue RepeatLoad = EltsFromConsecutiveLoads(
RepeatVT, RepeatedLoads, DL, DAG, Subtarget, IsAfterLegalize)) {		RepeatVT, RepeatedLoads, DL, DAG, Subtarget, IsAfterLegalize)) {
SDValue Broadcast = RepeatLoad;		SDValue Broadcast = RepeatLoad;
if (RepeatSize > ScalarSize) {		if (RepeatSize > ScalarSize) {
while (Broadcast.getValueSizeInBits() < VT.getSizeInBits())		while (Broadcast.getValueSizeInBits() < VT.getSizeInBits())
Broadcast = concatSubVectors(Broadcast, Broadcast, DAG, DL);		Broadcast = concatSubVectors(Broadcast, Broadcast, DAG, DL);
} else {		} else {
if (!Subtarget.hasAVX2() &&		if (!Subtarget.hasAVX2() &&
!MayFoldLoadIntoBroadcastFromMem(		!mayFoldLoadIntoBroadcastFromMem(
RepeatLoad, RepeatVT.getScalarType().getSimpleVT(),		RepeatLoad, RepeatVT.getScalarType().getSimpleVT(),
		Subtarget,
/AssumeSingleUse=/true))		/AssumeSingleUse=/true))
return SDValue();		return SDValue();
Broadcast =		Broadcast =
DAG.getNode(X86ISD::VBROADCAST, DL, BroadcastVT, RepeatLoad);		DAG.getNode(X86ISD::VBROADCAST, DL, BroadcastVT, RepeatLoad);
}		}
return DAG.getBitcast(VT, Broadcast);		return DAG.getBitcast(VT, Broadcast);
}		}
}		}
▲ Show 20 Lines • Show All 3,713 Lines • ▼ Show 20 Lines	static SDValue lowerShuffleAsDecomposedShuffleMerge(

// If we effectively only demand the 0'th element of \p Input, and not only		// If we effectively only demand the 0'th element of \p Input, and not only
// as 0'th element, then broadcast said input,		// as 0'th element, then broadcast said input,
// and change \p InputMask to be a no-op (identity) mask.		// and change \p InputMask to be a no-op (identity) mask.
auto canonicalizeBroadcastableInput = [DL, VT, &Subtarget,		auto canonicalizeBroadcastableInput = [DL, VT, &Subtarget,
&DAG](SDValue &Input,		&DAG](SDValue &Input,
MutableArrayRef<int> InputMask) {		MutableArrayRef<int> InputMask) {
unsigned EltSizeInBits = Input.getScalarValueSizeInBits();		unsigned EltSizeInBits = Input.getScalarValueSizeInBits();
if (!Subtarget.hasAVX2() &&		if (!Subtarget.hasAVX2() && (!Subtarget.hasAVX() \|\| EltSizeInBits < 32 \|\|
(!Subtarget.hasAVX() \|\| EltSizeInBits < 32 \|\| !MayFoldLoad(Input)))		!mayFoldLoad(Input, Subtarget)))
return;		return;
if (isNoopShuffleMask(InputMask))		if (isNoopShuffleMask(InputMask))
return;		return;
assert(isBroadcastShuffleMask(InputMask) &&		assert(isBroadcastShuffleMask(InputMask) &&
"Expected to demand only the 0'th element.");		"Expected to demand only the 0'th element.");
Input = DAG.getNode(X86ISD::VBROADCAST, DL, VT, Input);		Input = DAG.getNode(X86ISD::VBROADCAST, DL, VT, Input);
for (auto I : enumerate(InputMask)) {		for (auto I : enumerate(InputMask)) {
int &InputMaskElt = I.value();		int &InputMaskElt = I.value();
▲ Show 20 Lines • Show All 3,668 Lines • ▼ Show 20 Lines	static SDValue lowerV2X128Shuffle(const SDLoc &DL, MVT VT, SDValue V1,
const APInt &Zeroable,		const APInt &Zeroable,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
if (V2.isUndef()) {		if (V2.isUndef()) {
// Attempt to match VBROADCAST*128 subvector broadcast load.		// Attempt to match VBROADCAST*128 subvector broadcast load.
bool SplatLo = isShuffleEquivalent(Mask, {0, 1, 0, 1}, V1);		bool SplatLo = isShuffleEquivalent(Mask, {0, 1, 0, 1}, V1);
bool SplatHi = isShuffleEquivalent(Mask, {2, 3, 2, 3}, V1);		bool SplatHi = isShuffleEquivalent(Mask, {2, 3, 2, 3}, V1);
if ((SplatLo \|\| SplatHi) && !Subtarget.hasAVX512() && V1.hasOneUse() &&		if ((SplatLo \|\| SplatHi) && !Subtarget.hasAVX512() && V1.hasOneUse() &&
MayFoldLoad(peekThroughOneUseBitcasts(V1))) {		mayFoldLoad(peekThroughOneUseBitcasts(V1), Subtarget)) {
auto *Ld = cast<LoadSDNode>(peekThroughOneUseBitcasts(V1));		auto *Ld = cast<LoadSDNode>(peekThroughOneUseBitcasts(V1));
if (!Ld->isNonTemporal()) {		if (!Ld->isNonTemporal()) {
MVT MemVT = VT.getHalfNumVectorElementsVT();		MVT MemVT = VT.getHalfNumVectorElementsVT();
unsigned Ofs = SplatLo ? 0 : MemVT.getStoreSize();		unsigned Ofs = SplatLo ? 0 : MemVT.getStoreSize();
SDVTList Tys = DAG.getVTList(VT, MVT::Other);		SDVTList Tys = DAG.getVTList(VT, MVT::Other);
SDValue Ptr = DAG.getMemBasePlusOffset(Ld->getBasePtr(),		SDValue Ptr = DAG.getMemBasePlusOffset(Ld->getBasePtr(),
TypeSize::Fixed(Ofs), DL);		TypeSize::Fixed(Ofs), DL);
SDValue Ops[] = {Ld->getChain(), Ptr};		SDValue Ops[] = {Ld->getChain(), Ptr};
▲ Show 20 Lines • Show All 2,983 Lines • ▼ Show 20 Lines	if (VT.is256BitVector() \|\| VT.is512BitVector()) {
assert(isPowerOf2_32(NumEltsIn128) &&		assert(isPowerOf2_32(NumEltsIn128) &&
"Vectors will always have power-of-two number of elements.");		"Vectors will always have power-of-two number of elements.");

// If we are not inserting into the low 128-bit vector chunk,		// If we are not inserting into the low 128-bit vector chunk,
// then prefer the broadcast+blend sequence.		// then prefer the broadcast+blend sequence.
// FIXME: relax the profitability check iff all N1 uses are insertions.		// FIXME: relax the profitability check iff all N1 uses are insertions.
if (!VT.is128BitVector() && IdxVal >= NumEltsIn128 &&		if (!VT.is128BitVector() && IdxVal >= NumEltsIn128 &&
((Subtarget.hasAVX2() && EltSizeInBits != 8) \|\|		((Subtarget.hasAVX2() && EltSizeInBits != 8) \|\|
(Subtarget.hasAVX() && (EltSizeInBits >= 32) && MayFoldLoad(N1)))) {		(Subtarget.hasAVX() && (EltSizeInBits >= 32) &&
		mayFoldLoad(N1, Subtarget)))) {
SDValue N1SplatVec = DAG.getSplatBuildVector(VT, dl, N1);		SDValue N1SplatVec = DAG.getSplatBuildVector(VT, dl, N1);
SmallVector<int, 8> BlendMask;		SmallVector<int, 8> BlendMask;
for (unsigned i = 0; i != NumElts; ++i)		for (unsigned i = 0; i != NumElts; ++i)
BlendMask.push_back(i == IdxVal ? i + NumElts : i);		BlendMask.push_back(i == IdxVal ? i + NumElts : i);
return DAG.getVectorShuffle(VT, dl, N0, N1SplatVec, BlendMask);		return DAG.getVectorShuffle(VT, dl, N0, N1SplatVec, BlendMask);
}		}

// Get the desired 128-bit vector chunk.		// Get the desired 128-bit vector chunk.
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	if (EltVT == MVT::f32) {
// these bits. For example (insert (extract, 3), 2) could be matched by		// these bits. For example (insert (extract, 3), 2) could be matched by
// putting the '3' into bits [7:6] of X86ISD::INSERTPS.		// putting the '3' into bits [7:6] of X86ISD::INSERTPS.
// Bits [5:4] of the constant are the destination select. This is the		// Bits [5:4] of the constant are the destination select. This is the
// value of the incoming immediate.		// value of the incoming immediate.
// Bits [3:0] of the constant are the zero mask. The DAG Combiner may		// Bits [3:0] of the constant are the zero mask. The DAG Combiner may
// combine either bitwise AND or insert of float 0.0 to set these bits.		// combine either bitwise AND or insert of float 0.0 to set these bits.

bool MinSize = DAG.getMachineFunction().getFunction().hasMinSize();		bool MinSize = DAG.getMachineFunction().getFunction().hasMinSize();
if (IdxVal == 0 && (!MinSize \|\| !MayFoldLoad(N1))) {		if (IdxVal == 0 && (!MinSize \|\| !mayFoldLoad(N1, Subtarget))) {
// If this is an insertion of 32-bits into the low 32-bits of		// If this is an insertion of 32-bits into the low 32-bits of
// a vector, we prefer to generate a blend with immediate rather		// a vector, we prefer to generate a blend with immediate rather
// than an insertps. Blends are simpler operations in hardware and so		// than an insertps. Blends are simpler operations in hardware and so
// will always have equal or better performance than insertps.		// will always have equal or better performance than insertps.
// But if optimizing for size and there's a load folding opportunity,		// But if optimizing for size and there's a load folding opportunity,
// generate insertps because blendps does not have a 32-bit memory		// generate insertps because blendps does not have a 32-bit memory
// operand form.		// operand form.
N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);		N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);
▲ Show 20 Lines • Show All 5,123 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {

// Or finally, promote i8 cmovs if we have CMOV,		// Or finally, promote i8 cmovs if we have CMOV,
// or i16 cmovs if it won't prevent folding a load.		// or i16 cmovs if it won't prevent folding a load.
// FIXME: we should not limit promotion of i8 case to only when the CMOV is		// FIXME: we should not limit promotion of i8 case to only when the CMOV is
// legal, but EmitLoweredSelect() can not deal with these extensions		// legal, but EmitLoweredSelect() can not deal with these extensions
// being inserted between two CMOV's. (in i16 case too TBN)		// being inserted between two CMOV's. (in i16 case too TBN)
// https://bugs.llvm.org/show_bug.cgi?id=40974		// https://bugs.llvm.org/show_bug.cgi?id=40974
if ((Op.getValueType() == MVT::i8 && Subtarget.hasCMov()) \|\|		if ((Op.getValueType() == MVT::i8 && Subtarget.hasCMov()) \|\|
(Op.getValueType() == MVT::i16 && !MayFoldLoad(Op1) &&		(Op.getValueType() == MVT::i16 && !mayFoldLoad(Op1, Subtarget) &&
!MayFoldLoad(Op2))) {		!mayFoldLoad(Op2, Subtarget))) {
Op1 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, Op1);		Op1 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, Op1);
Op2 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, Op2);		Op2 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, Op2);
SDValue Ops[] = { Op2, Op1, CC, Cond };		SDValue Ops[] = { Op2, Op1, CC, Cond };
SDValue Cmov = DAG.getNode(X86ISD::CMOV, DL, MVT::i32, Ops);		SDValue Cmov = DAG.getNode(X86ISD::CMOV, DL, MVT::i32, Ops);
return DAG.getNode(ISD::TRUNCATE, DL, Op.getValueType(), Cmov);		return DAG.getNode(ISD::TRUNCATE, DL, Op.getValueType(), Cmov);
}		}

// X86ISD::CMOV means set the result (which is operand 1) to the RHS if		// X86ISD::CMOV means set the result (which is operand 1) to the RHS if
▲ Show 20 Lines • Show All 12,330 Lines • ▼ Show 20 Lines	if (UnaryShuffle) {
// Attempt to match against broadcast-from-vector.		// Attempt to match against broadcast-from-vector.
// Limit AVX1 to cases where we're loading+broadcasting a scalar element.		// Limit AVX1 to cases where we're loading+broadcasting a scalar element.
if ((Subtarget.hasAVX2() \|\|		if ((Subtarget.hasAVX2() \|\|
(Subtarget.hasAVX() && 32 <= MaskEltSizeInBits)) &&		(Subtarget.hasAVX() && 32 <= MaskEltSizeInBits)) &&
(!IsMaskedShuffle \|\| NumRootElts == NumMaskElts)) {		(!IsMaskedShuffle \|\| NumRootElts == NumMaskElts)) {
if (isUndefOrEqual(Mask, 0)) {		if (isUndefOrEqual(Mask, 0)) {
if (V1.getValueType() == MaskVT &&		if (V1.getValueType() == MaskVT &&
V1.getOpcode() == ISD::SCALAR_TO_VECTOR &&		V1.getOpcode() == ISD::SCALAR_TO_VECTOR &&
MayFoldLoad(V1.getOperand(0))) {		mayFoldLoad(V1.getOperand(0), Subtarget)) {
if (Depth == 0 && Root.getOpcode() == X86ISD::VBROADCAST)		if (Depth == 0 && Root.getOpcode() == X86ISD::VBROADCAST)
return SDValue(); // Nothing to do!		return SDValue(); // Nothing to do!
Res = V1.getOperand(0);		Res = V1.getOperand(0);
Res = DAG.getNode(X86ISD::VBROADCAST, DL, MaskVT, Res);		Res = DAG.getNode(X86ISD::VBROADCAST, DL, MaskVT, Res);
return DAG.getBitcast(RootVT, Res);		return DAG.getBitcast(RootVT, Res);
}		}
if (Subtarget.hasAVX2()) {		if (Subtarget.hasAVX2()) {
if (Depth == 0 && Root.getOpcode() == X86ISD::VBROADCAST)		if (Depth == 0 && Root.getOpcode() == X86ISD::VBROADCAST)
▲ Show 20 Lines • Show All 1,424 Lines • ▼ Show 20 Lines	static SDValue combineCommutableSHUFP(SDValue N, MVT VT, const SDLoc &DL,

// SHUFP(LHS, RHS) -> SHUFP(RHS, LHS) iff LHS is foldable + RHS is not.		// SHUFP(LHS, RHS) -> SHUFP(RHS, LHS) iff LHS is foldable + RHS is not.
auto commuteSHUFP = [&VT, &DL, &DAG](SDValue Parent, SDValue V) {		auto commuteSHUFP = [&VT, &DL, &DAG](SDValue Parent, SDValue V) {
if (V.getOpcode() != X86ISD::SHUFP \|\| !Parent->isOnlyUserOf(V.getNode()))		if (V.getOpcode() != X86ISD::SHUFP \|\| !Parent->isOnlyUserOf(V.getNode()))
return SDValue();		return SDValue();
SDValue N0 = V.getOperand(0);		SDValue N0 = V.getOperand(0);
SDValue N1 = V.getOperand(1);		SDValue N1 = V.getOperand(1);
unsigned Imm = V.getConstantOperandVal(2);		unsigned Imm = V.getConstantOperandVal(2);
if (!MayFoldLoad(peekThroughOneUseBitcasts(N0)) \|\|		const X86Subtarget &Subtarget =
MayFoldLoad(peekThroughOneUseBitcasts(N1)))		static_cast<const X86Subtarget &>(DAG.getSubtarget());
		if (!mayFoldLoad(peekThroughOneUseBitcasts(N0), Subtarget) \|\|
		mayFoldLoad(peekThroughOneUseBitcasts(N1), Subtarget))
return SDValue();		return SDValue();
Imm = ((Imm & 0x0F) << 4) \| ((Imm & 0xF0) >> 4);		Imm = ((Imm & 0x0F) << 4) \| ((Imm & 0xF0) >> 4);
return DAG.getNode(X86ISD::SHUFP, DL, VT, N1, N0,		return DAG.getNode(X86ISD::SHUFP, DL, VT, N1, N0,
DAG.getTargetConstant(Imm, DL, MVT::i8));		DAG.getTargetConstant(Imm, DL, MVT::i8));
};		};

switch (N.getOpcode()) {		switch (N.getOpcode()) {
case X86ISD::VPERMILPI:		case X86ISD::VPERMILPI:
▲ Show 20 Lines • Show All 13,219 Lines • ▼ Show 20 Lines	if (auto *Ld = dyn_cast<LoadSDNode>(Op0)) {
extractSubVector(BcastLd, 0, DAG, DL, Op0.getValueSizeInBits()));		extractSubVector(BcastLd, 0, DAG, DL, Op0.getValueSizeInBits()));
DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), BcastLd.getValue(1));		DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), BcastLd.getValue(1));
return BcastLd;		return BcastLd;
}		}
}		}

// concat_vectors(movddup(x),movddup(x)) -> broadcast(x)		// concat_vectors(movddup(x),movddup(x)) -> broadcast(x)
if (Op0.getOpcode() == X86ISD::MOVDDUP && VT == MVT::v4f64 &&		if (Op0.getOpcode() == X86ISD::MOVDDUP && VT == MVT::v4f64 &&
(Subtarget.hasAVX2() \|\| MayFoldLoadIntoBroadcastFromMem(		(Subtarget.hasAVX2() \|\|
Op0.getOperand(0), VT.getScalarType())))		mayFoldLoadIntoBroadcastFromMem(Op0.getOperand(0), VT.getScalarType(),
		Subtarget)))
return DAG.getNode(X86ISD::VBROADCAST, DL, VT,		return DAG.getNode(X86ISD::VBROADCAST, DL, VT,
DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f64,		DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f64,
Op0.getOperand(0),		Op0.getOperand(0),
DAG.getIntPtrConstant(0, DL)));		DAG.getIntPtrConstant(0, DL)));

// concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) -> broadcast(x)		// concat_vectors(scalar_to_vector(x),scalar_to_vector(x)) -> broadcast(x)
if (Op0.getOpcode() == ISD::SCALAR_TO_VECTOR &&		if (Op0.getOpcode() == ISD::SCALAR_TO_VECTOR &&
(Subtarget.hasAVX2() \|\|		(Subtarget.hasAVX2() \|\|
(EltSizeInBits >= 32 && MayFoldLoad(Op0.getOperand(0)))) &&		(EltSizeInBits >= 32 && mayFoldLoad(Op0.getOperand(0), Subtarget))) &&
Op0.getOperand(0).getValueType() == VT.getScalarType())		Op0.getOperand(0).getValueType() == VT.getScalarType())
return DAG.getNode(X86ISD::VBROADCAST, DL, VT, Op0.getOperand(0));		return DAG.getNode(X86ISD::VBROADCAST, DL, VT, Op0.getOperand(0));

// concat_vectors(extract_subvector(broadcast(x)),		// concat_vectors(extract_subvector(broadcast(x)),
// extract_subvector(broadcast(x))) -> broadcast(x)		// extract_subvector(broadcast(x))) -> broadcast(x)
if (Op0.getOpcode() == ISD::EXTRACT_SUBVECTOR &&		if (Op0.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
Op0.getOperand(0).getValueType() == VT) {		Op0.getOperand(0).getValueType() == VT) {
if (Op0.getOperand(0).getOpcode() == X86ISD::VBROADCAST \|\|		if (Op0.getOperand(0).getOpcode() == X86ISD::VBROADCAST \|\|
▲ Show 20 Lines • Show All 1,315 Lines • ▼ Show 20 Lines	bool X86TargetLowering::IsDesirableToPromoteOp(SDValue Op, EVT &PVT) const {
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
break;		break;
case ISD::SHL:		case ISD::SHL:
case ISD::SRA:		case ISD::SRA:
case ISD::SRL: {		case ISD::SRL: {
SDValue N0 = Op.getOperand(0);		SDValue N0 = Op.getOperand(0);
// Look out for (store (shl (load), x)).		// Look out for (store (shl (load), x)).
if (MayFoldLoad(N0) && IsFoldableRMW(N0, Op))		if (mayFoldLoad(N0, Subtarget) && IsFoldableRMW(N0, Op))
return false;		return false;
break;		break;
}		}
case ISD::ADD:		case ISD::ADD:
case ISD::MUL:		case ISD::MUL:
case ISD::AND:		case ISD::AND:
case ISD::OR:		case ISD::OR:
case ISD::XOR:		case ISD::XOR:
Commute = true;		Commute = true;
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
case ISD::SUB: {		case ISD::SUB: {
SDValue N0 = Op.getOperand(0);		SDValue N0 = Op.getOperand(0);
SDValue N1 = Op.getOperand(1);		SDValue N1 = Op.getOperand(1);
// Avoid disabling potential load folding opportunities.		// Avoid disabling potential load folding opportunities.
if (MayFoldLoad(N1) &&		if (mayFoldLoad(N1, Subtarget) &&
(!Commute \|\| !isa<ConstantSDNode>(N0) \|\|		(!Commute \|\| !isa<ConstantSDNode>(N0) \|\|
(Op.getOpcode() != ISD::MUL && IsFoldableRMW(N1, Op))))		(Op.getOpcode() != ISD::MUL && IsFoldableRMW(N1, Op))))
return false;		return false;
if (MayFoldLoad(N0) &&		if (mayFoldLoad(N0, Subtarget) &&
((Commute && !isa<ConstantSDNode>(N1)) \|\|		((Commute && !isa<ConstantSDNode>(N1)) \|\|
(Op.getOpcode() != ISD::MUL && IsFoldableRMW(N0, Op))))		(Op.getOpcode() != ISD::MUL && IsFoldableRMW(N0, Op))))
return false;		return false;
if (IsFoldableAtomicRMW(N0, Op) \|\|		if (IsFoldableAtomicRMW(N0, Op) \|\|
(Commute && IsFoldableAtomicRMW(N1, Op)))		(Commute && IsFoldableAtomicRMW(N1, Op)))
return false;		return false;
}		}
}		}
▲ Show 20 Lines • Show All 1,167 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/oddshuffles.ll

Show First 20 Lines • Show All 1,392 Lines • ▼ Show 20 Lines	; XOP-NEXT: retq
%interleaved = shufflevector <16 x i16> %t1, <16 x i16> %t2, <24 x i32> <i32 0, i32 8, i32 16, i32 1, i32 9, i32 17, i32 2, i32 10, i32 18, i32 3, i32 11, i32 19, i32 4, i32 12, i32 20, i32 5, i32 13, i32 21, i32 6, i32 14, i32 22, i32 7, i32 15, i32 23>		%interleaved = shufflevector <16 x i16> %t1, <16 x i16> %t2, <24 x i32> <i32 0, i32 8, i32 16, i32 1, i32 9, i32 17, i32 2, i32 10, i32 18, i32 3, i32 11, i32 19, i32 4, i32 12, i32 20, i32 5, i32 13, i32 21, i32 6, i32 14, i32 22, i32 7, i32 15, i32 23>
store <24 x i16> %interleaved, <24 x i16>* %p, align 4		store <24 x i16> %interleaved, <24 x i16>* %p, align 4
ret void		ret void
}		}

define void @interleave_24i32_out(<24 x i32>* %p, <8 x i32>* %q1, <8 x i32>* %q2, <8 x i32>* %q3) nounwind {		define void @interleave_24i32_out(<24 x i32>* %p, <8 x i32>* %q1, <8 x i32>* %q2, <8 x i32>* %q3) nounwind {
; SSE2-LABEL: interleave_24i32_out:		; SSE2-LABEL: interleave_24i32_out:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
		; SSE2-NEXT: movdqu 64(%rdi), %xmm9
; SSE2-NEXT: movups 80(%rdi), %xmm8		; SSE2-NEXT: movups 80(%rdi), %xmm8
; SSE2-NEXT: movups 64(%rdi), %xmm3		; SSE2-NEXT: movdqu (%rdi), %xmm0
; SSE2-NEXT: movdqu (%rdi), %xmm1		; SSE2-NEXT: movdqu 16(%rdi), %xmm10
; SSE2-NEXT: movups 16(%rdi), %xmm5		; SSE2-NEXT: movups 32(%rdi), %xmm5
; SSE2-NEXT: movups 32(%rdi), %xmm10		; SSE2-NEXT: movdqu 48(%rdi), %xmm3
; SSE2-NEXT: movdqu 48(%rdi), %xmm2		; SSE2-NEXT: movaps %xmm5, %xmm6
; SSE2-NEXT: movdqa %xmm1, %xmm11		; SSE2-NEXT: pshufd {{.*#+}} xmm7 = xmm0[2,3,2,3]
; SSE2-NEXT: movaps %xmm10, %xmm7		; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm10[1,1,1,1]
; SSE2-NEXT: shufps {{.*#+}} xmm7 = xmm7[2,1],xmm5[3,3]		; SSE2-NEXT: punpckldq {{.*#+}} xmm7 = xmm7[0],xmm4[0],xmm7[1],xmm4[1]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,2,3]		; SSE2-NEXT: shufps {{.*#+}} xmm7 = xmm7[0,1],xmm5[0,3]
; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,0],xmm5[0,0]		; SSE2-NEXT: shufps {{.*#+}} xmm5 = xmm5[1,1],xmm10[2,3]
; SSE2-NEXT: pshufd {{.*#+}} xmm9 = xmm5[1,1,1,1]		; SSE2-NEXT: movdqa %xmm0, %xmm4
; SSE2-NEXT: shufps {{.*#+}} xmm5 = xmm5[2,3],xmm10[1,1]		; SSE2-NEXT: shufps {{.*#+}} xmm4 = xmm4[0,3],xmm5[2,0]
; SSE2-NEXT: shufps {{.*#+}} xmm11 = xmm11[0,3],xmm5[0,2]		; SSE2-NEXT: movaps %xmm8, %xmm5
; SSE2-NEXT: movdqa %xmm2, %xmm5		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm3[2,3,2,3]
; SSE2-NEXT: movaps %xmm8, %xmm4		; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm9[1,1,1,1]
; SSE2-NEXT: shufps {{.*#+}} xmm4 = xmm4[2,1],xmm3[3,3]		; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1]
; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm2[2,3,2,3]		; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm8[0,3]
; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[1,0],xmm3[0,0]		; SSE2-NEXT: shufps {{.*#+}} xmm8 = xmm8[1,1],xmm9[2,3]
; SSE2-NEXT: pshufd {{.*#+}} xmm12 = xmm3[1,1,1,1]		; SSE2-NEXT: movdqa %xmm3, %xmm2
; SSE2-NEXT: shufps {{.*#+}} xmm3 = xmm3[2,3],xmm8[1,1]		; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,3],xmm8[2,0]
; SSE2-NEXT: shufps {{.*#+}} xmm5 = xmm5[0,3],xmm3[0,2]		; SSE2-NEXT: shufps {{.*#+}} xmm5 = xmm5[2,1],xmm9[3,3]
; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm4[2,0]		; SSE2-NEXT: shufps {{.*#+}} xmm3 = xmm3[1,0],xmm9[0,0]
; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm7[2,0]		; SSE2-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,2],xmm5[2,0]
; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm9[0],xmm0[1],xmm9[1]		; SSE2-NEXT: shufps {{.*#+}} xmm6 = xmm6[2,1],xmm10[3,3]
; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm10[0,3]		; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,0],xmm10[0,0]
; SSE2-NEXT: punpckldq {{.*#+}} xmm6 = xmm6[0],xmm12[0],xmm6[1],xmm12[1]		; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm6[2,0]
; SSE2-NEXT: shufps {{.*#+}} xmm6 = xmm6[0,1],xmm8[0,3]		; SSE2-NEXT: movups %xmm2, 16(%rsi)
; SSE2-NEXT: movups %xmm5, 16(%rsi)		; SSE2-NEXT: movups %xmm4, (%rsi)
; SSE2-NEXT: movups %xmm11, (%rsi)		; SSE2-NEXT: movups %xmm3, 16(%rdx)
; SSE2-NEXT: movups %xmm2, 16(%rdx)		; SSE2-NEXT: movups %xmm0, (%rdx)
; SSE2-NEXT: movups %xmm1, (%rdx)		; SSE2-NEXT: movups %xmm1, 16(%rcx)
; SSE2-NEXT: movups %xmm6, 16(%rcx)		; SSE2-NEXT: movups %xmm7, (%rcx)
; SSE2-NEXT: movups %xmm0, (%rcx)
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSE42-LABEL: interleave_24i32_out:		; SSE42-LABEL: interleave_24i32_out:
; SSE42: # %bb.0:		; SSE42: # %bb.0:
; SSE42-NEXT: movups 80(%rdi), %xmm8		; SSE42-NEXT: movups 80(%rdi), %xmm8
; SSE42-NEXT: movdqu 64(%rdi), %xmm9		; SSE42-NEXT: movdqu 64(%rdi), %xmm9
; SSE42-NEXT: movdqu (%rdi), %xmm3		; SSE42-NEXT: movdqu (%rdi), %xmm3
; SSE42-NEXT: movdqu 16(%rdi), %xmm2		; SSE42-NEXT: movdqu 16(%rdi), %xmm2
▲ Show 20 Lines • Show All 1,118 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vec_insert-5.ll

Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
%tmp2 = shufflevector <4 x float> zeroinitializer, <4 x float> %tmp1, <4 x i32> < i32 7, i32 0, i32 0, i32 0 >		%tmp2 = shufflevector <4 x float> zeroinitializer, <4 x float> %tmp1, <4 x i32> < i32 7, i32 0, i32 0, i32 0 >
ret <4 x float> %tmp2		ret <4 x float> %tmp2
}		}

define <4 x float> @t4_under_aligned(<4 x float>* %P) nounwind {		define <4 x float> @t4_under_aligned(<4 x float>* %P) nounwind {
; X32-LABEL: t4_under_aligned:		; X32-LABEL: t4_under_aligned:
; X32: # %bb.0:		; X32: # %bb.0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: movups (%eax), %xmm1		; X32-NEXT: movups (%eax), %xmm0
; X32-NEXT: xorps %xmm2, %xmm2		; X32-NEXT: xorps %xmm1, %xmm1
; X32-NEXT: xorps %xmm0, %xmm0		; X32-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,0],xmm1[1,0]
; X32-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,0],xmm1[3,0]		; X32-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[2,3]
; X32-NEXT: shufps {{.*#+}} xmm0 = xmm0[2,0],xmm2[2,3]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; ALIGN-LABEL: t4_under_aligned:		; ALIGN-LABEL: t4_under_aligned:
; ALIGN: # %bb.0:		; ALIGN: # %bb.0:
; ALIGN-NEXT: movups (%rdi), %xmm1		; ALIGN-NEXT: movups (%rdi), %xmm0
; ALIGN-NEXT: xorps %xmm2, %xmm2		; ALIGN-NEXT: xorps %xmm1, %xmm1
; ALIGN-NEXT: xorps %xmm0, %xmm0		; ALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,0],xmm1[1,0]
; ALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,0],xmm1[3,0]		; ALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[2,3]
; ALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[2,0],xmm2[2,3]
; ALIGN-NEXT: retq		; ALIGN-NEXT: retq
;		;
; UNALIGN-LABEL: t4_under_aligned:		; UNALIGN-LABEL: t4_under_aligned:
; UNALIGN: # %bb.0:		; UNALIGN: # %bb.0:
; UNALIGN-NEXT: xorps %xmm1, %xmm1		; UNALIGN-NEXT: xorps %xmm1, %xmm1
; UNALIGN-NEXT: xorps %xmm0, %xmm0		; UNALIGN-NEXT: xorps %xmm0, %xmm0
; UNALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,0],mem[3,0]		; UNALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,0],mem[3,0]
; UNALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[2,0],xmm1[2,3]		; UNALIGN-NEXT: shufps {{.*#+}} xmm0 = xmm0[2,0],xmm1[2,3]
▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines