This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
1
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
5
load-partial-dot-product.ll
-
load-partial.ll
-
masked_gather.ll

Differential D106280

[X86][AVX] scalar_to_vector(load_scalar()) -> load_vector() for fast dereferencable loads
AbandonedPublic

Authored by RKSimon on Jul 19 2021, 7:43 AM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
spatel

Summary

As reported on PR51075, we fail to make use of dereferencable 128-bit vector loads for float2 loads which were then being widened for float4 operations, preventing a useful load-fold.

We already do a similar fold for insert_subvector patterns of 128-bit loads with 256-bit dereferencable pointers.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	2,880 ms	x64 debian > libarcher.barrier::barrier.c
	2,700 ms	x64 debian > libarcher.critical::critical.c
	2,730 ms	x64 debian > libarcher.critical::lock.c
	3,210 ms	x64 debian > libarcher.races::critical-unrelated.c
	3,120 ms	x64 debian > libarcher.races::lock-nested-unrelated.c
		View Full Test Results (17 Failed)

Event Timeline

RKSimon created this revision.Jul 19 2021, 7:43 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptJul 19 2021, 7:43 AM

RKSimon requested review of this revision.Jul 19 2021, 7:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2021, 7:43 AM

RKSimon mentioned this in D105390: [X86] Lower insertions into upper half of an 256-bit vector as broadcast+blend (PR50971).Jul 19 2021, 7:46 AM

lebedev.ri added a subscriber: lebedev.ri.Jul 19 2021, 7:49 AM

lebedev.ri added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
8611	Please can you precommit this case change

Harbormaster completed remote builds in B114862: Diff 359790.Jul 19 2021, 8:47 AM

RKSimon mentioned this in rG142e60f40b50: [X86] Fix case of IsAfterLegalize argument. NFC..Jul 19 2021, 9:16 AM

rebase after rG142e60f40b50

efriedma added a subscriber: efriedma.Jul 19 2021, 9:33 AM

efriedma added inline comments.

llvm/test/CodeGen/X86/load-partial-dot-product.ll
183	Even if we're allowed to do this, it doesn't seem wise; having zero in the high bits of the register is better than random junk. Can we mark up the loads somehow?

RKSimon added inline comments.Jul 19 2021, 9:43 AM

llvm/test/CodeGen/X86/load-partial-dot-product.ll
183	Isn't that what the dereferenceable(16) tag is doing?

Harbormaster completed remote builds in B114883: Diff 359823.Jul 19 2021, 10:50 AM

pengfei added inline comments.Jul 19 2021, 5:46 PM

llvm/test/CodeGen/X86/load-partial-dot-product.ll
183	I have the same doubt. `dereferenceable(16)` tells the memory of the high bits is available. But shouldn't we always prefer to loading less bytes for performance?

I'll have a think about possible alternatives for now

Please note that this patch is very partial.
The actual assembly diff should be as follows:
https://godbolt.org/z/W47nvzc4e

I think from it is it clear that the wide load is unquestionably better.

llvm/test/CodeGen/X86/load-partial-dot-product.ll
183	You are comparing apples to oranges here. The problem here is that `vinsertps` is (obviously) redundant and should go away. Then it's obviously better - we have one less memory access.

pengfei added inline comments.Jul 20 2021, 6:09 AM

llvm/test/CodeGen/X86/load-partial-dot-product.ll
183	I see it now. It makes sense if it wants to turn vmovsd {{.#+}} xmm0 = mem[0],zero vinsertps {{.#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3] into `vmovups (%rdi), %xmm0`

Took a stab at VectorCombine side of the puzzle: D106399

RKSimon mentioned this in D106399: [VectorCombine] Widening of partial vector loads.Jul 21 2021, 4:59 AM

Should this wait for D106447?

In D106280#2925306, @RKSimon wrote:

Should this wait for D106447?

No, probably not.

RKSimon abandoned this revision.Jan 25 2022, 5:07 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

53 lines

test/

CodeGen/

X86/

load-partial-dot-product.ll

14 lines

load-partial.ll

2 lines

masked_gather.ll

6 lines

Diff 359790

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 8,471 Lines • ▼ Show 20 Lines
	/// Given the initializing elements 'Elts' of a vector of type 'VT', see if the			/// Given the initializing elements 'Elts' of a vector of type 'VT', see if the
	/// elements can be replaced by a single large load which has the same value as			/// elements can be replaced by a single large load which has the same value as
	/// a build_vector or insert_subvector whose loaded operands are 'Elts'.			/// a build_vector or insert_subvector whose loaded operands are 'Elts'.
	///			///
	/// Example: <load i32 a, load i32 a+4, zero, undef> -> zextload a			/// Example: <load i32 a, load i32 a+4, zero, undef> -> zextload a
	static SDValue EltsFromConsecutiveLoads(EVT VT, ArrayRef<SDValue> Elts,			static SDValue EltsFromConsecutiveLoads(EVT VT, ArrayRef<SDValue> Elts,
	const SDLoc &DL, SelectionDAG &DAG,			const SDLoc &DL, SelectionDAG &DAG,
	const X86Subtarget &Subtarget,			const X86Subtarget &Subtarget,
	bool isAfterLegalize) {			bool IsAfterLegalize,
				bool VectorLoadsOnly = false) {
	if ((VT.getScalarSizeInBits() % 8) != 0)			if ((VT.getScalarSizeInBits() % 8) != 0)
	return SDValue();			return SDValue();

	unsigned NumElems = Elts.size();			unsigned NumElems = Elts.size();

	int LastLoadedElt = -1;			int LastLoadedElt = -1;
	APInt LoadMask = APInt::getNullValue(NumElems);			APInt LoadMask = APInt::getNullValue(NumElems);
	APInt ZeroMask = APInt::getNullValue(NumElems);			APInt ZeroMask = APInt::getNullValue(NumElems);
	▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines
	// LOAD - all consecutive load/undefs (must start/end with a load or be			// LOAD - all consecutive load/undefs (must start/end with a load or be
	// entirely dereferenceable). If we have found an entire vector of loads and			// entirely dereferenceable). If we have found an entire vector of loads and
	// undefs, then return a large load of the entire vector width starting at the			// undefs, then return a large load of the entire vector width starting at the
	// base pointer. If the vector contains zeros, then attempt to shuffle those			// base pointer. If the vector contains zeros, then attempt to shuffle those
	// elements.			// elements.
	if (FirstLoadedElt == 0 &&			if (FirstLoadedElt == 0 &&
	(NumLoadedElts == (int)NumElems \|\| IsDereferenceable) &&			(NumLoadedElts == (int)NumElems \|\| IsDereferenceable) &&
	(IsConsecutiveLoad \|\| IsConsecutiveLoadWithZeros)) {			(IsConsecutiveLoad \|\| IsConsecutiveLoadWithZeros)) {
	if (isAfterLegalize && !TLI.isOperationLegal(ISD::LOAD, VT))			if (IsAfterLegalize && !TLI.isOperationLegal(ISD::LOAD, VT))
				lebedev.riUnsubmitted Not Done Reply Inline Actions Please can you precommit this case change lebedev.ri: Please can you precommit this case change
	return SDValue();			return SDValue();

	// Don't create 256-bit non-temporal aligned loads without AVX2 as these			// Don't create 256-bit non-temporal aligned loads without AVX2 as these
	// will lower to regular temporal loads and use the cache.			// will lower to regular temporal loads and use the cache.
	if (LDBase->isNonTemporal() && LDBase->getAlignment() >= 32 &&			if (LDBase->isNonTemporal() && LDBase->getAlignment() >= 32 &&
	VT.is256BitVector() && !Subtarget.hasInt256())			VT.is256BitVector() && !Subtarget.hasInt256())
	return SDValue();			return SDValue();

	if (NumElems == 1)			if (NumElems == 1)
	return DAG.getBitcast(VT, Elts[FirstLoadedElt]);			return DAG.getBitcast(VT, Elts[FirstLoadedElt]);

	if (!ZeroMask)			if (!ZeroMask)
	return CreateLoad(VT, LDBase);			return CreateLoad(VT, LDBase);

	// IsConsecutiveLoadWithZeros - we need to create a shuffle of the loaded			// IsConsecutiveLoadWithZeros - we need to create a shuffle of the loaded
	// vector and a zero vector to clear out the zero elements.			// vector and a zero vector to clear out the zero elements.
	if (!isAfterLegalize && VT.isVector()) {			if (!VectorLoadsOnly && !IsAfterLegalize && VT.isVector()) {
	unsigned NumMaskElts = VT.getVectorNumElements();			unsigned NumMaskElts = VT.getVectorNumElements();
	if ((NumMaskElts % NumElems) == 0) {			if ((NumMaskElts % NumElems) == 0) {
	unsigned Scale = NumMaskElts / NumElems;			unsigned Scale = NumMaskElts / NumElems;
	SmallVector<int, 4> ClearMask(NumMaskElts, -1);			SmallVector<int, 4> ClearMask(NumMaskElts, -1);
	for (unsigned i = 0; i < NumElems; ++i) {			for (unsigned i = 0; i < NumElems; ++i) {
	if (UndefMask[i])			if (UndefMask[i])
	continue;			continue;
	int Offset = ZeroMask[i] ? NumMaskElts : 0;			int Offset = ZeroMask[i] ? NumMaskElts : 0;
	for (unsigned j = 0; j != Scale; ++j)			for (unsigned j = 0; j != Scale; ++j)
	ClearMask[(i * Scale) + j] = (i * Scale) + j + Offset;			ClearMask[(i * Scale) + j] = (i * Scale) + j + Offset;
	}			}
	SDValue V = CreateLoad(VT, LDBase);			SDValue V = CreateLoad(VT, LDBase);
	SDValue Z = VT.isInteger() ? DAG.getConstant(0, DL, VT)			SDValue Z = VT.isInteger() ? DAG.getConstant(0, DL, VT)
	: DAG.getConstantFP(0.0, DL, VT);			: DAG.getConstantFP(0.0, DL, VT);
	return DAG.getVectorShuffle(VT, DL, V, Z, ClearMask);			return DAG.getVectorShuffle(VT, DL, V, Z, ClearMask);
	}			}
	}			}
	}			}

				if (VectorLoadsOnly)
				return SDValue();

	// If the upper half of a ymm/zmm load is undef then just load the lower half.			// If the upper half of a ymm/zmm load is undef then just load the lower half.
	if (VT.is256BitVector() \|\| VT.is512BitVector()) {			if (VT.is256BitVector() \|\| VT.is512BitVector()) {
	unsigned HalfNumElems = NumElems / 2;			unsigned HalfNumElems = NumElems / 2;
	if (UndefMask.extractBits(HalfNumElems, HalfNumElems).isAllOnesValue()) {			if (UndefMask.extractBits(HalfNumElems, HalfNumElems).isAllOnesValue()) {
	EVT HalfVT =			EVT HalfVT =
	EVT::getVectorVT(*DAG.getContext(), VT.getScalarType(), HalfNumElems);			EVT::getVectorVT(*DAG.getContext(), VT.getScalarType(), HalfNumElems);
	SDValue HalfLD =			SDValue HalfLD =
	EltsFromConsecutiveLoads(HalfVT, Elts.drop_back(HalfNumElems), DL,			EltsFromConsecutiveLoads(HalfVT, Elts.drop_back(HalfNumElems), DL,
	DAG, Subtarget, isAfterLegalize);			DAG, Subtarget, IsAfterLegalize);
	if (HalfLD)			if (HalfLD)
	return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT),			return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT),
	HalfLD, DAG.getIntPtrConstant(0, DL));			HalfLD, DAG.getIntPtrConstant(0, DL));
	}			}
	}			}

	// VZEXT_LOAD - consecutive 32/64-bit load/undefs followed by zeros/undefs.			// VZEXT_LOAD - consecutive 32/64-bit load/undefs followed by zeros/undefs.
	if (IsConsecutiveLoad && FirstLoadedElt == 0 &&			if (IsConsecutiveLoad && FirstLoadedElt == 0 &&
	▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	if (RepeatSize > ScalarSize)			if (RepeatSize > ScalarSize)
	RepeatVT = EVT::getVectorVT(*DAG.getContext(), RepeatVT,			RepeatVT = EVT::getVectorVT(*DAG.getContext(), RepeatVT,
	RepeatSize / ScalarSize);			RepeatSize / ScalarSize);
	EVT BroadcastVT =			EVT BroadcastVT =
	EVT::getVectorVT(*DAG.getContext(), RepeatVT.getScalarType(),			EVT::getVectorVT(*DAG.getContext(), RepeatVT.getScalarType(),
	VT.getSizeInBits() / ScalarSize);			VT.getSizeInBits() / ScalarSize);
	if (TLI.isTypeLegal(BroadcastVT)) {			if (TLI.isTypeLegal(BroadcastVT)) {
	if (SDValue RepeatLoad = EltsFromConsecutiveLoads(			if (SDValue RepeatLoad = EltsFromConsecutiveLoads(
	RepeatVT, RepeatedLoads, DL, DAG, Subtarget, isAfterLegalize)) {			RepeatVT, RepeatedLoads, DL, DAG, Subtarget, IsAfterLegalize)) {
	SDValue Broadcast = RepeatLoad;			SDValue Broadcast = RepeatLoad;
	if (RepeatSize > ScalarSize) {			if (RepeatSize > ScalarSize) {
	while (Broadcast.getValueSizeInBits() < VT.getSizeInBits())			while (Broadcast.getValueSizeInBits() < VT.getSizeInBits())
	Broadcast = concatSubVectors(Broadcast, Broadcast, DAG, DL);			Broadcast = concatSubVectors(Broadcast, Broadcast, DAG, DL);
	} else {			} else {
	Broadcast =			Broadcast =
	DAG.getNode(X86ISD::VBROADCAST, DL, BroadcastVT, RepeatLoad);			DAG.getNode(X86ISD::VBROADCAST, DL, BroadcastVT, RepeatLoad);
	}			}
	return DAG.getBitcast(VT, Broadcast);			return DAG.getBitcast(VT, Broadcast);
	}			}
	}			}
	}			}
	}			}

	return SDValue();			return SDValue();
	}			}

	// Combine a vector ops (shuffles etc.) that is equal to build_vector load1,			// Combine a vector ops (shuffles etc.) that is equal to build_vector load1,
	// load2, load3, load4, <0, 1, 2, 3> into a vector load if the load addresses			// load2, load3, load4, <0, 1, 2, 3> into a vector load if the load addresses
	// are consecutive, non-overlapping, and in the right order.			// are consecutive, non-overlapping, and in the right order.
	static SDValue combineToConsecutiveLoads(EVT VT, SDValue Op, const SDLoc &DL,			static SDValue combineToConsecutiveLoads(EVT VT, SDValue Op, const SDLoc &DL,
	SelectionDAG &DAG,			SelectionDAG &DAG,
	const X86Subtarget &Subtarget,			const X86Subtarget &Subtarget,
	bool isAfterLegalize) {			bool IsAfterLegalize,
				bool VectorLoadsOnly = false) {
	SmallVector<SDValue, 64> Elts;			SmallVector<SDValue, 64> Elts;
	for (unsigned i = 0, e = VT.getVectorNumElements(); i != e; ++i) {			for (unsigned i = 0, e = VT.getVectorNumElements(); i != e; ++i) {
	if (SDValue Elt = getShuffleScalarElt(Op, i, DAG, 0)) {			if (SDValue Elt = getShuffleScalarElt(Op, i, DAG, 0)) {
	Elts.push_back(Elt);			Elts.push_back(Elt);
	continue;			continue;
	}			}
	return SDValue();			return SDValue();
	}			}
	assert(Elts.size() == VT.getVectorNumElements());			assert(Elts.size() == VT.getVectorNumElements());
	return EltsFromConsecutiveLoads(VT, Elts, DL, DAG, Subtarget,			return EltsFromConsecutiveLoads(VT, Elts, DL, DAG, Subtarget,
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - return EltsFromConsecutiveLoads(VT, Elts, DL, DAG, Subtarget, - IsAfterLegalize, VectorLoadsOnly); + return EltsFromConsecutiveLoads(VT, Elts, DL, DAG, Subtarget, IsAfterLegalize, + VectorLoadsOnly); Lint: Pre-merge checks: clang-format: please reformat the code ``` - return EltsFromConsecutiveLoads(VT, Elts, DL, DAG…
	isAfterLegalize);			IsAfterLegalize, VectorLoadsOnly);
	}			}

	static Constant *getConstantVector(MVT VT, const APInt &SplatValue,			static Constant *getConstantVector(MVT VT, const APInt &SplatValue,
	unsigned SplatBitSize, LLVMContext &C) {			unsigned SplatBitSize, LLVMContext &C) {
	unsigned ScalarSize = VT.getScalarSizeInBits();			unsigned ScalarSize = VT.getScalarSizeInBits();
	unsigned NumElm = SplatBitSize / ScalarSize;			unsigned NumElm = SplatBitSize / ScalarSize;

	SmallVector<Constant *, 32> ConstantVec;			SmallVector<Constant *, 32> ConstantVec;
	▲ Show 20 Lines • Show All 19,982 Lines • ▼ Show 20 Lines
	SDLoc dl(N);			SDLoc dl(N);
	EVT VT = N->getValueType(0);			EVT VT = N->getValueType(0);
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();			const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (TLI.isTypeLegal(VT))			if (TLI.isTypeLegal(VT))
	if (SDValue AddSub = combineShuffleToAddSubOrFMAddSub(N, Subtarget, DAG))			if (SDValue AddSub = combineShuffleToAddSubOrFMAddSub(N, Subtarget, DAG))
	return AddSub;			return AddSub;

	// Attempt to combine into a vector load/broadcast.			// Attempt to combine into a vector load/broadcast.
	if (SDValue LD = combineToConsecutiveLoads(VT, SDValue(N, 0), dl, DAG,			if (SDValue LD = combineToConsecutiveLoads(
	Subtarget, true))			VT, SDValue(N, 0), dl, DAG, Subtarget, /IsAfterLegalize/ true))
	return LD;			return LD;

	// For AVX2, we sometimes want to combine			// For AVX2, we sometimes want to combine
	// (vector_shuffle <mask> (concat_vectors t1, undef)			// (vector_shuffle <mask> (concat_vectors t1, undef)
	// (concat_vectors t2, undef))			// (concat_vectors t2, undef))
	// Into:			// Into:
	// (vector_shuffle <mask> (concat_vectors t1, t2), undef)			// (vector_shuffle <mask> (concat_vectors t1, t2), undef)
	// Since the latter can be efficiently lowered with VPERMD/VPERMQ			// Since the latter can be efficiently lowered with VPERMD/VPERMQ
	▲ Show 20 Lines • Show All 12,133 Lines • ▼ Show 20 Lines
	SDValue Ext =			SDValue Ext =
	extractSubVector(InVec.getOperand(0), IdxVal, DAG, DL, SizeInBits);			extractSubVector(InVec.getOperand(0), IdxVal, DAG, DL, SizeInBits);
	return DAG.getNode(InOpcode, DL, VT, Ext, InVec.getOperand(1));			return DAG.getNode(InOpcode, DL, VT, Ext, InVec.getOperand(1));
	}			}

	return SDValue();			return SDValue();
	}			}

	static SDValue combineScalarToVector(SDNode *N, SelectionDAG &DAG) {			static SDValue combineScalarToVector(SDNode *N, SelectionDAG &DAG,
				TargetLowering::DAGCombinerInfo &DCI,
				const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);			EVT VT = N->getValueType(0);
	SDValue Src = N->getOperand(0);			SDValue Src = N->getOperand(0);
	SDLoc DL(N);			SDLoc DL(N);

	// If this is a scalar to vector to v1i1 from an AND with 1, bypass the and.			// If this is a scalar to vector to v1i1 from an AND with 1, bypass the and.
	// This occurs frequently in our masked scalar intrinsic code and our			// This occurs frequently in our masked scalar intrinsic code and our
	// floating point select lowering with AVX512.			// floating point select lowering with AVX512.
	// TODO: SimplifyDemandedBits instead?			// TODO: SimplifyDemandedBits instead?
	Show All 33 Lines
	DAG.getAnyExtOrTrunc(ExtSrc, DL, MVT::i32)));			DAG.getAnyExtOrTrunc(ExtSrc, DL, MVT::i32)));
	}			}

	// Combine (v2i64 (scalar_to_vector (i64 (bitconvert (mmx))))) to MOVQ2DQ.			// Combine (v2i64 (scalar_to_vector (i64 (bitconvert (mmx))))) to MOVQ2DQ.
	if (VT == MVT::v2i64 && Src.getOpcode() == ISD::BITCAST &&			if (VT == MVT::v2i64 && Src.getOpcode() == ISD::BITCAST &&
	Src.getOperand(0).getValueType() == MVT::x86mmx)			Src.getOperand(0).getValueType() == MVT::x86mmx)
	return DAG.getNode(X86ISD::MOVQ2DQ, DL, VT, Src.getOperand(0));			return DAG.getNode(X86ISD::MOVQ2DQ, DL, VT, Src.getOperand(0));

	// See if we're broadcasting the scalar value, in which case just reuse that.			// Ensure we don't have any implicit truncation.
	// Ensure the same SDValue from the SDNode use is being used.			if (VT.getScalarType() == Src.getValueType()) {
	if (VT.getScalarType() == Src.getValueType())			// See if we're broadcasting the scalar value, in which case just reuse
	for (SDNode *User : Src->uses())			// that. Ensure the same SDValue from the SDNode use is being used.
				for (SDNode *User : Src->uses()) {
	if (User->getOpcode() == X86ISD::VBROADCAST &&			if (User->getOpcode() == X86ISD::VBROADCAST &&
	Src == User->getOperand(0)) {			Src == User->getOperand(0)) {
	unsigned SizeInBits = VT.getFixedSizeInBits();			unsigned SizeInBits = VT.getFixedSizeInBits();
	unsigned BroadcastSizeInBits =			unsigned BroadcastSizeInBits =
	User->getValueSizeInBits(0).getFixedSize();			User->getValueSizeInBits(0).getFixedSize();
	if (BroadcastSizeInBits == SizeInBits)			if (BroadcastSizeInBits == SizeInBits)
	return SDValue(User, 0);			return SDValue(User, 0);
	if (BroadcastSizeInBits > SizeInBits)			if (BroadcastSizeInBits > SizeInBits)
	return extractSubVector(SDValue(User, 0), 0, DAG, DL, SizeInBits);			return extractSubVector(SDValue(User, 0), 0, DAG, DL, SizeInBits);
	// TODO: Handle BroadcastSizeInBits < SizeInBits when we have test			// TODO: Handle BroadcastSizeInBits < SizeInBits when we have test
	// coverage.			// coverage.
	}			}
				}

				// Attempt to combine into a vector load.
				if (auto *Ld = dyn_cast<LoadSDNode>(peekThroughBitcasts(Src))) {
				bool Fast;
				const X86TargetLowering *TLI = Subtarget.getTargetLowering();
				if (N->isOnlyUserOf(Src.getNode()) &&
				TLI->allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), VT,
				*Ld->getMemOperand(), &Fast) && Fast)
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - Ld->getMemOperand(), &Fast) && Fast) + Ld->getMemOperand(), &Fast) && + Fast) Lint: Pre-merge checks: clang-format: please reformat the code ``` - *Ld…
				if (SDValue LD = combineToConsecutiveLoads(
				VT, SDValue(N, 0), DL, DAG, Subtarget, DCI.isAfterLegalizeDAG(),
				/VectorLoadsOnly/ true))
				return LD;
				}
				}

	return SDValue();			return SDValue();
	}			}

	// Simplify PMULDQ and PMULUDQ operations.			// Simplify PMULDQ and PMULUDQ operations.
	static SDValue combinePMULDQ(SDNode *N, SelectionDAG &DAG,			static SDValue combinePMULDQ(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,			TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	▲ Show 20 Lines • Show All 325 Lines • ▼ Show 20 Lines
	}			}

	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,			SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
	DAGCombinerInfo &DCI) const {			DAGCombinerInfo &DCI) const {
	SelectionDAG &DAG = DCI.DAG;			SelectionDAG &DAG = DCI.DAG;
	switch (N->getOpcode()) {			switch (N->getOpcode()) {
	default: break;			default: break;
	case ISD::SCALAR_TO_VECTOR:			case ISD::SCALAR_TO_VECTOR:
	return combineScalarToVector(N, DAG);			return combineScalarToVector(N, DAG, DCI, Subtarget);
	case ISD::EXTRACT_VECTOR_ELT:			case ISD::EXTRACT_VECTOR_ELT:
	case X86ISD::PEXTRW:			case X86ISD::PEXTRW:
	case X86ISD::PEXTRB:			case X86ISD::PEXTRB:
	return combineExtractVectorElt(N, DAG, DCI, Subtarget);			return combineExtractVectorElt(N, DAG, DCI, Subtarget);
	case ISD::CONCAT_VECTORS:			case ISD::CONCAT_VECTORS:
	return combineConcatVectors(N, DAG, DCI, Subtarget);			return combineConcatVectors(N, DAG, DCI, Subtarget);
	case ISD::INSERT_SUBVECTOR:			case ISD::INSERT_SUBVECTOR:
	return combineInsertSubvector(N, DAG, DCI, Subtarget);			return combineInsertSubvector(N, DAG, DCI, Subtarget);
	▲ Show 20 Lines • Show All 1,433 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/load-partial-dot-product.ll

	Show First 20 Lines • Show All 172 Lines • ▼ Show 20 Lines
	; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]			; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
	; SSE41-NEXT: addss %xmm1, %xmm0			; SSE41-NEXT: addss %xmm1, %xmm0
	; SSE41-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]			; SSE41-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]
	; SSE41-NEXT: addss %xmm1, %xmm0			; SSE41-NEXT: addss %xmm1, %xmm0
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: dot3_float3:			; AVX-LABEL: dot3_float3:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovups (%rdi), %xmm0
	; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]			; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
	; AVX-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero			; AVX-NEXT: vmovups (%rsi), %xmm1
				efriedmaUnsubmitted Not Done Reply Inline Actions Even if we're allowed to do this, it doesn't seem wise; having zero in the high bits of the register is better than random junk. Can we mark up the loads somehow? efriedma: Even if we're allowed to do this, it doesn't seem wise; having zero in the high bits of the…
				RKSimonAuthorUnsubmitted Not Done Reply Inline Actions Isn't that what the dereferenceable(16) tag is doing? RKSimon: Isn't that what the dereferenceable(16) tag is doing?
				pengfeiUnsubmitted Not Done Reply Inline Actions I have the same doubt. `dereferenceable(16)` tells the memory of the high bits is available. But shouldn't we always prefer to loading less bytes for performance? pengfei: I have the same doubt. `dereferenceable(16)` tells the memory of the high bits is available.
				lebedev.riUnsubmitted Not Done Reply Inline Actions You are comparing apples to oranges here. The problem here is that `vinsertps` is (obviously) redundant and should go away. Then it's obviously better - we have one less memory access. lebedev.ri: You are comparing apples to oranges here. The problem here is that `vinsertps` is (obviously)…
				pengfeiUnsubmitted Not Done Reply Inline Actions I see it now. It makes sense if it wants to turn vmovsd {{.#+}} xmm0 = mem[0],zero vinsertps {{.#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3] into `vmovups (%rdi), %xmm0` pengfei: I see it now. It makes sense if it wants to turn ``` vmovsd {{.*#+}} xmm0 = mem[0],zero…
	; AVX-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],mem[0],xmm1[3]			; AVX-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],mem[0],xmm1[3]
	; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0			; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
	; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm0[1,0]			; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
	; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0			; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vaddss %xmm2, %xmm0, %xmm0			; AVX-NEXT: vaddss %xmm2, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%bcx012 = bitcast float* %a0 to <3 x float>*			%bcx012 = bitcast float* %a0 to <3 x float>*
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	; SSE41-NEXT: mulss 8(%rsi), %xmm2			; SSE41-NEXT: mulss 8(%rsi), %xmm2
	; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]			; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
	; SSE41-NEXT: addss %xmm1, %xmm0			; SSE41-NEXT: addss %xmm1, %xmm0
	; SSE41-NEXT: addss %xmm2, %xmm0			; SSE41-NEXT: addss %xmm2, %xmm0
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: dot3_float2_float:			; AVX-LABEL: dot3_float2_float:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovups (%rdi), %xmm0
	; AVX-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
	; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero			; AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX-NEXT: vmulps (%rsi), %xmm0, %xmm0
	; AVX-NEXT: vmulss 8(%rsi), %xmm1, %xmm1			; AVX-NEXT: vmulss 8(%rsi), %xmm1, %xmm1
	; AVX-NEXT: vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]			; AVX-NEXT: vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
	; AVX-NEXT: vaddss %xmm2, %xmm0, %xmm0			; AVX-NEXT: vaddss %xmm2, %xmm0, %xmm0
	; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0			; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%bcx01 = bitcast float* %a0 to <2 x float>*			%bcx01 = bitcast float* %a0 to <2 x float>*
	%bcy01 = bitcast float* %a1 to <2 x float>*			%bcy01 = bitcast float* %a1 to <2 x float>*
	%x01 = load <2 x float>, <2 x float>* %bcx01, align 4			%x01 = load <2 x float>, <2 x float>* %bcx01, align 4
	▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines
	; SSE41-NEXT: movsd {{.*#+}} xmm1 = mem[0],zero			; SSE41-NEXT: movsd {{.*#+}} xmm1 = mem[0],zero
	; SSE41-NEXT: mulps %xmm0, %xmm1			; SSE41-NEXT: mulps %xmm0, %xmm1
	; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]			; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
	; SSE41-NEXT: addss %xmm1, %xmm0			; SSE41-NEXT: addss %xmm1, %xmm0
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: dot2_float2:			; AVX-LABEL: dot2_float2:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovups (%rdi), %xmm0
	; AVX-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero			; AVX-NEXT: vmulps (%rsi), %xmm0, %xmm0
	; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
	; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0			; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%bcx01 = bitcast float* %a0 to <2 x float>*			%bcx01 = bitcast float* %a0 to <2 x float>*
	%bcy01 = bitcast float* %a1 to <2 x float>*			%bcy01 = bitcast float* %a1 to <2 x float>*
	%x01 = load <2 x float>, <2 x float>* %bcx01, align 4			%x01 = load <2 x float>, <2 x float>* %bcx01, align 4
	%y01 = load <2 x float>, <2 x float>* %bcy01, align 4			%y01 = load <2 x float>, <2 x float>* %bcy01, align 4
	%mul01 = fmul <2 x float> %x01, %y01			%mul01 = fmul <2 x float> %x01, %y01
	%mul0 = extractelement <2 x float> %mul01, i32 0			%mul0 = extractelement <2 x float> %mul01, i32 0
	%mul1 = extractelement <2 x float> %mul01, i32 1			%mul1 = extractelement <2 x float> %mul01, i32 1
	%dot01 = fadd float %mul0, %mul1			%dot01 = fadd float %mul0, %mul1
	ret float %dot01			ret float %dot01
	}			}

llvm/test/CodeGen/X86/load-partial.ll

	Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero			; SSE-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero			; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]			; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: load_float4_float3_as_float2_float_0122:			; AVX-LABEL: load_float4_float3_as_float2_float_0122:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovups (%rdi), %xmm0
	; AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero			; AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]			; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,0]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%2 = bitcast <4 x float>* %0 to <2 x float>*			%2 = bitcast <4 x float>* %0 to <2 x float>*
	%3 = load <2 x float>, <2 x float>* %2, align 4			%3 = load <2 x float>, <2 x float>* %2, align 4
	%4 = extractelement <2 x float> %3, i32 0			%4 = extractelement <2 x float> %3, i32 0
	%5 = insertelement <4 x float> undef, float %4, i32 0			%5 = insertelement <4 x float> undef, float %4, i32 0
	%6 = extractelement <2 x float> %3, i32 1			%6 = extractelement <2 x float> %3, i32 1
	▲ Show 20 Lines • Show All 242 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/masked_gather.ll

	Show First 20 Lines • Show All 1,142 Lines • ▼ Show 20 Lines
	; SSE-NEXT: pcmpeqd %xmm0, %xmm1			; SSE-NEXT: pcmpeqd %xmm0, %xmm1
	; SSE-NEXT: pcmpeqd %xmm2, %xmm0			; SSE-NEXT: pcmpeqd %xmm2, %xmm0
	; SSE-NEXT: packssdw %xmm1, %xmm0			; SSE-NEXT: packssdw %xmm1, %xmm0
	; SSE-NEXT: packsswb %xmm0, %xmm0			; SSE-NEXT: packsswb %xmm0, %xmm0
	; SSE-NEXT: pmovmskb %xmm0, %eax			; SSE-NEXT: pmovmskb %xmm0, %eax
	; SSE-NEXT: testb $1, %al			; SSE-NEXT: testb $1, %al
	; SSE-NEXT: je .LBB4_1			; SSE-NEXT: je .LBB4_1
	; SSE-NEXT: # %bb.2: # %cond.load			; SSE-NEXT: # %bb.2: # %cond.load
	; SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero			; SSE-NEXT: movdqu c+12(%rip), %xmm0
	; SSE-NEXT: testb $2, %al			; SSE-NEXT: testb $2, %al
	; SSE-NEXT: jne .LBB4_4			; SSE-NEXT: jne .LBB4_4
	; SSE-NEXT: jmp .LBB4_5			; SSE-NEXT: jmp .LBB4_5
	; SSE-NEXT: .LBB4_1:			; SSE-NEXT: .LBB4_1:
	; SSE-NEXT: # implicit-def: $xmm0			; SSE-NEXT: # implicit-def: $xmm0
	; SSE-NEXT: testb $2, %al			; SSE-NEXT: testb $2, %al
	; SSE-NEXT: je .LBB4_5			; SSE-NEXT: je .LBB4_5
	; SSE-NEXT: .LBB4_4: # %cond.load1			; SSE-NEXT: .LBB4_4: # %cond.load1
	▲ Show 20 Lines • Show All 290 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: .LBB4_48: # %else102			; AVX1-NEXT: .LBB4_48: # %else102
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3
	; AVX1-NEXT: vpaddd %xmm2, %xmm3, %xmm2			; AVX1-NEXT: vpaddd %xmm2, %xmm3, %xmm2
	; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	; AVX1-NEXT: .LBB4_1: # %cond.load			; AVX1-NEXT: .LBB4_1: # %cond.load
	; AVX1-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; AVX1-NEXT: vmovdqu c+12(%rip), %xmm1
	; AVX1-NEXT: testb $2, %al			; AVX1-NEXT: testb $2, %al
	; AVX1-NEXT: je .LBB4_4			; AVX1-NEXT: je .LBB4_4
	; AVX1-NEXT: .LBB4_3: # %cond.load1			; AVX1-NEXT: .LBB4_3: # %cond.load1
	; AVX1-NEXT: vpinsrd $1, c+12(%rip), %xmm1, %xmm3			; AVX1-NEXT: vpinsrd $1, c+12(%rip), %xmm1, %xmm3
	; AVX1-NEXT: vblendps {{.*#+}} ymm1 = ymm3[0,1,2,3],ymm1[4,5,6,7]			; AVX1-NEXT: vblendps {{.*#+}} ymm1 = ymm3[0,1,2,3],ymm1[4,5,6,7]
	; AVX1-NEXT: testb $4, %al			; AVX1-NEXT: testb $4, %al
	; AVX1-NEXT: je .LBB4_6			; AVX1-NEXT: je .LBB4_6
	; AVX1-NEXT: .LBB4_5: # %cond.load4			; AVX1-NEXT: .LBB4_5: # %cond.load4
	▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: # %bb.47: # %cond.load99			; AVX2-NEXT: # %bb.47: # %cond.load99
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm2			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm2
	; AVX2-NEXT: vpinsrd $3, c+28(%rip), %xmm2, %xmm2			; AVX2-NEXT: vpinsrd $3, c+28(%rip), %xmm2, %xmm2
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm0			; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm0
	; AVX2-NEXT: .LBB4_48: # %else102			; AVX2-NEXT: .LBB4_48: # %else102
	; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	; AVX2-NEXT: .LBB4_1: # %cond.load			; AVX2-NEXT: .LBB4_1: # %cond.load
	; AVX2-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; AVX2-NEXT: vmovdqu c+12(%rip), %xmm1
	; AVX2-NEXT: testb $2, %al			; AVX2-NEXT: testb $2, %al
	; AVX2-NEXT: je .LBB4_4			; AVX2-NEXT: je .LBB4_4
	; AVX2-NEXT: .LBB4_3: # %cond.load1			; AVX2-NEXT: .LBB4_3: # %cond.load1
	; AVX2-NEXT: vpinsrd $1, c+12(%rip), %xmm1, %xmm2			; AVX2-NEXT: vpinsrd $1, c+12(%rip), %xmm1, %xmm2
	; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm2[0,1,2,3],ymm1[4,5,6,7]			; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm2[0,1,2,3],ymm1[4,5,6,7]
	; AVX2-NEXT: testb $4, %al			; AVX2-NEXT: testb $4, %al
	; AVX2-NEXT: je .LBB4_6			; AVX2-NEXT: je .LBB4_6
	; AVX2-NEXT: .LBB4_5: # %cond.load4			; AVX2-NEXT: .LBB4_5: # %cond.load4
	▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines