This is an archive of the discontinued LLVM Phabricator instance.

merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711)
ClosedPublic

Authored by spatel on Sep 4 2015, 8:36 AM.

Download Raw Diff

Details

Reviewers

jyknight
qcolombet
t.p.northover
ab
ahatanak
jmolloy
hfinkel

Commits

rGbbbf9a1a34af: merge vector stores into wider vector stores and fix AArch64 misaligned access…
rL248622: merge vector stores into wider vector stores and fix AArch64 misaligned…

Summary

This is a redo of D7208 ( r227242 - http://llvm.org/viewvc/llvm-project?view=revision&revision=227242 ).

The patch was reverted because an AArch64 target could infinite loop after the change in DAGCombiner to merge vector stores. That happened because AArch64's allowsMisalignedMemoryAccesses() wasn't telling the truth. It reported all unaligned memory accesses as fast, but then split some 128-bit unaligned accesses up in performSTORECombine() because they are slow.

This patch attempts to fix the problem in allowsMisalignedMemoryAccesses() while preserving existing lowering behavior for AArch64.

Diff Detail

Event Timeline

spatel updated this revision to Diff 34033.Sep 4 2015, 8:36 AM

spatel retitled this revision from to merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711).

spatel updated this object.

spatel added reviewers: qcolombet, hfinkel, ahatanak, jyknight, t.p.northover.

spatel added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptSep 4 2015, 8:36 AM

Ping.

arsenm added a subscriber: arsenm.Sep 11 2015, 12:37 PM

arsenm added inline comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
804	Pedantry: using VT.getStoreSize() would be preferable
test/CodeGen/AArch64/merge-store.ll
30	Can you add some more test cases with different combination sized vectors? I would like to see a testcase that might try to combine multiple 3x vectors

spatel added inline comments.Sep 11 2015, 2:30 PM

lib/Target/AArch64/AArch64ISelLowering.cpp
804	Ok. Note: I was misled by the 'store' in the name at first since we're handling both loads and stores here, but I see now that that function is not really about stores specifically.
test/CodeGen/AArch64/merge-store.ll
30	I've been trying to find a way to do this, but all attempts so far have been thwarted because things like a v3f32 are not a simple type, so MergeConsecutiveStores doesn't get very far. Do you have a specific pattern that you're thinking of? Keep in mind that this patch limits vector merging only to extracted elements of a vector, so we're not even handling loads yet.

arsenm added inline comments.Sep 11 2015, 2:35 PM

test/CodeGen/AArch64/merge-store.ll
30	I was thinking in case you had something like load v3i32, load v3i32 and those might d be legalized into v2i32, i32, v2i32, i32 so you would have mixed scalar and vector loads to worry about. I forgot that MergeConsecutiveStores currently only runs before types are legal, so this won't really do anything for now, although I would eventually like to run it again later and to handle 3 vectors

spatel added inline comments.Sep 11 2015, 2:36 PM

test/CodeGen/AArch64/merge-store.ll
30	Also note that we don't handle merging different sized types in getStoreMergeAndAliasCandidates(). I filed PR24654 ( https://llvm.org/bugs/show_bug.cgi?id=24654 ) related to that.

arsenm added inline comments.Sep 11 2015, 2:40 PM

test/CodeGen/AArch64/merge-store.ll
30	D12698 may help with that specific problem

Patch updated based on Matt's feedback:

Use getStoreSize() rather than getSizeInBits().
I haven't found any interesting new test cases where this will fire because we're still quite limited in what we try to merge, so no new test cases vs. previous revision of this patch.

Ping * 2.

arsenm added inline comments.Sep 20 2015, 11:50 AM

lib/Target/AArch64/AArch64ISelLowering.cpp
808–811	Which extensions do you mean? I've been looking for a way to specify alignment of vector loads from C, but nothing I've tried seems to work. However, using the existence of this to justify reporting a different alignment as fast seems suspect.

spatel added inline comments.Sep 20 2015, 12:22 PM

lib/Target/AArch64/AArch64ISelLowering.cpp
808–811	I agree that this looks hacky (along with the comment about optimizing for benchmarks), but the comments and code are copied directly from the existing performSTORECombine() (see around line 8476). I don't want to alter the existing Aarch logic for this patch (other than to fix the obviously broken allowsMisalignedMemoryAccesses() implementation to allow the vector merging in DAGCombiner).

Ping * 3.

scanon added a subscriber: scanon.Sep 25 2015, 10:45 AM

LGTM

lib/Target/AArch64/AArch64ISelLowering.cpp

808–811

I guess this refers to something like:

typedef int __attribute__((ext_vector_type(4))) v4i32;
typedef v4i32 __attribute__((aligned(2))) v4i32_a2;

v4i32 foo(v4i32 *p) {
  v4i32_a2 *p2 = p;
  return *p2;
}

Interestingly, this generates a naturally aligned load:

typedef int __attribute__((ext_vector_type(4))) v4i32;
typedef v4i32 __attribute__((aligned(2))) v4i32_a2;

v4i32 foo(v4i32 *p) {
  return *(v4i32_a2 *)p;
}

This revision is now accepted and ready to land.Sep 25 2015, 11:44 AM

arsenm added inline comments.Sep 25 2015, 11:51 AM

lib/Target/AArch64/AArch64ISelLowering.cpp
808–811	Looks like a bug to me

scanon added inline comments.Sep 25 2015, 1:01 PM

lib/Target/AArch64/AArch64ISelLowering.cpp
808–811	That's expected, because p has 16B alignment; that alignment doesn't go away just because you cast the pointer to a less-aligned type. To get an unaligned load, you do something like: typedef int __attribute__((ext_vector_type(4))) v4i32; typedef int __attribute__((ext_vector_type(4), aligned(4))) v4i32_a4; v4i32 foo(const int p) { return (const v4i32_a4 *)p; }

scanon added inline comments.Sep 25 2015, 1:13 PM

test/CodeGen/X86/MergeConsecutiveStores.ll
482	Combining these stores is not an unambiguous win. With 16B alignment (as here), you're introducing a cacheline-crossing penalty where there otherwise would not be one; even with unknown alignment, Intel's optimization manual recommends using two 16B stores, rather than a 32B store (11.6.2). Does anyone from Intel want to comment on the considerations here?

Please see the discussion in D12154.

I don't think we take cacheline-crossing penalties into account anywhere in the compiler. Ie, we produce unaligned accesses for all x86 targets when we can merge smaller ops together to reduce the instruction count.

Note that we do have CPU attributes (eg, FeatureSlowUAMem32) that change this behavior; see unaligned-32-byte-memops.ll for examples of how that works.

Closed by commit rL248622: merge vector stores into wider vector stores and fix AArch64 misaligned… (authored by spatel). · Explain WhySep 25 2015, 2:51 PM

This revision was automatically updated to reflect the committed changes.

ab added inline comments.Sep 25 2015, 4:54 PM

lib/Target/AArch64/AArch64ISelLowering.cpp
808–811	That's expected, because p has 16B alignment; that alignment doesn't go away just because you cast the pointer to a less-aligned type. You're right, but can't we argue the opposite? If I want to make the alignment go away, isn't attribute((aligned)) the best (or at least a very obvious) tool at my disposal? In any case, I guess we should most importantly match gcc's behavior, and from what I see this changed since ~r246705, possibly as unintentional fallout of the recent alignment tracking improvements: filed PR24944 to look into this.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

35 lines

Target/

AArch64/

AArch64ISelLowering.cpp

26 lines

test/

CodeGen/

AArch64/

merge-store.ll

30 lines

X86/

MergeConsecutiveStores.ll

6 lines

Diff 34033

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,940 Lines • ▼ Show 20 Lines	bool DAGCombiner::MergeConsecutiveStores(StoreSDNode* St) {
int64_t ElementSizeBytes = MemVT.getSizeInBits() / 8;		int64_t ElementSizeBytes = MemVT.getSizeInBits() / 8;
bool NoVectors = DAG.getMachineFunction().getFunction()->hasFnAttribute(		bool NoVectors = DAG.getMachineFunction().getFunction()->hasFnAttribute(
Attribute::NoImplicitFloat);		Attribute::NoImplicitFloat);

// This function cannot currently deal with non-byte-sized memory sizes.		// This function cannot currently deal with non-byte-sized memory sizes.
if (ElementSizeBytes * 8 != MemVT.getSizeInBits())		if (ElementSizeBytes * 8 != MemVT.getSizeInBits())
return false;		return false;

// Don't merge vectors into wider inputs.		if (!MemVT.isSimple())
if (MemVT.isVector() \|\| !MemVT.isSimple())
return false;		return false;

// Perform an early exit check. Do not bother looking at stored values that		// Perform an early exit check. Do not bother looking at stored values that
// are not constants, loads, or extracted vector elements.		// are not constants, loads, or extracted vector elements.
SDValue StoredVal = St->getValue();		SDValue StoredVal = St->getValue();
bool IsLoadSrc = isa<LoadSDNode>(StoredVal);		bool IsLoadSrc = isa<LoadSDNode>(StoredVal);
bool IsConstantSrc = isa<ConstantSDNode>(StoredVal) \|\|		bool IsConstantSrc = isa<ConstantSDNode>(StoredVal) \|\|
isa<ConstantFPSDNode>(StoredVal);		isa<ConstantFPSDNode>(StoredVal);
bool IsExtractVecEltSrc = (StoredVal.getOpcode() == ISD::EXTRACT_VECTOR_ELT);		bool IsExtractVecSrc = (StoredVal.getOpcode() == ISD::EXTRACT_VECTOR_ELT \|\|
		StoredVal.getOpcode() == ISD::EXTRACT_SUBVECTOR);

if (!IsConstantSrc && !IsLoadSrc && !IsExtractVecEltSrc)		if (!IsConstantSrc && !IsLoadSrc && !IsExtractVecSrc)
		return false;

		// Don't merge vectors into wider vectors if the source data comes from loads.
		// TODO: This restriction can be lifted by using logic similar to the
		// ExtractVecSrc case.
		if (MemVT.isVector() && IsLoadSrc)
return false;		return false;

// Only look at ends of store sequences.		// Only look at ends of store sequences.
SDValue Chain = SDValue(St, 0);		SDValue Chain = SDValue(St, 0);
if (Chain->hasOneUse() && Chain->use_begin()->getOpcode() == ISD::STORE)		if (Chain->hasOneUse() && Chain->use_begin()->getOpcode() == ISD::STORE)
return false;		return false;

// Save the LoadSDNodes that we find in the chain.		// Save the LoadSDNodes that we find in the chain.
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	if (IsConstantSrc) {
unsigned NumElem = UseVector ? LastLegalVectorType : LastLegalType;		unsigned NumElem = UseVector ? LastLegalVectorType : LastLegalType;

return MergeStoresOfConstantsOrVecElts(StoreNodes, MemVT, NumElem,		return MergeStoresOfConstantsOrVecElts(StoreNodes, MemVT, NumElem,
true, UseVector);		true, UseVector);
}		}

// When extracting multiple vector elements, try to store them		// When extracting multiple vector elements, try to store them
// in one vector store rather than a sequence of scalar stores.		// in one vector store rather than a sequence of scalar stores.
if (IsExtractVecEltSrc) {		if (IsExtractVecSrc) {
unsigned NumElem = 0;		unsigned NumStoresToMerge = 0;
		bool IsVec = MemVT.isVector();
for (unsigned i = 0; i < LastConsecutiveStore + 1; ++i) {		for (unsigned i = 0; i < LastConsecutiveStore + 1; ++i) {
StoreSDNode *St = cast<StoreSDNode>(StoreNodes[i].MemNode);		StoreSDNode *St = cast<StoreSDNode>(StoreNodes[i].MemNode);
SDValue StoredVal = St->getValue();		unsigned StoreValOpcode = St->getValue().getOpcode();
// This restriction could be loosened.		// This restriction could be loosened.
// Bail out if any stored values are not elements extracted from a vector.		// Bail out if any stored values are not elements extracted from a vector.
// It should be possible to handle mixed sources, but load sources need		// It should be possible to handle mixed sources, but load sources need
// more careful handling (see the block of code below that handles		// more careful handling (see the block of code below that handles
// consecutive loads).		// consecutive loads).
if (StoredVal.getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (StoreValOpcode != ISD::EXTRACT_VECTOR_ELT &&
		StoreValOpcode != ISD::EXTRACT_SUBVECTOR)
return false;		return false;

// Find a legal type for the vector store.		// Find a legal type for the vector store.
EVT Ty = EVT::getVectorVT(Context, MemVT, i+1);		unsigned Elts = i + 1;
		if (IsVec) {
		// When merging vector stores, get the total number of elements.
		Elts *= MemVT.getVectorNumElements();
		}
		EVT Ty = EVT::getVectorVT(*DAG.getContext(), MemVT.getScalarType(), Elts);
bool IsFast;		bool IsFast;
if (TLI.isTypeLegal(Ty) &&		if (TLI.isTypeLegal(Ty) &&
TLI.allowsMemoryAccess(Context, DL, Ty, FirstStoreAS,		TLI.allowsMemoryAccess(Context, DL, Ty, FirstStoreAS,
FirstStoreAlign, &IsFast) && IsFast)		FirstStoreAlign, &IsFast) && IsFast)
NumElem = i + 1;		NumStoresToMerge = i + 1;
}		}

return MergeStoresOfConstantsOrVecElts(StoreNodes, MemVT, NumElem,		return MergeStoresOfConstantsOrVecElts(StoreNodes, MemVT, NumStoresToMerge,
false, true);		false, true);
}		}

// Below we handle the case of multiple consecutive stores that		// Below we handle the case of multiple consecutive stores that
// come from multiple consecutive loads. We merge them into a single		// come from multiple consecutive loads. We merge them into a single
// wide load and a single wide store.		// wide load and a single wide store.

// Look for load nodes which are used by the stored values.		// Look for load nodes which are used by the stored values.
▲ Show 20 Lines • Show All 3,270 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 789 Lines • ▼ Show 20 Lines
}		}

bool AArch64TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,		bool AArch64TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
unsigned AddrSpace,		unsigned AddrSpace,
unsigned Align,		unsigned Align,
bool *Fast) const {		bool *Fast) const {
if (Subtarget->requiresStrictAlign())		if (Subtarget->requiresStrictAlign())
return false;		return false;
// FIXME: True for Cyclone, but not necessary others.
if (Fast)		// FIXME: This is mostly true for Cyclone, but not necessarily others.
*Fast = true;		if (Fast) {
		// FIXME: Define an attribute for slow unaligned accesses instead of
		// relying on the CPU type as a proxy.
		// On Cyclone, unaligned 128-bit stores are slow.
		*Fast = !Subtarget->isCyclone() \|\| VT.getSizeInBits() != 128 \|\|
		arsenmUnsubmitted Not Done Reply Inline Actions Pedantry: using VT.getStoreSize() would be preferable arsenm: Pedantry: using VT.getStoreSize() would be preferable
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Ok. Note: I was misled by the 'store' in the name at first since we're handling both loads and stores here, but I see now that that function is not really about stores specifically. spatel: Ok. Note: I was misled by the 'store' in the name at first since we're handling both loads…
		// See comments in performSTORECombine() for more details about
		// these conditions.

		// Code that uses clang vector extensions can mark that it
		// wants unaligned accesses to be treated as fast by
		// underspecifying alignment to be 1 or 2.
		Align <= 2 \|\|
		arsenmUnsubmitted Not Done Reply Inline Actions Which extensions do you mean? I've been looking for a way to specify alignment of vector loads from C, but nothing I've tried seems to work. However, using the existence of this to justify reporting a different alignment as fast seems suspect. arsenm: Which extensions do you mean? I've been looking for a way to specify alignment of vector loads…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions I agree that this looks hacky (along with the comment about optimizing for benchmarks), but the comments and code are copied directly from the existing performSTORECombine() (see around line 8476). I don't want to alter the existing Aarch logic for this patch (other than to fix the obviously broken allowsMisalignedMemoryAccesses() implementation to allow the vector merging in DAGCombiner). spatel: I agree that this looks hacky (along with the comment about optimizing for benchmarks), but the…
		abUnsubmitted Not Done Reply Inline Actions I guess this refers to something like: typedef int __attribute__((ext_vector_type(4))) v4i32; typedef v4i32 __attribute__((aligned(2))) v4i32_a2; v4i32 foo(v4i32 p) { v4i32_a2 p2 = p; return p2; } Interestingly, this generates a naturally aligned load: typedef int __attribute__((ext_vector_type(4))) v4i32; typedef v4i32 __attribute__((aligned(2))) v4i32_a2; v4i32 foo(v4i32 p) { return (v4i32_a2 )p; } ab: I guess this refers to something like: ``` typedef int __attribute__((ext_vector_type(4)))…
		arsenmUnsubmitted Not Done Reply Inline Actions Looks like a bug to me arsenm: Looks like a bug to me
		scanonUnsubmitted Not Done Reply Inline Actions That's expected, because p has 16B alignment; that alignment doesn't go away just because you cast the pointer to a less-aligned type. To get an unaligned load, you do something like: typedef int __attribute__((ext_vector_type(4))) v4i32; typedef int __attribute__((ext_vector_type(4), aligned(4))) v4i32_a4; v4i32 foo(const int p) { return (const v4i32_a4 )p; } scanon:* That's expected, because p has 16B alignment; that alignment doesn't go away just because you…
		abUnsubmitted Not Done Reply Inline Actions That's expected, because p has 16B alignment; that alignment doesn't go away just because you cast the pointer to a less-aligned type. You're right, but can't we argue the opposite? If I want to make the alignment go away, isn't attribute((aligned)) the best (or at least a very obvious) tool at my disposal? In any case, I guess we should most importantly match gcc's behavior, and from what I see this changed since ~r246705, possibly as unintentional fallout of the recent alignment tracking improvements: filed PR24944 to look into this. ab: > That's expected, because p has 16B alignment; that alignment doesn't go away just because you…

		// Disregard v2i64. Memcpy lowering produces those and splitting
		// them regresses performance on micro-benchmarks and olden/bh.
		VT == MVT::v2i64;
		}
return true;		return true;
}		}

FastISel *		FastISel *
AArch64TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,		AArch64TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo) const {		const TargetLibraryInfo *libInfo) const {
return AArch64::createFastISel(funcInfo, libInfo);		return AArch64::createFastISel(funcInfo, libInfo);
}		}
▲ Show 20 Lines • Show All 7,620 Lines • ▼ Show 20 Lines	static SDValue performSTORECombine(SDNode *N,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
if (!DCI.isBeforeLegalize())		if (!DCI.isBeforeLegalize())
return SDValue();		return SDValue();

StoreSDNode *S = cast<StoreSDNode>(N);		StoreSDNode *S = cast<StoreSDNode>(N);
if (S->isVolatile())		if (S->isVolatile())
return SDValue();		return SDValue();

		// FIXME: The logic for deciding if an unaligned store should be split should
		// be included in TLI.allowsMisalignedMemoryAccesses(), and there should be
		// a call to that function here.

// Cyclone has bad performance on unaligned 16B stores when crossing line and		// Cyclone has bad performance on unaligned 16B stores when crossing line and
// page boundaries. We want to split such stores.		// page boundaries. We want to split such stores.
if (!Subtarget->isCyclone())		if (!Subtarget->isCyclone())
return SDValue();		return SDValue();

// Don't split at -Oz.		// Don't split at -Oz.
if (DAG.getMachineFunction().getFunction()->optForMinSize())		if (DAG.getMachineFunction().getFunction()->optForMinSize())
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 1,044 Lines • Show Last 20 Lines

test/CodeGen/AArch64/merge-store.ll

	; RUN: llc -march aarch64 %s -o - \| FileCheck %s			; RUN: llc -march aarch64 %s -o - \| FileCheck %s
				; RUN: llc < %s -mtriple=aarch64-unknown-unknown -mcpu=cyclone \| FileCheck %s --check-prefix=CYCLONE

	@g0 = external global <3 x float>, align 16			@g0 = external global <3 x float>, align 16
	@g1 = external global <3 x float>, align 4			@g1 = external global <3 x float>, align 4

	; CHECK: ldr s[[R0:[0-9]+]], {{\[}}[[R1:x[0-9]+]]{{\]}}, #4			; CHECK: ldr s[[R0:[0-9]+]], {{\[}}[[R1:x[0-9]+]]{{\]}}, #4
	; CHECK: ld1{{\.?s?}} { v[[R0]]{{\.?s?}} }[1], {{\[}}[[R1]]{{\]}}			; CHECK: ld1{{\.?s?}} { v[[R0]]{{\.?s?}} }[1], {{\[}}[[R1]]{{\]}}
	; CHECK: str d[[R0]]			; CHECK: str d[[R0]]

	define void @blam() {			define void @blam() {
	%tmp4 = getelementptr inbounds <3 x float>, <3 x float>* @g1, i64 0, i64 0			%tmp4 = getelementptr inbounds <3 x float>, <3 x float>* @g1, i64 0, i64 0
	%tmp5 = load <3 x float>, <3 x float>* @g0, align 16			%tmp5 = load <3 x float>, <3 x float>* @g0, align 16
	%tmp6 = extractelement <3 x float> %tmp5, i64 0			%tmp6 = extractelement <3 x float> %tmp5, i64 0
	store float %tmp6, float* %tmp4			store float %tmp6, float* %tmp4
	%tmp7 = getelementptr inbounds float, float* %tmp4, i64 1			%tmp7 = getelementptr inbounds float, float* %tmp4, i64 1
	%tmp8 = load <3 x float>, <3 x float>* @g0, align 16			%tmp8 = load <3 x float>, <3 x float>* @g0, align 16
	%tmp9 = extractelement <3 x float> %tmp8, i64 1			%tmp9 = extractelement <3 x float> %tmp8, i64 1
	store float %tmp9, float* %tmp7			store float %tmp9, float* %tmp7
	ret void;			ret void;
	}			}


				; PR21711 - Merge vector stores into wider vector stores.

				; On Cyclone, the stores should not get merged into a 16-byte store because
				; unaligned 16-byte stores are slow. This test would infinite loop when
				; the fastness of unaligned accesses was not specified correctly.

				define void @merge_vec_extract_stores(<4 x float> %v1, <2 x float>* %ptr) {
				arsenmUnsubmitted Not Done Reply Inline Actions Can you add some more test cases with different combination sized vectors? I would like to see a testcase that might try to combine multiple 3x vectors arsenm: Can you add some more test cases with different combination sized vectors? I would like to see…
				spatelAuthorUnsubmitted Not Done Reply Inline Actions I've been trying to find a way to do this, but all attempts so far have been thwarted because things like a v3f32 are not a simple type, so MergeConsecutiveStores doesn't get very far. Do you have a specific pattern that you're thinking of? Keep in mind that this patch limits vector merging only to extracted elements of a vector, so we're not even handling loads yet. spatel: I've been trying to find a way to do this, but all attempts so far have been thwarted because…
				arsenmUnsubmitted Not Done Reply Inline Actions I was thinking in case you had something like load v3i32, load v3i32 and those might d be legalized into v2i32, i32, v2i32, i32 so you would have mixed scalar and vector loads to worry about. I forgot that MergeConsecutiveStores currently only runs before types are legal, so this won't really do anything for now, although I would eventually like to run it again later and to handle 3 vectors arsenm: I was thinking in case you had something like load v3i32, load v3i32 and those might d be…
				spatelAuthorUnsubmitted Not Done Reply Inline Actions Also note that we don't handle merging different sized types in getStoreMergeAndAliasCandidates(). I filed PR24654 ( https://llvm.org/bugs/show_bug.cgi?id=24654 ) related to that. spatel: Also note that we don't handle merging different sized types in getStoreMergeAndAliasCandidates…
				arsenmUnsubmitted Not Done Reply Inline Actions D12698 may help with that specific problem arsenm: D12698 may help with that specific problem
				%idx0 = getelementptr inbounds <2 x float>, <2 x float>* %ptr, i64 3
				%idx1 = getelementptr inbounds <2 x float>, <2 x float>* %ptr, i64 4

				%shuffle0 = shufflevector <4 x float> %v1, <4 x float> undef, <2 x i32> <i32 0, i32 1>
				%shuffle1 = shufflevector <4 x float> %v1, <4 x float> undef, <2 x i32> <i32 2, i32 3>

				store <2 x float> %shuffle0, <2 x float>* %idx0, align 8
				store <2 x float> %shuffle1, <2 x float>* %idx1, align 8
				ret void

				; CHECK-LABEL: merge_vec_extract_stores
				; CHECK: stur q0, [x0, #24]
				; CHECK-NEXT: ret

				; CYCLONE-LABEL: merge_vec_extract_stores
				; CYCLONE: ext v1.16b, v0.16b, v0.16b, #8
				; CYCLONE-NEXT: str d0, [x0, #24]
				; CYCLONE-NEXT: str d1, [x0, #32]
				; CYCLONE-NEXT: ret
				}

test/CodeGen/X86/MergeConsecutiveStores.ll

Show First 20 Lines • Show All 472 Lines • ▼ Show 20 Lines	define void @merge_vec_extract_stores(<8 x float> %v1, <8 x float> %v2, <4 x float>* %ptr) {
%shuffle3 = shufflevector <8 x float> %v2, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>		%shuffle3 = shufflevector <8 x float> %v2, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
store <4 x float> %shuffle0, <4 x float>* %idx0, align 16		store <4 x float> %shuffle0, <4 x float>* %idx0, align 16
store <4 x float> %shuffle1, <4 x float>* %idx1, align 16		store <4 x float> %shuffle1, <4 x float>* %idx1, align 16
store <4 x float> %shuffle2, <4 x float>* %idx2, align 16		store <4 x float> %shuffle2, <4 x float>* %idx2, align 16
store <4 x float> %shuffle3, <4 x float>* %idx3, align 16		store <4 x float> %shuffle3, <4 x float>* %idx3, align 16
ret void		ret void

; CHECK-LABEL: merge_vec_extract_stores		; CHECK-LABEL: merge_vec_extract_stores
; CHECK: vmovaps %xmm0, 48(%rdi)		; CHECK: vmovups %ymm0, 48(%rdi)
; CHECK-NEXT: vextractf128 $1, %ymm0, 64(%rdi)		; CHECK-NEXT: vmovups %ymm1, 80(%rdi)
		scanonUnsubmitted Not Done Reply Inline Actions Combining these stores is not an unambiguous win. With 16B alignment (as here), you're introducing a cacheline-crossing penalty where there otherwise would not be one; even with unknown alignment, Intel's optimization manual recommends using two 16B stores, rather than a 32B store (11.6.2). Does anyone from Intel want to comment on the considerations here? scanon: Combining these stores is not an unambiguous win. With 16B alignment (as here), you're…
; CHECK-NEXT: vmovaps %xmm1, 80(%rdi)
; CHECK-NEXT: vextractf128 $1, %ymm1, 96(%rdi)
; CHECK-NEXT: vzeroupper		; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq		; CHECK-NEXT: retq
}		}

; Merging vector stores when sourced from vector loads is not currently handled.		; Merging vector stores when sourced from vector loads is not currently handled.
define void @merge_vec_stores_from_loads(<4 x float>* %v, <4 x float>* %ptr) {		define void @merge_vec_stores_from_loads(<4 x float>* %v, <4 x float>* %ptr) {
%load_idx0 = getelementptr inbounds <4 x float>, <4 x float>* %v, i64 0		%load_idx0 = getelementptr inbounds <4 x float>, <4 x float>* %v, i64 0
%load_idx1 = getelementptr inbounds <4 x float>, <4 x float>* %v, i64 1		%load_idx1 = getelementptr inbounds <4 x float>, <4 x float>* %v, i64 1
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 34033

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/merge-store.ll

test/CodeGen/X86/MergeConsecutiveStores.ll

merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711)
ClosedPublic