This is an archive of the discontinued LLVM Phabricator instance.

[x86] fix allowsMisalignedMemoryAccess() implementation
ClosedPublic

Authored by spatel on Jun 23 2015, 10:33 AM.

Download Raw Diff

Details

Reviewers

jyknight
qcolombet
hfinkel

Commits

rGed502905f7c2: [x86] fix allowsMisalignedMemoryAccess() implementation
rL245075: [x86] fix allowsMisalignedMemoryAccess() implementation

Summary

The ultimate motivation for this patch is to fix the part of PR21711 ( https://llvm.org/bugs/show_bug.cgi?id=21711#c12 ) that is still not working. To get there, I'd like to use TLI.allowsMemoryAccess() in DAGCombiner's MergeConsecutiveStores(). This will require fixing bugs in x86, AArch64 (see post-commit thread for r227242) and possibly other targets.

This patch fixes the x86 implementation of allowsMisalignedMemoryAccess() to correctly return the 'Fast' output parameter for 32-byte accesses. To test that, an existing load merging optimization is changed to use the TLI hook. This exposes a shortcoming in the current logic and results in the regression test update. Changing other direct users of the isUnalignedMem32Slow() x86 CPU attribute would be a follow-on patch.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 28252.Jun 23 2015, 10:33 AM

spatel retitled this revision from to [x86] fix allowsMisalignedMemoryAccess() implementation.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: qcolombet, jyknight, hfinkel.

spatel added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptJun 23 2015, 10:33 AM

Ping.

qcolombet added inline comments.Jun 30 2015, 5:07 PM

lib/Target/X86/X86ISelLowering.cpp
1802 ↗	(On Diff #28252)	Shouldn't we check the alignment from the data layout?

jyknight added inline comments.Jul 1 2015, 8:58 AM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4131 ↗	(On Diff #28252)	It still might actually have an alignment greater than 1 though, right? (I guess it probably doesn't really matter much, though.)
lib/Target/X86/X86ISelLowering.cpp
1802 ↗	(On Diff #28252)	Why does this even have a check for correct alignment at all? The function is "allowsMisalignedMemoryAccesses" -- the assumption being you already know your data isn't aligned. I think it's the caller's responsibility to call DataLayout::getPrefTypeAlignment, isn't it? Perhaps the static "allowableAlignment" helper function in DAGCombiner.cpp should be made more generally available, to make doing so easier.

qcolombet added inline comments.Jul 1 2015, 10:07 AM

lib/Target/X86/X86ISelLowering.cpp
1802 ↗	(On Diff #28252)	My bad, you're right.

spatel added inline comments.Jul 1 2015, 11:56 AM

lib/Target/X86/X86ISelLowering.cpp
1802 ↗	(On Diff #28252)	An audit of the trunk overrides of this function shows that only the SI lowering in the AMDGPU backend actually makes use of the Align param. Based on that implementation, it looks like the intended usage of the param is to specify how bad the misalignment can be (an 8-byte access with only 4-byte alignment is ok on that target). There are a few spots in LegalizeDAG and SelectionDAG that check the DataLayout before calling allowsMisalignedMemoryAccesses(), so making allowableAlignment() more accessible sounds like a good change to me. I'll work on that and then fix this patch up. Thanks!

spatel mentioned this in D10905: move DAGCombiner's allowableAlignment() helper function into the TLI.Jul 2 2015, 1:21 PM

spatel mentioned this in rL243549: move DAGCombiner's allowableAlignment() helper function into the TLI.Jul 29 2015, 11:24 AM

We now have a decent (if not perfect) TLI.allowsMemoryAccess() after r243549 (D10905). That makes this patch considerably simpler: change a load merging optimization to use the new hook and fix the 'fast' reporting for x86 misaligned 32-byte accesses.

Without the fix in allowsMisalignedMemoryAccesses(), we will infinite loop when targeting SandyBridge because LowerINSERT_SUBVECTOR() creates 32-byte loads from two 16-byte loads while PerformLOADCombine() splits them back into 16-byte loads.

spatel updated this object.Aug 12 2015, 2:42 PM

spatel updated this object.

jyknight added inline comments.Aug 14 2015, 8:25 AM

lib/Target/X86/X86ISelLowering.cpp
1910 ↗	(On Diff #31982)	This (pre-existing code!) seems really wrong. "isUnalignedMemAccessFast" is a very-poorly-named predicate, which only really is intended to indicate whether unaligned SSE 16-byte memory accesses are fast. I believe unaligned access of all other sizes should always be treated as fast on x86. Does it break anything if you fix that too while you're in here?

spatel added inline comments.Aug 14 2015, 8:39 AM

lib/Target/X86/X86ISelLowering.cpp
1910 ↗	(On Diff #31982)	I agree that the name is wrong; I added that FIXME note in x86.td. :) And yes, the logic here looks quite wrong to me. I'm working on a possibly related bug in PR24449. If it's alright with you, I'd like to get this patch in since it's a small independent fix. Then, I'll make sure the 16-byte and under checks are working as intended and put a patch up for review for that.

jyknight accepted this revision.Aug 14 2015, 10:09 AM

jyknight edited edge metadata.

This revision is now accepted and ready to land.Aug 14 2015, 10:09 AM

Closed by commit rL245075: [x86] fix allowsMisalignedMemoryAccess() implementation (authored by spatel). · Explain WhyAug 14 2015, 10:54 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

37 lines

test/

CodeGen/

X86/

unaligned-32-byte-memops.ll

6 lines

Diff 32165

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,909 Lines • ▼ Show 20 Lines	bool X86TargetLowering::isSafeMemOpType(MVT VT) const {
return true;		return true;
}		}

bool		bool
X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,		X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
unsigned,		unsigned,
unsigned,		unsigned,
bool *Fast) const {		bool *Fast) const {
if (Fast)		if (Fast) {
		// FIXME: We should be checking 128-bit accesses separately from smaller
		// accesses.
		if (VT.getSizeInBits() == 256)
		*Fast = !Subtarget->isUnalignedMem32Slow();
		else
*Fast = Subtarget->isUnalignedMemAccessFast();		*Fast = Subtarget->isUnalignedMemAccessFast();
		}
return true;		return true;
}		}

/// Return the entry encoding for a jump table in the		/// Return the entry encoding for a jump table in the
/// current function. The returned value is a member of the		/// current function. The returned value is a member of the
/// MachineJumpTableInfo::JTEntryKind enum.		/// MachineJumpTableInfo::JTEntryKind enum.
unsigned X86TargetLowering::getJumpTableEncoding() const {		unsigned X86TargetLowering::getJumpTableEncoding() const {
// In GOT pic mode, each entry in the jump table is emitted as a @GOTOFF		// In GOT pic mode, each entry in the jump table is emitted as a @GOTOFF
▲ Show 20 Lines • Show All 9,326 Lines • ▼ Show 20 Lines	static SDValue LowerINSERT_SUBVECTOR(SDValue Op, const X86Subtarget *Subtarget,
MVT SubVecVT = SubVec.getSimpleValueType();		MVT SubVecVT = SubVec.getSimpleValueType();

// Fold two 16-byte subvector loads into one 32-byte load:		// Fold two 16-byte subvector loads into one 32-byte load:
// (insert_subvector (insert_subvector undef, (load addr), 0),		// (insert_subvector (insert_subvector undef, (load addr), 0),
// (load addr + 16), Elts/2)		// (load addr + 16), Elts/2)
// --> load32 addr		// --> load32 addr
if ((IdxVal == OpVT.getVectorNumElements() / 2) &&		if ((IdxVal == OpVT.getVectorNumElements() / 2) &&
Vec.getOpcode() == ISD::INSERT_SUBVECTOR &&		Vec.getOpcode() == ISD::INSERT_SUBVECTOR &&
OpVT.is256BitVector() && SubVecVT.is128BitVector() &&		OpVT.is256BitVector() && SubVecVT.is128BitVector()) {
!Subtarget->isUnalignedMem32Slow()) {		auto *Idx2 = dyn_cast<ConstantSDNode>(Vec.getOperand(2));
		if (Idx2 && Idx2->getZExtValue() == 0) {
SDValue SubVec2 = Vec.getOperand(1);		SDValue SubVec2 = Vec.getOperand(1);
if (auto *Idx2 = dyn_cast<ConstantSDNode>(Vec.getOperand(2))) {		// If needed, look through a bitcast to get to the load.
if (Idx2->getZExtValue() == 0) {		if (SubVec2.getNode() && SubVec2.getOpcode() == ISD::BITCAST)
		SubVec2 = SubVec2.getOperand(0);

		if (auto *FirstLd = dyn_cast<LoadSDNode>(SubVec2)) {
		bool Fast;
		unsigned Alignment = FirstLd->getAlignment();
		unsigned AS = FirstLd->getAddressSpace();
		const X86TargetLowering *TLI = Subtarget->getTargetLowering();
		if (TLI->allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(),
		OpVT, AS, Alignment, &Fast) && Fast) {
SDValue Ops[] = { SubVec2, SubVec };		SDValue Ops[] = { SubVec2, SubVec };
if (SDValue Ld = EltsFromConsecutiveLoads(OpVT, Ops, dl, DAG, false))		if (SDValue Ld = EltsFromConsecutiveLoads(OpVT, Ops, dl, DAG, false))
return Ld;		return Ld;
}		}
}		}
}		}
		}

if ((OpVT.is256BitVector() \|\| OpVT.is512BitVector()) &&		if ((OpVT.is256BitVector() \|\| OpVT.is512BitVector()) &&
SubVecVT.is128BitVector())		SubVecVT.is128BitVector())
return Insert128BitVector(Vec, SubVec, IdxVal, DAG, dl);		return Insert128BitVector(Vec, SubVec, IdxVal, DAG, dl);

if (OpVT.is512BitVector() && SubVecVT.is256BitVector())		if (OpVT.is512BitVector() && SubVecVT.is256BitVector())
return Insert256BitVector(Vec, SubVec, IdxVal, DAG, dl);		return Insert256BitVector(Vec, SubVec, IdxVal, DAG, dl);

▲ Show 20 Lines • Show All 15,127 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/unaligned-32-byte-memops.ll

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	; AVX2-NEXT: retq
%ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 3		%ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 3
%ptr2 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 4		%ptr2 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 4
%v1 = load <4 x float>, <4 x float>* %ptr1, align 1		%v1 = load <4 x float>, <4 x float>* %ptr1, align 1
%v2 = load <4 x float>, <4 x float>* %ptr2, align 1		%v2 = load <4 x float>, <4 x float>* %ptr2, align 1
%v3 = shufflevector <4 x float> %v1, <4 x float> %v2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%v3 = shufflevector <4 x float> %v1, <4 x float> %v2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <8 x float> %v3		ret <8 x float> %v3
}		}

		; If the first load is 32-byte aligned, then the loads should be merged in all cases.

define <8 x float> @combine_16_byte_loads_aligned(<4 x float>* %ptr) {		define <8 x float> @combine_16_byte_loads_aligned(<4 x float>* %ptr) {
;; FIXME: The first load is 32-byte aligned, so the second load should get merged.
; AVXSLOW-LABEL: combine_16_byte_loads_aligned:		; AVXSLOW-LABEL: combine_16_byte_loads_aligned:
; AVXSLOW: # BB#0:		; AVXSLOW: # BB#0:
; AVXSLOW-NEXT: vmovaps 48(%rdi), %xmm0		; AVXSLOW-NEXT: vmovaps 48(%rdi), %ymm0
; AVXSLOW-NEXT: vinsertf128 $1, 64(%rdi), %ymm0, %ymm0
; AVXSLOW-NEXT: retq		; AVXSLOW-NEXT: retq
;		;
; AVXFAST-LABEL: combine_16_byte_loads_aligned:		; AVXFAST-LABEL: combine_16_byte_loads_aligned:
; AVXFAST: # BB#0:		; AVXFAST: # BB#0:
; AVXFAST-NEXT: vmovaps 48(%rdi), %ymm0		; AVXFAST-NEXT: vmovaps 48(%rdi), %ymm0
; AVXFAST-NEXT: retq		; AVXFAST-NEXT: retq
;		;
; AVX2-LABEL: combine_16_byte_loads_aligned:		; AVX2-LABEL: combine_16_byte_loads_aligned:
▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines