This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Fix lowering of misaligned memcpy/memset
ClosedPublic

Authored by john.brawn on May 23 2017, 7:40 AM.

Download Raw Diff

Details

Reviewers

arsenm
rengolin
lhames
efriedma

Commits

rG9009d2905deb: [ARM] Fix lowering of misaligned memcpy/memset
rL303990: [ARM] Fix lowering of misaligned memcpy/memset

Summary

Currently getOptimalMemOpType returns i32 for large enough sizes without checking for alignment, leading to poor code generation when misaligned accesses aren't permitted as we generate a word store then later split it up into byte stores. This means we inadvertantly go over the MaxStoresPerMemcpy limit and for memset we splat the memset value into a word then immediately split it up again.

Fix this by leaving it up to FindOptimalMemOpLowering to figure out which type to use, but also fix a bug there where it wasn't correctly checking if misaligned memory accesses are allowed.

Diff Detail

Repository: rL LLVM

Event Timeline

john.brawn created this revision.May 23 2017, 7:40 AM

Herald added subscribers: eraman, javed.absar, wdng, aemerson. · View Herald TranscriptMay 23 2017, 7:40 AM

Please regenerate the tests in a separate commit, to make it clear what this patch is actually changing. Please change the description to make it clear what this is actually fixing; as far as I can tell, we aren't generating "wrong code", just inlining memcpy and memset calls too aggressively.

john.brawn edited the summary of this revision. (Show Details)May 24 2017, 6:22 AM

In D33442#762325, @efriedma wrote:

Please regenerate the tests in a separate commit, to make it clear what this patch is actually changing.

I've moved the test changes that aren't directly due to this code change to D33495

Please change the description to make it clear what this is actually fixing; as far as I can tell, we aren't generating "wrong code", just inlining memcpy and memset calls too aggressively.

Done

efriedma added inline comments.May 24 2017, 11:06 AM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4782	The use of getPointerTy() here seems dubious. On many architectures, you can use getPointerTy as a rough proxy for the largest legal integer type, but that's not universal, and the usage of DstAS doesn't make any sense. Maybe just loop over {i64, i32, i16}.
4796	The use of isTypeLegal() here looks suspicious; i16 is not legal on ARM, but we definitely want to use i16 stores if we can. Do you have any testcases where the known alignment is two?

john.brawn added inline comments.May 25 2017, 6:42 AM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4782	I'd thought about doing that but instead went with the simpler change. I'll give the loop approach a try.
4796	This loop is actually finding the largest legal type, i.e. i32 on ARM, so that if VT is larger than that it's used instead. Adding a test sounds like a good idea though.

Adjust FindOptimalMemOpLowering to use a loop instead of the pointer type, and add a 2-byte aligned test.

LGTM.

This revision is now accepted and ready to land.May 25 2017, 12:12 PM

Closed by commit rL303990: [ARM] Fix lowering of misaligned memcpy/memset (authored by john.brawn). · Explain WhyMay 26 2017, 6:59 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

SelectionDAG.cpp

5 lines

Target/

ARM/

ARMISelLowering.cpp

6 lines

test/

CodeGen/

ARM/

memcpy-inline.ll

13 lines

memset-inline.ll

59 lines

Diff 99908

lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,773 Lines • ▼ Show 20 Lines	static bool FindOptimalMemOpLowering(std::vector<EVT> &MemOps,
// means it's possible to change the alignment of the destination.		// means it's possible to change the alignment of the destination.
// 'MemcpyStrSrc' indicates whether the memcpy source is constant so it does		// 'MemcpyStrSrc' indicates whether the memcpy source is constant so it does
// not need to be loaded.		// not need to be loaded.
EVT VT = TLI.getOptimalMemOpType(Size, DstAlign, SrcAlign,		EVT VT = TLI.getOptimalMemOpType(Size, DstAlign, SrcAlign,
IsMemset, ZeroMemset, MemcpyStrSrc,		IsMemset, ZeroMemset, MemcpyStrSrc,
DAG.getMachineFunction());		DAG.getMachineFunction());

if (VT == MVT::Other) {		if (VT == MVT::Other) {
		EVT PointerVT = TLI.getPointerTy(DAG.getDataLayout(), DstAS);
		efriedmaUnsubmitted Not Done Reply Inline Actions The use of getPointerTy() here seems dubious. On many architectures, you can use getPointerTy as a rough proxy for the largest legal integer type, but that's not universal, and the usage of DstAS doesn't make any sense. Maybe just loop over {i64, i32, i16}. efriedma: The use of getPointerTy() here seems dubious. On many architectures, you can use getPointerTy…
		john.brawnAuthorUnsubmitted Not Done Reply Inline Actions I'd thought about doing that but instead went with the simpler change. I'll give the loop approach a try. john.brawn: I'd thought about doing that but instead went with the simpler change. I'll give the loop…
if (DstAlign >= DAG.getDataLayout().getPointerPrefAlignment(DstAS) \|\|		if (DstAlign >= DAG.getDataLayout().getPointerPrefAlignment(DstAS) \|\|
TLI.allowsMisalignedMemoryAccesses(VT, DstAS, DstAlign)) {		TLI.allowsMisalignedMemoryAccesses(PointerVT, DstAS, DstAlign)) {
VT = TLI.getPointerTy(DAG.getDataLayout(), DstAS);		VT = PointerVT;
} else {		} else {
switch (DstAlign & 7) {		switch (DstAlign & 7) {
case 0: VT = MVT::i64; break;		case 0: VT = MVT::i64; break;
case 4: VT = MVT::i32; break;		case 4: VT = MVT::i32; break;
case 2: VT = MVT::i16; break;		case 2: VT = MVT::i16; break;
default: VT = MVT::i8; break;		default: VT = MVT::i8; break;
}		}
}		}

MVT LVT = MVT::i64;		MVT LVT = MVT::i64;
while (!TLI.isTypeLegal(LVT))		while (!TLI.isTypeLegal(LVT))
		efriedmaUnsubmitted Not Done Reply Inline Actions The use of isTypeLegal() here looks suspicious; i16 is not legal on ARM, but we definitely want to use i16 stores if we can. Do you have any testcases where the known alignment is two? efriedma: The use of isTypeLegal() here looks suspicious; i16 is not legal on ARM, but we definitely want…
		john.brawnAuthorUnsubmitted Not Done Reply Inline Actions This loop is actually finding the largest legal type, i.e. i32 on ARM, so that if VT is larger than that it's used instead. Adding a test sounds like a good idea though. john.brawn: This loop is actually finding the largest legal type, i.e. i32 on ARM, so that if VT is larger…
LVT = (MVT::SimpleValueType)(LVT.SimpleTy - 1);		LVT = (MVT::SimpleValueType)(LVT.SimpleTy - 1);
assert(LVT.isInteger());		assert(LVT.isInteger());

if (VT.bitsGT(LVT))		if (VT.bitsGT(LVT))
VT = LVT;		VT = LVT;
}		}

unsigned NumMemOps = 0;		unsigned NumMemOps = 0;
▲ Show 20 Lines • Show All 3,085 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 9,991 Lines • ▼ Show 20 Lines
	} else if (Size >= 8 &&			} else if (Size >= 8 &&
	(memOpAlign(SrcAlign, DstAlign, 8) \|\|			(memOpAlign(SrcAlign, DstAlign, 8) \|\|
	(allowsMisalignedMemoryAccesses(MVT::f64, 0, 1, &Fast) &&			(allowsMisalignedMemoryAccesses(MVT::f64, 0, 1, &Fast) &&
	Fast))) {			Fast))) {
	return MVT::f64;			return MVT::f64;
	}			}
	}			}

	// Lowering to i32/i16 if the size permits.
	if (Size >= 4)
	return MVT::i32;
	else if (Size >= 2)
	return MVT::i16;

	// Let the target-independent logic figure it out.			// Let the target-independent logic figure it out.
	return MVT::Other;			return MVT::Other;
	}			}

	bool ARMTargetLowering::isZExtFree(SDValue Val, EVT VT2) const {			bool ARMTargetLowering::isZExtFree(SDValue Val, EVT VT2) const {
	if (Val.getOpcode() != ISD::LOAD)			if (Val.getOpcode() != ISD::LOAD)
	return false;			return false;

	▲ Show 20 Lines • Show All 1,900 Lines • Show Last 20 Lines

test/CodeGen/ARM/memcpy-inline.ll

	Show All 29 Lines
	define void @t1(i8* nocapture %C) nounwind {			define void @t1(i8* nocapture %C) nounwind {
	entry:			entry:
	; CHECK-LABEL: t1:			; CHECK-LABEL: t1:
	; CHECK: movs [[INC:r[0-9]+]], #15			; CHECK: movs [[INC:r[0-9]+]], #15
	; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1], [[INC]]			; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1], [[INC]]
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0], [[INC]]			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0], [[INC]]
	; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]			; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0]			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0]
				; CHECK-T1-LABEL: t1:
				; CHECK-T1: bl _memcpy
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([31 x i8], [31 x i8]* @.str1, i64 0, i64 0), i64 31, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([31 x i8], [31 x i8]* @.str1, i64 0, i64 0), i64 31, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t2(i8* nocapture %C) nounwind {			define void @t2(i8* nocapture %C) nounwind {
	entry:			entry:
	; CHECK-LABEL: t2:			; CHECK-LABEL: t2:
	; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]!			; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]!
	; CHECK: movs [[INC:r[0-9]+]], #32			; CHECK: movs [[INC:r[0-9]+]], #32
	; CHECK: add.w r3, r0, #16			; CHECK: add.w r3, r0, #16
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0], [[INC]]			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0], [[INC]]
	; CHECK: movw [[REG2:r[0-9]+]], #16716			; CHECK: movw [[REG2:r[0-9]+]], #16716
	; CHECK: movt [[REG2:r[0-9]+]], #72			; CHECK: movt [[REG2:r[0-9]+]], #72
	; CHECK: str [[REG2]], [r0]			; CHECK: str [[REG2]], [r0]
	; CHECK: vld1.64 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]			; CHECK: vld1.64 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r3]			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r3]
				; CHECK-T1-LABEL: t2:
				; CHECK-T1: bl _memcpy
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([36 x i8], [36 x i8]* @.str2, i64 0, i64 0), i64 36, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([36 x i8], [36 x i8]* @.str2, i64 0, i64 0), i64 36, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t3(i8* nocapture %C) nounwind {			define void @t3(i8* nocapture %C) nounwind {
	entry:			entry:
	; CHECK-LABEL: t3:			; CHECK-LABEL: t3:
	; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]!			; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]!
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0]!			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0]!
	; CHECK: vldr d{{[0-9]+}}, [r1]			; CHECK: vldr d{{[0-9]+}}, [r1]
	; CHECK: vst1.8 {d{{[0-9]+}}}, [r0]			; CHECK: vst1.8 {d{{[0-9]+}}}, [r0]
				; CHECK-T1-LABEL: t3:
				; CHECK-T1: bl _memcpy
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([24 x i8], [24 x i8]* @.str3, i64 0, i64 0), i64 24, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([24 x i8], [24 x i8]* @.str3, i64 0, i64 0), i64 24, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t4(i8* nocapture %C) nounwind {			define void @t4(i8* nocapture %C) nounwind {
	entry:			entry:
	; CHECK-LABEL: t4:			; CHECK-LABEL: t4:
	; CHECK: vld1.64 {[[REG3:d[0-9]+]], [[REG4:d[0-9]+]]}, [r1]			; CHECK: vld1.64 {[[REG3:d[0-9]+]], [[REG4:d[0-9]+]]}, [r1]
	; CHECK: vst1.8 {[[REG3]], [[REG4]]}, [r0]!			; CHECK: vst1.8 {[[REG3]], [[REG4]]}, [r0]!
	; CHECK: strh [[REG5:r[0-9]+]], [r0]			; CHECK: strh [[REG5:r[0-9]+]], [r0]
				; CHECK-T1-LABEL: t4:
				; CHECK-T1: bl _memcpy
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([18 x i8], [18 x i8]* @.str4, i64 0, i64 0), i64 18, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([18 x i8], [18 x i8]* @.str4, i64 0, i64 0), i64 18, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t5(i8* nocapture %C) nounwind {			define void @t5(i8* nocapture %C) nounwind {
	entry:			entry:
	; CHECK-LABEL: t5:			; CHECK-LABEL: t5:
	; CHECK: movs [[REG5:r[0-9]+]], #0			; CHECK: movs [[REG5:r[0-9]+]], #0
	; CHECK: strb [[REG5]], [r0, #6]			; CHECK: strb [[REG5]], [r0, #6]
	; CHECK: movw [[REG6:r[0-9]+]], #21587			; CHECK: movw [[REG6:r[0-9]+]], #21587
	; CHECK: strh [[REG6]], [r0, #4]			; CHECK: strh [[REG6]], [r0, #4]
	; CHECK: movw [[REG7:r[0-9]+]], #18500			; CHECK: movw [[REG7:r[0-9]+]], #18500
	; CHECK: movt [[REG7:r[0-9]+]], #22866			; CHECK: movt [[REG7:r[0-9]+]], #22866
	; CHECK: str [[REG7]]			; CHECK: str [[REG7]]
	; CHECK-T1-LABEL: t5:			; CHECK-T1-LABEL: t5:
	; CHECK-T1: movs [[TREG3:r[0-9]]],			; CHECK-T1: bl _memcpy
	; CHECK-T1: strb [[TREG3]],
	; CHECK-T1: movs [[TREG4:r[0-9]]],
	; CHECK-T1: strb [[TREG4]],
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str5, i64 0, i64 0), i64 7, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str5, i64 0, i64 0), i64 7, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t6() nounwind {			define void @t6() nounwind {
	entry:			entry:
	; CHECK-LABEL: t6:			; CHECK-LABEL: t6:
	; CHECK: vldr [[REG9:d[0-9]+]], [r0]			; CHECK: vldr [[REG9:d[0-9]+]], [r0]
	Show All 32 Lines

test/CodeGen/ARM/memset-inline.ll

	; RUN: llc < %s -mtriple=thumbv7-apple-ios -mcpu=cortex-a8 -pre-RA-sched=source -disable-post-ra \| FileCheck %s			; RUN: llc < %s -mtriple=thumbv7-apple-ios -mcpu=cortex-a8 -pre-RA-sched=source -disable-post-ra \| FileCheck %s -check-prefix=CHECK-7A
				; RUN: llc < %s -mtriple=thumbv6m -pre-RA-sched=source -disable-post-ra -mattr=+strict-align \| FileCheck %s -check-prefix=CHECK-6M

	define void @t1(i8* nocapture %c) nounwind optsize {			define void @t1(i8* nocapture %c) nounwind optsize {
	entry:			entry:
	; CHECK-LABEL: t1:			; CHECK-7A-LABEL: t1:
	; CHECK: movs r1, #0			; CHECK-7A: movs r1, #0
	; CHECK: strd r1, r1, [r0]			; CHECK-7A: strd r1, r1, [r0]
	; CHECK: str r1, [r0, #8]			; CHECK-7A: str r1, [r0, #8]
				; CHECK-6M-LABEL: t1:
				; CHECK-6M: movs r1, #0
				; CHECK-6M: str r1, [r0]
				; CHECK-6M: str r1, [r0, #4]
				; CHECK-6M: str r1, [r0, #8]
	call void @llvm.memset.p0i8.i64(i8* %c, i8 0, i64 12, i32 8, i1 false)			call void @llvm.memset.p0i8.i64(i8* %c, i8 0, i64 12, i32 8, i1 false)
	ret void			ret void
	}			}

	define void @t2() nounwind ssp {			define void @t2() nounwind ssp {
	entry:			entry:
	; CHECK-LABEL: t2:			; CHECK-7A-LABEL: t2:
	; CHECK: vmov.i32 {{q[0-9]+}}, #0x0			; CHECK-7A: vmov.i32 {{q[0-9]+}}, #0x0
	; CHECK: movs r1, #10			; CHECK-7A: movs r1, #10
	; CHECK: vst1.64 {d{{[0-9]+}}, d{{[0-9]+}}}, [r2], r1			; CHECK-7A: vst1.64 {d{{[0-9]+}}, d{{[0-9]+}}}, [r2], r1
	; CHECK: vst1.16 {d{{[0-9]+}}, d{{[0-9]+}}}, [r2]			; CHECK-7A: vst1.16 {d{{[0-9]+}}, d{{[0-9]+}}}, [r2]
				; CHECK-6M-LABEL: t2:
				; CHECK-6M: movs [[REG:r[0-9]+]], #0
				; CHECK-6M: str [[REG]], [sp, #20]
				; CHECK-6M: str [[REG]], [sp, #16]
				; CHECK-6M: str [[REG]], [sp, #12]
				; CHECK-6M: str [[REG]], [sp, #8]
				; CHECK-6M: str [[REG]], [sp, #4]
				; CHECK-6M: str [[REG]], [sp]
	%buf = alloca [26 x i8], align 1			%buf = alloca [26 x i8], align 1
	%0 = getelementptr inbounds [26 x i8], [26 x i8]* %buf, i32 0, i32 0			%0 = getelementptr inbounds [26 x i8], [26 x i8]* %buf, i32 0, i32 0
	call void @llvm.memset.p0i8.i32(i8* %0, i8 0, i32 26, i32 1, i1 false)			call void @llvm.memset.p0i8.i32(i8* %0, i8 0, i32 26, i32 1, i1 false)
	call void @something(i8* %0) nounwind			call void @something(i8* %0) nounwind
	ret void			ret void
	}			}

				define void @t3(i8* %p) {
				entry:
				; CHECK-7A-LABEL: t3:
				; CHECK-7A: muls [[REG:r[0-9]+]],
				; CHECK-7A: str [[REG]],
				; CHECK-6M-LABEL: t3:
				; CHECK-6M: strb [[REG:r[0-9]+]],
				; CHECK-6M: strb [[REG]],
				; CHECK-6M: strb [[REG]],
				; CHECK-6M: strb [[REG]],
				br label %for.body

				for.body:
				%i = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%0 = trunc i32 %i to i8
				call void @llvm.memset.p0i8.i32(i8* %p, i8 %0, i32 4, i32 1, i1 false)
				call void @something(i8* %p)
				%inc = add nuw nsw i32 %i, 1
				%exitcond = icmp eq i32 %inc, 255
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret void
				}

	declare void @something(i8*) nounwind			declare void @something(i8*) nounwind
	declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i32, i1) nounwind			declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i32, i1) nounwind
	declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) nounwind			declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) nounwind