Download Raw Diff

Details

Reviewers

eli.friedman
davide
javed.absar
sebpop
efriedma

Commits

rGf8425340e429: [AArch64] Fix PR32384: bump up the number of stores per memset and memcpy
rL333429: [AArch64] Fix PR32384: bump up the number of stores per memset and memcpy

Summary

As @eli.friedman suggested in https://bugs.llvm.org/show_bug.cgi?id=32384#c1, this makes the inlining of memset and memcpy more aggressive when compiling for speed. The tuning remains the same when optimizing for size.

The following experiment on A-72 shows that there are several benchmarks benefiting from this change when compiling the SPEC CPU2000 with -O3 with a low overhead on code size.

A better score is positive, an increase in text size is positive:

Benchmark	Score	Text
spec2000/164.gzip	0.01%	-0.01%
spec2000/175.vpr	-0.46%	0.01%
spec2000/176.gcc	-0.28%	0.01%
spec2000/177.mesa	0.75%	0.08%
spec2000/179.art	0.39%	0.00%
spec2000/181.mcf	0.26%	0.00%
spec2000/183.equake	-0.34%	-0.01%
spec2000/186.crafty	0.09%	0.06%
spec2000/188.ammp	2.50%	0.01%
spec2000/197.parser	0.21%	0.00%
spec2000/252.eon	0.50%	1.62%
spec2000/253.perlbmk	1.67%	0.00%
spec2000/254.gap	-0.40%	0.01%
spec2000/255.vortex	-0.24%	0.00%
spec2000/256.bzip2	0.01%	0.00%
spec2000/300.twolf	0.59%	0.00%

Diff Detail

Repository: rL LLVM

Event Timeline

sebpop created this revision.Mar 30 2018, 9:19 AM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar, rengolin. · View Herald TranscriptMar 30 2018, 9:19 AM

Increasing this makes sense.

Should we check for hasNEON() here? The generic code doesn't know AArch64 has ldp/stp, so we might want to be a little more aggressive to compensate.

In D45098#1053201, @efriedma wrote:

Should we check for hasNEON() here? The generic code doesn't know AArch64 has ldp/stp, so we might want to be a little more aggressive to compensate.

Do you mean something like this? or something else?

if (Subtarget->hasNEON()) {
  MaxStoresPerMemset = 32;
  MaxStoresPerMemsetOptSize = 8;
  MaxStoresPerMemcpy = 16;
  MaxStoresPerMemcpyOptSize = 4;
  MaxStoresPerMemmove = 16;
  MaxStoresPerMemmoveOptSize = 4;
} else {
  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;
  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;
}

In D45098#1053201, @efriedma wrote:

Should we check for hasNEON() here? The generic code doesn't know AArch64 has ldp/stp, so we might want to be a little more aggressive to compensate.

Makes sense. Though LDP and STP are available to generic registers even when hasNEON() is false. The issue is whether AArch64LoadStoreOpt can form the pairs if the pairs are too far apart.

Typically, depending on the target, blocks of loads and stores should be have the loads grouped together followed by the stores.

Wait, nevermind, it shouldn't matter whether we have NEON; we probably want to inline roughly the same number of instructions either way, and integer and vector registers have roughly equivalent ldp/stp instructions.

Looking at the generated code a bit, it looks like we do a really terrible job lowering memcpy; we don't form ldp/stp at all, ever. We should probably fix that before we mess with the threshold here; it could substantially change the codesize/performance impact of this change.

In D45098#1053266, @efriedma wrote:

Looking at the generated code a bit, it looks like we do a really terrible job lowering memcpy; we don't form ldp/stp at all, ever.

Yes, the inline code for memcpy does not look great: I was seeing a mix of ldr and str.

We should probably fix that before we mess with the threshold here; it could substantially change the codesize/performance impact of this change.

Agreed, let's measure the perf of this patch again after we improve the codegen for memcpy.

I just helped the compiler with restrict and I see a pretty good code generated out of this example:

void fun(char * restrict in, char * restrict out) {
  memcpy(out, in, 100);
}

llvm produces:

	ldp	q0, q1, [x0, #64]
	stp	q0, q1, [x1, #64]
	ldp	q0, q1, [x0, #32]
	stp	q0, q1, [x1, #32]
	ldp	q0, q1, [x0]
	ldr	w8, [x0, #96]
	str	w8, [x1, #96]
	stp	q0, q1, [x1]
	ret

And here is the testcase I was looking at before producing the mix of ldr/str:

void fun(char *in, char *out) {
  memcpy(out, in, 100);
}

the mi-scheduler is unable to move ldr past str:

	ldr	w8, [x0, #96]
	str	w8, [x1, #96]
	ldr	q0, [x0, #80]
	str	q0, [x1, #80]
	ldr	q0, [x0, #64]
	str	q0, [x1, #64]
	ldr	q0, [x0, #48]
	str	q0, [x1, #48]
	ldr	q0, [x0, #32]
	str	q0, [x1, #32]
	ldr	q0, [x0, #16]
	str	q0, [x1, #16]
	ldr	q0, [x0]
	str	q0, [x1]
	ret

For this to work, the code generator expanding memcpy in getMemcpyLoadsAndStores()
needs to be amended to produce more than one ldr/str at a time.
The target should be able to specify the number of consecutive loads and stores to be produced.
In the case of generic aarch64 that should be 2 such that we can produce a ldp; stp; sequence.
For Exynos processors that should be a much higher number like 8 as it is better to have all loads and all stores scheduled together.

Sirish is working on a patch for that.

mcrosier added a subscriber: mcrosier.Apr 27 2018, 12:00 PM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 27 2018, 12:00 PM

xbolva00 added a subscriber: xbolva00.Apr 27 2018, 12:06 PM

In D45098#1053266, @efriedma wrote:

Looking at the generated code a bit, it looks like we do a really terrible job lowering memcpy; we don't form ldp/stp at all, ever. We should probably fix that before we mess with the threshold here; it could substantially change the codesize/performance impact of this change.

Eli, could you please look at the patch that Sirish has posted https://reviews.llvm.org/D46477 to make the memcpy lowering produce ldp/stp?
Are there other things to be fixed to land this change?

It looks like D46477 has review comments which are waiting to be addressed? But the approach of changing the chains seems reasonable.

Increasing the maximum number of inlined instructions for memcpy and memset from 4 to 16 seems right; it's probably a good performance/size tradeoff. Not sure about memmove; you also have to worry about register pressure for that.

Following Eli's recommendation the patch does not modify memmov.
I will post the updated numbers on top of the improved code generation
for memcpy: https://reviews.llvm.org/rL332482

dmgreen added a subscriber: dmgreen.May 23 2018, 6:57 AM

The experiment is cpu2000 best score out of 3 runs on A-72 of a firefly device.
A better score is positive.

spec2000-164.gzip	0.03%
spec2000-175.vpr	-0.66%
spec2000-177.mesa	0.28%
spec2000-179.art	-0.96%
spec2000-181.mcf	-0.98%
spec2000-183.equake	-0.86%
spec2000-186.crafty	0.06%
spec2000-188.ammp	-0.62%
spec2000-197.parser	-0.61%
spec2000-252.eon	-0.05%
spec2000-253.perlbmk	0.47%
spec2000-254.gap	-0.68%
spec2000-255.vortex	0.04%
spec2000-256.bzip2	-0.75%
spec2000-300.twolf	2.30%

Those numbers look very different from the ones before. Is r332482 making this less profitable somehow? Or is the change all noise?

In D45098#1111446, @efriedma wrote:

Those numbers look very different from the ones before. Is r332482 making this less profitable somehow? Or is the change all noise?

I'll have to take a closer look at these CPU2000 results, but in proprietary benchmarks this change is still beneficial.

evandro set the repository for this revision to rL LLVM.May 24 2018, 2:16 PM

evandro commandeered this revision.May 24 2018, 2:22 PM

evandro edited reviewers, added: sebpop; removed: evandro.

Since @sebpop has just left for a deserved vacation, he asked me to babysit his pending patches.

Update test case to keep it from failing due to this change.

evandro mentioned this in D47349: [AArch64] Limit inlining string functions with strict alignment.May 24 2018, 4:39 PM

evandro added a parent revision: D47349: [AArch64] Limit inlining string functions with strict alignment.

In D45098#1111593, @evandro wrote:

In D45098#1111446, @efriedma wrote:

Those numbers look very different from the ones before. Is r332482 making this less profitable somehow? Or is the change all noise?

I'll have to take a closer look at these CPU2000 results, but in proprietary benchmarks this change is still beneficial.

It's come down to noise. CPU2000 just lacks enough C++ code where this change has shown to be more beneficial. Again, in proprietary benchmarks, this change has yielded aggregate improvement around 1% overall and just shy of 5% in a few cases.

efriedma added inline comments.May 25 2018, 10:31 AM

llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4 ↗	(On Diff #148487)	This change doesn't make any sense to me; what are you trying to do?

evandro added inline comments.May 25 2018, 11:26 AM

llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4 ↗	(On Diff #148487)	This test doesn't seem to expect that any string function is inlined, so I added `-mattr=+strict-align` to prevent this.

efriedma added inline comments.May 25 2018, 11:44 AM

llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4 ↗	(On Diff #148487)	Those two aren't related...? I mean, yes, it has the right effect, but that's just a coincidence. Please mark the functions optsize instead.

evandro added inline comments.May 25 2018, 11:58 AM

llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4 ↗	(On Diff #148487)	<facepalm>Of course!</facepalm>

evandro removed a parent revision: D47349: [AArch64] Limit inlining string functions with strict alignment.May 25 2018, 2:14 PM

evandro marked 4 inline comments as done.

Include the solution found in D47349.

Herald added a subscriber: eraman. · View Herald TranscriptMay 25 2018, 2:22 PM

efriedma added inline comments.May 25 2018, 2:30 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
586 ↗	(On Diff #148670)	Useless comment.
llvm/lib/Target/AArch64/AArch64ISelLowering.h
243 ↗	(On Diff #148670)	No-op change?

evandro updated this revision to Diff 148673.May 25 2018, 2:35 PM

evandro marked 2 inline comments as done.

LGTM

This revision is now accepted and ready to land.May 25 2018, 2:37 PM

Thank you.

Closed by commit rL333429: [AArch64] Fix PR32384: bump up the number of stores per memset and memcpy (authored by evandro). · Explain WhyMay 29 2018, 9:02 AM

This revision was automatically updated to reflect the committed changes.

Diff 148923

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 492 Lines • ▼ Show 20 Lines	unsigned getNumInterleavedAccesses(VectorType *VecTy,
const DataLayout &DL) const;		const DataLayout &DL) const;

MachineMemOperand::Flags getMMOFlags(const Instruction &I) const override;		MachineMemOperand::Flags getMMOFlags(const Instruction &I) const override;

bool functionArgumentNeedsConsecutiveRegisters(Type *Ty,		bool functionArgumentNeedsConsecutiveRegisters(Type *Ty,
CallingConv::ID CallConv,		CallingConv::ID CallConv,
bool isVarArg) const override;		bool isVarArg) const override;
private:		private:
bool isExtFreeImpl(const Instruction *Ext) const override;

/// Keep a pointer to the AArch64Subtarget around so that we can		/// Keep a pointer to the AArch64Subtarget around so that we can
/// make the right decision when generating code for different targets.		/// make the right decision when generating code for different targets.
const AArch64Subtarget *Subtarget;		const AArch64Subtarget *Subtarget;

		bool isExtFreeImpl(const Instruction *Ext) const override;

void addTypeForNEON(MVT VT, MVT PromotedBitwiseVT);		void addTypeForNEON(MVT VT, MVT PromotedBitwiseVT);
void addDRTypeForNEON(MVT VT);		void addDRTypeForNEON(MVT VT);
void addQRTypeForNEON(MVT VT);		void addQRTypeForNEON(MVT VT);

SDValue LowerFormalArguments(SDValue Chain, CallingConv::ID CallConv,		SDValue LowerFormalArguments(SDValue Chain, CallingConv::ID CallConv,
bool isVarArg,		bool isVarArg,
const SmallVectorImpl<ISD::InputArg> &Ins,		const SmallVectorImpl<ISD::InputArg> &Ins,
const SDLoc &DL, SelectionDAG &DAG,		const SDLoc &DL, SelectionDAG &DAG,
▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 573 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
setTargetDAGCombine(ISD::VSELECT);		setTargetDAGCombine(ISD::VSELECT);

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);

setTargetDAGCombine(ISD::GlobalAddress);		setTargetDAGCombine(ISD::GlobalAddress);

MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;		// In case of strict alignment, avoid an excessive number of byte wide stores.
		MaxStoresPerMemsetOptSize = 8;
		MaxStoresPerMemset = Subtarget->requiresStrictAlign()
		? MaxStoresPerMemsetOptSize : 32;

MaxGluedStoresPerMemcpy = 4;		MaxGluedStoresPerMemcpy = 4;
		MaxStoresPerMemcpyOptSize = 4;
		MaxStoresPerMemcpy = Subtarget->requiresStrictAlign()
		? MaxStoresPerMemcpyOptSize : 16;

MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;		MaxStoresPerMemmoveOptSize = MaxStoresPerMemmove = 4;
MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;

setStackPointerRegisterToSaveRestore(AArch64::SP);		setStackPointerRegisterToSaveRestore(AArch64::SP);

setSchedulingPreference(Sched::Hybrid);		setSchedulingPreference(Sched::Hybrid);

EnableExtLdPromotion = true;		EnableExtLdPromotion = true;

// Set required alignment.		// Set required alignment.
▲ Show 20 Lines • Show All 10,851 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/arm64-memset-to-bzero.ll

	; RUN: llc %s -mtriple=arm64-apple-darwin -o - \| \			; RUN: llc %s -mtriple=arm64-apple-darwin -o - \| \
	; RUN: FileCheck --check-prefix=CHECK-DARWIN --check-prefix=CHECK %s			; RUN: FileCheck --check-prefixes=CHECK,CHECK-DARWIN %s
	; RUN: llc %s -mtriple=arm64-linux-gnu -o - \| \			; RUN: llc %s -mtriple=arm64-linux-gnu -o - \| \
	; RUN: FileCheck --check-prefix=CHECK-LINUX --check-prefix=CHECK %s			; RUN: FileCheck --check-prefixes=CHECK,CHECK-LINUX %s
	; <rdar://problem/14199482> ARM64: Calls to bzero() replaced with calls to memset()			; <rdar://problem/14199482> ARM64: Calls to bzero() replaced with calls to memset()

	; CHECK-LABEL: fct1:			; CHECK-LABEL: fct1:
	; For small size (<= 256), we do not change memset to bzero.			; For small size (<= 256), we do not change memset to bzero.
	; CHECK-DARWIN: {{b\|bl}} _memset			; CHECK-DARWIN: {{b\|bl}} _memset
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct1(i8* nocapture %ptr) {			define void @fct1(i8* nocapture %ptr) minsize {
	entry:			entry:
	tail call void @llvm.memset.p0i8.i64(i8* %ptr, i8 0, i64 256, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %ptr, i8 0, i64 256, i1 false)
	ret void			ret void
	}			}

	declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i1)			declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i1)

	; CHECK-LABEL: fct2:			; CHECK-LABEL: fct2:
	; When the size is bigger than 256, change into bzero.			; When the size is bigger than 256, change into bzero.
	; CHECK-DARWIN: {{b\|bl}} _bzero			; CHECK-DARWIN: {{b\|bl}} _bzero
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct2(i8* nocapture %ptr) {			define void @fct2(i8* nocapture %ptr) minsize {
	entry:			entry:
	tail call void @llvm.memset.p0i8.i64(i8* %ptr, i8 0, i64 257, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %ptr, i8 0, i64 257, i1 false)
	ret void			ret void
	}			}

	; CHECK-LABEL: fct3:			; CHECK-LABEL: fct3:
	; For unknown size, change to bzero.			; For unknown size, change to bzero.
	; CHECK-DARWIN: {{b\|bl}} _bzero			; CHECK-DARWIN: {{b\|bl}} _bzero
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct3(i8* nocapture %ptr, i32 %unknown) {			define void @fct3(i8* nocapture %ptr, i32 %unknown) minsize {
	entry:			entry:
	%conv = sext i32 %unknown to i64			%conv = sext i32 %unknown to i64
	tail call void @llvm.memset.p0i8.i64(i8* %ptr, i8 0, i64 %conv, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %ptr, i8 0, i64 %conv, i1 false)
	ret void			ret void
	}			}

	; CHECK-LABEL: fct4:			; CHECK-LABEL: fct4:
	; Size <= 256, no change.			; Size <= 256, no change.
	; CHECK-DARWIN: {{b\|bl}} _memset			; CHECK-DARWIN: {{b\|bl}} _memset
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct4(i8* %ptr) {			define void @fct4(i8* %ptr) minsize {
	entry:			entry:
	%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)			%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)
	%call = tail call i8* @__memset_chk(i8* %ptr, i32 0, i64 256, i64 %tmp)			%call = tail call i8* @__memset_chk(i8* %ptr, i32 0, i64 256, i64 %tmp)
	ret void			ret void
	}			}

	declare i8* @__memset_chk(i8*, i32, i64, i64)			declare i8* @__memset_chk(i8*, i32, i64, i64)

	declare i64 @llvm.objectsize.i64(i8*, i1)			declare i64 @llvm.objectsize.i64(i8*, i1)

	; CHECK-LABEL: fct5:			; CHECK-LABEL: fct5:
	; Size > 256, change.			; Size > 256, change.
	; CHECK-DARWIN: {{b\|bl}} _bzero			; CHECK-DARWIN: {{b\|bl}} _bzero
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct5(i8* %ptr) {			define void @fct5(i8* %ptr) minsize {
	entry:			entry:
	%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)			%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)
	%call = tail call i8* @__memset_chk(i8* %ptr, i32 0, i64 257, i64 %tmp)			%call = tail call i8* @__memset_chk(i8* %ptr, i32 0, i64 257, i64 %tmp)
	ret void			ret void
	}			}

	; CHECK-LABEL: fct6:			; CHECK-LABEL: fct6:
	; Size = unknown, change.			; Size = unknown, change.
	; CHECK-DARWIN: {{b\|bl}} _bzero			; CHECK-DARWIN: {{b\|bl}} _bzero
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct6(i8* %ptr, i32 %unknown) {			define void @fct6(i8* %ptr, i32 %unknown) minsize {
	entry:			entry:
	%conv = sext i32 %unknown to i64			%conv = sext i32 %unknown to i64
	%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)			%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)
	%call = tail call i8* @__memset_chk(i8* %ptr, i32 0, i64 %conv, i64 %tmp)			%call = tail call i8* @__memset_chk(i8* %ptr, i32 0, i64 %conv, i64 %tmp)
	ret void			ret void
	}			}

	; Next functions check that memset is not turned into bzero			; Next functions check that memset is not turned into bzero
	; when the set constant is non-zero, whatever the given size.			; when the set constant is non-zero, whatever the given size.

	; CHECK-LABEL: fct7:			; CHECK-LABEL: fct7:
	; memset with something that is not a zero, no change.			; memset with something that is not a zero, no change.
	; CHECK-DARWIN: {{b\|bl}} _memset			; CHECK-DARWIN: {{b\|bl}} _memset
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct7(i8* %ptr) {			define void @fct7(i8* %ptr) minsize {
	entry:			entry:
	%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)			%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)
	%call = tail call i8* @__memset_chk(i8* %ptr, i32 1, i64 256, i64 %tmp)			%call = tail call i8* @__memset_chk(i8* %ptr, i32 1, i64 256, i64 %tmp)
	ret void			ret void
	}			}

	; CHECK-LABEL: fct8:			; CHECK-LABEL: fct8:
	; memset with something that is not a zero, no change.			; memset with something that is not a zero, no change.
	; CHECK-DARWIN: {{b\|bl}} _memset			; CHECK-DARWIN: {{b\|bl}} _memset
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct8(i8* %ptr) {			define void @fct8(i8* %ptr) minsize {
	entry:			entry:
	%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)			%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)
	%call = tail call i8* @__memset_chk(i8* %ptr, i32 1, i64 257, i64 %tmp)			%call = tail call i8* @__memset_chk(i8* %ptr, i32 1, i64 257, i64 %tmp)
	ret void			ret void
	}			}

	; CHECK-LABEL: fct9:			; CHECK-LABEL: fct9:
	; memset with something that is not a zero, no change.			; memset with something that is not a zero, no change.
	; CHECK-DARWIN: {{b\|bl}} _memset			; CHECK-DARWIN: {{b\|bl}} _memset
	; CHECK-LINUX: {{b\|bl}} memset			; CHECK-LINUX: {{b\|bl}} memset
	define void @fct9(i8* %ptr, i32 %unknown) {			define void @fct9(i8* %ptr, i32 %unknown) minsize {
	entry:			entry:
	%conv = sext i32 %unknown to i64			%conv = sext i32 %unknown to i64
	%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)			%tmp = tail call i64 @llvm.objectsize.i64(i8* %ptr, i1 false)
	%call = tail call i8* @__memset_chk(i8* %ptr, i32 1, i64 %conv, i64 %tmp)			%call = tail call i8* @__memset_chk(i8* %ptr, i32 1, i64 %conv, i64 %tmp)
	ret void			ret void
	}			}

llvm/trunk/test/CodeGen/AArch64/arm64-misaligned-memcpy-inline.ll

	; RUN: llc -mtriple=arm64-apple-ios -mattr=+strict-align < %s \| FileCheck %s			; RUN: llc -mtriple=arm64-apple-ios -mattr=+strict-align < %s \| FileCheck %s

	; Small (16-bytes here) unaligned memcpys should stay memcpy calls if			; Small (16 bytes here) unaligned memcpy() should be a function call if
	; strict-alignment is turned on.			; strict-alignment is turned on.
	define void @t0(i8* %out, i8* %in) {			define void @t0(i8* %out, i8* %in) {
	; CHECK-LABEL: t0:			; CHECK-LABEL: t0:
	; CHECK: orr w2, wzr, #0x10			; CHECK: orr w2, wzr, #0x10
	; CHECK-NEXT: bl _memcpy			; CHECK-NEXT: bl _memcpy
	entry:			entry:
	call void @llvm.memcpy.p0i8.p0i8.i64(i8* %out, i8* %in, i64 16, i1 false)			call void @llvm.memcpy.p0i8.p0i8.i64(i8* %out, i8* %in, i64 16, i1 false)
	ret void			ret void
	}			}

				; Small (16 bytes here) aligned memcpy() should be inlined even if
				; strict-alignment is turned on.
				define void @t1(i8* align 8 %out, i8* align 8 %in) {
				; CHECK-LABEL: t1:
				; CHECK: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x1]
				; CHECK-NEXT: stp x{{[0-9]+}}, x{{[0-9]+}}, [x0]
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %out, i8* align 8 %in, i64 16, i1 false)
				ret void
				}

				; Tiny (4 bytes here) unaligned memcpy() should be inlined with byte sized
				; loads and stores if strict-alignment is turned on.
				define void @t2(i8* %out, i8* %in) {
				; CHECK-LABEL: t2:
				; CHECK: ldrb w{{[0-9]+}}, [x1, #3]
				; CHECK-NEXT: ldrb w{{[0-9]+}}, [x1, #2]
				; CHECK-NEXT: ldrb w{{[0-9]+}}, [x1, #1]
				; CHECK-NEXT: ldrb w{{[0-9]+}}, [x1]
				; CHECK-NEXT: strb w{{[0-9]+}}, [x0, #3]
				; CHECK-NEXT: strb w{{[0-9]+}}, [x0, #2]
				; CHECK-NEXT: strb w{{[0-9]+}}, [x0, #1]
				; CHECK-NEXT: strb w{{[0-9]+}}, [x0]
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %out, i8* %in, i64 4, i1 false)
				ret void
				}

	declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i1)			declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i1)

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix PR32384: bump up the number of stores per memset and memcpy
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 148923

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.h

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/trunk/test/CodeGen/AArch64/arm64-memset-to-bzero.ll

llvm/trunk/test/CodeGen/AArch64/arm64-misaligned-memcpy-inline.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix PR32384: bump up the number of stores per memset and memcpyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 148923

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.h

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/trunk/test/CodeGen/AArch64/arm64-memset-to-bzero.ll

llvm/trunk/test/CodeGen/AArch64/arm64-misaligned-memcpy-inline.ll

[AArch64] Fix PR32384: bump up the number of stores per memset and memcpy
ClosedPublic