This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-memset-inline.ll

Differential D51706

ARM64: improve non-zero memset isel by ~2x
ClosedPublic

Authored by jfb on Sep 5 2018, 3:50 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
MatzeB
javed.absar
efriedma

Commits

rG29200611055f: ARM64: improve non-zero memset isel by ~2x
rL341558: ARM64: improve non-zero memset isel by ~2x

Summary

I added a few ARM64 memset codegen tests in r341406 and r341493, and annotated
where the generated code was bad. This patch fixes the majority of the issues by
requesting that a 2xi64 vector be used for memset of 32 bytes and above.

The patch leaves the former request for f128 unchanged, despite f128
materialization being suboptimal: doing otherwise runs into other asserts in
isel and makes this patch too broad.

This patch hides the issue that was present in bzero_40_stack and bzero_72_stack
because the code now generates in a better order which doesn't have the store
offset issue. I'm not aware of that issue appearing elsewhere at the moment.

rdar://problem/44157755

Diff Detail

Repository: rL LLVM

Event Timeline

jfb created this revision.Sep 5 2018, 3:50 PM

Herald added a reviewer: javed.absar. · View Herald TranscriptSep 5 2018, 3:50 PM

Herald added subscribers: llvm-commits, dexonsmith, chrib and 2 others. · View Herald Transcript

Nice change, LGTM!
Maybe wait a couple days before committing in case someone better versed with SelectionDAG wants to chime in.

lib/Target/AArch64/AArch64ISelLowering.cpp
5198–5220 ↗	(On Diff #164119)	unrelated, so maybe split this into a separate commit when this review is approved (very nice change though)

This revision is now accepted and ready to land.Sep 5 2018, 4:06 PM

jfb mentioned this in rL341504: NFC: improve ARM64 isFPImmLegal debug print.Sep 5 2018, 4:39 PM

Rebase

lib/Target/AArch64/AArch64ISelLowering.cpp
5198–5220 ↗	(On Diff #164119)	I committed it in r341504.

Harbormaster completed remote builds in B22287: Diff 164125.Sep 5 2018, 4:42 PM

If I'm understanding correctly, the reason the f128 operations aren't efficient is that float immediate lowering doesn't know how to use movi? That seems like it would be more straightforward to solve by fixing the float immediate lowering, rather than messing with memset lowering.

In D51706#1225382, @efriedma wrote:

If I'm understanding correctly, the reason the f128 operations aren't efficient is that float immediate lowering doesn't know how to use movi? That seems like it would be more straightforward to solve by fixing the float immediate lowering, rather than messing with memset lowering.

I initially looked at this but it wasn't more straightforward. It's also odd to me because we know memset is i8, so v2xi64 clearly works, but we don't know that other values will work as f128 or anything else (and we don't have the value to look at here). We also know that ARM64 likes pairs of i64, whereas f128 is inherently something you'd need to load from a constant pool unless you're really lucky.

jfb added inline comments.Sep 5 2018, 4:52 PM

lib/Target/AArch64/AArch64ISelLowering.cpp
8350 ↗	(On Diff #164125)	I'd like to point out how this line doesn't match the above comment: Don't use AdvSIMD to implement 16-byte memset.

LGTM

I initially looked at this but it wasn't more straightforward

Well, even if it isn't simple, I would guess it's useful for f32 and f64 anyway. Maybe not so much for f1128, though.

efriedma added inline comments.Sep 5 2018, 5:44 PM

lib/Target/AArch64/AArch64ISelLowering.cpp
8347 ↗	(On Diff #164125)	Err, just realized one minor thing; technically, I guess you're supposed to check hasNEON() for v2i64, not hasFPARMv8(). Not really a big deal, though; they're essentially the same in practice.

can use NEON

jfb marked an inline comment as done.Sep 5 2018, 8:16 PM

jfb added inline comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
8347 ↗	(On Diff #164125)	Interesting! I split it up in 2, I assume that f128 requires the previous flag.

Closed by commit rL341558: ARM64: improve non-zero memset isel by ~2x (authored by jfb). · Explain WhySep 6 2018, 9:04 AM

This revision was automatically updated to reflect the committed changes.

jfb marked an inline comment as done.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

37 lines

test/

CodeGen/

AArch64/

arm64-memset-inline.ll

104 lines

Diff 164229

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,336 Lines • ▼ Show 20 Lines	return ((SrcAlign == 0 \|\| SrcAlign % AlignCheck == 0) &&
(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));		(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));
}		}

EVT AArch64TargetLowering::getOptimalMemOpType(uint64_t Size, unsigned DstAlign,		EVT AArch64TargetLowering::getOptimalMemOpType(uint64_t Size, unsigned DstAlign,
unsigned SrcAlign, bool IsMemset,		unsigned SrcAlign, bool IsMemset,
bool ZeroMemset,		bool ZeroMemset,
bool MemcpyStrSrc,		bool MemcpyStrSrc,
MachineFunction &MF) const {		MachineFunction &MF) const {
// Don't use AdvSIMD to implement 16-byte memset. It would have taken one
// instruction to materialize the v2i64 zero and one store (with restrictive
// addressing mode). Just do two i64 store of zero-registers.
bool Fast;
const Function &F = MF.getFunction();		const Function &F = MF.getFunction();
if (Subtarget->hasFPARMv8() && !IsMemset && Size >= 16 &&		bool CanImplicitFloat = !F.hasFnAttribute(Attribute::NoImplicitFloat);
!F.hasFnAttribute(Attribute::NoImplicitFloat) &&		bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;
(memOpAlign(SrcAlign, DstAlign, 16) \|\|		bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;
(allowsMisalignedMemoryAccesses(MVT::f128, 0, 1, &Fast) && Fast)))		// Only use AdvSIMD to implement memset of 32-byte and above. It would have
return MVT::f128;		// taken one instruction to materialize the v2i64 zero and one store (with
		// restrictive addressing mode). Just do i64 stores.
		bool IsSmallMemset = IsMemset && Size < 32;
		auto AlignmentIsAcceptable = [&](EVT VT, unsigned AlignCheck) {
		if (memOpAlign(SrcAlign, DstAlign, AlignCheck))
		return true;
		bool Fast;
		return allowsMisalignedMemoryAccesses(VT, 0, 1, &Fast) && Fast;
		};

if (Size >= 8 &&		if (CanUseNEON && IsMemset && !IsSmallMemset &&
(memOpAlign(SrcAlign, DstAlign, 8) \|\|		AlignmentIsAcceptable(MVT::v2i64, 16))
(allowsMisalignedMemoryAccesses(MVT::i64, 0, 1, &Fast) && Fast)))		return MVT::v2i64;
		if (CanUseFP && !IsSmallMemset && AlignmentIsAcceptable(MVT::f128, 16))
		return MVT::f128;
		if (Size >= 8 && AlignmentIsAcceptable(MVT::i64, 8))
return MVT::i64;		return MVT::i64;
		if (Size >= 4 && AlignmentIsAcceptable(MVT::i32, 4))
if (Size >= 4 &&
(memOpAlign(SrcAlign, DstAlign, 4) \|\|
(allowsMisalignedMemoryAccesses(MVT::i32, 0, 1, &Fast) && Fast)))
return MVT::i32;		return MVT::i32;

return MVT::Other;		return MVT::Other;
}		}

// 12-bit optionally shifted immediates are legal for adds.		// 12-bit optionally shifted immediates are legal for adds.
bool AArch64TargetLowering::isLegalAddImmediate(int64_t Immed) const {		bool AArch64TargetLowering::isLegalAddImmediate(int64_t Immed) const {
if (Immed == std::numeric_limits<int64_t>::min()) {		if (Immed == std::numeric_limits<int64_t>::min()) {
LLVM_DEBUG(dbgs() << "Illegal add imm " << Immed		LLVM_DEBUG(dbgs() << "Illegal add imm " << Immed
<< ": avoid UB for INT64_MIN\n");		<< ": avoid UB for INT64_MIN\n");
▲ Show 20 Lines • Show All 3,193 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/arm64-memset-inline.ll

	Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [32 x i8], align 1			%buf = alloca [32 x i8], align 1
	%cast = bitcast [32 x i8]* %buf to i8*			%cast = bitcast [32 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 32, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 32, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME These don't pair up because the offset isn't a multiple of 16 bits. x0, however, could be used as a base for a paired store.
	define void @bzero_40_stack() {			define void @bzero_40_stack() {
	; CHECK-LABEL: bzero_40_stack:			; CHECK-LABEL: bzero_40_stack:
	; CHECK: stp xzr, x30, [sp, #40]
	; CHECK: movi v0.2d, #0000000000000000			; CHECK: movi v0.2d, #0000000000000000
	; CHECK-NEXT: add x0, sp, #8			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stur q0, [sp, #24]			; CHECK-NEXT: str xzr, [sp, #32]
	; CHECK-NEXT: stur q0, [sp, #8]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [40 x i8], align 1			%buf = alloca [40 x i8], align 1
	%cast = bitcast [40 x i8]* %buf to i8*			%cast = bitcast [40 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 40, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 40, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	define void @bzero_64_stack() {			define void @bzero_64_stack() {
	; CHECK-LABEL: bzero_64_stack:			; CHECK-LABEL: bzero_64_stack:
	; CHECK: movi v0.2d, #0000000000000000			; CHECK: movi v0.2d, #0000000000000000
	; CHECK-NEXT: mov x0, sp			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp q0, q0, [sp, #32]			; CHECK-NEXT: stp q0, q0, [sp, #32]
	; CHECK-NEXT: stp q0, q0, [sp]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [64 x i8], align 1			%buf = alloca [64 x i8], align 1
	%cast = bitcast [64 x i8]* %buf to i8*			%cast = bitcast [64 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 64, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 64, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME These don't pair up because the offset isn't a multiple of 16 bits. x0, however, could be used as a base for a paired store.
	define void @bzero_72_stack() {			define void @bzero_72_stack() {
	; CHECK-LABEL: bzero_72_stack:			; CHECK-LABEL: bzero_72_stack:
	; CHECK: stp xzr, x30, [sp, #72]
	; CHECK: movi v0.2d, #0000000000000000			; CHECK: movi v0.2d, #0000000000000000
	; CHECK-NEXT: x0, sp, #8			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stur q0, [sp, #56]			; CHECK-NEXT: str xzr, [sp, #64]
	; CHECK-NEXT: stur q0, [sp, #40]			; CHECK-NEXT: stp q0, q0, [sp, #32]
	; CHECK-NEXT: stur q0, [sp, #24]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: stur q0, [sp, #8]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [72 x i8], align 1			%buf = alloca [72 x i8], align 1
	%cast = bitcast [72 x i8]* %buf to i8*			%cast = bitcast [72 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 72, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 0, i32 72, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [26 x i8], align 1			%buf = alloca [26 x i8], align 1
	%cast = bitcast [26 x i8]* %buf to i8*			%cast = bitcast [26 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 26, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 26, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME This could use FP ops.
	define void @memset_32_stack() {			define void @memset_32_stack() {
	; CHECK-LABEL: memset_32_stack:			; CHECK-LABEL: memset_32_stack:
	; CHECK: mov x8, #-6148914691236517206			; CHECK: movi v0.16b, #170
	; CHECK-NEXT: mov x0, sp			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp x8, x30, [sp, #24]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: stp x8, x8, [sp, #8]
	; CHECK-NEXT: str x8, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [32 x i8], align 1			%buf = alloca [32 x i8], align 1
	%cast = bitcast [32 x i8]* %buf to i8*			%cast = bitcast [32 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 32, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 32, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME This could use FP ops.
	define void @memset_40_stack() {			define void @memset_40_stack() {
	; CHECK-LABEL: memset_40_stack:			; CHECK-LABEL: memset_40_stack:
	; CHECK: mov x8, #-6148914691236517206			; CHECK: mov x8, #-6148914691236517206
	; CHECK-NEXT: add x0, sp, #8			; CHECK-NEXT: movi v0.16b, #170
	; CHECK-NEXT: stp x8, x30, [sp, #40]			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp x8, x8, [sp, #24]			; CHECK-NEXT: str x8, [sp, #32]
	; CHECK-NEXT: stp x8, x8, [sp, #8]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [40 x i8], align 1			%buf = alloca [40 x i8], align 1
	%cast = bitcast [40 x i8]* %buf to i8*			%cast = bitcast [40 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 40, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 40, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME This could use FP ops.
	define void @memset_64_stack() {			define void @memset_64_stack() {
	; CHECK-LABEL: memset_64_stack:			; CHECK-LABEL: memset_64_stack:
	; CHECK: mov x8, #-6148914691236517206			; CHECK: movi v0.16b, #170
	; CHECK-NEXT: mov x0, sp			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp x8, x30, [sp, #56]			; CHECK-NEXT: stp q0, q0, [sp, #32]
	; CHECK-NEXT: stp x8, x8, [sp, #40]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: stp x8, x8, [sp, #24]
	; CHECK-NEXT: stp x8, x8, [sp, #8]
	; CHECK-NEXT: str x8, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [64 x i8], align 1			%buf = alloca [64 x i8], align 1
	%cast = bitcast [64 x i8]* %buf to i8*			%cast = bitcast [64 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 64, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 64, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME This could use FP ops.
	define void @memset_72_stack() {			define void @memset_72_stack() {
	; CHECK-LABEL: memset_72_stack:			; CHECK-LABEL: memset_72_stack:
	; CHECK: mov x8, #-6148914691236517206			; CHECK: mov x8, #-6148914691236517206
	; CHECK-NEXT: add x0, sp, #8			; CHECK-NEXT: movi v0.16b, #170
	; CHECK-NEXT: stp x8, x30, [sp, #72]			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp x8, x8, [sp, #56]			; CHECK-NEXT: str x8, [sp, #64]
	; CHECK-NEXT: stp x8, x8, [sp, #40]			; CHECK-NEXT: stp q0, q0, [sp, #32]
	; CHECK-NEXT: stp x8, x8, [sp, #24]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: stp x8, x8, [sp, #8]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [72 x i8], align 1			%buf = alloca [72 x i8], align 1
	%cast = bitcast [72 x i8]* %buf to i8*			%cast = bitcast [72 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 72, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 72, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME This could use FP ops.
	define void @memset_128_stack() {			define void @memset_128_stack() {
	; CHECK-LABEL: memset_128_stack:			; CHECK-LABEL: memset_128_stack:
	; CHECK: mov x8, #-6148914691236517206			; CHECK: movi v0.16b, #170
	; CHECK-NEXT: mov x0, sp			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp x8, x30, [sp, #120]			; CHECK-NEXT: stp q0, q0, [sp, #96]
	; CHECK-NEXT: stp x8, x8, [sp, #104]			; CHECK-NEXT: stp q0, q0, [sp, #64]
	; CHECK-NEXT: stp x8, x8, [sp, #88]			; CHECK-NEXT: stp q0, q0, [sp, #32]
	; CHECK-NEXT: stp x8, x8, [sp, #72]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: stp x8, x8, [sp, #56]
	; CHECK-NEXT: stp x8, x8, [sp, #40]
	; CHECK-NEXT: stp x8, x8, [sp, #24]
	; CHECK-NEXT: stp x8, x8, [sp, #8]
	; CHECK-NEXT: str x8, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [128 x i8], align 1			%buf = alloca [128 x i8], align 1
	%cast = bitcast [128 x i8]* %buf to i8*			%cast = bitcast [128 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 128, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 128, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	; FIXME This could use FP ops.
	define void @memset_256_stack() {			define void @memset_256_stack() {
	; CHECK-LABEL: memset_256_stack:			; CHECK-LABEL: memset_256_stack:
	; CHECK: mov x8, #-6148914691236517206			; CHECK: movi v0.16b, #170
	; CHECK-NEXT: mov x0, sp			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: stp x8, x8, [sp, #240]			; CHECK-NEXT: stp q0, q0, [sp, #224]
	; CHECK-NEXT: stp x8, x8, [sp, #224]			; CHECK-NEXT: stp q0, q0, [sp, #192]
	; CHECK-NEXT: stp x8, x8, [sp, #208]			; CHECK-NEXT: stp q0, q0, [sp, #160]
	; CHECK-NEXT: stp x8, x8, [sp, #192]			; CHECK-NEXT: stp q0, q0, [sp, #128]
	; CHECK-NEXT: stp x8, x8, [sp, #176]			; CHECK-NEXT: stp q0, q0, [sp, #96]
	; CHECK-NEXT: stp x8, x8, [sp, #160]			; CHECK-NEXT: stp q0, q0, [sp, #64]
	; CHECK-NEXT: stp x8, x8, [sp, #144]			; CHECK-NEXT: stp q0, q0, [sp, #32]
	; CHECK-NEXT: stp x8, x8, [sp, #128]			; CHECK-NEXT: stp q0, q0, [sp]
	; CHECK-NEXT: stp x8, x8, [sp, #112]
	; CHECK-NEXT: stp x8, x8, [sp, #96]
	; CHECK-NEXT: stp x8, x8, [sp, #80]
	; CHECK-NEXT: stp x8, x8, [sp, #64]
	; CHECK-NEXT: stp x8, x8, [sp, #48]
	; CHECK-NEXT: stp x8, x8, [sp, #32]
	; CHECK-NEXT: stp x8, x8, [sp, #16]
	; CHECK-NEXT: stp x8, x8, [sp]
	; CHECK-NEXT: bl something			; CHECK-NEXT: bl something
	%buf = alloca [256 x i8], align 1			%buf = alloca [256 x i8], align 1
	%cast = bitcast [256 x i8]* %buf to i8*			%cast = bitcast [256 x i8]* %buf to i8*
	call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 256, i1 false)			call void @llvm.memset.p0i8.i32(i8* %cast, i8 -86, i32 256, i1 false)
	call void @something(i8* %cast)			call void @something(i8* %cast)
	ret void			ret void
	}			}

	declare void @something(i8*)			declare void @something(i8*)
	declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i1) nounwind			declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i1) nounwind
	declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i1) nounwind			declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i1) nounwind