This is an archive of the discontinued LLVM Phabricator instance.

[x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses
ClosedPublic

Authored by spatel on Sep 1 2015, 3:42 PM.

Download Raw Diff

Details

Reviewers

jyknight
qcolombet
RKSimon
silvas
zansari

Commits

rGfbcd189f8aab: [x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses
rL246658: [x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses

Summary

This is a continuation of the fix from:
http://reviews.llvm.org/D10662

and discussion in:
http://reviews.llvm.org/D12154

Here, we distinguish slow unaligned SSE (128-bit) accesses from slow unaligned scalar (64-bit and under) accesses. Other lowering (eg, getOptimalMemOpType) assumes that unaligned scalar accesses are always ok, so this changes allowsMisalignedMemoryAccesses() to match that behavior.

The test case changes show that we'll now use unaligned 8-byte load/store with a 64-bit CPU where before we settled for unaligned 4-byte ops.

The overlapping accesses may be a separate bug.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 33744.Sep 1 2015, 3:42 PM

spatel retitled this revision from to [x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses.

spatel updated this object.

spatel added reviewers: jyknight, zansari, qcolombet, RKSimon, silvas.

spatel added a subscriber: llvm-commits.

Hi Sanjay,

LGTM.

The overlapping accesses may be a separate bug.

I do not know if this is a separate bug, but at least it does not seem related to your changes, i.e., it happened before, right :).
What is the PR for that?
Do you plan to follow-up on that?

Thanks,
-Quentin

qcolombet accepted this revision.Sep 1 2015, 4:45 PM

qcolombet edited edge metadata.

This revision is now accepted and ready to land.Sep 1 2015, 4:45 PM

Thanks, Quentin!

In D12543#237849, @qcolombet wrote:

The overlapping accesses may be a separate bug.

I do not know if this is a separate bug, but at least it does not seem related to your changes, i.e., it happened before, right :).

That's correct - the overlapping codegen isn't created by this patch, although it may occur more often after this change.
It comes from SelectionDAG's getMem{cpy/set/move} which always pass AllowOverlap = true to getMem*LoadsAndStores().

What is the PR for that?
Do you plan to follow-up on that?

I'm not aware of a PR for this. It just looks wrong to me. :)
But I haven't run any tests to prove there's a perf problem. I was hoping that someone looking at this review could tell me if there really is nothing wrong with overlapping accesses like that. Either way, I'll try some experiments.

The constant store case in memcpy-2.ll looks particularly sketchy:

movabsq  $33909456017848440, %rax ## imm = 0x78787878787878
movq     %rax, -10(%rsp)
movabsq  $8680820740569200760, %rax ## imm = 0x7878787878787878
movq     %rax, -16(%rsp)

We've created an extra 64-bit constant, so that's wasteful at the least. And does this code now have a race condition because we're storing zeros to memory where there should not be from the time the first store completes until the second overrwrites those zeros?

But I haven't run any tests to prove there's a perf problem. I was hoping that someone looking at this review could tell me if there really is nothing wrong with overlapping accesses like that. Either way, I'll try some experiments.

Even if it does not turn out to be relevant perform-wise, I would like we track the issue, as it is not elegant :).

Thanks Sanjay,
Q.

In D12543#237942, @qcolombet wrote:

Even if it does not turn out to be relevant perform-wise, I would like we track the issue, as it is not elegant :).

Sure - filed as PR24678.
Also, disregard the earlier question about correctness...I still can't think in little endian. :)

lgtm..

The overlapping is interesting.. With a 0 mod 16 rsp, if we didn't overlap, we would split a cache line. At a high level, it looks like it's a reasonable strategy (without knowing all the cases in which it'll trigger). The extra immediate generation is, however, weird.

Thanks,
Zia.

Closed by commit rL246658: [x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses (authored by spatel). · Explain WhySep 2 2015, 8:44 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

18 lines

test/

CodeGen/

X86/

memcpy-2.ll

26 lines

pr11985.ll

22 lines

Diff 33813

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 1,917 Lines • ▼ Show 20 Lines
	}			}

	bool			bool
	X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,			X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
	unsigned,			unsigned,
	unsigned,			unsigned,
	bool *Fast) const {			bool *Fast) const {
	if (Fast) {			if (Fast) {
	if (VT.getSizeInBits() == 256)			switch (VT.getSizeInBits()) {
	*Fast = !Subtarget->isUnalignedMem32Slow();			default:
	else			// 8-byte and under are always assumed to be fast.
	// FIXME: We should always return that 8-byte and under accesses are fast.			*Fast = true;
	// That is what other x86 lowering code assumes.			break;
				case 128:
	*Fast = !Subtarget->isUnalignedMem16Slow();			*Fast = !Subtarget->isUnalignedMem16Slow();
				break;
				case 256:
				*Fast = !Subtarget->isUnalignedMem32Slow();
				break;
				// TODO: What about AVX-512 (512-bit) accesses?
				}
	}			}
				// Misaligned accesses of any size are always allowed.
	return true;			return true;
	}			}

	/// Return the entry encoding for a jump table in the			/// Return the entry encoding for a jump table in the
	/// current function. The returned value is a member of the			/// current function. The returned value is a member of the
	/// MachineJumpTableInfo::JTEntryKind enum.			/// MachineJumpTableInfo::JTEntryKind enum.
	unsigned X86TargetLowering::getJumpTableEncoding() const {			unsigned X86TargetLowering::getJumpTableEncoding() const {
	// In GOT pic mode, each entry in the jump table is emitted as a @GOTOFF			// In GOT pic mode, each entry in the jump table is emitted as a @GOTOFF
	▲ Show 20 Lines • Show All 24,731 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/memcpy-2.ll

	; RUN: llc < %s -mattr=+sse2 -mtriple=i686-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=SSE2-Darwin			; RUN: llc < %s -mattr=+sse2 -mtriple=i686-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=SSE2-Darwin
	; RUN: llc < %s -mattr=+sse2 -mtriple=i686-pc-mingw32 -mcpu=core2 \| FileCheck %s -check-prefix=SSE2-Mingw32			; RUN: llc < %s -mattr=+sse2 -mtriple=i686-pc-mingw32 -mcpu=core2 \| FileCheck %s -check-prefix=SSE2-Mingw32
	; RUN: llc < %s -mattr=+sse,-sse2 -mtriple=i686-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=SSE1			; RUN: llc < %s -mattr=+sse,-sse2 -mtriple=i686-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=SSE1
	; RUN: llc < %s -mattr=-sse -mtriple=i686-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=NOSSE			; RUN: llc < %s -mattr=-sse -mtriple=i686-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=NOSSE
	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=X86-64			; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=core2 \| FileCheck %s -check-prefix=X86-64
	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=nehalem \| FileCheck %s -check-prefix=NHM_64			; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=nehalem \| FileCheck %s -check-prefix=NHM_64

	;;; TODO: The last run line chooses cpu=nehalem to reveal possible bugs in the "t4" test case.
	;;;
	;;; Nehalem has a 'fast unaligned memory' attribute, so (1) some of the loads and stores
	;;; are certainly unaligned and (2) the first load and first store overlap with the second
	;;; load and second store respectively.
	;;;
	;;; Is either of the sequences ideal?
	;;; Is the ideal code being generated for all CPU models?


	@.str = internal constant [25 x i8] c"image\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"			@.str = internal constant [25 x i8] c"image\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00"
	@.str2 = internal constant [30 x i8] c"xxxxxxxxxxxxxxxxxxxxxxxxxxxxx\00", align 4			@.str2 = internal constant [30 x i8] c"xxxxxxxxxxxxxxxxxxxxxxxxxxxxx\00", align 4

	define void @t1(i32 %argc, i8** %argv) nounwind {			define void @t1(i32 %argc, i8** %argv) nounwind {
	entry:			entry:
	; SSE2-Darwin-LABEL: t1:			; SSE2-Darwin-LABEL: t1:
	; SSE2-Darwin: movsd _.str+16, %xmm0			; SSE2-Darwin: movsd _.str+16, %xmm0
	▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080
	; NOSSE: movl $2021161080			; NOSSE: movl $2021161080

				;;; TODO: (1) Some of the loads and stores are certainly unaligned and (2) the first load and first
				;;; store overlap with the second load and second store respectively.
				;;;
				;;; Is either of the sequences ideal?

	; X86-64-LABEL: t4:			; X86-64-LABEL: t4:
	; X86-64: movabsq $8680820740569200760, %rax			; X86-64: movabsq $33909456017848440, %rax ## imm = 0x78787878787878
	; X86-64: movq %rax			; X86-64: movq %rax, -10(%rsp)
	; X86-64: movq %rax			; X86-64: movabsq $8680820740569200760, %rax ## imm = 0x7878787878787878
	; X86-64: movq %rax			; X86-64: movq %rax, -16(%rsp)
	; X86-64: movw $120			; X86-64: movq %rax, -24(%rsp)
	; X86-64: movl $2021161080			; X86-64: movq %rax, -32(%rsp)

	; NHM_64-LABEL: t4:			; NHM_64-LABEL: t4:
	; NHM_64: movups _.str2+14(%rip), %xmm0			; NHM_64: movups _.str2+14(%rip), %xmm0
	; NHM_64: movups %xmm0, -26(%rsp)			; NHM_64: movups %xmm0, -26(%rsp)
	; NHM_64: movups _.str2(%rip), %xmm0			; NHM_64: movups _.str2(%rip), %xmm0
	; NHM_64: movaps %xmm0, -40(%rsp)			; NHM_64: movaps %xmm0, -40(%rsp)

	%tmp1 = alloca [30 x i8]			%tmp1 = alloca [30 x i8]
	%tmp2 = bitcast [30 x i8]* %tmp1 to i8*			%tmp2 = bitcast [30 x i8]* %tmp1 to i8*
	call void @llvm.memcpy.p0i8.p0i8.i32(i8* %tmp2, i8* getelementptr inbounds ([30 x i8], [30 x i8]* @.str2, i32 0, i32 0), i32 30, i32 1, i1 false)			call void @llvm.memcpy.p0i8.p0i8.i32(i8* %tmp2, i8* getelementptr inbounds ([30 x i8], [30 x i8]* @.str2, i32 0, i32 0), i32 30, i32 1, i1 false)
	unreachable			unreachable
	}			}

	declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1) nounwind			declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1) nounwind

llvm/trunk/test/CodeGen/X86/pr11985.ll

	; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=prescott \| FileCheck %s --check-prefix=PRESCOTT			; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=prescott \| FileCheck %s --check-prefix=PRESCOTT
	; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=nehalem \| FileCheck %s --check-prefix=NEHALEM			; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=nehalem \| FileCheck %s --check-prefix=NEHALEM

	;;; TODO: The last run line chooses cpu=nehalem to reveal possible bugs in the "foo" test case.			;;; TODO: (1) Some of the loads and stores are certainly unaligned and (2) the first load and first
	;;;			;;; store overlap with the second load and second store respectively.
	;;; Nehalem has a 'fast unaligned memory' attribute, so (1) some of the loads and stores
	;;; are certainly unaligned and (2) the first load and first store overlap with the second
	;;; load and second store respectively.
	;;;			;;;
	;;; Is either of these sequences ideal?			;;; Is either of these sequences ideal?
	;;; Is the ideal code being generated for all CPU models?

	define float @foo(i8* nocapture %buf, float %a, float %b) nounwind uwtable {			define float @foo(i8* nocapture %buf, float %a, float %b) nounwind uwtable {
	; PRESCOTT-LABEL: foo:			; PRESCOTT-LABEL: foo:
	; PRESCOTT: # BB#0: # %entry			; PRESCOTT: # BB#0: # %entry
	; PRESCOTT-NEXT: movw .Ltmp0+20(%rip), %ax			; PRESCOTT-NEXT: movq .Ltmp0+14(%rip), %rax
	; PRESCOTT-NEXT: movw %ax, 20(%rdi)			; PRESCOTT-NEXT: movq %rax, 14(%rdi)
	; PRESCOTT-NEXT: movl .Ltmp0+16(%rip), %eax
	; PRESCOTT-NEXT: movl %eax, 16(%rdi)
	; PRESCOTT-NEXT: movq .Ltmp0+8(%rip), %rax			; PRESCOTT-NEXT: movq .Ltmp0+8(%rip), %rax
	; PRESCOTT-NEXT: movq %rax, 8(%rdi)			; PRESCOTT-NEXT: movq %rax, 8(%rdi)
	; PRESCOTT-NEXT: movq .Ltmp0(%rip), %rax			; PRESCOTT-NEXT: movq .Ltmp0(%rip), %rax
	; PRESCOTT-NEXT: movq %rax, (%rdi)			; PRESCOTT-NEXT: movq %rax, (%rdi)
	;			;
	; NEHALEM-LABEL: foo:			; NEHALEM-LABEL: foo:
	; NEHALEM: # BB#0: # %entry			; NEHALEM: # BB#0: # %entry
	; NEHALEM-NEXT: movq .Ltmp0+14(%rip), %rax			; NEHALEM-NEXT: movq .Ltmp0+14(%rip), %rax
	; NEHALEM-NEXT: movq %rax, 14(%rdi)			; NEHALEM-NEXT: movq %rax, 14(%rdi)
	; NEHALEM-NEXT: movups .Ltmp0(%rip), %xmm2			; NEHALEM-NEXT: movups .Ltmp0(%rip), %xmm2
	; NEHALEM-NEXT: movups %xmm2, (%rdi)			; NEHALEM-NEXT: movups %xmm2, (%rdi)

	Show All 10 Lines