This is an archive of the discontinued LLVM Phabricator instance.

make fast unaligned memory accesses implicit with SSE4.2 or SSE4a
ClosedPublic

Authored by spatel on Aug 24 2015, 9:06 AM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon
chandlerc
silvas
zansari

Commits

rGdeb8f826a582: make fast unaligned memory accesses implicit with SSE4.2 or SSE4a
rL245950: make fast unaligned memory accesses implicit with SSE4.2 or SSE4a

Summary

This is a follow-on from the discussion in http://reviews.llvm.org/D12154.

This change allows memset/memcpy to use SSE or AVX memory accesses for any chip that has generally fast unaligned memory ops.

A motivating use case for this change is a clang invocation that doesn't explicitly set the CPU, but does target a feature that we know only exists on a CPU that supports fast unaligned memops. For example:
$ clang -O1 foo.c -mavx

This resolves a difference in lowering noted in PR24449:
https://llvm.org/bugs/show_bug.cgi?id=24449

Currently, we use different store types depending on whether the example can be lowered as a memset or not.

Diff Detail

Event Timeline

spatel updated this revision to Diff 32957.Aug 24 2015, 9:06 AM

spatel retitled this revision from to make fast unaligned memory accesses implicit with SSE4.2 or SSE4a.

spatel updated this object.

spatel added reviewers: zansari, chandlerc, qcolombet, RKSimon, silvas.

spatel added a subscriber: llvm-commits.

LGTM - My only concern was Via Nano which Agner says has particularly bad unaligned memory access. But these appear to only have SSE41.

This revision is now accepted and ready to land.Aug 24 2015, 12:55 PM

Hi Sanjay,

Just one tiny comment, otherwise lgtm.

Thanks,
Zia.

lib/Target/X86/X86Subtarget.cpp
197	Nehalem/Silvermont

spatel marked an inline comment as done.Aug 25 2015, 9:29 AM

spatel added inline comments.

lib/Target/X86/X86Subtarget.cpp
197	Thanks - updated.

Closed by commit rL245950: make fast unaligned memory accesses implicit with SSE4.2 or SSE4a (authored by spatel). · Explain WhyAug 25 2015, 9:30 AM

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

Revision Contents

Path

Size

lib/

Target/

X86/

X86Subtarget.cpp

7 lines

test/

CodeGen/

X86/

slow-unaligned-mem.ll

5 lines

Diff 32957

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	if (!FullFS.empty())
FullFS = "+64bit,+sse2," + FullFS;		FullFS = "+64bit,+sse2," + FullFS;
else		else
FullFS = "+64bit,+sse2";		FullFS = "+64bit,+sse2";
}		}

// Parse features string and set the CPU.		// Parse features string and set the CPU.
ParseSubtargetFeatures(CPUName, FullFS);		ParseSubtargetFeatures(CPUName, FullFS);

		// All CPUs that implement SSE4.2 or SSE4A support unaligned accesses of
		// 16-bytes and under that are reasonably fast. These features were
		// introduced with Intel's Nehalem and AMD's Family10h micro-architectures
		zansariUnsubmitted Done Reply Inline Actions Nehalem/Silvermont zansari: Nehalem/Silvermont
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Thanks - updated. spatel: Thanks - updated.
		// respectively.
		if (hasSSE42() \|\| hasSSE4A())
		IsUAMemUnder32Slow = false;

InstrItins = getInstrItineraryForCPU(CPUName);		InstrItins = getInstrItineraryForCPU(CPUName);

// It's important to keep the MCSubtargetInfo feature bits in sync with		// It's important to keep the MCSubtargetInfo feature bits in sync with
// target data structure which is shared with MC code emitter, etc.		// target data structure which is shared with MC code emitter, etc.
if (In64BitMode)		if (In64BitMode)
ToggleFeature(X86::Mode64Bit);		ToggleFeature(X86::Mode64Bit);
else if (In32BitMode)		else if (In32BitMode)
ToggleFeature(X86::Mode32Bit);		ToggleFeature(X86::Mode32Bit);
▲ Show 20 Lines • Show All 118 Lines • Show Last 20 Lines

test/CodeGen/X86/slow-unaligned-mem.ll

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; Other chips with slow unaligned memory accesses			; Other chips with slow unaligned memory accesses

	; RUN: llc < %s -mtriple=i386-unknown-unknown -mcpu=c3-2 2>&1 \| FileCheck %s --check-prefix=SLOW			; RUN: llc < %s -mtriple=i386-unknown-unknown -mcpu=c3-2 2>&1 \| FileCheck %s --check-prefix=SLOW

	; Verify that the slow/fast unaligned memory attribute is set correctly for each CPU model.			; Verify that the slow/fast unaligned memory attribute is set correctly for each CPU model.
	; Slow chips use 4-byte stores. Fast chips with SSE or later use something other than 4-byte stores.			; Slow chips use 4-byte stores. Fast chips with SSE or later use something other than 4-byte stores.
	; Chips that don't have SSE use 4-byte stores either way, so they're not tested.			; Chips that don't have SSE use 4-byte stores either way, so they're not tested.

				; Also verify that SSE4.2 or SSE4a imply fast unaligned accesses.

				; RUN: llc < %s -mtriple=i386-unknown-unknown -mattr=sse4.2 2>&1 \| FileCheck %s --check-prefix=FAST
				; RUN: llc < %s -mtriple=i386-unknown-unknown -mattr=sse4a 2>&1 \| FileCheck %s --check-prefix=FAST

	define void @store_zeros(i8* %a) {			define void @store_zeros(i8* %a) {
	; SLOW-NOT: not a recognized processor			; SLOW-NOT: not a recognized processor
	; SLOW-LABEL: store_zeros:			; SLOW-LABEL: store_zeros:
	; SLOW: # BB#0:			; SLOW: # BB#0:
	; SLOW-NEXT: movl			; SLOW-NEXT: movl
	; SLOW-NEXT: movl			; SLOW-NEXT: movl
	; SLOW-NEXT: movl			; SLOW-NEXT: movl
	; SLOW-NEXT: movl			; SLOW-NEXT: movl
	Show All 25 Lines