This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Do MergeConsecutiveStores again before Instruction Selection
ClosedPublic

Authored by niravd on May 30 2017, 6:38 AM.

Download Raw Diff

Details

Reviewers

jyknight
hfinkel
efriedma
rnk
jmolloy
RKSimon

Commits

rGdb77e57ea86d: [DAG] Do MergeConsecutiveStores again before Instruction Selection
rL319036: [DAG] Do MergeConsecutiveStores again before Instruction Selection

Summary

Enable by default post-legalization store merging to non-X86 machines for all targets to allow merging stores from lowered intrinsics / calls.

Pre-legalization store merging cannot yet be disabled as nodes with custom lowering may be lowered during legalization obscuring some merge candidates.

Diff Detail

Build Status

Buildable 6859
Build 6859: arc lint + arc unit

Event Timeline

niravd created this revision.May 30 2017, 6:38 AM

Herald added a subscriber: javed.absar. · View Herald TranscriptMay 30 2017, 6:38 AM

niravd added a parent revision: D33518: [AArch64] Fix stores of zero values.May 30 2017, 6:42 AM

5% of time in DAGCombine, or 5% total? 5% total is a lot for an optimization which triggers relatively rarely.

It's about a 5% time increase for the sum of DAG Combine phases on the my bad test cases for store merge (large basic blocks with a large number of stores which are offset from the same base, but non-mergeable stores). It looks closer to 2% on the total. In the majority of cases, the effect is negligible.

A simple caching check on the PersistantID could remove most of the redundant work. I had plans on looking at a more efficient version of this pass of this once the pre-legal type merge was removed and the nodes are more stable.

Okay, in that case the compile-time sounds fine.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170	Can you point to an example testcase where the run before legalization matters?
lib/Target/AArch64/AArch64ISelLowering.cpp
9425	How is this related to the other change?
test/CodeGen/X86/bigstructret.ll
24	We should be able to do something more clever here... but I guess it's not important.

niravd added inline comments.Jun 1 2017, 7:08 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170	The most obvious is: merge_vec_element_store and merge_vec_extract_stores in CodeGen/X86/MergeConsecutiveStores have issues related to bitcasting simplification. From what I've seen this is because we do a bticasting op for a vector with index 0 that we don't with others. Related issues that also need addressing before killing the pre-legalization merge. These seem simple enough: redundant_stores_merging and overlapping_stores_merging in CodeGen/X86/stores-merging.ll fails to merge because BaseIndexOffset cannot look through the X86ISD:Wrappers nor extract offset from TargetGlobals preventing merge otherwise obvious merges. The later is also true for CodeGen/AMDGPU/merge-stores.ll truncated stores are neither generated or used as valid inputs for store merge. CodeGen/AArch64/merge-store.ll adds extra nodes due to bitcasting
lib/Target/AArch64/AArch64ISelLowering.cpp
9425	Amongst other things split storess replaces vector stores of zero/scalars into appropriate sizes so that as Machine instructions we can create paired memory operations which may be cheaper. This should be run whenever we create a new store larger store. Concrete examples of the effects can be seen in the CodeGen/AArch64/ldst-opt.ll test. This patch nominally depends on D33518 which fixes untested cases caught by the recent store merge optimizations.

efriedma added inline comments.Jun 14 2017, 4:11 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170	truncated stores are neither generated or used as valid inputs for store merge. Oh, so it completely breaks everything that isn't x86. :) Could you make before/after/both an option, so it's easier to check the impact if we do run into issues? If we really just want to run this after legalization, it's probably a good idea to flip the switch as soon as possible, even if it causes minor regressions.

RKSimon resigned from this revision.Oct 29 2017, 5:14 AM

Resurrecting after long hiatus.

Herald added subscribers: nhaehnle, sdardis. · View Herald TranscriptNov 6 2017, 1:22 PM

The AArch64 test changes look fine.

If the Aarch64 tests look good, I think this should be landable. The only remaining test that is more than a simple merge is the Mips/cconv/vector.ll test where the store merging allows the value stored on the stack to be forwarded to a now matching load and the stores and loads excised.

In D33675#918302, @efriedma wrote:

The AArch64 test changes look fine.

Ping. I think we're all set here. Can I get an LGTM?

lgtm

This revision is now accepted and ready to land.Nov 17 2017, 1:38 PM

Closed by commit rL319036: [DAG] Do MergeConsecutiveStores again before Instruction Selection (authored by niravd). · Explain WhyNov 27 2017, 7:31 AM

This revision was automatically updated to reflect the committed changes.

Hi Nirav,

Could you please revert the changes? They affected Arm targets (Thumb2 code).
The following sequence of stores:

MOVS     r0,#0xe5
STRB     r0,[r6,#0x1e5]
MOVS     r0,#0xe4
STRB     r0,[r6,#0x1e4]
MOVS     r0,#0xe6
STRB     r0,[r6,#0x1e6]
MOVS     r0,#0xe7
STRB     r0,[r6,#0x1e7]

is optimised into

MVN      r0,#0x1b
STR      r0,[r6,#0x1e4]

causing incorrect data to be written.

We are working on a reproducer.

Thanks,
Evgeny Astigeevich
The Arm Compiler Optimisation team

A reproducer:

test.ll11 KBDownload

$ cat test.ll
...
  %v304 = getelementptr inbounds i8, i8* %v50, i32 508
  store i8 -4, i8* %v304, align 1
  %v305 = getelementptr inbounds i8, i8* %v50, i32 509
  store i8 -3, i8* %v305, align 1
  %v306 = getelementptr inbounds i8, i8* %v50, i32 510
  store i8 -2, i8* %v306, align 1
  %v307 = getelementptr inbounds i8, i8* %v50, i32 511
  store i8 -1, i8* %v307, align 1
...
$ llc -O3 -filetype=asm -o test.s test.ll
$ cat test.s
...
        movs    r1, #251
        strb.w  r1, [r0, #507]
        mvn     r1, #3 <========= HERE the problem: -4, -1, -1, -1 is written instead of -4, -3, -2, -1
        str.w   r1, [r0, #508]
        bx      lr
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cantunwind
        .fnend
...

These changes caused Clang to crash when it compiled spec2006 403.gcc for AArch64. I am working on a reproducer.

These changes caused failures of AArch64 NEON Emperor tests.

spatel mentioned this in D40790: DAGCombiner bugfix in MergeStoresOfConstantsOrVecElts().Dec 4 2017, 3:33 PM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

9 lines

Target/

AArch64/

AArch64ISelLowering.cpp

2 lines

test/

CodeGen/

AArch64/

arm64-complex-ret.ll

3 lines

arm64-narrow-st-merge.ll

4 lines

arm64-variadic-aapcs.ll

16 lines

merge-store-dependency.ll

7 lines

tailcall-explicit-sret.ll

4 lines

tailcall-implicit-sret.ll

6 lines

X86/

MergeConsecutiveStores.ll

9 lines

bigstructret.ll

7 lines

bitcast-i256.ll

3 lines

constant-combines.ll

3 lines

fold-vector-sext-crash2.ll

12 lines

legalize-shl-vec.ll

26 lines

merge-consecutive-loads-128.ll

187 lines

no-sse2-avg.ll

18 lines

Diff 100701

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,157 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitSTORE(SDNode *N) {
if ((Value.getOpcode() == ISD::FP_ROUND \|\| Value.getOpcode() == ISD::TRUNCATE)		if ((Value.getOpcode() == ISD::FP_ROUND \|\| Value.getOpcode() == ISD::TRUNCATE)
&& Value.getNode()->hasOneUse() && ST->isUnindexed() &&		&& Value.getNode()->hasOneUse() && ST->isUnindexed() &&
TLI.isTruncStoreLegal(Value.getOperand(0).getValueType(),		TLI.isTruncStoreLegal(Value.getOperand(0).getValueType(),
ST->getMemoryVT())) {		ST->getMemoryVT())) {
return DAG.getTruncStore(Chain, SDLoc(N), Value.getOperand(0),		return DAG.getTruncStore(Chain, SDLoc(N), Value.getOperand(0),
Ptr, ST->getMemoryVT(), ST->getMemOperand());		Ptr, ST->getMemoryVT(), ST->getMemOperand());
}		}

// Only perform this optimization before the types are legal, because we		// FIXME: This pass can be expensive and we should do it only once,
// don't want to perform this optimization on every DAGCombine invocation.		// ideally just before Instruction Selection so that we can merge stores from
if (!LegalTypes) {		// lowered intrinsics. Currently some LegalizeDAG changes cases of
		// of MergeStores from happening. For now do the merging twice; before and
		// after legalization.
		efriedmaUnsubmitted Not Done Reply Inline Actions Can you point to an example testcase where the run before legalization matters? efriedma: Can you point to an example testcase where the run before legalization matters?
		niravdAuthorUnsubmitted Not Done Reply Inline Actions The most obvious is: merge_vec_element_store and merge_vec_extract_stores in CodeGen/X86/MergeConsecutiveStores have issues related to bitcasting simplification. From what I've seen this is because we do a bticasting op for a vector with index 0 that we don't with others. Related issues that also need addressing before killing the pre-legalization merge. These seem simple enough: redundant_stores_merging and overlapping_stores_merging in CodeGen/X86/stores-merging.ll fails to merge because BaseIndexOffset cannot look through the X86ISD:Wrappers nor extract offset from TargetGlobals preventing merge otherwise obvious merges. The later is also true for CodeGen/AMDGPU/merge-stores.ll truncated stores are neither generated or used as valid inputs for store merge. CodeGen/AArch64/merge-store.ll adds extra nodes due to bitcasting niravd: The most obvious is: merge_vec_element_store and merge_vec_extract_stores in…
		efriedmaUnsubmitted Not Done Reply Inline Actions truncated stores are neither generated or used as valid inputs for store merge. Oh, so it completely breaks everything that isn't x86. :) Could you make before/after/both an option, so it's easier to check the impact if we do run into issues? If we really just want to run this after legalization, it's probably a good idea to flip the switch as soon as possible, even if it causes minor regressions. efriedma: > truncated stores are neither generated or used as valid inputs for store merge. Oh, so it…
		if (!LegalTypes \|\| (Level == AfterLegalizeDAG)) {
for (;;) {		for (;;) {
// There can be multiple store sequences on the same chain.		// There can be multiple store sequences on the same chain.
// Keep trying to merge store sequences until we are unable to do so		// Keep trying to merge store sequences until we are unable to do so
// or until we merge the last store on the chain.		// or until we merge the last store on the chain.
bool Changed = MergeConsecutiveStores(ST);		bool Changed = MergeConsecutiveStores(ST);
if (!Changed) break;		if (!Changed) break;
// Return N as merge only uses CombineTo and no worklist clean		// Return N as merge only uses CombineTo and no worklist clean
// up is necessary.		// up is necessary.
▲ Show 20 Lines • Show All 3,625 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,414 Lines • ▼ Show 20 Lines	if (IndexNotInserted.any())
return SDValue();		return SDValue();

return splitStoreSplat(DAG, St, SplatVal, NumVecElts);		return splitStoreSplat(DAG, St, SplatVal, NumVecElts);
}		}

static SDValue splitStores(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,		static SDValue splitStores(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
if (!DCI.isBeforeLegalize())
return SDValue();

efriedmaUnsubmitted Not Done Reply Inline Actions How is this related to the other change? efriedma: How is this related to the other change?
niravdAuthorUnsubmitted Not Done Reply Inline Actions Amongst other things split storess replaces vector stores of zero/scalars into appropriate sizes so that as Machine instructions we can create paired memory operations which may be cheaper. This should be run whenever we create a new store larger store. Concrete examples of the effects can be seen in the CodeGen/AArch64/ldst-opt.ll test. This patch nominally depends on D33518 which fixes untested cases caught by the recent store merge optimizations. niravd: Amongst other things split storess replaces vector stores of zero/scalars into appropriate…
StoreSDNode *S = cast<StoreSDNode>(N);		StoreSDNode *S = cast<StoreSDNode>(N);
if (S->isVolatile())		if (S->isVolatile())
return SDValue();		return SDValue();

SDValue StVal = S->getValue();		SDValue StVal = S->getValue();
EVT VT = StVal.getValueType();		EVT VT = StVal.getValueType();
if (!VT.isVector())		if (!VT.isVector())
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 1,405 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-complex-ret.ll

	; RUN: llc -mtriple=arm64-eabi -o - %s \| FileCheck %s			; RUN: llc -mtriple=arm64-eabi -o - %s \| FileCheck %s

	define { i192, i192, i21, i192 } @foo(i192) {			define { i192, i192, i21, i192 } @foo(i192) {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: stp xzr, xzr, [x8]			; CHECK-DAG: stp xzr, xzr, [x8, #8]
				; CHECK-DAG: str xzr, [x8]
	ret { i192, i192, i21, i192 } {i192 0, i192 1, i21 2, i192 3}			ret { i192, i192, i21, i192 } {i192 0, i192 1, i21 2, i192 3}
	}			}

test/CodeGen/AArch64/arm64-narrow-st-merge.ll

Show All 13 Lines	entry:
%add = add nsw i32 %n, 1		%add = add nsw i32 %n, 1
%idxprom1 = sext i32 %add to i64		%idxprom1 = sext i32 %add to i64
%arrayidx2 = getelementptr inbounds i16, i16* %P, i64 %idxprom1		%arrayidx2 = getelementptr inbounds i16, i16* %P, i64 %idxprom1
store i16 0, i16* %arrayidx2		store i16 0, i16* %arrayidx2
ret void		ret void
}		}

; CHECK-LABEL: Strh_zero_4		; CHECK-LABEL: Strh_zero_4
; CHECK: stp wzr, wzr		; CHECK: str xzr
; CHECK-STRICT-LABEL: Strh_zero_4		; CHECK-STRICT-LABEL: Strh_zero_4
; CHECK-STRICT: strh wzr		; CHECK-STRICT: strh wzr
; CHECK-STRICT: strh wzr		; CHECK-STRICT: strh wzr
; CHECK-STRICT: strh wzr		; CHECK-STRICT: strh wzr
; CHECK-STRICT: strh wzr		; CHECK-STRICT: strh wzr
define void @Strh_zero_4(i16* nocapture %P, i32 %n) {		define void @Strh_zero_4(i16* nocapture %P, i32 %n) {
entry:		entry:
%idxprom = sext i32 %n to i64		%idxprom = sext i32 %n to i64
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	entry:
%sub1 = add nsw i32 %n, -3		%sub1 = add nsw i32 %n, -3
%idxprom2 = sext i32 %sub1 to i64		%idxprom2 = sext i32 %sub1 to i64
%arrayidx3 = getelementptr inbounds i16, i16* %P, i64 %idxprom2		%arrayidx3 = getelementptr inbounds i16, i16* %P, i64 %idxprom2
store i16 0, i16* %arrayidx3		store i16 0, i16* %arrayidx3
ret void		ret void
}		}

; CHECK-LABEL: Sturh_zero_4		; CHECK-LABEL: Sturh_zero_4
; CHECK: stp wzr, wzr		; CHECK: stur xzr
; CHECK-STRICT-LABEL: Sturh_zero_4		; CHECK-STRICT-LABEL: Sturh_zero_4
; CHECK-STRICT: sturh wzr		; CHECK-STRICT: sturh wzr
; CHECK-STRICT: sturh wzr		; CHECK-STRICT: sturh wzr
; CHECK-STRICT: sturh wzr		; CHECK-STRICT: sturh wzr
; CHECK-STRICT: sturh wzr		; CHECK-STRICT: sturh wzr
define void @Sturh_zero_4(i16* nocapture %P, i32 %n) {		define void @Sturh_zero_4(i16* nocapture %P, i32 %n) {
entry:		entry:
%sub = add nsw i32 %n, -3		%sub = add nsw i32 %n, -3
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-variadic-aapcs.ll

	Show All 26 Lines
	; CHECK: add [[GR_TOPTMP:x[0-9]+]], sp, #[[GR_BASE]]			; CHECK: add [[GR_TOPTMP:x[0-9]+]], sp, #[[GR_BASE]]
	; CHECK: add [[GR_TOP:x[0-9]+]], [[GR_TOPTMP]], #56			; CHECK: add [[GR_TOP:x[0-9]+]], [[GR_TOPTMP]], #56
	; CHECK: str [[GR_TOP]], [x[[VA_LIST]], #8]			; CHECK: str [[GR_TOP]], [x[[VA_LIST]], #8]

	; CHECK: mov [[VR_TOPTMP:x[0-9]+]], sp			; CHECK: mov [[VR_TOPTMP:x[0-9]+]], sp
	; CHECK: add [[VR_TOP:x[0-9]+]], [[VR_TOPTMP]], #128			; CHECK: add [[VR_TOP:x[0-9]+]], [[VR_TOPTMP]], #128
	; CHECK: str [[VR_TOP]], [x[[VA_LIST]], #16]			; CHECK: str [[VR_TOP]], [x[[VA_LIST]], #16]

	; CHECK: mov [[GR_OFFS:w[0-9]+]], #-56			; CHECK: mov [[GRVR:x[0-9]+]], #-545460846720
	; CHECK: str [[GR_OFFS]], [x[[VA_LIST]], #24]			; CHECK: movk [[GRVR]], #65480
				; CHECK: str [[GRVR]], [x[[VA_LIST]], #24]
	; CHECK: orr [[VR_OFFS:w[0-9]+]], wzr, #0xffffff80
	; CHECK: str [[VR_OFFS]], [x[[VA_LIST]], #28]

	%addr = bitcast %va_list* @var to i8*			%addr = bitcast %va_list* @var to i8*
	call void @llvm.va_start(i8* %addr)			call void @llvm.va_start(i8* %addr)

	ret void			ret void
	}			}

	define void @test_fewargs(i32 %n, i32 %n1, i32 %n2, float %m, ...) {			define void @test_fewargs(i32 %n, i32 %n1, i32 %n2, float %m, ...) {
	Show All 17 Lines
	; CHECK: add [[GR_TOPTMP:x[0-9]+]], sp, #[[GR_BASE]]			; CHECK: add [[GR_TOPTMP:x[0-9]+]], sp, #[[GR_BASE]]
	; CHECK: add [[GR_TOP:x[0-9]+]], [[GR_TOPTMP]], #40			; CHECK: add [[GR_TOP:x[0-9]+]], [[GR_TOPTMP]], #40
	; CHECK: str [[GR_TOP]], [x[[VA_LIST]], #8]			; CHECK: str [[GR_TOP]], [x[[VA_LIST]], #8]

	; CHECK: mov [[VR_TOPTMP:x[0-9]+]], sp			; CHECK: mov [[VR_TOPTMP:x[0-9]+]], sp
	; CHECK: add [[VR_TOP:x[0-9]+]], [[VR_TOPTMP]], #112			; CHECK: add [[VR_TOP:x[0-9]+]], [[VR_TOPTMP]], #112
	; CHECK: str [[VR_TOP]], [x[[VA_LIST]], #16]			; CHECK: str [[VR_TOP]], [x[[VA_LIST]], #16]

	; CHECK: mov [[GR_OFFS:w[0-9]+]], #-40			; CHECK: mov [[GRVR_OFFS:x[0-9]+]], #-40
	; CHECK: str [[GR_OFFS]], [x[[VA_LIST]], #24]			; CHECK: movk [[GRVR_OFFS]], #65424, lsl #32
				; CHECK: str [[GRVR_OFFS]], [x[[VA_LIST]], #24]
	; CHECK: mov [[VR_OFFS:w[0-9]+]], #-11
	; CHECK: str [[VR_OFFS]], [x[[VA_LIST]], #28]

	%addr = bitcast %va_list* @var to i8*			%addr = bitcast %va_list* @var to i8*
	call void @llvm.va_start(i8* %addr)			call void @llvm.va_start(i8* %addr)

	ret void			ret void
	}			}

	define void @test_nospare([8 x i64], [8 x float], ...) {			define void @test_nospare([8 x i64], [8 x float], ...) {
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

test/CodeGen/AArch64/merge-store-dependency.ll

	; RUN: llc < %s -mcpu cortex-a53 -mtriple=aarch64-eabi \| FileCheck %s --check-prefix=A53			; RUN: llc < %s -mcpu cortex-a53 -mtriple=aarch64-eabi \| FileCheck %s --check-prefix=A53

	; PR26827 - Merge stores causes wrong dependency.			; PR26827 - Merge stores causes wrong dependency.
	%struct1 = type { %struct1, %struct1, i32, i32, i16, i16, void (i32, i32, i8), i8* }			%struct1 = type { %struct1, %struct1, i32, i32, i16, i16, void (i32, i32, i8), i8* }
	@gv0 = internal unnamed_addr global i32 0, align 4			@gv0 = internal unnamed_addr global i32 0, align 4
	@gv1 = internal unnamed_addr global %struct1** null, align 8			@gv1 = internal unnamed_addr global %struct1** null, align 8

	define void @test(%struct1* %fde, i32 %fd, void (i32, i32, i8) %func, i8* %arg) {			define void @test(%struct1* %fde, i32 %fd, void (i32, i32, i8) %func, i8* %arg) {
	;CHECK-LABEL: test			;CHECK-LABEL: test
	entry:			entry:
	; A53: mov [[DATA:w[0-9]+]], w1			; A53: mov [[DATA:w[0-9]+]], w1
	; A53: str q{{[0-9]+}}, {{.*}}			; A53-DAG: stp xzr, xzr
	; A53: str q{{[0-9]+}}, {{.*}}			; A53-DAG: str q0
	; A53: str [[DATA]], {{.*}}			; A53-DAG: str [[DATA]]

	%0 = bitcast %struct1* %fde to i8*			%0 = bitcast %struct1* %fde to i8*
	tail call void @llvm.memset.p0i8.i64(i8* %0, i8 0, i64 40, i32 8, i1 false)			tail call void @llvm.memset.p0i8.i64(i8* %0, i8 0, i64 40, i32 8, i1 false)
	%state = getelementptr inbounds %struct1, %struct1* %fde, i64 0, i32 4			%state = getelementptr inbounds %struct1, %struct1* %fde, i64 0, i32 4
	store i16 256, i16* %state, align 8			store i16 256, i16* %state, align 8
	%fd1 = getelementptr inbounds %struct1, %struct1* %fde, i64 0, i32 2			%fd1 = getelementptr inbounds %struct1, %struct1* %fde, i64 0, i32 2
	store i32 %fd, i32* %fd1, align 8			store i32 %fd, i32* %fd1, align 8
	%force_eof = getelementptr inbounds %struct1, %struct1* %fde, i64 0, i32 3			%force_eof = getelementptr inbounds %struct1, %struct1* %fde, i64 0, i32 3
	store i32 0, i32* %force_eof, align 4			store i32 0, i32* %force_eof, align 4
	Show All 40 Lines

test/CodeGen/AArch64/tailcall-explicit-sret.ll

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	define i1024 @test_tailcall_explicit_sret_alloca_returned() #0 {
ret i1024 %r		ret i1024 %r
}		}

; CHECK-LABEL: _test_indirect_tailcall_explicit_sret_nosret_arg:		; CHECK-LABEL: _test_indirect_tailcall_explicit_sret_nosret_arg:
; CHECK-DAG: mov x[[CALLERX8NUM:[0-9]+]], x8		; CHECK-DAG: mov x[[CALLERX8NUM:[0-9]+]], x8
; CHECK-DAG: mov [[FPTR:x[0-9]+]], x0		; CHECK-DAG: mov [[FPTR:x[0-9]+]], x0
; CHECK: mov x0, sp		; CHECK: mov x0, sp
; CHECK-NEXT: blr [[FPTR]]		; CHECK-NEXT: blr [[FPTR]]
; CHECK-NEXT: ldr [[CALLERSRET1:x[0-9]+]], [sp]		; CHECK: ldr [[CALLERSRET1:x[0-9]+]], [sp]
; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]		; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]
; CHECK: ret		; CHECK: ret
define void @test_indirect_tailcall_explicit_sret_nosret_arg(i1024* sret %arg, void (i1024) %f) #0 {		define void @test_indirect_tailcall_explicit_sret_nosret_arg(i1024* sret %arg, void (i1024) %f) #0 {
%l = alloca i1024, align 8		%l = alloca i1024, align 8
tail call void %f(i1024* %l)		tail call void %f(i1024* %l)
%r = load i1024, i1024* %l, align 8		%r = load i1024, i1024* %l, align 8
store i1024 %r, i1024* %arg, align 8		store i1024 %r, i1024* %arg, align 8
ret void		ret void
}		}

; CHECK-LABEL: _test_indirect_tailcall_explicit_sret_:		; CHECK-LABEL: _test_indirect_tailcall_explicit_sret_:
; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8		; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8
; CHECK: mov x8, sp		; CHECK: mov x8, sp
; CHECK-NEXT: blr x0		; CHECK-NEXT: blr x0
; CHECK-NEXT: ldr [[CALLERSRET1:x[0-9]+]], [sp]		; CHECK: ldr [[CALLERSRET1:x[0-9]+]], [sp]
; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]		; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]
; CHECK: ret		; CHECK: ret
define void @test_indirect_tailcall_explicit_sret_(i1024* sret %arg, i1024 ()* %f) #0 {		define void @test_indirect_tailcall_explicit_sret_(i1024* sret %arg, i1024 ()* %f) #0 {
%ret = tail call i1024 %f()		%ret = tail call i1024 %f()
store i1024 %ret, i1024* %arg, align 8		store i1024 %ret, i1024* %arg, align 8
ret void		ret void
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }

test/CodeGen/AArch64/tailcall-implicit-sret.ll

	; RUN: llc < %s -mtriple arm64-apple-darwin -aarch64-enable-ldst-opt=false -disable-post-ra -asm-verbose=false \| FileCheck %s			; RUN: llc < %s -mtriple arm64-apple-darwin -aarch64-enable-ldst-opt=false -disable-post-ra -asm-verbose=false \| FileCheck %s
	; Disable the load/store optimizer to avoid having LDP/STPs and simplify checks.			; Disable the load/store optimizer to avoid having LDP/STPs and simplify checks.

	target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

	; Check that we don't try to tail-call with an sret-demoted return.			; Check that we don't try to tail-call with an sret-demoted return.

	declare i1024 @test_sret() #0			declare i1024 @test_sret() #0

	; CHECK-LABEL: _test_call_sret:			; CHECK-LABEL: _test_call_sret:
	; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8			; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8
	; CHECK: mov x8, sp			; CHECK: mov x8, sp
	; CHECK-NEXT: bl _test_sret			; CHECK-NEXT: bl _test_sret
	; CHECK-NEXT: ldr [[CALLERSRET1:x[0-9]+]], [sp]			; CHECK: ldr [[CALLERSRET1:x[0-9]+]], [sp]
	; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]			; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]
	; CHECK: ret			; CHECK: ret
	define i1024 @test_call_sret() #0 {			define i1024 @test_call_sret() #0 {
	%a = call i1024 @test_sret()			%a = call i1024 @test_sret()
	ret i1024 %a			ret i1024 %a
	}			}

	; CHECK-LABEL: _test_tailcall_sret:			; CHECK-LABEL: _test_tailcall_sret:
	; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8			; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8
	; CHECK: mov x8, sp			; CHECK: mov x8, sp
	; CHECK-NEXT: bl _test_sret			; CHECK-NEXT: bl _test_sret
	; CHECK-NEXT: ldr [[CALLERSRET1:x[0-9]+]], [sp]			; CHECK: ldr [[CALLERSRET1:x[0-9]+]], [sp]
	; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]			; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]
	; CHECK: ret			; CHECK: ret
	define i1024 @test_tailcall_sret() #0 {			define i1024 @test_tailcall_sret() #0 {
	%a = tail call i1024 @test_sret()			%a = tail call i1024 @test_sret()
	ret i1024 %a			ret i1024 %a
	}			}

	; CHECK-LABEL: _test_indirect_tailcall_sret:			; CHECK-LABEL: _test_indirect_tailcall_sret:
	; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8			; CHECK: mov x[[CALLERX8NUM:[0-9]+]], x8
	; CHECK: mov x8, sp			; CHECK: mov x8, sp
	; CHECK-NEXT: blr x0			; CHECK-NEXT: blr x0
	; CHECK-NEXT: ldr [[CALLERSRET1:x[0-9]+]], [sp]			; CHECK: ldr [[CALLERSRET1:x[0-9]+]], [sp]
	; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]			; CHECK: str [[CALLERSRET1:x[0-9]+]], [x[[CALLERX8NUM]]]
	; CHECK: ret			; CHECK: ret
	define i1024 @test_indirect_tailcall_sret(i1024 ()* %f) #0 {			define i1024 @test_indirect_tailcall_sret(i1024 ()* %f) #0 {
	%a = tail call i1024 %f()			%a = tail call i1024 %f()
	ret i1024 %a			ret i1024 %a
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

test/CodeGen/X86/MergeConsecutiveStores.ll

	Show First 20 Lines • Show All 552 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: merge_vec_stores_of_constants			; CHECK-LABEL: merge_vec_stores_of_constants
	; CHECK: vxorps			; CHECK: vxorps
	; CHECK-NEXT: vmovaps			; CHECK-NEXT: vmovaps
	; CHECK-NEXT: vmovaps			; CHECK-NEXT: vmovaps
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	}			}

	; This is a minimized test based on real code that was failing.			; This is a minimized test based on real code that was failing.
	; We could merge stores (and loads) like this...			; This should now be merged.

	define void @merge_vec_element_and_scalar_load([6 x i64]* %array) {			define void @merge_vec_element_and_scalar_load([6 x i64]* %array) {
	%idx0 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 0			%idx0 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 0
	%idx1 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 1			%idx1 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 1
	%idx4 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 4			%idx4 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 4
	%idx5 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 5			%idx5 = getelementptr inbounds [6 x i64], [6 x i64]* %array, i64 0, i64 5

	%a0 = load i64, i64* %idx0, align 8			%a0 = load i64, i64* %idx0, align 8
	store i64 %a0, i64* %idx4, align 8			store i64 %a0, i64* %idx4, align 8

	%b = bitcast i64* %idx1 to <2 x i64>*			%b = bitcast i64* %idx1 to <2 x i64>*
	%v = load <2 x i64>, <2 x i64>* %b, align 8			%v = load <2 x i64>, <2 x i64>* %b, align 8
	%a1 = extractelement <2 x i64> %v, i32 0			%a1 = extractelement <2 x i64> %v, i32 0
	store i64 %a1, i64* %idx5, align 8			store i64 %a1, i64* %idx5, align 8
	ret void			ret void

	; CHECK-LABEL: merge_vec_element_and_scalar_load			; CHECK-LABEL: merge_vec_element_and_scalar_load
	; CHECK: movq (%rdi), %rax			; CHECK: vmovups (%rdi), %xmm0
	; CHECK-NEXT: movq 8(%rdi), %rcx			; CHECK-NEXT: vmovups %xmm0, 32(%rdi)
	; CHECK-NEXT: movq %rax, 32(%rdi)
	; CHECK-NEXT: movq %rcx, 40(%rdi)
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	}			}



	; Don't let a non-consecutive store thwart merging of the last two.			; Don't let a non-consecutive store thwart merging of the last two.
	define void @almost_consecutive_stores(i8* %p) {			define void @almost_consecutive_stores(i8* %p) {
	store i8 0, i8* %p			store i8 0, i8* %p
	Show All 13 Lines

test/CodeGen/X86/bigstructret.ll

Show All 13 Lines	entry:
%0 = insertvalue %0 zeroinitializer, i32 12, 0		%0 = insertvalue %0 zeroinitializer, i32 12, 0
%1 = insertvalue %0 %0, i32 24, 1		%1 = insertvalue %0 %0, i32 24, 1
%2 = insertvalue %0 %1, i32 48, 2		%2 = insertvalue %0 %1, i32 48, 2
%3 = insertvalue %0 %2, i32 24601, 3		%3 = insertvalue %0 %2, i32 24601, 3
ret %0 %3		ret %0 %3
}		}

; CHECK: ReturnBigStruct2		; CHECK: ReturnBigStruct2
; CHECK: movl $48, 4(%ecx)		; CHECK-DAG: movl $48, 4(%ecx)
; CHECK: movb $1, 2(%ecx)		; CHECK-DAG: movb $1, 2(%ecx)
; CHECK: movb $1, 1(%ecx)		; CHECK-DAG: movw $256, (%ecx)
		efriedmaUnsubmitted Not Done Reply Inline Actions We should be able to do something more clever here... but I guess it's not important. efriedma: We should be able to do something more clever here... but I guess it's not important.
; CHECK: movb $0, (%ecx)

define fastcc %1 @ReturnBigStruct2() nounwind readnone {		define fastcc %1 @ReturnBigStruct2() nounwind readnone {
entry:		entry:
%0 = insertvalue %1 zeroinitializer, i1 false, 0		%0 = insertvalue %1 zeroinitializer, i1 false, 0
%1 = insertvalue %1 %0, i1 true, 1		%1 = insertvalue %1 %0, i1 true, 1
%2 = insertvalue %1 %1, i1 true, 2		%2 = insertvalue %1 %1, i1 true, 2
%3 = insertvalue %1 %2, i32 48, 3		%3 = insertvalue %1 %2, i32 48, 3
ret %1 %3		ret %1 %3
}		}

test/CodeGen/X86/bitcast-i256.ll

	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=core-avx-i < %s \| FileCheck %s			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=core-avx-i < %s \| FileCheck %s

	define i256 @foo(<8 x i32> %a) {			define i256 @foo(<8 x i32> %a) {
	%r = bitcast <8 x i32> %a to i256			%r = bitcast <8 x i32> %a to i256
	ret i256 %r			ret i256 %r
	; CHECK: foo			; CHECK: foo
	; CHECK: vextractf128			; CHECK: vextractf128
	; CHECK: vpextrq			; CHECK: vmovups
	; CHECK: vpextrq
	; CHECK: ret			; CHECK: ret
	}			}

test/CodeGen/X86/constant-combines.ll

	Show All 9 Lines
	; The DAG combiner at one point contained bugs that given enough permutations			; The DAG combiner at one point contained bugs that given enough permutations
	; would incorrectly form an illegal operation for the last of these stores when			; would incorrectly form an illegal operation for the last of these stores when
	; it folded it to a zero too late to legalize the zero store operation. If this			; it folded it to a zero too late to legalize the zero store operation. If this
	; ever starts forming a zero store instead of movss, the test case has stopped			; ever starts forming a zero store instead of movss, the test case has stopped
	; being useful.			; being useful.
	;			;
	; CHECK-LABEL: PR22524:			; CHECK-LABEL: PR22524:
	; CHECK: # BB#0: # %entry			; CHECK: # BB#0: # %entry
	; CHECK-NEXT: movl $0, 4(%rdi)
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: movd %eax, %xmm0			; CHECK-NEXT: movd %eax, %xmm0
	; CHECK-NEXT: xorps %xmm1, %xmm1			; CHECK-NEXT: xorps %xmm1, %xmm1
	; CHECK-NEXT: mulss %xmm0, %xmm1			; CHECK-NEXT: mulss %xmm0, %xmm1
	; CHECK-NEXT: movl $0, (%rdi)			; CHECK-NEXT: movq $0, (%rdi)
	; CHECK-NEXT: movss %xmm1, 4(%rdi)			; CHECK-NEXT: movss %xmm1, 4(%rdi)
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	entry:			entry:
	%0 = getelementptr inbounds { float, float }, { float, float }* %arg, i32 0, i32 1			%0 = getelementptr inbounds { float, float }, { float, float }* %arg, i32 0, i32 1
	store float 0.000000e+00, float* %0, align 4			store float 0.000000e+00, float* %0, align 4
	%1 = getelementptr inbounds { float, float }, { float, float }* %arg, i64 0, i32 0			%1 = getelementptr inbounds { float, float }, { float, float }* %arg, i64 0, i32 0
	%2 = bitcast float* %1 to i64*			%2 = bitcast float* %1 to i64*
	%3 = load i64, i64* %2, align 8			%3 = load i64, i64* %2, align 8
	Show All 10 Lines

test/CodeGen/X86/fold-vector-sext-crash2.ll

	Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	}			}

	define <2 x i256> @test_zext1() {			define <2 x i256> @test_zext1() {
	%Se = zext <2 x i8> <i8 -1, i8 -2> to <2 x i256>			%Se = zext <2 x i8> <i8 -1, i8 -2> to <2 x i256>
	%Shuff = shufflevector <2 x i256> zeroinitializer, <2 x i256> %Se, <2 x i32> <i32 1, i32 3>			%Shuff = shufflevector <2 x i256> zeroinitializer, <2 x i256> %Se, <2 x i32> <i32 1, i32 3>
	ret <2 x i256> %Shuff			ret <2 x i256> %Shuff

	; X64-LABEL: test_zext1			; X64-LABEL: test_zext1
	; X64: movq $0			; X64: xorps %xmm0, %xmm0
	; X64-NEXT: movq $0			; X64: movaps %xmm0
				; X64: movaps %xmm0
				; X64: movaps %xmm0
	; X64-NEXT: movq $0			; X64-NEXT: movq $0
	; X64-NEXT: movq $254			; X64-NEXT: movq $254

	; X32-LABEL: test_zext1			; X32-LABEL: test_zext1
	; X32: movl $0			; X32: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $254			; X32-NEXT: movl $254
	}			}

	define <2 x i256> @test_zext2() {			define <2 x i256> @test_zext2() {
	%Se = zext <2 x i128> <i128 -1, i128 -2> to <2 x i256>			%Se = zext <2 x i128> <i128 -1, i128 -2> to <2 x i256>
	%Shuff = shufflevector <2 x i256> zeroinitializer, <2 x i256> %Se, <2 x i32> <i32 1, i32 3>			%Shuff = shufflevector <2 x i256> zeroinitializer, <2 x i256> %Se, <2 x i32> <i32 1, i32 3>
	ret <2 x i256> %Shuff			ret <2 x i256> %Shuff

	; X64-LABEL: test_zext2			; X64-LABEL: test_zext2
	; X64: movq $0			; X64: xorps %xmm0, %xmm0
	; X64-NEXT: movq $0			; X64-NEXT: movaps %xmm0
				; X64-NEXT: movaps %xmm0
				; X64-NEXT: movaps %xmm0
	; X64-NEXT: movq $-1			; X64-NEXT: movq $-1
	; X64-NEXT: movq $-2			; X64-NEXT: movq $-2

	; X32-LABEL: test_zext2			; X32-LABEL: test_zext2
	; X32: movl $0			; X32: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $0			; X32-NEXT: movl $0
	; X32-NEXT: movl $-1			; X32-NEXT: movl $-1
	; X32-NEXT: movl $-1			; X32-NEXT: movl $-1
	; X32-NEXT: movl $-1			; X32-NEXT: movl $-1
	; X32-NEXT: movl $-2			; X32-NEXT: movl $-2
	}			}

test/CodeGen/X86/legalize-shl-vec.ll

	Show All 20 Lines
	; X32-NEXT: movl $0, 12(%eax)			; X32-NEXT: movl $0, 12(%eax)
	; X32-NEXT: movl $0, 8(%eax)			; X32-NEXT: movl $0, 8(%eax)
	; X32-NEXT: movl $0, 4(%eax)			; X32-NEXT: movl $0, 4(%eax)
	; X32-NEXT: movl $0, (%eax)			; X32-NEXT: movl $0, (%eax)
	; X32-NEXT: retl $4			; X32-NEXT: retl $4
	;			;
	; X64-LABEL: test_shl:			; X64-LABEL: test_shl:
	; X64: # BB#0:			; X64: # BB#0:
	; X64-NEXT: movq $0, 56(%rdi)			; X64-NEXT: xorps %xmm0, %xmm0
	; X64-NEXT: movq $0, 48(%rdi)			; X64-NEXT: movaps %xmm0, 48(%rdi)
	; X64-NEXT: movq $0, 40(%rdi)			; X64-NEXT: movaps %xmm0, 32(%rdi)
	; X64-NEXT: movq $0, 32(%rdi)			; X64-NEXT: movaps %xmm0, 16(%rdi)
	; X64-NEXT: movq $0, 24(%rdi)			; X64-NEXT: movaps %xmm0, (%rdi)
	; X64-NEXT: movq $0, 16(%rdi)
	; X64-NEXT: movq $0, 8(%rdi)
	; X64-NEXT: movq $0, (%rdi)
	; X64-NEXT: movq %rdi, %rax			; X64-NEXT: movq %rdi, %rax
	; X64-NEXT: retq			; X64-NEXT: retq
	%Amt = insertelement <2 x i256> undef, i256 -1, i32 0			%Amt = insertelement <2 x i256> undef, i256 -1, i32 0
	%Out = shl <2 x i256> %In, %Amt			%Out = shl <2 x i256> %In, %Amt
	ret <2 x i256> %Out			ret <2 x i256> %Out
	}			}

	define <2 x i256> @test_srl(<2 x i256> %In) {			define <2 x i256> @test_srl(<2 x i256> %In) {
	Show All 15 Lines
	; X32-NEXT: movl $0, 12(%eax)			; X32-NEXT: movl $0, 12(%eax)
	; X32-NEXT: movl $0, 8(%eax)			; X32-NEXT: movl $0, 8(%eax)
	; X32-NEXT: movl $0, 4(%eax)			; X32-NEXT: movl $0, 4(%eax)
	; X32-NEXT: movl $0, (%eax)			; X32-NEXT: movl $0, (%eax)
	; X32-NEXT: retl $4			; X32-NEXT: retl $4
	;			;
	; X64-LABEL: test_srl:			; X64-LABEL: test_srl:
	; X64: # BB#0:			; X64: # BB#0:
	; X64-NEXT: movq $0, 56(%rdi)			; X64-NEXT: xorps %xmm0, %xmm0
	; X64-NEXT: movq $0, 48(%rdi)			; X64-NEXT: movaps %xmm0, 48(%rdi)
	; X64-NEXT: movq $0, 40(%rdi)			; X64-NEXT: movaps %xmm0, 32(%rdi)
	; X64-NEXT: movq $0, 32(%rdi)			; X64-NEXT: movaps %xmm0, 16(%rdi)
	; X64-NEXT: movq $0, 24(%rdi)			; X64-NEXT: movaps %xmm0, (%rdi)
	; X64-NEXT: movq $0, 16(%rdi)
	; X64-NEXT: movq $0, 8(%rdi)
	; X64-NEXT: movq $0, (%rdi)
	; X64-NEXT: movq %rdi, %rax			; X64-NEXT: movq %rdi, %rax
	; X64-NEXT: retq			; X64-NEXT: retq
	%Amt = insertelement <2 x i256> undef, i256 -1, i32 0			%Amt = insertelement <2 x i256> undef, i256 -1, i32 0
	%Out = lshr <2 x i256> %In, %Amt			%Out = lshr <2 x i256> %In, %Amt
	ret <2 x i256> %Out			ret <2 x i256> %Out
	}			}

	define <2 x i256> @test_sra(<2 x i256> %In) {			define <2 x i256> @test_sra(<2 x i256> %In) {
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

test/CodeGen/X86/merge-consecutive-loads-128.ll

	Show First 20 Lines • Show All 524 Lines • ▼ Show 20 Lines
	;			;
	; AVX-LABEL: merge_8i16_i16_23u567u9:			; AVX-LABEL: merge_8i16_i16_23u567u9:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovups 4(%rdi), %xmm0			; AVX-NEXT: vmovups 4(%rdi), %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_8i16_i16_23u567u9:			; X32-SSE1-LABEL: merge_8i16_i16_23u567u9:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
	; X32-SSE1-NEXT: pushl %ebp			; X32-SSE1-NEXT: pushl %edi
	; X32-SSE1-NEXT: .Lcfi6:			; X32-SSE1-NEXT: .Lcfi6:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 8			; X32-SSE1-NEXT: .cfi_def_cfa_offset 8
	; X32-SSE1-NEXT: pushl %ebx			; X32-SSE1-NEXT: pushl %esi
	; X32-SSE1-NEXT: .Lcfi7:			; X32-SSE1-NEXT: .Lcfi7:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 12			; X32-SSE1-NEXT: .cfi_def_cfa_offset 12
	; X32-SSE1-NEXT: pushl %edi
	; X32-SSE1-NEXT: .Lcfi8:			; X32-SSE1-NEXT: .Lcfi8:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 16			; X32-SSE1-NEXT: .cfi_offset %esi, -12
	; X32-SSE1-NEXT: pushl %esi
	; X32-SSE1-NEXT: .Lcfi9:			; X32-SSE1-NEXT: .Lcfi9:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 20			; X32-SSE1-NEXT: .cfi_offset %edi, -8
	; X32-SSE1-NEXT: .Lcfi10:
	; X32-SSE1-NEXT: .cfi_offset %esi, -20
	; X32-SSE1-NEXT: .Lcfi11:
	; X32-SSE1-NEXT: .cfi_offset %edi, -16
	; X32-SSE1-NEXT: .Lcfi12:
	; X32-SSE1-NEXT: .cfi_offset %ebx, -12
	; X32-SSE1-NEXT: .Lcfi13:
	; X32-SSE1-NEXT: .cfi_offset %ebp, -8
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movzwl 4(%ecx), %edx			; X32-SSE1-NEXT: movl 4(%ecx), %edx
	; X32-SSE1-NEXT: movzwl 6(%ecx), %esi			; X32-SSE1-NEXT: movl 10(%ecx), %esi
	; X32-SSE1-NEXT: movzwl 10(%ecx), %edi			; X32-SSE1-NEXT: movzwl 14(%ecx), %edi
	; X32-SSE1-NEXT: movzwl 12(%ecx), %ebx
	; X32-SSE1-NEXT: movzwl 14(%ecx), %ebp
	; X32-SSE1-NEXT: movzwl 18(%ecx), %ecx			; X32-SSE1-NEXT: movzwl 18(%ecx), %ecx
	; X32-SSE1-NEXT: movw %bp, 10(%eax)			; X32-SSE1-NEXT: movw %di, 10(%eax)
	; X32-SSE1-NEXT: movw %bx, 8(%eax)
	; X32-SSE1-NEXT: movw %cx, 14(%eax)			; X32-SSE1-NEXT: movw %cx, 14(%eax)
	; X32-SSE1-NEXT: movw %si, 2(%eax)			; X32-SSE1-NEXT: movl %edx, (%eax)
	; X32-SSE1-NEXT: movw %dx, (%eax)			; X32-SSE1-NEXT: movl %esi, 6(%eax)
	; X32-SSE1-NEXT: movw %di, 6(%eax)
	; X32-SSE1-NEXT: popl %esi			; X32-SSE1-NEXT: popl %esi
	; X32-SSE1-NEXT: popl %edi			; X32-SSE1-NEXT: popl %edi
	; X32-SSE1-NEXT: popl %ebx
	; X32-SSE1-NEXT: popl %ebp
	; X32-SSE1-NEXT: retl $4			; X32-SSE1-NEXT: retl $4
	;			;
	; X32-SSE41-LABEL: merge_8i16_i16_23u567u9:			; X32-SSE41-LABEL: merge_8i16_i16_23u567u9:
	; X32-SSE41: # BB#0:			; X32-SSE41: # BB#0:
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movups 4(%eax), %xmm0			; X32-SSE41-NEXT: movups 4(%eax), %xmm0
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 2			%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 2
	Show All 27 Lines
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; AVX-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_8i16_i16_34uuuuuu:			; X32-SSE1-LABEL: merge_8i16_i16_34uuuuuu:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movzwl 6(%ecx), %edx			; X32-SSE1-NEXT: movl 6(%ecx), %ecx
	; X32-SSE1-NEXT: movzwl 8(%ecx), %ecx			; X32-SSE1-NEXT: movl %ecx, (%eax)
	; X32-SSE1-NEXT: movw %cx, 2(%eax)
	; X32-SSE1-NEXT: movw %dx, (%eax)
	; X32-SSE1-NEXT: retl $4			; X32-SSE1-NEXT: retl $4
	;			;
	; X32-SSE41-LABEL: merge_8i16_i16_34uuuuuu:			; X32-SSE41-LABEL: merge_8i16_i16_34uuuuuu:
	; X32-SSE41: # BB#0:			; X32-SSE41: # BB#0:
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X32-SSE41-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 3			%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 3
	Show All 13 Lines
	;			;
	; AVX-LABEL: merge_8i16_i16_45u7zzzz:			; AVX-LABEL: merge_8i16_i16_45u7zzzz:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_8i16_i16_45u7zzzz:			; X32-SSE1-LABEL: merge_8i16_i16_45u7zzzz:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
	; X32-SSE1-NEXT: pushl %esi
	; X32-SSE1-NEXT: .Lcfi14:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 8
	; X32-SSE1-NEXT: .Lcfi15:
	; X32-SSE1-NEXT: .cfi_offset %esi, -8
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movzwl 8(%ecx), %edx			; X32-SSE1-NEXT: movl 8(%ecx), %edx
	; X32-SSE1-NEXT: movzwl 10(%ecx), %esi
	; X32-SSE1-NEXT: movzwl 14(%ecx), %ecx			; X32-SSE1-NEXT: movzwl 14(%ecx), %ecx
	; X32-SSE1-NEXT: movw %si, 2(%eax)			; X32-SSE1-NEXT: movl %edx, (%eax)
	; X32-SSE1-NEXT: movw %dx, (%eax)
	; X32-SSE1-NEXT: movw %cx, 6(%eax)			; X32-SSE1-NEXT: movw %cx, 6(%eax)
	; X32-SSE1-NEXT: movw $0, 14(%eax)			; X32-SSE1-NEXT: movl $0, 12(%eax)
	; X32-SSE1-NEXT: movw $0, 12(%eax)			; X32-SSE1-NEXT: movl $0, 8(%eax)
	; X32-SSE1-NEXT: movw $0, 10(%eax)
	; X32-SSE1-NEXT: movw $0, 8(%eax)
	; X32-SSE1-NEXT: popl %esi
	; X32-SSE1-NEXT: retl $4			; X32-SSE1-NEXT: retl $4
	;			;
	; X32-SSE41-LABEL: merge_8i16_i16_45u7zzzz:			; X32-SSE41-LABEL: merge_8i16_i16_45u7zzzz:
	; X32-SSE41: # BB#0:			; X32-SSE41: # BB#0:
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero			; X32-SSE41-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 4			%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 4
	Show All 20 Lines
	;			;
	; AVX-LABEL: merge_16i8_i8_01u3456789ABCDuF:			; AVX-LABEL: merge_16i8_i8_01u3456789ABCDuF:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovups (%rdi), %xmm0			; AVX-NEXT: vmovups (%rdi), %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_16i8_i8_01u3456789ABCDuF:			; X32-SSE1-LABEL: merge_16i8_i8_01u3456789ABCDuF:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
				; X32-SSE1-NEXT: pushl %ebp
				; X32-SSE1-NEXT: .Lcfi10:
				; X32-SSE1-NEXT: .cfi_def_cfa_offset 8
	; X32-SSE1-NEXT: pushl %ebx			; X32-SSE1-NEXT: pushl %ebx
				; X32-SSE1-NEXT: .Lcfi11:
				; X32-SSE1-NEXT: .cfi_def_cfa_offset 12
				; X32-SSE1-NEXT: pushl %edi
				; X32-SSE1-NEXT: .Lcfi12:
				; X32-SSE1-NEXT: .cfi_def_cfa_offset 16
				; X32-SSE1-NEXT: pushl %esi
				; X32-SSE1-NEXT: .Lcfi13:
				; X32-SSE1-NEXT: .cfi_def_cfa_offset 20
				; X32-SSE1-NEXT: .Lcfi14:
				; X32-SSE1-NEXT: .cfi_offset %esi, -20
				; X32-SSE1-NEXT: .Lcfi15:
				; X32-SSE1-NEXT: .cfi_offset %edi, -16
	; X32-SSE1-NEXT: .Lcfi16:			; X32-SSE1-NEXT: .Lcfi16:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 8			; X32-SSE1-NEXT: .cfi_offset %ebx, -12
	; X32-SSE1-NEXT: subl $12, %esp
	; X32-SSE1-NEXT: .Lcfi17:			; X32-SSE1-NEXT: .Lcfi17:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 20			; X32-SSE1-NEXT: .cfi_offset %ebp, -8
	; X32-SSE1-NEXT: .Lcfi18:
	; X32-SSE1-NEXT: .cfi_offset %ebx, -8
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movb (%ecx), %dl			; X32-SSE1-NEXT: movzwl (%ecx), %ebp
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill			; X32-SSE1-NEXT: movl 3(%ecx), %esi
	; X32-SSE1-NEXT: movb 1(%ecx), %dl			; X32-SSE1-NEXT: movl 7(%ecx), %edi
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill			; X32-SSE1-NEXT: movzwl 11(%ecx), %ebx
	; X32-SSE1-NEXT: movb 3(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 4(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 5(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 6(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 7(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 8(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 9(%ecx), %dl
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill
	; X32-SSE1-NEXT: movb 10(%ecx), %bh
	; X32-SSE1-NEXT: movb 11(%ecx), %bl
	; X32-SSE1-NEXT: movb 12(%ecx), %dh
	; X32-SSE1-NEXT: movb 13(%ecx), %dl			; X32-SSE1-NEXT: movb 13(%ecx), %dl
	; X32-SSE1-NEXT: movb 15(%ecx), %cl			; X32-SSE1-NEXT: movb 15(%ecx), %cl
	; X32-SSE1-NEXT: movb %dl, 13(%eax)			; X32-SSE1-NEXT: movb %dl, 13(%eax)
	; X32-SSE1-NEXT: movb %dh, 12(%eax)
	; X32-SSE1-NEXT: movb %cl, 15(%eax)			; X32-SSE1-NEXT: movb %cl, 15(%eax)
	; X32-SSE1-NEXT: movb %bl, 11(%eax)			; X32-SSE1-NEXT: movw %bx, 11(%eax)
	; X32-SSE1-NEXT: movb %bh, 10(%eax)			; X32-SSE1-NEXT: movl %edi, 7(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload			; X32-SSE1-NEXT: movw %bp, (%eax)
	; X32-SSE1-NEXT: movb %cl, 9(%eax)			; X32-SSE1-NEXT: movl %esi, 3(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload			; X32-SSE1-NEXT: popl %esi
	; X32-SSE1-NEXT: movb %cl, 8(%eax)			; X32-SSE1-NEXT: popl %edi
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, 7(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, 6(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, 5(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, 4(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, 1(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, (%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, 3(%eax)
	; X32-SSE1-NEXT: addl $12, %esp
	; X32-SSE1-NEXT: popl %ebx			; X32-SSE1-NEXT: popl %ebx
				; X32-SSE1-NEXT: popl %ebp
	; X32-SSE1-NEXT: retl $4			; X32-SSE1-NEXT: retl $4
	;			;
	; X32-SSE41-LABEL: merge_16i8_i8_01u3456789ABCDuF:			; X32-SSE41-LABEL: merge_16i8_i8_01u3456789ABCDuF:
	; X32-SSE41: # BB#0:			; X32-SSE41: # BB#0:
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movups (%eax), %xmm0			; X32-SSE41-NEXT: movups (%eax), %xmm0
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0			%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0
	▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; AVX-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:			; X32-SSE1-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movb (%ecx), %dl			; X32-SSE1-NEXT: movzwl (%ecx), %edx
	; X32-SSE1-NEXT: movb 1(%ecx), %dh
	; X32-SSE1-NEXT: movb 3(%ecx), %cl			; X32-SSE1-NEXT: movb 3(%ecx), %cl
	; X32-SSE1-NEXT: movb %dh, 1(%eax)			; X32-SSE1-NEXT: movw %dx, (%eax)
	; X32-SSE1-NEXT: movb %dl, (%eax)
	; X32-SSE1-NEXT: movb %cl, 3(%eax)			; X32-SSE1-NEXT: movb %cl, 3(%eax)
	; X32-SSE1-NEXT: movb $0, 15(%eax)			; X32-SSE1-NEXT: movb $0, 15(%eax)
	; X32-SSE1-NEXT: movb $0, 14(%eax)			; X32-SSE1-NEXT: movw $0, 13(%eax)
	; X32-SSE1-NEXT: movb $0, 13(%eax)			; X32-SSE1-NEXT: movw $0, 6(%eax)
	; X32-SSE1-NEXT: movb $0, 7(%eax)
	; X32-SSE1-NEXT: movb $0, 6(%eax)
	; X32-SSE1-NEXT: retl $4			; X32-SSE1-NEXT: retl $4
	;			;
	; X32-SSE41-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:			; X32-SSE41-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:
	; X32-SSE41: # BB#0:			; X32-SSE41: # BB#0:
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X32-SSE41-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0			%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0
	Show All 21 Lines
	;			;
	; AVX-LABEL: merge_16i8_i8_0123uu67uuuuuzzz:			; AVX-LABEL: merge_16i8_i8_0123uu67uuuuuzzz:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_16i8_i8_0123uu67uuuuuzzz:			; X32-SSE1-LABEL: merge_16i8_i8_0123uu67uuuuuzzz:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
	; X32-SSE1-NEXT: pushl %ebx
	; X32-SSE1-NEXT: .Lcfi19:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 8
	; X32-SSE1-NEXT: pushl %eax
	; X32-SSE1-NEXT: .Lcfi20:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 12
	; X32-SSE1-NEXT: .Lcfi21:
	; X32-SSE1-NEXT: .cfi_offset %ebx, -8
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movb (%ecx), %dl			; X32-SSE1-NEXT: movl (%ecx), %edx
	; X32-SSE1-NEXT: movb %dl, {{[0-9]+}}(%esp) # 1-byte Spill			; X32-SSE1-NEXT: movzwl 6(%ecx), %ecx
	; X32-SSE1-NEXT: movb 1(%ecx), %dh			; X32-SSE1-NEXT: movw %cx, 6(%eax)
	; X32-SSE1-NEXT: movb 2(%ecx), %bl			; X32-SSE1-NEXT: movl %edx, (%eax)
	; X32-SSE1-NEXT: movb 3(%ecx), %bh
	; X32-SSE1-NEXT: movb 6(%ecx), %dl
	; X32-SSE1-NEXT: movb 7(%ecx), %cl
	; X32-SSE1-NEXT: movb %cl, 7(%eax)
	; X32-SSE1-NEXT: movb %dl, 6(%eax)
	; X32-SSE1-NEXT: movb %bh, 3(%eax)
	; X32-SSE1-NEXT: movb %bl, 2(%eax)
	; X32-SSE1-NEXT: movb %dh, 1(%eax)
	; X32-SSE1-NEXT: movb {{[0-9]+}}(%esp), %cl # 1-byte Reload
	; X32-SSE1-NEXT: movb %cl, (%eax)
	; X32-SSE1-NEXT: movb $0, 15(%eax)			; X32-SSE1-NEXT: movb $0, 15(%eax)
	; X32-SSE1-NEXT: movb $0, 14(%eax)			; X32-SSE1-NEXT: movw $0, 13(%eax)
	; X32-SSE1-NEXT: movb $0, 13(%eax)
	; X32-SSE1-NEXT: addl $4, %esp
	; X32-SSE1-NEXT: popl %ebx
	; X32-SSE1-NEXT: retl $4			; X32-SSE1-NEXT: retl $4
	;			;
	; X32-SSE41-LABEL: merge_16i8_i8_0123uu67uuuuuzzz:			; X32-SSE41-LABEL: merge_16i8_i8_0123uu67uuuuuzzz:
	; X32-SSE41: # BB#0:			; X32-SSE41: # BB#0:
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero			; X32-SSE41-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0			%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0
	▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero			; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
	; AVX-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero			; AVX-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
	; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]			; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; X32-SSE1-LABEL: merge_2i64_i64_12_volatile:			; X32-SSE1-LABEL: merge_2i64_i64_12_volatile:
	; X32-SSE1: # BB#0:			; X32-SSE1: # BB#0:
	; X32-SSE1-NEXT: pushl %edi			; X32-SSE1-NEXT: pushl %edi
	; X32-SSE1-NEXT: .Lcfi22:			; X32-SSE1-NEXT: .Lcfi18:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 8			; X32-SSE1-NEXT: .cfi_def_cfa_offset 8
	; X32-SSE1-NEXT: pushl %esi			; X32-SSE1-NEXT: pushl %esi
	; X32-SSE1-NEXT: .Lcfi23:			; X32-SSE1-NEXT: .Lcfi19:
	; X32-SSE1-NEXT: .cfi_def_cfa_offset 12			; X32-SSE1-NEXT: .cfi_def_cfa_offset 12
	; X32-SSE1-NEXT: .Lcfi24:			; X32-SSE1-NEXT: .Lcfi20:
	; X32-SSE1-NEXT: .cfi_offset %esi, -12			; X32-SSE1-NEXT: .cfi_offset %esi, -12
	; X32-SSE1-NEXT: .Lcfi25:			; X32-SSE1-NEXT: .Lcfi21:
	; X32-SSE1-NEXT: .cfi_offset %edi, -8			; X32-SSE1-NEXT: .cfi_offset %edi, -8
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-SSE1-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-SSE1-NEXT: movl 8(%ecx), %edx			; X32-SSE1-NEXT: movl 8(%ecx), %edx
	; X32-SSE1-NEXT: movl 12(%ecx), %esi			; X32-SSE1-NEXT: movl 12(%ecx), %esi
	; X32-SSE1-NEXT: movl 16(%ecx), %edi			; X32-SSE1-NEXT: movl 16(%ecx), %edi
	; X32-SSE1-NEXT: movl 20(%ecx), %ecx			; X32-SSE1-NEXT: movl 20(%ecx), %ecx
	; X32-SSE1-NEXT: movl %ecx, 12(%eax)			; X32-SSE1-NEXT: movl %ecx, 12(%eax)
	▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

test/CodeGen/X86/no-sse2-avg.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-sse2 \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-sse2 \| FileCheck %s

	define <16 x i8> @PR27973() {			define <16 x i8> @PR27973() {
	; CHECK-LABEL: PR27973:			; CHECK-LABEL: PR27973:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: movb $0, 15(%rdi)			; CHECK-NEXT: movq $0, 8(%rdi)
	; CHECK-NEXT: movb $0, 14(%rdi)			; CHECK-NEXT: movq $0, (%rdi)
	; CHECK-NEXT: movb $0, 13(%rdi)
	; CHECK-NEXT: movb $0, 12(%rdi)
	; CHECK-NEXT: movb $0, 11(%rdi)
	; CHECK-NEXT: movb $0, 10(%rdi)
	; CHECK-NEXT: movb $0, 9(%rdi)
	; CHECK-NEXT: movb $0, 8(%rdi)
	; CHECK-NEXT: movb $0, 7(%rdi)
	; CHECK-NEXT: movb $0, 6(%rdi)
	; CHECK-NEXT: movb $0, 5(%rdi)
	; CHECK-NEXT: movb $0, 4(%rdi)
	; CHECK-NEXT: movb $0, 3(%rdi)
	; CHECK-NEXT: movb $0, 2(%rdi)
	; CHECK-NEXT: movb $0, 1(%rdi)
	; CHECK-NEXT: movb $0, (%rdi)
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	;			;
	%t0 = zext <16 x i8> zeroinitializer to <16 x i32>			%t0 = zext <16 x i8> zeroinitializer to <16 x i32>
	%t1 = add nuw nsw <16 x i32> %t0, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			%t1 = add nuw nsw <16 x i32> %t0, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
	%t2 = lshr <16 x i32> %t1, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			%t2 = lshr <16 x i32> %t1, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
	%t3 = trunc <16 x i32> %t2 to <16 x i8>			%t3 = trunc <16 x i32> %t2 to <16 x i8>
	ret <16 x i8> %t3			ret <16 x i8> %t3
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Do MergeConsecutiveStores again before Instruction SelectionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 100701

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/arm64-complex-ret.ll

test/CodeGen/AArch64/arm64-narrow-st-merge.ll

test/CodeGen/AArch64/arm64-variadic-aapcs.ll

test/CodeGen/AArch64/merge-store-dependency.ll

test/CodeGen/AArch64/tailcall-explicit-sret.ll

test/CodeGen/AArch64/tailcall-implicit-sret.ll

test/CodeGen/X86/MergeConsecutiveStores.ll

test/CodeGen/X86/bigstructret.ll

test/CodeGen/X86/bitcast-i256.ll

test/CodeGen/X86/constant-combines.ll

test/CodeGen/X86/fold-vector-sext-crash2.ll

test/CodeGen/X86/legalize-shl-vec.ll

test/CodeGen/X86/merge-consecutive-loads-128.ll

test/CodeGen/X86/no-sse2-avg.ll

[DAG] Do MergeConsecutiveStores again before Instruction Selection
ClosedPublic