This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] allow load/store merging if pairs can be rotated into place
ClosedPublic

Authored by spatel on Jul 10 2020, 9:20 AM.

Download Raw Diff

Details

Reviewers

efriedma
RKSimon
lebedev.ri
craig.topper

Commits

rG2df46a574387: [DAGCombiner] allow load/store merging if pairs can be rotated into place

Summary

This carves out an exception for a pair of consecutive loads that are reversed from the consecutive order of a pair of stores. All of the existing profitability/legality checks for the memops remain between the 2 altered hunks of code.

This should give us the same x86 base-case asm that gcc gets in PR41098 and PR44895:
https://bugs.llvm.org/show_bug.cgi?id=41098
https://bugs.llvm.org/show_bug.cgi?id=44895

I think we are missing a potential subsequent conversion to use "movbe" if the target supports that. That might be similar to what AArch64 would use to get "rev16".

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jul 10 2020, 9:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2020, 9:20 AM

Herald added subscribers: ecnelises, hiraditya, kristof.beyls, mcrosier. · View Herald Transcript

Can you create new PR for aarch64’s missed opportunity to use rev16?

In D83567#2146341, @xbolva00 wrote:

Can you create new PR for aarch64’s missed opportunity to use rev16?

https://bugs.llvm.org/show_bug.cgi?id=46694

LGTM - I'm curious whether rotates across more load/stores would be useful or not: https://gcc.godbolt.org/z/196ar9

This revision is now accepted and ready to land.Jul 13 2020, 1:04 AM

In D83567#2146679, @RKSimon wrote:

LGTM - I'm curious whether rotates across more load/stores would be useful or not: https://gcc.godbolt.org/z/196ar9

It's hard to say without a real app to show the usefulness. We're reducing instructions, but creating a longer critical path here, so the profitability of even the basic case will depend on uarch/benchmark.

"reverse_edge4_2()" in the godbolt example is effectively the same as the "rotate64_iterate()" regression test , so we already get that one. But the "rotate32_consecutive()" regression test shows a gap in our merging ability.

Compile-time may be another consideration on how far we should take this. IIRC, the existing merging code could cause noticeable slowdowns for large blocks with lots of memops.

Closed by commit rG2df46a574387: [DAGCombiner] allow load/store merging if pairs can be rotated into place (authored by spatel). · Explain WhyJul 13 2020, 5:58 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D86420: [DAGCombiner] allow store merging non-i8 truncated ops.Aug 26 2020, 5:17 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

39 lines

test/

CodeGen/

AArch64/

merge-store-dependency.ll

22 lines

X86/

stores-merging.ll

61 lines

Diff 277398

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,535 Lines • ▼ Show 20 Lines	if (LdBasePtr.getBase().getNode()) {
LdBasePtr = LdPtr;		LdBasePtr = LdPtr;
}		}

// We found a potential memory operand to merge.		// We found a potential memory operand to merge.
LoadNodes.push_back(MemOpLink(Ld, LdOffset));		LoadNodes.push_back(MemOpLink(Ld, LdOffset));
}		}

while (NumConsecutiveStores >= 2 && LoadNodes.size() >= 2) {		while (NumConsecutiveStores >= 2 && LoadNodes.size() >= 2) {
		Align RequiredAlignment;
		bool NeedRotate = false;
		if (LoadNodes.size() == 2) {
// If we have load/store pair instructions and we only have two values,		// If we have load/store pair instructions and we only have two values,
// don't bother merging.		// don't bother merging.
Align RequiredAlignment;		if (TLI.hasPairedLoad(MemVT, RequiredAlignment) &&
if (LoadNodes.size() == 2 && TLI.hasPairedLoad(MemVT, RequiredAlignment) &&
StoreNodes[0].MemNode->getAlign() >= RequiredAlignment) {		StoreNodes[0].MemNode->getAlign() >= RequiredAlignment) {
StoreNodes.erase(StoreNodes.begin(), StoreNodes.begin() + 2);		StoreNodes.erase(StoreNodes.begin(), StoreNodes.begin() + 2);
LoadNodes.erase(LoadNodes.begin(), LoadNodes.begin() + 2);		LoadNodes.erase(LoadNodes.begin(), LoadNodes.begin() + 2);
break;		break;
}		}
		// If the loads are reversed, see if we can rotate the halves into place.
		int64_t Offset0 = LoadNodes[0].OffsetFromBase;
		int64_t Offset1 = LoadNodes[1].OffsetFromBase;
		EVT PairVT = EVT::getIntegerVT(Context, ElementSizeBytes * 8 * 2);
		if (Offset0 - Offset1 == ElementSizeBytes &&
		(hasOperation(ISD::ROTL, PairVT) \|\|
		hasOperation(ISD::ROTR, PairVT))) {
		std::swap(LoadNodes[0], LoadNodes[1]);
		NeedRotate = true;
		}
		}
LSBaseSDNode *FirstInChain = StoreNodes[0].MemNode;		LSBaseSDNode *FirstInChain = StoreNodes[0].MemNode;
unsigned FirstStoreAS = FirstInChain->getAddressSpace();		unsigned FirstStoreAS = FirstInChain->getAddressSpace();
unsigned FirstStoreAlign = FirstInChain->getAlignment();		unsigned FirstStoreAlign = FirstInChain->getAlignment();
LoadSDNode *FirstLoad = cast<LoadSDNode>(LoadNodes[0].MemNode);		LoadSDNode *FirstLoad = cast<LoadSDNode>(LoadNodes[0].MemNode);

// Scan the memory operations on the chain and find the first		// Scan the memory operations on the chain and find the first
// non-consecutive load memory address. These variables hold the index in		// non-consecutive load memory address. These variables hold the index in
// the store node array.		// the store node array.
▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	MachineMemOperand::Flags StMMOFlags = IsNonTemporalStore
? MachineMemOperand::MONonTemporal		? MachineMemOperand::MONonTemporal
: MachineMemOperand::MONone;		: MachineMemOperand::MONone;

SDValue NewLoad, NewStore;		SDValue NewLoad, NewStore;
if (UseVectorTy \|\| !DoIntegerTruncate) {		if (UseVectorTy \|\| !DoIntegerTruncate) {
NewLoad = DAG.getLoad(		NewLoad = DAG.getLoad(
JointMemOpVT, LoadDL, FirstLoad->getChain(), FirstLoad->getBasePtr(),		JointMemOpVT, LoadDL, FirstLoad->getChain(), FirstLoad->getBasePtr(),
FirstLoad->getPointerInfo(), FirstLoadAlign, LdMMOFlags);		FirstLoad->getPointerInfo(), FirstLoadAlign, LdMMOFlags);
		SDValue StoreOp = NewLoad;
		if (NeedRotate) {
		unsigned LoadWidth = ElementSizeBytes * 8 * 2;
		assert(JointMemOpVT == EVT::getIntegerVT(Context, LoadWidth) &&
		"Unexpected type for rotate-able load pair");
		SDValue RotAmt =
		DAG.getShiftAmountConstant(LoadWidth / 2, JointMemOpVT, LoadDL);
		// Target can convert to the identical ROTR if it does not have ROTL.
		StoreOp = DAG.getNode(ISD::ROTL, LoadDL, JointMemOpVT, NewLoad, RotAmt);
		}
NewStore = DAG.getStore(		NewStore = DAG.getStore(
NewStoreChain, StoreDL, NewLoad, FirstInChain->getBasePtr(),		NewStoreChain, StoreDL, StoreOp, FirstInChain->getBasePtr(),
FirstInChain->getPointerInfo(), FirstStoreAlign, StMMOFlags);		FirstInChain->getPointerInfo(), FirstStoreAlign, StMMOFlags);
} else { // This must be the truncstore/extload case		} else { // This must be the truncstore/extload case
EVT ExtendedTy =		EVT ExtendedTy =
TLI.getTypeToTransformTo(*DAG.getContext(), JointMemOpVT);		TLI.getTypeToTransformTo(*DAG.getContext(), JointMemOpVT);
NewLoad = DAG.getExtLoad(ISD::EXTLOAD, LoadDL, ExtendedTy,		NewLoad = DAG.getExtLoad(ISD::EXTLOAD, LoadDL, ExtendedTy,
FirstLoad->getChain(), FirstLoad->getBasePtr(),		FirstLoad->getChain(), FirstLoad->getBasePtr(),
FirstLoad->getPointerInfo(), JointMemOpVT,		FirstLoad->getPointerInfo(), JointMemOpVT,
FirstLoadAlign, LdMMOFlags);		FirstLoadAlign, LdMMOFlags);
▲ Show 20 Lines • Show All 5,384 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/merge-store-dependency.ll

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	while.end.i:
%call.i = tail call i8* @foo()		%call.i = tail call i8* @foo()
store i8* %call.i, i8 bitcast (%struct1* @gv1 to i8**), align 8		store i8* %call.i, i8 bitcast (%struct1* @gv1 to i8**), align 8
br label %exit		br label %exit

exit:		exit:
ret void		ret void
}		}

		; TODO: rev16?

define void @rotate16_in_place(i8* %p) {		define void @rotate16_in_place(i8* %p) {
; A53-LABEL: rotate16_in_place:		; A53-LABEL: rotate16_in_place:
; A53: // %bb.0:		; A53: // %bb.0:
; A53-NEXT: ldrb w8, [x0, #1]		; A53-NEXT: ldrb w8, [x0, #1]
; A53-NEXT: ldrb w9, [x0]		; A53-NEXT: ldrb w9, [x0]
; A53-NEXT: strb w8, [x0]		; A53-NEXT: strb w8, [x0]
; A53-NEXT: strb w9, [x0, #1]		; A53-NEXT: strb w9, [x0, #1]
; A53-NEXT: ret		; A53-NEXT: ret
%p0 = getelementptr i8, i8* %p, i64 0		%p0 = getelementptr i8, i8* %p, i64 0
%p1 = getelementptr i8, i8* %p, i64 1		%p1 = getelementptr i8, i8* %p, i64 1
%i0 = load i8, i8* %p0, align 1		%i0 = load i8, i8* %p0, align 1
%i1 = load i8, i8* %p1, align 1		%i1 = load i8, i8* %p1, align 1
store i8 %i1, i8* %p0, align 1		store i8 %i1, i8* %p0, align 1
store i8 %i0, i8* %p1, align 1		store i8 %i0, i8* %p1, align 1
ret void		ret void
}		}

		; TODO: rev16?

define void @rotate16(i8* %p, i8* %q) {		define void @rotate16(i8* %p, i8* %q) {
; A53-LABEL: rotate16:		; A53-LABEL: rotate16:
; A53: // %bb.0:		; A53: // %bb.0:
; A53-NEXT: ldrb w8, [x0, #1]		; A53-NEXT: ldrb w8, [x0, #1]
; A53-NEXT: ldrb w9, [x0]		; A53-NEXT: ldrb w9, [x0]
; A53-NEXT: strb w8, [x1]		; A53-NEXT: strb w8, [x1]
; A53-NEXT: strb w9, [x1, #1]		; A53-NEXT: strb w9, [x1, #1]
; A53-NEXT: ret		; A53-NEXT: ret
%p0 = getelementptr i8, i8* %p, i64 0		%p0 = getelementptr i8, i8* %p, i64 0
%p1 = getelementptr i8, i8* %p, i64 1		%p1 = getelementptr i8, i8* %p, i64 1
%q0 = getelementptr i8, i8* %q, i64 0		%q0 = getelementptr i8, i8* %q, i64 0
%q1 = getelementptr i8, i8* %q, i64 1		%q1 = getelementptr i8, i8* %q, i64 1
%i0 = load i8, i8* %p0, align 1		%i0 = load i8, i8* %p0, align 1
%i1 = load i8, i8* %p1, align 1		%i1 = load i8, i8* %p1, align 1
store i8 %i1, i8* %q0, align 1		store i8 %i1, i8* %q0, align 1
store i8 %i0, i8* %q1, align 1		store i8 %i0, i8* %q1, align 1
ret void		ret void
}		}

define void @rotate32_in_place(i16* %p) {		define void @rotate32_in_place(i16* %p) {
; A53-LABEL: rotate32_in_place:		; A53-LABEL: rotate32_in_place:
; A53: // %bb.0:		; A53: // %bb.0:
; A53-NEXT: ldrh w8, [x0, #2]		; A53-NEXT: ldr w8, [x0]
; A53-NEXT: ldrh w9, [x0]		; A53-NEXT: ror w8, w8, #16
; A53-NEXT: strh w8, [x0]		; A53-NEXT: str w8, [x0]
; A53-NEXT: strh w9, [x0, #2]
; A53-NEXT: ret		; A53-NEXT: ret
%p0 = getelementptr i16, i16* %p, i64 0		%p0 = getelementptr i16, i16* %p, i64 0
%p1 = getelementptr i16, i16* %p, i64 1		%p1 = getelementptr i16, i16* %p, i64 1
%i0 = load i16, i16* %p0, align 2		%i0 = load i16, i16* %p0, align 2
%i1 = load i16, i16* %p1, align 2		%i1 = load i16, i16* %p1, align 2
store i16 %i1, i16* %p0, align 2		store i16 %i1, i16* %p0, align 2
store i16 %i0, i16* %p1, align 2		store i16 %i0, i16* %p1, align 2
ret void		ret void
}		}

define void @rotate32(i16* %p) {		define void @rotate32(i16* %p) {
; A53-LABEL: rotate32:		; A53-LABEL: rotate32:
; A53: // %bb.0:		; A53: // %bb.0:
; A53-NEXT: ldrh w8, [x0, #2]		; A53-NEXT: ldr w8, [x0]
; A53-NEXT: ldrh w9, [x0]		; A53-NEXT: ror w8, w8, #16
; A53-NEXT: strh w8, [x0, #84]		; A53-NEXT: str w8, [x0, #84]
; A53-NEXT: strh w9, [x0, #86]
; A53-NEXT: ret		; A53-NEXT: ret
%p0 = getelementptr i16, i16* %p, i64 0		%p0 = getelementptr i16, i16* %p, i64 0
%p1 = getelementptr i16, i16* %p, i64 1		%p1 = getelementptr i16, i16* %p, i64 1
%p42 = getelementptr i16, i16* %p, i64 42		%p42 = getelementptr i16, i16* %p, i64 42
%p43 = getelementptr i16, i16* %p, i64 43		%p43 = getelementptr i16, i16* %p, i64 43
%i0 = load i16, i16* %p0, align 2		%i0 = load i16, i16* %p0, align 2
%i1 = load i16, i16* %p1, align 2		%i1 = load i16, i16* %p1, align 2
store i16 %i1, i16* %p42, align 2		store i16 %i1, i16* %p42, align 2
store i16 %i0, i16* %p43, align 2		store i16 %i0, i16* %p43, align 2
ret void		ret void
}		}

		; Prefer paired memops over rotate.

define void @rotate64_in_place(i32* %p) {		define void @rotate64_in_place(i32* %p) {
; A53-LABEL: rotate64_in_place:		; A53-LABEL: rotate64_in_place:
; A53: // %bb.0:		; A53: // %bb.0:
; A53-NEXT: ldp w9, w8, [x0]		; A53-NEXT: ldp w9, w8, [x0]
; A53-NEXT: stp w8, w9, [x0]		; A53-NEXT: stp w8, w9, [x0]
; A53-NEXT: ret		; A53-NEXT: ret
%p0 = getelementptr i32, i32* %p, i64 0		%p0 = getelementptr i32, i32* %p, i64 0
%p1 = getelementptr i32, i32* %p, i64 1		%p1 = getelementptr i32, i32* %p, i64 1
%i0 = load i32, i32* %p0, align 4		%i0 = load i32, i32* %p0, align 4
%i1 = load i32, i32* %p1, align 4		%i1 = load i32, i32* %p1, align 4
store i32 %i1, i32* %p0, align 4		store i32 %i1, i32* %p0, align 4
store i32 %i0, i32* %p1, align 4		store i32 %i0, i32* %p1, align 4
ret void		ret void
}		}

		; Prefer paired memops over rotate.

define void @rotate64(i32* %p) {		define void @rotate64(i32* %p) {
; A53-LABEL: rotate64:		; A53-LABEL: rotate64:
; A53: // %bb.0:		; A53: // %bb.0:
; A53-NEXT: ldp w9, w8, [x0]		; A53-NEXT: ldp w9, w8, [x0]
; A53-NEXT: stp w8, w9, [x0, #8]		; A53-NEXT: stp w8, w9, [x0, #8]
; A53-NEXT: ret		; A53-NEXT: ret
%p0 = getelementptr i32, i32* %p, i64 0		%p0 = getelementptr i32, i32* %p, i64 0
%p1 = getelementptr i32, i32* %p, i64 1		%p1 = getelementptr i32, i32* %p, i64 1
Show All 12 Lines

llvm/test/CodeGen/X86/stores-merging.ll

Show First 20 Lines • Show All 240 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
%b = bitcast i8* %a to i1*		%b = bitcast i8* %a to i1*
store i1 true, i1* %b, align 1		store i1 true, i1* %b, align 1
ret void		ret void
}		}

define void @rotate16_in_place(i8* %p) {		define void @rotate16_in_place(i8* %p) {
; CHECK-LABEL: rotate16_in_place:		; CHECK-LABEL: rotate16_in_place:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movb (%rdi), %al		; CHECK-NEXT: rolw $8, (%rdi)
; CHECK-NEXT: movb 1(%rdi), %cl
; CHECK-NEXT: movb %cl, (%rdi)
; CHECK-NEXT: movb %al, 1(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i8, i8* %p, i64 0		%p0 = getelementptr i8, i8* %p, i64 0
%p1 = getelementptr i8, i8* %p, i64 1		%p1 = getelementptr i8, i8* %p, i64 1
%i0 = load i8, i8* %p0, align 1		%i0 = load i8, i8* %p0, align 1
%i1 = load i8, i8* %p1, align 1		%i1 = load i8, i8* %p1, align 1
store i8 %i1, i8* %p0, align 1		store i8 %i1, i8* %p0, align 1
store i8 %i0, i8* %p1, align 1		store i8 %i0, i8* %p1, align 1
ret void		ret void
}		}

define void @rotate16(i8* %p, i8* %q) {		define void @rotate16(i8* %p, i8* %q) {
; CHECK-LABEL: rotate16:		; CHECK-LABEL: rotate16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movb (%rdi), %al		; CHECK-NEXT: movzwl (%rdi), %eax
; CHECK-NEXT: movb 1(%rdi), %cl		; CHECK-NEXT: rolw $8, %ax
; CHECK-NEXT: movb %cl, (%rsi)		; CHECK-NEXT: movw %ax, (%rsi)
; CHECK-NEXT: movb %al, 1(%rsi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i8, i8* %p, i64 0		%p0 = getelementptr i8, i8* %p, i64 0
%p1 = getelementptr i8, i8* %p, i64 1		%p1 = getelementptr i8, i8* %p, i64 1
%q0 = getelementptr i8, i8* %q, i64 0		%q0 = getelementptr i8, i8* %q, i64 0
%q1 = getelementptr i8, i8* %q, i64 1		%q1 = getelementptr i8, i8* %q, i64 1
%i0 = load i8, i8* %p0, align 1		%i0 = load i8, i8* %p0, align 1
%i1 = load i8, i8* %p1, align 1		%i1 = load i8, i8* %p1, align 1
store i8 %i1, i8* %q0, align 1		store i8 %i1, i8* %q0, align 1
store i8 %i0, i8* %q1, align 1		store i8 %i0, i8* %q1, align 1
ret void		ret void
}		}

define void @rotate32_in_place(i16* %p) {		define void @rotate32_in_place(i16* %p) {
; CHECK-LABEL: rotate32_in_place:		; CHECK-LABEL: rotate32_in_place:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movzwl (%rdi), %eax		; CHECK-NEXT: roll $16, (%rdi)
; CHECK-NEXT: movzwl 2(%rdi), %ecx
; CHECK-NEXT: movw %cx, (%rdi)
; CHECK-NEXT: movw %ax, 2(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i16, i16* %p, i64 0		%p0 = getelementptr i16, i16* %p, i64 0
%p1 = getelementptr i16, i16* %p, i64 1		%p1 = getelementptr i16, i16* %p, i64 1
%i0 = load i16, i16* %p0, align 2		%i0 = load i16, i16* %p0, align 2
%i1 = load i16, i16* %p1, align 2		%i1 = load i16, i16* %p1, align 2
store i16 %i1, i16* %p0, align 2		store i16 %i1, i16* %p0, align 2
store i16 %i0, i16* %p1, align 2		store i16 %i0, i16* %p1, align 2
ret void		ret void
}		}

define void @rotate32(i16* %p) {		define void @rotate32(i16* %p) {
; CHECK-LABEL: rotate32:		; CHECK-LABEL: rotate32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movzwl (%rdi), %eax		; CHECK-NEXT: movl (%rdi), %eax
; CHECK-NEXT: movzwl 2(%rdi), %ecx		; CHECK-NEXT: roll $16, %eax
; CHECK-NEXT: movw %cx, 84(%rdi)		; CHECK-NEXT: movl %eax, 84(%rdi)
; CHECK-NEXT: movw %ax, 86(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i16, i16* %p, i64 0		%p0 = getelementptr i16, i16* %p, i64 0
%p1 = getelementptr i16, i16* %p, i64 1		%p1 = getelementptr i16, i16* %p, i64 1
%p42 = getelementptr i16, i16* %p, i64 42		%p42 = getelementptr i16, i16* %p, i64 42
%p43 = getelementptr i16, i16* %p, i64 43		%p43 = getelementptr i16, i16* %p, i64 43
%i0 = load i16, i16* %p0, align 2		%i0 = load i16, i16* %p0, align 2
%i1 = load i16, i16* %p1, align 2		%i1 = load i16, i16* %p1, align 2
store i16 %i1, i16* %p42, align 2		store i16 %i1, i16* %p42, align 2
store i16 %i0, i16* %p43, align 2		store i16 %i0, i16* %p43, align 2
ret void		ret void
}		}

define void @rotate64_in_place(i32* %p) {		define void @rotate64_in_place(i32* %p) {
; CHECK-LABEL: rotate64_in_place:		; CHECK-LABEL: rotate64_in_place:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movl (%rdi), %eax		; CHECK-NEXT: rolq $32, (%rdi)
; CHECK-NEXT: movl 4(%rdi), %ecx
; CHECK-NEXT: movl %ecx, (%rdi)
; CHECK-NEXT: movl %eax, 4(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i32, i32* %p, i64 0		%p0 = getelementptr i32, i32* %p, i64 0
%p1 = getelementptr i32, i32* %p, i64 1		%p1 = getelementptr i32, i32* %p, i64 1
%i0 = load i32, i32* %p0, align 4		%i0 = load i32, i32* %p0, align 4
%i1 = load i32, i32* %p1, align 4		%i1 = load i32, i32* %p1, align 4
store i32 %i1, i32* %p0, align 4		store i32 %i1, i32* %p0, align 4
store i32 %i0, i32* %p1, align 4		store i32 %i0, i32* %p1, align 4
ret void		ret void
}		}

define void @rotate64(i32* %p) {		define void @rotate64(i32* %p) {
; CHECK-LABEL: rotate64:		; CHECK-LABEL: rotate64:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movl (%rdi), %eax		; CHECK-NEXT: movq (%rdi), %rax
; CHECK-NEXT: movl 4(%rdi), %ecx		; CHECK-NEXT: rolq $32, %rax
; CHECK-NEXT: movl %ecx, 8(%rdi)		; CHECK-NEXT: movq %rax, 8(%rdi)
; CHECK-NEXT: movl %eax, 12(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i32, i32* %p, i64 0		%p0 = getelementptr i32, i32* %p, i64 0
%p1 = getelementptr i32, i32* %p, i64 1		%p1 = getelementptr i32, i32* %p, i64 1
%p2 = getelementptr i32, i32* %p, i64 2		%p2 = getelementptr i32, i32* %p, i64 2
%p3 = getelementptr i32, i32* %p, i64 3		%p3 = getelementptr i32, i32* %p, i64 3
%i0 = load i32, i32* %p0, align 4		%i0 = load i32, i32* %p0, align 4
%i1 = load i32, i32* %p1, align 4		%i1 = load i32, i32* %p1, align 4
store i32 %i1, i32* %p2, align 4		store i32 %i1, i32* %p2, align 4
store i32 %i0, i32* %p3, align 4		store i32 %i0, i32* %p3, align 4
ret void		ret void
}		}

define void @rotate64_iterate(i16* %p) {		define void @rotate64_iterate(i16* %p) {
; CHECK-LABEL: rotate64_iterate:		; CHECK-LABEL: rotate64_iterate:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movl (%rdi), %eax		; CHECK-NEXT: movq (%rdi), %rax
; CHECK-NEXT: movl 4(%rdi), %ecx		; CHECK-NEXT: rolq $32, %rax
; CHECK-NEXT: movl %ecx, 84(%rdi)		; CHECK-NEXT: movq %rax, 84(%rdi)
; CHECK-NEXT: movl %eax, 88(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i16, i16* %p, i64 0		%p0 = getelementptr i16, i16* %p, i64 0
%p1 = getelementptr i16, i16* %p, i64 1		%p1 = getelementptr i16, i16* %p, i64 1
%p2 = getelementptr i16, i16* %p, i64 2		%p2 = getelementptr i16, i16* %p, i64 2
%p3 = getelementptr i16, i16* %p, i64 3		%p3 = getelementptr i16, i16* %p, i64 3
%p42 = getelementptr i16, i16* %p, i64 42		%p42 = getelementptr i16, i16* %p, i64 42
%p43 = getelementptr i16, i16* %p, i64 43		%p43 = getelementptr i16, i16* %p, i64 43
%p44 = getelementptr i16, i16* %p, i64 44		%p44 = getelementptr i16, i16* %p, i64 44
%p45 = getelementptr i16, i16* %p, i64 45		%p45 = getelementptr i16, i16* %p, i64 45
%i0 = load i16, i16* %p0, align 2		%i0 = load i16, i16* %p0, align 2
%i1 = load i16, i16* %p1, align 2		%i1 = load i16, i16* %p1, align 2
%i2 = load i16, i16* %p2, align 2		%i2 = load i16, i16* %p2, align 2
%i3 = load i16, i16* %p3, align 2		%i3 = load i16, i16* %p3, align 2
store i16 %i2, i16* %p42, align 2		store i16 %i2, i16* %p42, align 2
store i16 %i3, i16* %p43, align 2		store i16 %i3, i16* %p43, align 2
store i16 %i0, i16* %p44, align 2		store i16 %i0, i16* %p44, align 2
store i16 %i1, i16* %p45, align 2		store i16 %i1, i16* %p45, align 2
ret void		ret void
}		}

		; TODO: recognize this as 2 rotates?

define void @rotate32_consecutive(i16* %p) {		define void @rotate32_consecutive(i16* %p) {
; CHECK-LABEL: rotate32_consecutive:		; CHECK-LABEL: rotate32_consecutive:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movzwl (%rdi), %eax		; CHECK-NEXT: movzwl (%rdi), %eax
; CHECK-NEXT: movzwl 2(%rdi), %ecx		; CHECK-NEXT: movzwl 2(%rdi), %ecx
; CHECK-NEXT: movzwl 4(%rdi), %edx		; CHECK-NEXT: movzwl 4(%rdi), %edx
; CHECK-NEXT: movzwl 6(%rdi), %esi		; CHECK-NEXT: movzwl 6(%rdi), %esi
; CHECK-NEXT: movw %cx, 84(%rdi)		; CHECK-NEXT: movw %cx, 84(%rdi)
Show All 15 Lines	; CHECK-NEXT: retq
%i3 = load i16, i16* %p3, align 2		%i3 = load i16, i16* %p3, align 2
store i16 %i1, i16* %p42, align 2		store i16 %i1, i16* %p42, align 2
store i16 %i0, i16* %p43, align 2		store i16 %i0, i16* %p43, align 2
store i16 %i3, i16* %p44, align 2		store i16 %i3, i16* %p44, align 2
store i16 %i2, i16* %p45, align 2		store i16 %i2, i16* %p45, align 2
ret void		ret void
}		}

		; Same as above, but now the stores are not all consecutive.

define void @rotate32_twice(i16* %p) {		define void @rotate32_twice(i16* %p) {
; CHECK-LABEL: rotate32_twice:		; CHECK-LABEL: rotate32_twice:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movzwl (%rdi), %eax		; CHECK-NEXT: movl (%rdi), %eax
; CHECK-NEXT: movzwl 2(%rdi), %ecx		; CHECK-NEXT: movl 4(%rdi), %ecx
; CHECK-NEXT: movzwl 4(%rdi), %edx		; CHECK-NEXT: roll $16, %eax
; CHECK-NEXT: movzwl 6(%rdi), %esi		; CHECK-NEXT: roll $16, %ecx
; CHECK-NEXT: movw %cx, 84(%rdi)		; CHECK-NEXT: movl %eax, 84(%rdi)
; CHECK-NEXT: movw %ax, 86(%rdi)		; CHECK-NEXT: movl %ecx, 108(%rdi)
; CHECK-NEXT: movw %si, 108(%rdi)
; CHECK-NEXT: movw %dx, 110(%rdi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%p0 = getelementptr i16, i16* %p, i64 0		%p0 = getelementptr i16, i16* %p, i64 0
%p1 = getelementptr i16, i16* %p, i64 1		%p1 = getelementptr i16, i16* %p, i64 1
%p2 = getelementptr i16, i16* %p, i64 2		%p2 = getelementptr i16, i16* %p, i64 2
%p3 = getelementptr i16, i16* %p, i64 3		%p3 = getelementptr i16, i16* %p, i64 3
%p42 = getelementptr i16, i16* %p, i64 42		%p42 = getelementptr i16, i16* %p, i64 42
%p43 = getelementptr i16, i16* %p, i64 43		%p43 = getelementptr i16, i16* %p, i64 43
%p54 = getelementptr i16, i16* %p, i64 54		%p54 = getelementptr i16, i16* %p, i64 54
Show All 11 Lines