Download Raw Diff

Details

Reviewers

pengfei
RKSimon
craig.topper
xbolva00

Commits

rG285b8abce483: [x86] limit vector increment fold to allow load folding

Summary

The tests are based on the example from:
https://llvm.org/PR52032

I suspect that it looks worse than it actually is. :)
That is, llvm-mca says there's no uop/timing difference with the load folding and pcmpeq vs. broadcast on Haswell (and probably other targets).
The load-folding definitely makes the code smaller, so it's good for that at least. So this requires carving a narrow hole in the transform to get just this case without changing others that look good as-is (in other words, the transform still seems good for most examples).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Oct 25 2021, 9:36 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptOct 25 2021, 9:36 AM

spatel requested review of this revision.Oct 25 2021, 9:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2021, 9:36 AM

Harbormaster completed remote builds in B130488: Diff 382022.Oct 25 2021, 9:36 AM

lebedev.ri added a subscriber: lebedev.ri.Oct 25 2021, 9:39 AM

lebedev.ri added inline comments.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
902	What about multi-use load, we usually don't load-fold those?

spatel marked an inline comment as done.Oct 25 2021, 11:53 AM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
902	Yes, good point. I really want to use the same logic as in X86ISelLowering's MayFoldLoad, but it's a static function. Let me update that.

Patch updated:

Make mayFoldLoad visible to X86ISelDAGToDAG and use it in the predicate.
Added a test with extra uses of loads, so we can see the result in that case (unchanged / negative test for this patch).

I can make the helper visibility/formatting changes an NFC pre-commit if that looks ok. We have functions like that scattered around, so it's not clear to me if there's a better way.

Harbormaster completed remote builds in B130522: Diff 382078.Oct 25 2021, 11:59 AM

craig.topper added inline comments.Oct 25 2021, 12:12 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
5053 ↗	(On Diff #382078)	While you're in here, can you use `cast` here since the `dyn_cast` isn't checked for null.

SGTM in general.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
903	I'm stuck on this one-use check. Is it really reasonable? We don't really know that the constant is used nearby, do we? There doesn't seem to be a test for it. https://godbolt.org/z/Ea9sad3Ya

spatel added inline comments.Oct 25 2021, 12:21 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
5053 ↗	(On Diff #382078)	Yes - I can do that as a 1-liner. Beyond that, I wasn't sure what cleanup we want to do. Could update this group of 4 "MayFold*" functions together (pull the declarations into the header and X86 namespace even though there's no current external user of the other 3)?

spatel added inline comments.Oct 25 2021, 1:18 PM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

903

Right, we don't know exactly where the other use of that constant is or even if it's part of another increment op with this check, so it's a (weak) heuristic.

I'll add a test like you've shown. Without that check, we would alter codegen on it as shown below. It's a close call, but I think it's a regression to increase the load uops on that -- even if it is one less macro instruction.

diff --git a/llvm/test/CodeGen/X86/combine-sub.ll b/llvm/test/CodeGen/X86/combine-sub.ll
index a399c5175dd6..5090895c0ab8 100644
--- a/llvm/test/CodeGen/X86/combine-sub.ll
+++ b/llvm/test/CodeGen/X86/combine-sub.ll
@@ -290,9 +290,8 @@ define void @PR52032_oneuse_constant(<8 x i32>* %p) {
 ;
 ; AVX-LABEL: PR52032_oneuse_constant:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmovdqu (%rdi), %ymm0
-; AVX-NEXT:    vpcmpeqd %ymm1, %ymm1, %ymm1
-; AVX-NEXT:    vpsubd %ymm1, %ymm0, %ymm0
+; AVX-NEXT:    vpbroadcastd {{.*#+}} ymm0 = [1,1,1,1,1,1,1,1]
+; AVX-NEXT:    vpaddd (%rdi), %ymm0, %ymm0
 ; AVX-NEXT:    vmovdqu %ymm0, (%rdi)
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq

Patch updated:
Added one-use-of-constant "1" test (currently negative for this patch) with explanatory comment.

Harbormaster completed remote builds in B130543: Diff 382106.Oct 25 2021, 1:29 PM

craig.topper added inline comments.Oct 25 2021, 1:49 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
5042 ↗	(On Diff #382106)	Not related to this patch, but this function is incorrect for SSE which requires alignment or a subtarget feature that disables the alignment check. Are all existing uses scalar or AVX only?

craig.topper added inline comments.Oct 25 2021, 1:51 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
5042 ↗	(On Diff #382106)	We also don't fold non-temporal loads if we have an instruction for it. See useNonTemporalLoad

RKSimon added inline comments.Oct 26 2021, 3:20 AM

llvm/lib/Target/X86/X86ISelLowering.h
915 ↗	(On Diff #382106)	(minor) Add description now that its exposed

spatel mentioned this in rG2ab0148c140d: [x86] use cast instead of dyn_cast for unchecked usage; NFC.Oct 26 2021, 5:23 AM

spatel mentioned this in D112545: [x86] enhance mayFoldLoad to check alignment.Oct 26 2021, 8:08 AM

spatel added inline comments.Oct 26 2021, 8:12 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
5042 ↗	(On Diff #382106)	Let's try to deal with this first (and fold in the formatting changes to that patch): D112545

spatel mentioned this in rG6c0a2c2804c0: [x86] enhance mayFoldLoad to check alignment.Oct 27 2021, 4:54 AM

Patch updated:
Rebased after 6c0a2c2804c0 and added function definition code comment.

spatel marked an inline comment as done.Oct 27 2021, 6:28 AM

spatel marked 4 inline comments as done.

LGTM - although might be better to pre-commit the change to X86::mayFoldLoad separately from the actual PR52032 fix

Harbormaster completed remote builds in B130938: Diff 382643.Oct 27 2021, 7:43 AM

Any other comments?

Seems fine to me.

RKSimon accepted this revision.Oct 28 2021, 4:35 AM

This revision is now accepted and ready to land.Oct 28 2021, 4:35 AM

This revision was landed with ongoing or failed builds.Oct 29 2021, 12:53 PM

Closed by commit rG285b8abce483: [x86] limit vector increment fold to allow load folding (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in rG837518d6a08b: [x86] make mayFold* helpers visible to more files; NFC.

spatel added a commit: rG285b8abce483: [x86] limit vector increment fold to allow load folding.

Diff 383470

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

Show First 20 Lines • Show All 874 Lines • ▼ Show 20 Lines	if (N->getOpcode() == X86ISD::AND && !N->hasAnyUseOfValue(1)) {
N->getOperand(0), N->getOperand(1));		N->getOperand(0), N->getOperand(1));
--I;		--I;
CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 0), Res);		CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 0), Res);
++I;		++I;
MadeChange = true;		MadeChange = true;
continue;		continue;
}		}

/// Convert vector increment or decrement to sub/add with an all-ones		// Convert vector increment or decrement to sub/add with an all-ones
/// constant:		// constant:
/// add X, <1, 1...> --> sub X, <-1, -1...>		// add X, <1, 1...> --> sub X, <-1, -1...>
/// sub X, <1, 1...> --> add X, <-1, -1...>		// sub X, <1, 1...> --> add X, <-1, -1...>
/// The all-ones vector constant can be materialized using a pcmpeq		// The all-ones vector constant can be materialized using a pcmpeq
/// instruction that is commonly recognized as an idiom (has no register		// instruction that is commonly recognized as an idiom (has no register
/// dependency), so that's better/smaller than loading a splat 1 constant.		// dependency), so that's better/smaller than loading a splat 1 constant.
		//
		// But don't do this if it would inhibit a potentially profitable load
		// folding opportunity for the other operand. That only occurs with the
		// intersection of:
		// (1) The other operand (op0) is load foldable.
		// (2) The op is an add (otherwise, we are creating an add and can still
		// load fold the other op).
		// (3) The target has AVX (otherwise, we have a destructive add and can't
		// load fold the other op without killing the constant op).
		// (4) The constant 1 vector has multiple uses (so it is profitable to load
		// into a register anyway).
		auto mayPreventLoadFold = [&]() {
		return X86::mayFoldLoad(N->getOperand(0), *Subtarget) &&
		lebedev.riUnsubmitted Done Reply Inline Actions What about multi-use load, we usually don't load-fold those? lebedev.ri: What about multi-use load, we usually don't load-fold those?
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, good point. I really want to use the same logic as in X86ISelLowering's MayFoldLoad, but it's a static function. Let me update that. spatel: Yes, good point. I really want to use the same logic as in X86ISelLowering's MayFoldLoad, but…
		N->getOpcode() == ISD::ADD && Subtarget->hasAVX() &&
		lebedev.riUnsubmitted Done Reply Inline Actions I'm stuck on this one-use check. Is it really reasonable? We don't really know that the constant is used nearby, do we? There doesn't seem to be a test for it. https://godbolt.org/z/Ea9sad3Ya lebedev.ri: I'm stuck on this one-use check. Is it really reasonable? We don't really know that the…
		spatelAuthorUnsubmitted Done Reply Inline Actions Right, we don't know exactly where the other use of that constant is or even if it's part of another increment op with this check, so it's a (weak) heuristic. I'll add a test like you've shown. Without that check, we would alter codegen on it as shown below. It's a close call, but I think it's a regression to increase the load uops on that -- even if it is one less macro instruction. diff --git a/llvm/test/CodeGen/X86/combine-sub.ll b/llvm/test/CodeGen/X86/combine-sub.ll index a399c5175dd6..5090895c0ab8 100644 --- a/llvm/test/CodeGen/X86/combine-sub.ll +++ b/llvm/test/CodeGen/X86/combine-sub.ll @@ -290,9 +290,8 @@ define void @PR52032_oneuse_constant(<8 x i32>* %p) { ; ; AVX-LABEL: PR52032_oneuse_constant: ; AVX: # %bb.0: -; AVX-NEXT: vmovdqu (%rdi), %ymm0 -; AVX-NEXT: vpcmpeqd %ymm1, %ymm1, %ymm1 -; AVX-NEXT: vpsubd %ymm1, %ymm0, %ymm0 +; AVX-NEXT: vpbroadcastd {{.#+}} ymm0 = [1,1,1,1,1,1,1,1] +; AVX-NEXT: vpaddd (%rdi), %ymm0, %ymm0 ; AVX-NEXT: vmovdqu %ymm0, (%rdi) ; AVX-NEXT: vzeroupper ; AVX-NEXT: retq spatel:* Right, we don't know exactly where the other use of that constant is or even if it's part of…
		!N->getOperand(1).hasOneUse();
		};
if ((N->getOpcode() == ISD::ADD \|\| N->getOpcode() == ISD::SUB) &&		if ((N->getOpcode() == ISD::ADD \|\| N->getOpcode() == ISD::SUB) &&
N->getSimpleValueType(0).isVector()) {		N->getSimpleValueType(0).isVector() && !mayPreventLoadFold()) {

APInt SplatVal;		APInt SplatVal;
if (X86::isConstantSplat(N->getOperand(1), SplatVal) &&		if (X86::isConstantSplat(N->getOperand(1), SplatVal) &&
SplatVal.isOne()) {		SplatVal.isOne()) {
SDLoc DL(N);		SDLoc DL(N);

MVT VT = N->getSimpleValueType(0);		MVT VT = N->getSimpleValueType(0);
unsigned NumElts = VT.getSizeInBits() / 32;		unsigned NumElts = VT.getSizeInBits() / 32;
SDValue AllOnes =		SDValue AllOnes =
▲ Show 20 Lines • Show All 5,207 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/combine-sub.ll

Show First 20 Lines • Show All 270 Lines • ▼ Show 20 Lines
; AVX-NEXT: vpcmpeqd %xmm1, %xmm1, %xmm1		; AVX-NEXT: vpcmpeqd %xmm1, %xmm1, %xmm1
; AVX-NEXT: vpsubd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vpsubd %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%xor = xor <4 x i32> %x, <i32 -1, i32 -1, i32 -1, i32 -1>		%xor = xor <4 x i32> %x, <i32 -1, i32 -1, i32 -1, i32 -1>
%sub = sub <4 x i32> zeroinitializer, %xor		%sub = sub <4 x i32> zeroinitializer, %xor
ret <4 x i32> %sub		ret <4 x i32> %sub
}		}

		; With AVX, this could use broadcast (an extra load) and
		; load-folded 'add', but currently we favor the virtually
		; free pcmpeq instruction.

define void @PR52032_oneuse_constant(<8 x i32>* %p) {		define void @PR52032_oneuse_constant(<8 x i32>* %p) {
; SSE-LABEL: PR52032_oneuse_constant:		; SSE-LABEL: PR52032_oneuse_constant:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movdqu (%rdi), %xmm0		; SSE-NEXT: movdqu (%rdi), %xmm0
; SSE-NEXT: movdqu 16(%rdi), %xmm1		; SSE-NEXT: movdqu 16(%rdi), %xmm1
; SSE-NEXT: pcmpeqd %xmm2, %xmm2		; SSE-NEXT: pcmpeqd %xmm2, %xmm2
; SSE-NEXT: psubd %xmm2, %xmm1		; SSE-NEXT: psubd %xmm2, %xmm1
; SSE-NEXT: psubd %xmm2, %xmm0		; SSE-NEXT: psubd %xmm2, %xmm0
Show All 10 Lines
; AVX-NEXT: vzeroupper		; AVX-NEXT: vzeroupper
; AVX-NEXT: retq		; AVX-NEXT: retq
%i3 = load <8 x i32>, <8 x i32>* %p, align 4		%i3 = load <8 x i32>, <8 x i32>* %p, align 4
%i4 = add nsw <8 x i32> %i3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>		%i4 = add nsw <8 x i32> %i3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
store <8 x i32> %i4, <8 x i32>* %p, align 4		store <8 x i32> %i4, <8 x i32>* %p, align 4
ret void		ret void
}		}

		; With AVX, we don't transform 'add' to 'sub' because that prevents load folding.
		; With SSE, we do it because we can't load fold the other op without overwriting the constant op.

define void @PR52032(<8 x i32>* %p) {		define void @PR52032(<8 x i32>* %p) {
; SSE-LABEL: PR52032:		; SSE-LABEL: PR52032:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: pcmpeqd %xmm0, %xmm0		; SSE-NEXT: pcmpeqd %xmm0, %xmm0
; SSE-NEXT: movdqu (%rdi), %xmm1		; SSE-NEXT: movdqu (%rdi), %xmm1
; SSE-NEXT: movdqu 16(%rdi), %xmm2		; SSE-NEXT: movdqu 16(%rdi), %xmm2
; SSE-NEXT: movdqu 32(%rdi), %xmm3		; SSE-NEXT: movdqu 32(%rdi), %xmm3
; SSE-NEXT: movdqu 48(%rdi), %xmm4		; SSE-NEXT: movdqu 48(%rdi), %xmm4
; SSE-NEXT: psubd %xmm0, %xmm2		; SSE-NEXT: psubd %xmm0, %xmm2
; SSE-NEXT: psubd %xmm0, %xmm1		; SSE-NEXT: psubd %xmm0, %xmm1
; SSE-NEXT: movdqu %xmm1, (%rdi)		; SSE-NEXT: movdqu %xmm1, (%rdi)
; SSE-NEXT: movdqu %xmm2, 16(%rdi)		; SSE-NEXT: movdqu %xmm2, 16(%rdi)
; SSE-NEXT: psubd %xmm0, %xmm4		; SSE-NEXT: psubd %xmm0, %xmm4
; SSE-NEXT: psubd %xmm0, %xmm3		; SSE-NEXT: psubd %xmm0, %xmm3
; SSE-NEXT: movdqu %xmm3, 32(%rdi)		; SSE-NEXT: movdqu %xmm3, 32(%rdi)
; SSE-NEXT: movdqu %xmm4, 48(%rdi)		; SSE-NEXT: movdqu %xmm4, 48(%rdi)
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: PR52032:		; AVX-LABEL: PR52032:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0		; AVX-NEXT: vpbroadcastd {{.*#+}} ymm0 = [1,1,1,1,1,1,1,1]
; AVX-NEXT: vmovdqu (%rdi), %ymm1		; AVX-NEXT: vpaddd (%rdi), %ymm0, %ymm1
; AVX-NEXT: vmovdqu 32(%rdi), %ymm2
; AVX-NEXT: vpsubd %ymm0, %ymm1, %ymm1
; AVX-NEXT: vmovdqu %ymm1, (%rdi)		; AVX-NEXT: vmovdqu %ymm1, (%rdi)
; AVX-NEXT: vpsubd %ymm0, %ymm2, %ymm0		; AVX-NEXT: vpaddd 32(%rdi), %ymm0, %ymm0
; AVX-NEXT: vmovdqu %ymm0, 32(%rdi)		; AVX-NEXT: vmovdqu %ymm0, 32(%rdi)
; AVX-NEXT: vzeroupper		; AVX-NEXT: vzeroupper
; AVX-NEXT: retq		; AVX-NEXT: retq
%i3 = load <8 x i32>, <8 x i32>* %p, align 4		%i3 = load <8 x i32>, <8 x i32>* %p, align 4
%i4 = add nsw <8 x i32> %i3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>		%i4 = add nsw <8 x i32> %i3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
store <8 x i32> %i4, <8 x i32>* %p, align 4		store <8 x i32> %i4, <8 x i32>* %p, align 4
%p2 = getelementptr inbounds <8 x i32>, <8 x i32>* %p, i64 1		%p2 = getelementptr inbounds <8 x i32>, <8 x i32>* %p, i64 1
%i8 = load <8 x i32>, <8 x i32>* %p2, align 4		%i8 = load <8 x i32>, <8 x i32>* %p2, align 4
%i9 = add nsw <8 x i32> %i8, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>		%i9 = add nsw <8 x i32> %i8, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
store <8 x i32> %i9, <8 x i32>* %p2, align 4		store <8 x i32> %i9, <8 x i32>* %p2, align 4
ret void		ret void
}		}

		; Same as above, but 128-bit ops:
		; With AVX, we don't transform 'add' to 'sub' because that prevents load folding.
		; With SSE, we do it because we can't load fold the other op without overwriting the constant op.

define void @PR52032_2(<4 x i32>* %p) {		define void @PR52032_2(<4 x i32>* %p) {
; SSE-LABEL: PR52032_2:		; SSE-LABEL: PR52032_2:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: pcmpeqd %xmm0, %xmm0		; SSE-NEXT: pcmpeqd %xmm0, %xmm0
; SSE-NEXT: movdqu (%rdi), %xmm1		; SSE-NEXT: movdqu (%rdi), %xmm1
; SSE-NEXT: movdqu 16(%rdi), %xmm2		; SSE-NEXT: movdqu 16(%rdi), %xmm2
; SSE-NEXT: psubd %xmm0, %xmm1		; SSE-NEXT: psubd %xmm0, %xmm1
; SSE-NEXT: movdqu %xmm1, (%rdi)		; SSE-NEXT: movdqu %xmm1, (%rdi)
; SSE-NEXT: psubd %xmm0, %xmm2		; SSE-NEXT: psubd %xmm0, %xmm2
; SSE-NEXT: movdqu %xmm2, 16(%rdi)		; SSE-NEXT: movdqu %xmm2, 16(%rdi)
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: PR52032_2:		; AVX-LABEL: PR52032_2:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0		; AVX-NEXT: vpbroadcastd {{.*#+}} xmm0 = [1,1,1,1]
; AVX-NEXT: vmovdqu (%rdi), %xmm1		; AVX-NEXT: vpaddd (%rdi), %xmm0, %xmm1
; AVX-NEXT: vmovdqu 16(%rdi), %xmm2
; AVX-NEXT: vpsubd %xmm0, %xmm1, %xmm1
; AVX-NEXT: vmovdqu %xmm1, (%rdi)		; AVX-NEXT: vmovdqu %xmm1, (%rdi)
; AVX-NEXT: vpsubd %xmm0, %xmm2, %xmm0		; AVX-NEXT: vpaddd 16(%rdi), %xmm0, %xmm0
; AVX-NEXT: vmovdqu %xmm0, 16(%rdi)		; AVX-NEXT: vmovdqu %xmm0, 16(%rdi)
; AVX-NEXT: retq		; AVX-NEXT: retq
%i3 = load <4 x i32>, <4 x i32>* %p, align 4		%i3 = load <4 x i32>, <4 x i32>* %p, align 4
%i4 = add nsw <4 x i32> %i3, <i32 1, i32 1, i32 1, i32 1>		%i4 = add nsw <4 x i32> %i3, <i32 1, i32 1, i32 1, i32 1>
store <4 x i32> %i4, <4 x i32>* %p, align 4		store <4 x i32> %i4, <4 x i32>* %p, align 4
%p2 = getelementptr inbounds <4 x i32>, <4 x i32>* %p, i64 1		%p2 = getelementptr inbounds <4 x i32>, <4 x i32>* %p, i64 1
%i8 = load <4 x i32>, <4 x i32>* %p2, align 4		%i8 = load <4 x i32>, <4 x i32>* %p2, align 4
%i9 = add nsw <4 x i32> %i8, <i32 1, i32 1, i32 1, i32 1>		%i9 = add nsw <4 x i32> %i8, <i32 1, i32 1, i32 1, i32 1>
store <4 x i32> %i9, <4 x i32>* %p2, align 4		store <4 x i32> %i9, <4 x i32>* %p2, align 4
ret void		ret void
}		}

		; If we are starting with a 'sub', it is always better to do the transform.

define void @PR52032_3(<4 x i32>* %p) {		define void @PR52032_3(<4 x i32>* %p) {
; SSE-LABEL: PR52032_3:		; SSE-LABEL: PR52032_3:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: pcmpeqd %xmm0, %xmm0		; SSE-NEXT: pcmpeqd %xmm0, %xmm0
; SSE-NEXT: movdqu (%rdi), %xmm1		; SSE-NEXT: movdqu (%rdi), %xmm1
; SSE-NEXT: movdqu 16(%rdi), %xmm2		; SSE-NEXT: movdqu 16(%rdi), %xmm2
; SSE-NEXT: paddd %xmm0, %xmm1		; SSE-NEXT: paddd %xmm0, %xmm1
; SSE-NEXT: movdqu %xmm1, (%rdi)		; SSE-NEXT: movdqu %xmm1, (%rdi)
Show All 14 Lines	; AVX-NEXT: retq
store <4 x i32> %i4, <4 x i32>* %p, align 4		store <4 x i32> %i4, <4 x i32>* %p, align 4
%p2 = getelementptr inbounds <4 x i32>, <4 x i32>* %p, i64 1		%p2 = getelementptr inbounds <4 x i32>, <4 x i32>* %p, i64 1
%i8 = load <4 x i32>, <4 x i32>* %p2, align 4		%i8 = load <4 x i32>, <4 x i32>* %p2, align 4
%i9 = sub nsw <4 x i32> %i8, <i32 1, i32 1, i32 1, i32 1>		%i9 = sub nsw <4 x i32> %i8, <i32 1, i32 1, i32 1, i32 1>
store <4 x i32> %i9, <4 x i32>* %p2, align 4		store <4 x i32> %i9, <4 x i32>* %p2, align 4
ret void		ret void
}		}

		; If there's no chance of profitable load folding (because of extra uses), we convert 'add' to 'sub'.

define void @PR52032_4(<4 x i32>* %p, <4 x i32>* %q) {		define void @PR52032_4(<4 x i32>* %p, <4 x i32>* %q) {
; SSE-LABEL: PR52032_4:		; SSE-LABEL: PR52032_4:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movdqu (%rdi), %xmm0		; SSE-NEXT: movdqu (%rdi), %xmm0
; SSE-NEXT: movdqa %xmm0, (%rsi)		; SSE-NEXT: movdqa %xmm0, (%rsi)
; SSE-NEXT: pcmpeqd %xmm1, %xmm1		; SSE-NEXT: pcmpeqd %xmm1, %xmm1
; SSE-NEXT: psubd %xmm1, %xmm0		; SSE-NEXT: psubd %xmm1, %xmm0
; SSE-NEXT: movdqu %xmm0, (%rdi)		; SSE-NEXT: movdqu %xmm0, (%rdi)
Show All 30 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[x86] limit vector increment fold to allow load folding
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383470

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/test/CodeGen/X86/combine-sub.ll

This is an archive of the discontinued LLVM Phabricator instance.

[x86] limit vector increment fold to allow load foldingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383470

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/test/CodeGen/X86/combine-sub.ll

[x86] limit vector increment fold to allow load folding
ClosedPublic