This is an archive of the discontinued LLVM Phabricator instance.

[LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused (PR35681)
ClosedPublic

Authored by spatel on Jan 26 2018, 3:31 PM.

Download Raw Diff

Details

Reviewers

evstupac
zvi
RKSimon
hfinkel
venkataramanan.kumar.llvm
DavidKreitzer
GGanesh
craig.topper
qcolombet

Commits

rGd7c702b45191: [LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused…
rL324289: [LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused…

Summary

I think there would be a small code size win from this change and possibly a slight perf win, but I don't have a representative benchmarking system to test that theory. I figure it's worth posting this patch to get feedback and let others give it a try if they're interested. If you have access to SPEC or other standard benchmarks, I'd be most grateful to know if it helps.

The irony is that AMD Jaguar apparently does not have macro-fusion, so the target I was hoping to help the most is excluded from consideration...

In the motivating case from PR35681 and represented in the new test in this patch:
https://bugs.llvm.org/show_bug.cgi?id=35681
...there's a 37 -> 31 byte size win for the loop because we eliminate the big base address offsets.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Jan 26 2018, 3:31 PM

Herald added a subscriber: mcrosier. · View Herald TranscriptJan 26 2018, 3:31 PM

LGTM

Couple of nits on the test case.

test/Transforms/LoopStrengthReduce/X86/macro-fuse-cmp.ll
3 ↗	(On Diff #131663)	Could you also make a IR to IR test with opt -loop-reduce?
57 ↗	(On Diff #131663)	Could you run instnamer on the test case?

This revision is now accepted and ready to land.Jan 29 2018, 11:00 AM

It is not obvious that constants in address can somehow hurt performance.
I'll run the patch on the benchmarks I have. However I believe performance issue in PR35861 is about complicated addresses which limit execution ports for stores.

Hi,

I agree removing the lengthy (9 byte) instructions and reducing size of the loop is good. But on performance side, I need to do some tests.

spatel mentioned this in rL323806: [LoopStrengthReduce] add test to show potential macro-fusion-based diff….Jan 30 2018, 11:19 AM

Patch updated (no code changes; improved the new test file based on Quentin's feedback):

Give the IR values real names, so the test is easier to understand.
Add IR output testing, so we see what that difference looks like independent of anything else in the x86 backend.

Thanks @evstupac and @venkataramanan.kumar.llvm for the feedback. I micro-benchmarked the maxArray code in its -O2 unrolled form from PR28343 (see below for object dump with address offsets), but as expected, I can't measure any perf difference on Haswell.

It does shrink the loop from 134 bytes to 97 bytes though.

Baseline (don't account for macro-fusion, so less instructions is considered better):

0000000100001000	movq	$-0x80000, %rax
0000000100001007	nopw	(%rax,%rax)
0000000100001010	movupd	0x80000(%rsi,%rax), %xmm0
0000000100001019	movupd	0x80000(%rdi,%rax), %xmm1
0000000100001022	maxpd	%xmm1, %xmm0
0000000100001026	movupd	0x80010(%rdi,%rax), %xmm1
000000010000102f	movupd	0x80020(%rdi,%rax), %xmm2
0000000100001038	movupd	0x80030(%rdi,%rax), %xmm3
0000000100001041	movupd	%xmm0, 0x80000(%rdi,%rax)
000000010000104a	movupd	0x80010(%rsi,%rax), %xmm0
0000000100001053	maxpd	%xmm1, %xmm0
0000000100001057	movupd	%xmm0, 0x80010(%rdi,%rax)
0000000100001060	movupd	0x80020(%rsi,%rax), %xmm0
0000000100001069	maxpd	%xmm2, %xmm0
000000010000106d	movupd	0x80030(%rsi,%rax), %xmm1
0000000100001076	movupd	%xmm0, 0x80020(%rdi,%rax)
000000010000107f	maxpd	%xmm3, %xmm1
0000000100001083	movupd	%xmm1, 0x80030(%rdi,%rax)
000000010000108c	addq	$0x40, %rax
0000000100001090	jne	0x100001010
0000000100001096	retq

Use macro-fusion in cost calc (extra instruction, but allows smaller constant offsets):

0000000100001000	xorl	%eax, %eax
0000000100001002	nopw	%cs:(%rax,%rax)
000000010000100c	nopl	(%rax)
0000000100001010	movupd	(%rsi,%rax,8), %xmm0
0000000100001015	movupd	0x10(%rsi,%rax,8), %xmm1
000000010000101b	movupd	(%rdi,%rax,8), %xmm2
0000000100001020	maxpd	%xmm2, %xmm0
0000000100001024	movupd	0x10(%rdi,%rax,8), %xmm2
000000010000102a	maxpd	%xmm2, %xmm1
000000010000102e	movupd	0x20(%rdi,%rax,8), %xmm2
0000000100001034	movupd	0x30(%rdi,%rax,8), %xmm3
000000010000103a	movupd	%xmm0, (%rdi,%rax,8)
000000010000103f	movupd	%xmm1, 0x10(%rdi,%rax,8)
0000000100001045	movupd	0x20(%rsi,%rax,8), %xmm0
000000010000104b	maxpd	%xmm2, %xmm0
000000010000104f	movupd	0x30(%rsi,%rax,8), %xmm1
0000000100001055	maxpd	%xmm3, %xmm1
0000000100001059	movupd	%xmm0, 0x20(%rdi,%rax,8)
000000010000105f	movupd	%xmm1, 0x30(%rdi,%rax,8)
0000000100001065	addq	$0x8, %rax
0000000100001069	cmpq	$0x10000, %rax
000000010000106f	jne	0x100001010
0000000100001071	retq

junbuml added a subscriber: junbuml.Jan 30 2018, 12:37 PM

I ran SPEC2017 on Ryzen (c/c++ benchmarks) -O2 -fno-unroll-loops. no significant change in performance with the patch.

In D42607#992829, @venkataramanan.kumar.llvm wrote:

I ran SPEC2017 on Ryzen (c/c++ benchmarks) -O2 -fno-unroll-loops. no significant change in performance with the patch.

Great - thank you for checking that. If there are no objections, I'll commit this so we get wider testing (or I can wait for more data to make sure there are no regressions).

That will allow more experimentation with other LSR changes (that might have more impact) without worrying about this case.

spatel mentioned this in D42805: [utils] Refactor utils/update_{,llc_}test_checks.py to share more code.Feb 2 2018, 11:22 AM

Closed by commit rL324289: [LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused… (authored by spatel). · Explain WhyFeb 5 2018, 3:44 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

TargetTransformInfo.h

9 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

X86/

X86TargetTransformInfo.h

1 line

X86TargetTransformInfo.cpp

4 lines

Transforms/

Scalar/

LoopStrengthReduce.cpp

5 lines

test/

CodeGen/

X86/

rdrand.ll

30 lines

Transforms/

LoopStrengthReduce/

X86/

ivchain-X86.ll

27 lines

macro-fuse-cmp.ll

40 lines

Diff 132898

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace = 0,		unsigned AddrSpace = 0,
Instruction *I = nullptr) const;		Instruction *I = nullptr) const;

/// \brief Return true if LSR cost of C1 is lower than C1.		/// \brief Return true if LSR cost of C1 is lower than C1.
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) const;		TargetTransformInfo::LSRCost &C2) const;

		/// Return true if the target can fuse a compare and branch.
		/// Loop-strength-reduction (LSR) uses that knowledge to adjust its cost
		/// calculation for the instructions in a loop.
		bool canMacroFuseCmp() const;

/// \brief Return true if the target supports masked load/store		/// \brief Return true if the target supports masked load/store
/// AVX2 and AVX-512 targets allow masks for consecutive load and store		/// AVX2 and AVX-512 targets allow masks for consecutive load and store
bool isLegalMaskedStore(Type *DataType) const;		bool isLegalMaskedStore(Type *DataType) const;
bool isLegalMaskedLoad(Type *DataType) const;		bool isLegalMaskedLoad(Type *DataType) const;

/// \brief Return true if the target supports masked gather/scatter		/// \brief Return true if the target supports masked gather/scatter
/// AVX-512 fully supports gather and scatter for vectors with 32 and 64		/// AVX-512 fully supports gather and scatter for vectors with 32 and 64
/// bits scalar type.		/// bits scalar type.
▲ Show 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace,		unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
		virtual bool canMacroFuseCmp() = 0;
virtual bool isLegalMaskedStore(Type *DataType) = 0;		virtual bool isLegalMaskedStore(Type *DataType) = 0;
virtual bool isLegalMaskedLoad(Type *DataType) = 0;		virtual bool isLegalMaskedLoad(Type *DataType) = 0;
virtual bool isLegalMaskedScatter(Type *DataType) = 0;		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
virtual bool isLegalMaskedGather(Type *DataType) = 0;		virtual bool isLegalMaskedGather(Type *DataType) = 0;
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
▲ Show 20 Lines • Show All 203 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
Instruction *I) override {		Instruction *I) override {
return Impl.isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,		return Impl.isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace, I);		Scale, AddrSpace, I);
}		}
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) override {		TargetTransformInfo::LSRCost &C2) override {
return Impl.isLSRCostLess(C1, C2);		return Impl.isLSRCostLess(C1, C2);
}		}
		bool canMacroFuseCmp() override {
		return Impl.canMacroFuseCmp();
		}
bool isLegalMaskedStore(Type *DataType) override {		bool isLegalMaskedStore(Type *DataType) override {
return Impl.isLegalMaskedStore(DataType);		return Impl.isLegalMaskedStore(DataType);
}		}
bool isLegalMaskedLoad(Type *DataType) override {		bool isLegalMaskedLoad(Type *DataType) override {
return Impl.isLegalMaskedLoad(DataType);		return Impl.isLegalMaskedLoad(DataType);
}		}
bool isLegalMaskedScatter(Type *DataType) override {		bool isLegalMaskedScatter(Type *DataType) override {
return Impl.isLegalMaskedScatter(DataType);		return Impl.isLegalMaskedScatter(DataType);
▲ Show 20 Lines • Show All 389 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 240 Lines • ▼ Show 20 Lines	public:

bool isLSRCostLess(TTI::LSRCost &C1, TTI::LSRCost &C2) {		bool isLSRCostLess(TTI::LSRCost &C1, TTI::LSRCost &C2) {
return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,		return std::tie(C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds,
C1.ScaleCost, C1.ImmCost, C1.SetupCost) <		C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,		std::tie(C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);		C2.ScaleCost, C2.ImmCost, C2.SetupCost);
}		}

		bool canMacroFuseCmp() { return false; }

bool isLegalMaskedStore(Type *DataType) { return false; }		bool isLegalMaskedStore(Type *DataType) { return false; }

bool isLegalMaskedLoad(Type *DataType) { return false; }		bool isLegalMaskedLoad(Type *DataType) { return false; }

bool isLegalMaskedScatter(Type *DataType) { return false; }		bool isLegalMaskedScatter(Type *DataType) { return false; }

bool isLegalMaskedGather(Type *DataType) { return false; }		bool isLegalMaskedGather(Type *DataType) { return false; }

▲ Show 20 Lines • Show All 576 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
return TTIImpl->isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,		return TTIImpl->isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace, I);		Scale, AddrSpace, I);
}		}

bool TargetTransformInfo::isLSRCostLess(LSRCost &C1, LSRCost &C2) const {		bool TargetTransformInfo::isLSRCostLess(LSRCost &C1, LSRCost &C2) const {
return TTIImpl->isLSRCostLess(C1, C2);		return TTIImpl->isLSRCostLess(C1, C2);
}		}

		bool TargetTransformInfo::canMacroFuseCmp() const {
		return TTIImpl->canMacroFuseCmp();
		}

bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {		bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {
return TTIImpl->isLegalMaskedStore(DataType);		return TTIImpl->isLegalMaskedStore(DataType);
}		}

bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {		bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {
return TTIImpl->isLegalMaskedLoad(DataType);		return TTIImpl->isLegalMaskedLoad(DataType);
}		}

▲ Show 20 Lines • Show All 1,035 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	public:

unsigned getUserCost(const User U, ArrayRef<const Value > Operands);		unsigned getUserCost(const User U, ArrayRef<const Value > Operands);

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2);		TargetTransformInfo::LSRCost &C2);
		bool canMacroFuseCmp();
bool isLegalMaskedLoad(Type *DataType);		bool isLegalMaskedLoad(Type *DataType);
bool isLegalMaskedStore(Type *DataType);		bool isLegalMaskedStore(Type *DataType);
bool isLegalMaskedGather(Type *DataType);		bool isLegalMaskedGather(Type *DataType);
bool isLegalMaskedScatter(Type *DataType);		bool isLegalMaskedScatter(Type *DataType);
bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
Show All 15 Lines

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,476 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1,
return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost,		return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost,
C1.NumIVMuls, C1.NumBaseAdds,		C1.NumIVMuls, C1.NumBaseAdds,
C1.ScaleCost, C1.ImmCost, C1.SetupCost) <		C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost,		std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost,
C2.NumIVMuls, C2.NumBaseAdds,		C2.NumIVMuls, C2.NumBaseAdds,
C2.ScaleCost, C2.ImmCost, C2.SetupCost);		C2.ScaleCost, C2.ImmCost, C2.SetupCost);
}		}

		bool X86TTIImpl::canMacroFuseCmp() {
		return ST->hasMacroFusion();
		}

bool X86TTIImpl::isLegalMaskedLoad(Type *DataTy) {		bool X86TTIImpl::isLegalMaskedLoad(Type *DataTy) {
// The backend can't handle a single element vector.		// The backend can't handle a single element vector.
if (isa<VectorType>(DataTy) && DataTy->getVectorNumElements() == 1)		if (isa<VectorType>(DataTy) && DataTy->getVectorNumElements() == 1)
return false;		return false;
Type *ScalarTy = DataTy->getScalarType();		Type *ScalarTy = DataTy->getScalarType();
int DataWidth = isa<PointerType>(ScalarTy) ?		int DataWidth = isa<PointerType>(ScalarTy) ?
DL.getPointerSizeInBits() : ScalarTy->getPrimitiveSizeInBits();		DL.getPointerSizeInBits() : ScalarTy->getPrimitiveSizeInBits();

▲ Show 20 Lines • Show All 370 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Scalar/LoopStrengthReduce.cpp

Show First 20 Lines • Show All 1,337 Lines • ▼ Show 20 Lines	if (C.NumRegs > TTIRegNum) {
if (PrevNumRegs > TTIRegNum)		if (PrevNumRegs > TTIRegNum)
C.Insns += (C.NumRegs - PrevNumRegs);		C.Insns += (C.NumRegs - PrevNumRegs);
else		else
C.Insns += (C.NumRegs - TTIRegNum);		C.Insns += (C.NumRegs - TTIRegNum);
}		}

// If ICmpZero formula ends with not 0, it could not be replaced by		// If ICmpZero formula ends with not 0, it could not be replaced by
// just add or sub. We'll need to compare final result of AddRec.		// just add or sub. We'll need to compare final result of AddRec.
// That means we'll need an additional instruction.		// That means we'll need an additional instruction. But if the target can
		// macro-fuse a compare with a branch, don't count this extra instruction.
// For -10 + {0, +, 1}:		// For -10 + {0, +, 1}:
// i = i + 1;		// i = i + 1;
// cmp i, 10		// cmp i, 10
//		//
// For {-10, +, 1}:		// For {-10, +, 1}:
// i = i + 1;		// i = i + 1;
if (LU.Kind == LSRUse::ICmpZero && !F.hasZeroEnd())		if (LU.Kind == LSRUse::ICmpZero && !F.hasZeroEnd() && !TTI.canMacroFuseCmp())
C.Insns++;		C.Insns++;
// Each new AddRec adds 1 instruction to calculation.		// Each new AddRec adds 1 instruction to calculation.
C.Insns += (C.AddRecCost - PrevAddRecCost);		C.Insns += (C.AddRecCost - PrevAddRecCost);

// BaseAdds adds instructions for unfolded registers.		// BaseAdds adds instructions for unfolded registers.
if (LU.Kind != LSRUse::ICmpZero)		if (LU.Kind != LSRUse::ICmpZero)
C.Insns += C.NumBaseAdds - PrevNumBaseAdds;		C.Insns += C.NumBaseAdds - PrevNumBaseAdds;
assert(isValid() && "invalid cost");		assert(isValid() && "invalid cost");
▲ Show 20 Lines • Show All 4,157 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/rdrand.ll

	Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
	%add = add i32 %v2, %v1			%add = add i32 %v2, %v1
	ret i32 %add			ret i32 %add
	}			}

	; Check that MachineLICM doesn't hoist rdrand instructions.			; Check that MachineLICM doesn't hoist rdrand instructions.
	define void @loop(i32* %p, i32 %n) nounwind {			define void @loop(i32* %p, i32 %n) nounwind {
	; X86-LABEL: loop:			; X86-LABEL: loop:
	; X86: # %bb.0: # %entry			; X86: # %bb.0: # %entry
				; X86-NEXT: pushl %esi
	; X86-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: testl %eax, %eax			; X86-NEXT: testl %eax, %eax
	; X86-NEXT: je .LBB3_3			; X86-NEXT: je .LBB3_3
	; X86-NEXT: # %bb.1: # %while.body.preheader			; X86-NEXT: # %bb.1: # %while.body.preheader
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; X86-NEXT: xorl %edx, %edx
	; X86-NEXT: .p2align 4, 0x90			; X86-NEXT: .p2align 4, 0x90
	; X86-NEXT: .LBB3_2: # %while.body			; X86-NEXT: .LBB3_2: # %while.body
	; X86-NEXT: # =>This Inner Loop Header: Depth=1			; X86-NEXT: # =>This Inner Loop Header: Depth=1
	; X86-NEXT: rdrandl %edx			; X86-NEXT: rdrandl %esi
	; X86-NEXT: movl %edx, (%ecx)			; X86-NEXT: movl %esi, (%ecx,%edx,4)
	; X86-NEXT: leal 4(%ecx), %ecx			; X86-NEXT: addl $1, %edx
	; X86-NEXT: addl $-1, %eax			; X86-NEXT: cmpl %edx, %eax
	; X86-NEXT: jne .LBB3_2			; X86-NEXT: jne .LBB3_2
	; X86-NEXT: .LBB3_3: # %while.end			; X86-NEXT: .LBB3_3: # %while.end
				; X86-NEXT: popl %esi
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: loop:			; X64-LABEL: loop:
	; X64: # %bb.0: # %entry			; X64: # %bb.0: # %entry
	; X64-NEXT: testl %esi, %esi			; X64-NEXT: testl %esi, %esi
	; X64-NEXT: je .LBB3_2			; X64-NEXT: je .LBB3_3
				; X64-NEXT: # %bb.1: # %while.body.preheader
				; X64-NEXT: movl %esi, %eax
				; X64-NEXT: xorl %ecx, %ecx
	; X64-NEXT: .p2align 4, 0x90			; X64-NEXT: .p2align 4, 0x90
	; X64-NEXT: .LBB3_1: # %while.body			; X64-NEXT: .LBB3_2: # %while.body
	; X64-NEXT: # =>This Inner Loop Header: Depth=1			; X64-NEXT: # =>This Inner Loop Header: Depth=1
	; X64-NEXT: rdrandl %eax			; X64-NEXT: rdrandl %edx
	; X64-NEXT: movl %eax, (%rdi)			; X64-NEXT: movl %edx, (%rdi,%rcx,4)
	; X64-NEXT: leaq 4(%rdi), %rdi			; X64-NEXT: addq $1, %rcx
	; X64-NEXT: addl $-1, %esi			; X64-NEXT: cmpl %ecx, %eax
	; X64-NEXT: jne .LBB3_1			; X64-NEXT: jne .LBB3_2
	; X64-NEXT: .LBB3_2: # %while.end			; X64-NEXT: .LBB3_3: # %while.end
	; X64-NEXT: retq			; X64-NEXT: retq
	entry:			entry:
	%tobool1 = icmp eq i32 %n, 0			%tobool1 = icmp eq i32 %n, 0
	br i1 %tobool1, label %while.end, label %while.body			br i1 %tobool1, label %while.end, label %while.body

	while.body: ; preds = %entry, %while.body			while.body: ; preds = %entry, %while.body
	%p.addr.03 = phi i32* [ %incdec.ptr, %while.body ], [ %p, %entry ]			%p.addr.03 = phi i32* [ %incdec.ptr, %while.body ], [ %p, %entry ]
	%n.addr.02 = phi i32 [ %dec, %while.body ], [ %n, %entry ]			%n.addr.02 = phi i32 [ %dec, %while.body ], [ %n, %entry ]
	Show All 11 Lines

llvm/trunk/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll

	Show First 20 Lines • Show All 341 Lines • ▼ Show 20 Lines
	; X64-NEXT: # %bb.2: # %for.end			; X64-NEXT: # %bb.2: # %for.end
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X32-LABEL: foldedidx:			; X32-LABEL: foldedidx:
	; X32: # %bb.0: # %entry			; X32: # %bb.0: # %entry
	; X32-NEXT: pushl %ebx			; X32-NEXT: pushl %ebx
	; X32-NEXT: pushl %edi			; X32-NEXT: pushl %edi
	; X32-NEXT: pushl %esi			; X32-NEXT: pushl %esi
	; X32-NEXT: movl $-400, %eax # imm = 0xFE70			; X32-NEXT: movl $3, %eax
	; X32-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X32-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32-NEXT: movl {{[0-9]+}}(%esp), %edx			; X32-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X32-NEXT: movl {{[0-9]+}}(%esp), %esi			; X32-NEXT: movl {{[0-9]+}}(%esp), %esi
	; X32-NEXT: .p2align 4, 0x90			; X32-NEXT: .p2align 4, 0x90
	; X32-NEXT: .LBB3_1: # %for.body			; X32-NEXT: .LBB3_1: # %for.body
	; X32-NEXT: # =>This Inner Loop Header: Depth=1			; X32-NEXT: # =>This Inner Loop Header: Depth=1
	; X32-NEXT: movzbl 400(%esi,%eax), %edi			; X32-NEXT: movzbl -3(%esi,%eax), %edi
	; X32-NEXT: movzbl 400(%edx,%eax), %ebx			; X32-NEXT: movzbl -3(%edx,%eax), %ebx
	; X32-NEXT: addl %edi, %ebx			; X32-NEXT: addl %edi, %ebx
	; X32-NEXT: movb %bl, 400(%ecx,%eax)			; X32-NEXT: movb %bl, -3(%ecx,%eax)
	; X32-NEXT: movzbl 401(%esi,%eax), %edi			; X32-NEXT: movzbl -2(%esi,%eax), %edi
	; X32-NEXT: movzbl 401(%edx,%eax), %ebx			; X32-NEXT: movzbl -2(%edx,%eax), %ebx
	; X32-NEXT: addl %edi, %ebx			; X32-NEXT: addl %edi, %ebx
	; X32-NEXT: movb %bl, 401(%ecx,%eax)			; X32-NEXT: movb %bl, -2(%ecx,%eax)
	; X32-NEXT: movzbl 402(%esi,%eax), %edi			; X32-NEXT: movzbl -1(%esi,%eax), %edi
	; X32-NEXT: movzbl 402(%edx,%eax), %ebx			; X32-NEXT: movzbl -1(%edx,%eax), %ebx
	; X32-NEXT: addl %edi, %ebx			; X32-NEXT: addl %edi, %ebx
	; X32-NEXT: movb %bl, 402(%ecx,%eax)			; X32-NEXT: movb %bl, -1(%ecx,%eax)
	; X32-NEXT: movzbl 403(%esi,%eax), %edi			; X32-NEXT: movzbl (%esi,%eax), %edi
	; X32-NEXT: movzbl 403(%edx,%eax), %ebx			; X32-NEXT: movzbl (%edx,%eax), %ebx
	; X32-NEXT: addl %edi, %ebx			; X32-NEXT: addl %edi, %ebx
	; X32-NEXT: movb %bl, 403(%ecx,%eax)			; X32-NEXT: movb %bl, (%ecx,%eax)
	; X32-NEXT: addl $4, %eax			; X32-NEXT: addl $4, %eax
				; X32-NEXT: cmpl $403, %eax # imm = 0x193
	; X32-NEXT: jne .LBB3_1			; X32-NEXT: jne .LBB3_1
	; X32-NEXT: # %bb.2: # %for.end			; X32-NEXT: # %bb.2: # %for.end
	; X32-NEXT: popl %esi			; X32-NEXT: popl %esi
	; X32-NEXT: popl %edi			; X32-NEXT: popl %edi
	; X32-NEXT: popl %ebx			; X32-NEXT: popl %ebx
	; X32-NEXT: retl			; X32-NEXT: retl
	entry:			entry:
	br label %for.body			br label %for.body
	▲ Show 20 Lines • Show All 194 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopStrengthReduce/X86/macro-fuse-cmp.ll

	Show All 37 Lines
	; JAG-NEXT: [[LSR_IV_NEXT]] = add nsw i64 [[LSR_IV]], 16			; JAG-NEXT: [[LSR_IV_NEXT]] = add nsw i64 [[LSR_IV]], 16
	; JAG-NEXT: [[DONE:%.*]] = icmp eq i64 [[LSR_IV_NEXT]], 0			; JAG-NEXT: [[DONE:%.*]] = icmp eq i64 [[LSR_IV_NEXT]], 0
	; JAG-NEXT: br i1 [[DONE]], label [[EXIT:%.*]], label [[VECTOR_BODY]]			; JAG-NEXT: br i1 [[DONE]], label [[EXIT:%.*]], label [[VECTOR_BODY]]
	; JAG: exit:			; JAG: exit:
	; JAG-NEXT: ret void			; JAG-NEXT: ret void
	;			;
	; HSW-LABEL: @maxArray(			; HSW-LABEL: @maxArray(
	; HSW-NEXT: entry:			; HSW-NEXT: entry:
	; HSW-NEXT: [[Y1:%.]] = bitcast double [[Y:%.]] to i8
	; HSW-NEXT: [[X3:%.]] = bitcast double [[X:%.]] to i8
	; HSW-NEXT: br label [[VECTOR_BODY:%.*]]			; HSW-NEXT: br label [[VECTOR_BODY:%.*]]
	; HSW: vector.body:			; HSW: vector.body:
	; HSW-NEXT: [[LSR_IV:%.]] = phi i64 [ [[LSR_IV_NEXT:%.]], [[VECTOR_BODY]] ], [ -524288, [[ENTRY:%.*]] ]			; HSW-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
	; HSW-NEXT: [[UGLYGEP7:%.]] = getelementptr i8, i8 [[X3]], i64 [[LSR_IV]]			; HSW-NEXT: [[SCEVGEP4:%.]] = getelementptr double, double [[X:%.*]], i64 [[INDEX]]
	; HSW-NEXT: [[UGLYGEP78:%.]] = bitcast i8 [[UGLYGEP7]] to <2 x double>*			; HSW-NEXT: [[SCEVGEP45:%.]] = bitcast double [[SCEVGEP4]] to <2 x double>*
	; HSW-NEXT: [[SCEVGEP9:%.]] = getelementptr <2 x double>, <2 x double> [[UGLYGEP78]], i64 32768			; HSW-NEXT: [[SCEVGEP:%.]] = getelementptr double, double [[Y:%.*]], i64 [[INDEX]]
	; HSW-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[Y1]], i64 [[LSR_IV]]			; HSW-NEXT: [[SCEVGEP1:%.]] = bitcast double [[SCEVGEP]] to <2 x double>*
	; HSW-NEXT: [[UGLYGEP2:%.]] = bitcast i8 [[UGLYGEP]] to <2 x double>*			; HSW-NEXT: [[XVAL:%.]] = load <2 x double>, <2 x double> [[SCEVGEP45]], align 8
	; HSW-NEXT: [[SCEVGEP:%.]] = getelementptr <2 x double>, <2 x double> [[UGLYGEP2]], i64 32768			; HSW-NEXT: [[YVAL:%.]] = load <2 x double>, <2 x double> [[SCEVGEP1]], align 8
	; HSW-NEXT: [[XVAL:%.]] = load <2 x double>, <2 x double> [[SCEVGEP9]], align 8
	; HSW-NEXT: [[YVAL:%.]] = load <2 x double>, <2 x double> [[SCEVGEP]], align 8
	; HSW-NEXT: [[CMP:%.*]] = fcmp ogt <2 x double> [[YVAL]], [[XVAL]]			; HSW-NEXT: [[CMP:%.*]] = fcmp ogt <2 x double> [[YVAL]], [[XVAL]]
	; HSW-NEXT: [[MAX:%.*]] = select <2 x i1> [[CMP]], <2 x double> [[YVAL]], <2 x double> [[XVAL]]			; HSW-NEXT: [[MAX:%.*]] = select <2 x i1> [[CMP]], <2 x double> [[YVAL]], <2 x double> [[XVAL]]
	; HSW-NEXT: [[UGLYGEP4:%.]] = getelementptr i8, i8 [[X3]], i64 [[LSR_IV]]			; HSW-NEXT: [[SCEVGEP2:%.]] = getelementptr double, double [[X]], i64 [[INDEX]]
	; HSW-NEXT: [[UGLYGEP45:%.]] = bitcast i8 [[UGLYGEP4]] to <2 x double>*			; HSW-NEXT: [[SCEVGEP23:%.]] = bitcast double [[SCEVGEP2]] to <2 x double>*
	; HSW-NEXT: [[SCEVGEP6:%.]] = getelementptr <2 x double>, <2 x double> [[UGLYGEP45]], i64 32768			; HSW-NEXT: store <2 x double> [[MAX]], <2 x double>* [[SCEVGEP23]], align 8
	; HSW-NEXT: store <2 x double> [[MAX]], <2 x double>* [[SCEVGEP6]], align 8			; HSW-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
	; HSW-NEXT: [[LSR_IV_NEXT]] = add nsw i64 [[LSR_IV]], 16			; HSW-NEXT: [[DONE:%.*]] = icmp eq i64 [[INDEX_NEXT]], 65536
	; HSW-NEXT: [[DONE:%.*]] = icmp eq i64 [[LSR_IV_NEXT]], 0
	; HSW-NEXT: br i1 [[DONE]], label [[EXIT:%.*]], label [[VECTOR_BODY]]			; HSW-NEXT: br i1 [[DONE]], label [[EXIT:%.*]], label [[VECTOR_BODY]]
	; HSW: exit:			; HSW: exit:
	; HSW-NEXT: ret void			; HSW-NEXT: ret void
	;			;
	; BASE-LABEL: maxArray:			; BASE-LABEL: maxArray:
	; BASE: # %bb.0: # %entry			; BASE: # %bb.0: # %entry
	; BASE-NEXT: movq $-524288, %rax # imm = 0xFFF80000			; BASE-NEXT: movq $-524288, %rax # imm = 0xFFF80000
	; BASE-NEXT: .p2align 4, 0x90			; BASE-NEXT: .p2align 4, 0x90
	; BASE-NEXT: .LBB0_1: # %vector.body			; BASE-NEXT: .LBB0_1: # %vector.body
	; BASE-NEXT: # =>This Inner Loop Header: Depth=1			; BASE-NEXT: # =>This Inner Loop Header: Depth=1
	; BASE-NEXT: movupd 524288(%rdi,%rax), %xmm0			; BASE-NEXT: movupd 524288(%rdi,%rax), %xmm0
	; BASE-NEXT: movupd 524288(%rsi,%rax), %xmm1			; BASE-NEXT: movupd 524288(%rsi,%rax), %xmm1
	; BASE-NEXT: maxpd %xmm0, %xmm1			; BASE-NEXT: maxpd %xmm0, %xmm1
	; BASE-NEXT: movupd %xmm1, 524288(%rdi,%rax)			; BASE-NEXT: movupd %xmm1, 524288(%rdi,%rax)
	; BASE-NEXT: addq $16, %rax			; BASE-NEXT: addq $16, %rax
	; BASE-NEXT: jne .LBB0_1			; BASE-NEXT: jne .LBB0_1
	; BASE-NEXT: # %bb.2: # %exit			; BASE-NEXT: # %bb.2: # %exit
	; BASE-NEXT: retq			; BASE-NEXT: retq
	;			;
	; FUSE-LABEL: maxArray:			; FUSE-LABEL: maxArray:
	; FUSE: # %bb.0: # %entry			; FUSE: # %bb.0: # %entry
	; FUSE-NEXT: movq $-524288, %rax # imm = 0xFFF80000			; FUSE-NEXT: xorl %eax, %eax
	; FUSE-NEXT: .p2align 4, 0x90			; FUSE-NEXT: .p2align 4, 0x90
	; FUSE-NEXT: .LBB0_1: # %vector.body			; FUSE-NEXT: .LBB0_1: # %vector.body
	; FUSE-NEXT: # =>This Inner Loop Header: Depth=1			; FUSE-NEXT: # =>This Inner Loop Header: Depth=1
	; FUSE-NEXT: movupd 524288(%rdi,%rax), %xmm0			; FUSE-NEXT: movupd (%rdi,%rax,8), %xmm0
	; FUSE-NEXT: movupd 524288(%rsi,%rax), %xmm1			; FUSE-NEXT: movupd (%rsi,%rax,8), %xmm1
	; FUSE-NEXT: maxpd %xmm0, %xmm1			; FUSE-NEXT: maxpd %xmm0, %xmm1
	; FUSE-NEXT: movupd %xmm1, 524288(%rdi,%rax)			; FUSE-NEXT: movupd %xmm1, (%rdi,%rax,8)
	; FUSE-NEXT: addq $16, %rax			; FUSE-NEXT: addq $2, %rax
				; FUSE-NEXT: cmpq $65536, %rax # imm = 0x10000
	; FUSE-NEXT: jne .LBB0_1			; FUSE-NEXT: jne .LBB0_1
	; FUSE-NEXT: # %bb.2: # %exit			; FUSE-NEXT: # %bb.2: # %exit
	; FUSE-NEXT: retq			; FUSE-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused (PR35681)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 132898

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.h

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/trunk/lib/Transforms/Scalar/LoopStrengthReduce.cpp

llvm/trunk/test/CodeGen/X86/rdrand.ll

llvm/trunk/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll

llvm/trunk/test/Transforms/LoopStrengthReduce/X86/macro-fuse-cmp.ll

[LoopStrengthReduce, x86] don't add cost for a cmp that will be macro-fused (PR35681)
ClosedPublic