This is an archive of the discontinued LLVM Phabricator instance.

[CGP / PowerPC] avoid multi-block overhead for simple memcmp expansion
ClosedPublic

Authored by spatel on Jun 7 2017, 11:08 AM.

Download Raw Diff

Details

Reviewers

efriedma
syzaara
nemanjai
hfinkel
courbet

Commits

rGe7c5041c2ae1: [CGP / PowerPC] avoid multi-block overhead for simple memcmp expansion
rL304987: [CGP / PowerPC] avoid multi-block overhead for simple memcmp expansion

Summary

The test diff for PowerPC is minimal, but for x86, there's a substantial difference because branches are assumed cheap and SDAG can't optimize across blocks. Instead of this:

_cmp_eq8:
	movq	(%rdi), %rax
	cmpq	(%rsi), %rax
	je	LBB23_1
## BB#2:                                ## %res_block
	movl	$1, %ecx
	jmp	LBB23_3
LBB23_1:
	xorl	%ecx, %ecx
LBB23_3:                                ## %endblock
	xorl	%eax, %eax
	testl	%ecx, %ecx
	sete	%al
	retq

We get this:

cmp_eq8:   
	movq	(%rdi), %rcx
	xorl	%eax, %eax
	cmpq	(%rsi), %rcx
	sete	%al
	retq

And that matches the optimal codegen that we get from the current expansion in SelectionDAGBuilder::visitMemCmpCall(). If this looks right, then I just need to confirm that vector-sized expansion will work from here, and we can enable CGP memcmp() expansion for x86. Ie, we'll bypass the power-of-2 special cases currently optimized in SDAG because we can lower the IR produced here optimally.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Jun 7 2017, 11:08 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptJun 7 2017, 11:08 AM

LGTM, you might want to wait for other comments as I'm new here :)

lib/CodeGen/CodeGenPrepare.cpp
1703 ↗	(On Diff #101780)	I think the comment should also explain why in this case (only one block) we don't want to abort everything right now and let the SDAG do the lowering (IIUC, something along the lines of "in that case, we still want to do the memcmp expansion here because this code handles vector expansions better").
1729 ↗	(On Diff #101780)	The debug location is never used; this looks like a remnant of some previous code before refactoring. It looks like the intent was to use that builder in member functions. Now the builder is recreated every time (see e.g. MemCmpExpansion::emitLoadCompareByteBlock, line 1750). Builder should be made a member (maybe in another revision).

courbet accepted this revision.Jun 7 2017, 12:08 PM

This revision is now accepted and ready to land.Jun 7 2017, 12:08 PM

In D34005#775445, @courbet wrote:

LGTM, you might want to wait for other comments as I'm new here :)

Thanks for the quick review! I'll give the other reviewers a little more time to comment in case they see any problems/improvements.

lib/CodeGen/CodeGenPrepare.cpp
1703 ↗	(On Diff #101780)	Sure. I think there will be scalar expansions that are also better handled here if we lift MemCmpNumLoadsPerBlock above '1'. It could have been done in the SDAG, but now that we have this general infrastructure here, I think it's better to keep the memcmp expansion together.
1729 ↗	(On Diff #101780)	This hasn't changed since the initial commit of D28637 / rL304313, but I agree we should clean it up. I'll make that a follow-up step.

Patch updated - no code changes, but:

Added comment about overlapped responsibility with the DAG for lowering the single-block case.
Rebased to r304974. The PPC test now avoids cmp/isel altogether. I think this was either D33718 / D33720. Thanks, @nemanjai!

spatel mentioned this in rL304979: [x86] add tests for memcmp expansion; NFC.Jun 8 2017, 8:02 AM

Closed by commit rL304987: [CGP / PowerPC] avoid multi-block overhead for simple memcmp expansion (authored by spatel). · Explain WhyJun 8 2017, 9:53 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

CodeGenPrepare.cpp

67 lines

test/

CodeGen/

PowerPC/

memCmpUsedInZeroEqualityComparison.ll

10 lines

Diff 101936

llvm/trunk/lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 1,669 Lines • ▼ Show 20 Lines	void emitLoadCompareBlock(unsigned Index, int LoadSize, int GEPIndex,
bool IsLittleEndian);		bool IsLittleEndian);
Value *getCompareLoadPairs(unsigned Index, unsigned Size,		Value *getCompareLoadPairs(unsigned Index, unsigned Size,
unsigned &NumBytesProcessed, IRBuilder<> &Builder);		unsigned &NumBytesProcessed, IRBuilder<> &Builder);
void emitLoadCompareBlockMultipleLoads(unsigned Index, unsigned Size,		void emitLoadCompareBlockMultipleLoads(unsigned Index, unsigned Size,
unsigned &NumBytesProcessed);		unsigned &NumBytesProcessed);
void emitLoadCompareByteBlock(unsigned Index, int GEPIndex);		void emitLoadCompareByteBlock(unsigned Index, int GEPIndex);
void emitMemCmpResultBlock(bool IsLittleEndian);		void emitMemCmpResultBlock(bool IsLittleEndian);
Value *getMemCmpExpansionZeroCase(unsigned Size, bool IsLittleEndian);		Value *getMemCmpExpansionZeroCase(unsigned Size, bool IsLittleEndian);
		Value *getMemCmpEqZeroOneBlock(unsigned Size);
unsigned getLoadSize(unsigned Size);		unsigned getLoadSize(unsigned Size);
unsigned getNumLoads(unsigned Size);		unsigned getNumLoads(unsigned Size);

public:		public:
MemCmpExpansion(CallInst *CI, uint64_t Size, unsigned MaxLoadSize,		MemCmpExpansion(CallInst *CI, uint64_t Size, unsigned MaxLoadSize,
unsigned NumLoadsPerBlock);		unsigned NumLoadsPerBlock);
Value *getMemCmpExpansion(uint64_t Size, bool IsLittleEndian);		Value *getMemCmpExpansion(uint64_t Size, bool IsLittleEndian);
};		};

MemCmpExpansion::ResultBlock::ResultBlock()		MemCmpExpansion::ResultBlock::ResultBlock()
: BB(nullptr), PhiSrc1(nullptr), PhiSrc2(nullptr) {}		: BB(nullptr), PhiSrc1(nullptr), PhiSrc2(nullptr) {}

// Initialize the basic block structure required for expansion of memcmp call		// Initialize the basic block structure required for expansion of memcmp call
// with given maximum load size and memcmp size parameter.		// with given maximum load size and memcmp size parameter.
// This structure includes:		// This structure includes:
// 1. A list of load compare blocks - LoadCmpBlocks.		// 1. A list of load compare blocks - LoadCmpBlocks.
// 2. An EndBlock, split from original instruction point, which is the block to		// 2. An EndBlock, split from original instruction point, which is the block to
// return from.		// return from.
// 3. ResultBlock, block to branch to for early exit when a		// 3. ResultBlock, block to branch to for early exit when a
// LoadCmpBlock finds a difference.		// LoadCmpBlock finds a difference.
MemCmpExpansion::MemCmpExpansion(CallInst *CI, uint64_t Size,		MemCmpExpansion::MemCmpExpansion(CallInst *CI, uint64_t Size,
unsigned MaxLoadSize, unsigned LoadsPerBlock)		unsigned MaxLoadSize, unsigned LoadsPerBlock)
: CI(CI), MaxLoadSize(MaxLoadSize), NumLoadsPerBlock(LoadsPerBlock) {		: CI(CI), MaxLoadSize(MaxLoadSize), NumLoadsPerBlock(LoadsPerBlock) {

IRBuilder<> Builder(CI->getContext());		// A memcmp with zero-comparison with only one block of load and compare does
		// not need to set up any extra blocks. This case could be handled in the DAG,
		// but since we have all of the machinery to flexibly expand any memcpy here,
		// we choose to handle this case too to avoid fragmented lowering.
		IsUsedForZeroCmp = isOnlyUsedInZeroEqualityComparison(CI);
		NumBlocks = calculateNumBlocks(Size);
		if (!IsUsedForZeroCmp \|\| NumBlocks != 1) {
BasicBlock *StartBlock = CI->getParent();		BasicBlock *StartBlock = CI->getParent();
EndBlock = StartBlock->splitBasicBlock(CI, "endblock");		EndBlock = StartBlock->splitBasicBlock(CI, "endblock");
setupEndBlockPHINodes();		setupEndBlockPHINodes();
IsUsedForZeroCmp = isOnlyUsedInZeroEqualityComparison(CI);

// Calculate how many load compare blocks are required for an expansion of
// given Size.
NumBlocks = calculateNumBlocks(Size);
createResultBlock();		createResultBlock();

// If return value of memcmp is not used in a zero equality, we need to		// If return value of memcmp is not used in a zero equality, we need to
// calculate which source was larger. The calculation requires the		// calculate which source was larger. The calculation requires the
// two loaded source values of each load compare block.		// two loaded source values of each load compare block.
// These will be saved in the phi nodes created by setupResultBlockPHINodes.		// These will be saved in the phi nodes created by setupResultBlockPHINodes.
if (!IsUsedForZeroCmp)		if (!IsUsedForZeroCmp)
setupResultBlockPHINodes();		setupResultBlockPHINodes();

// Create the number of required load compare basic blocks.		// Create the number of required load compare basic blocks.
createLoadCmpBlocks();		createLoadCmpBlocks();

// Update the terminator added by splitBasicBlock to branch to the first		// Update the terminator added by splitBasicBlock to branch to the first
// LoadCmpBlock.		// LoadCmpBlock.
Builder.SetCurrentDebugLocation(CI->getDebugLoc());
StartBlock->getTerminator()->setSuccessor(0, LoadCmpBlocks[0]);		StartBlock->getTerminator()->setSuccessor(0, LoadCmpBlocks[0]);
}		}

		IRBuilder<> Builder(CI->getContext());
		Builder.SetCurrentDebugLocation(CI->getDebugLoc());
		}

void MemCmpExpansion::createLoadCmpBlocks() {		void MemCmpExpansion::createLoadCmpBlocks() {
for (unsigned i = 0; i < NumBlocks; i++) {		for (unsigned i = 0; i < NumBlocks; i++) {
BasicBlock *BB = BasicBlock::Create(CI->getContext(), "loadbb",		BasicBlock *BB = BasicBlock::Create(CI->getContext(), "loadbb",
EndBlock->getParent(), EndBlock);		EndBlock->getParent(), EndBlock);
LoadCmpBlocks.push_back(BB);		LoadCmpBlocks.push_back(BB);
}		}
}		}

▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	Value *MemCmpExpansion::getCompareLoadPairs(unsigned Index, unsigned Size,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
std::vector<Value *> XorList, OrList;		std::vector<Value *> XorList, OrList;
Value *Diff;		Value *Diff;

unsigned RemainingBytes = Size - NumBytesProcessed;		unsigned RemainingBytes = Size - NumBytesProcessed;
unsigned NumLoadsRemaining = getNumLoads(RemainingBytes);		unsigned NumLoadsRemaining = getNumLoads(RemainingBytes);
unsigned NumLoads = std::min(NumLoadsRemaining, NumLoadsPerBlock);		unsigned NumLoads = std::min(NumLoadsRemaining, NumLoadsPerBlock);

		// For a single-block expansion, start inserting before the memcmp call.
		if (LoadCmpBlocks.empty())
		Builder.SetInsertPoint(CI);
		else
Builder.SetInsertPoint(LoadCmpBlocks[Index]);		Builder.SetInsertPoint(LoadCmpBlocks[Index]);

Value *Cmp = nullptr;		Value *Cmp = nullptr;
for (unsigned i = 0; i < NumLoads; ++i) {		for (unsigned i = 0; i < NumLoads; ++i) {
unsigned LoadSize = getLoadSize(RemainingBytes);		unsigned LoadSize = getLoadSize(RemainingBytes);
unsigned GEPIndex = NumBytesProcessed / LoadSize;		unsigned GEPIndex = NumBytesProcessed / LoadSize;
NumBytesProcessed += LoadSize;		NumBytesProcessed += LoadSize;
RemainingBytes -= LoadSize;		RemainingBytes -= LoadSize;

Type LoadSizeType = IntegerType::get(CI->getContext(), LoadSize 8);		Type LoadSizeType = IntegerType::get(CI->getContext(), LoadSize 8);
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	Value *MemCmpExpansion::getMemCmpExpansionZeroCase(unsigned Size,
// handle multiple loads per block.		// handle multiple loads per block.
for (unsigned i = 0; i < NumBlocks; ++i)		for (unsigned i = 0; i < NumBlocks; ++i)
emitLoadCompareBlockMultipleLoads(i, Size, NumBytesProcessed);		emitLoadCompareBlockMultipleLoads(i, Size, NumBytesProcessed);

emitMemCmpResultBlock(IsLittleEndian);		emitMemCmpResultBlock(IsLittleEndian);
return PhiRes;		return PhiRes;
}		}

		/// A memcmp expansion that compares equality with 0 and only has one block of
		/// load and compare can bypass the compare, branch, and phi IR that is required
		/// in the general case.
		Value *MemCmpExpansion::getMemCmpEqZeroOneBlock(unsigned Size) {
		unsigned NumBytesProcessed = 0;
		IRBuilder<> Builder(CI->getContext());
		Value *Cmp = getCompareLoadPairs(0, Size, NumBytesProcessed, Builder);
		return Builder.CreateZExt(Cmp, Type::getInt32Ty(CI->getContext()));
		}

// This function expands the memcmp call into an inline expansion and returns		// This function expands the memcmp call into an inline expansion and returns
// the memcmp result.		// the memcmp result.
Value *MemCmpExpansion::getMemCmpExpansion(uint64_t Size, bool IsLittleEndian) {		Value *MemCmpExpansion::getMemCmpExpansion(uint64_t Size, bool IsLittleEndian) {
if (IsUsedForZeroCmp)		if (IsUsedForZeroCmp)
return getMemCmpExpansionZeroCase(Size, IsLittleEndian);		return NumBlocks == 1 ? getMemCmpEqZeroOneBlock(Size) :
		getMemCmpExpansionZeroCase(Size, IsLittleEndian);

// This loop calls emitLoadCompareBlock for comparing Size bytes of the two		// This loop calls emitLoadCompareBlock for comparing Size bytes of the two
// memcmp sources. It starts with loading using the maximum load size set by		// memcmp sources. It starts with loading using the maximum load size set by
// the target. It processes any remaining bytes using a load size which is the		// the target. It processes any remaining bytes using a load size which is the
// next smallest power of 2.		// next smallest power of 2.
int LoadSize = MaxLoadSize;		int LoadSize = MaxLoadSize;
int NumBytesToBeProcessed = Size;		int NumBytesToBeProcessed = Size;
unsigned Index = 0;		unsigned Index = 0;
▲ Show 20 Lines • Show All 4,360 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/PowerPC/memCmpUsedInZeroEqualityComparison.ll

	Show All 11 Lines
	@zeroEqualityTest04.buffer1 = private unnamed_addr constant [15 x i32] [i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14], align 4			@zeroEqualityTest04.buffer1 = private unnamed_addr constant [15 x i32] [i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14], align 4
	@zeroEqualityTest04.buffer2 = private unnamed_addr constant [15 x i32] [i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 13], align 4			@zeroEqualityTest04.buffer2 = private unnamed_addr constant [15 x i32] [i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 13], align 4

	declare signext i32 @memcmp(i8* nocapture, i8* nocapture, i64) local_unnamed_addr #1			declare signext i32 @memcmp(i8* nocapture, i8* nocapture, i64) local_unnamed_addr #1

	; Check 4 bytes - requires 1 load for each param.			; Check 4 bytes - requires 1 load for each param.
	define signext i32 @zeroEqualityTest02(i8* %x, i8* %y) {			define signext i32 @zeroEqualityTest02(i8* %x, i8* %y) {
	; CHECK-LABEL: zeroEqualityTest02:			; CHECK-LABEL: zeroEqualityTest02:
	; CHECK: # BB#0: # %loadbb			; CHECK: # BB#0:
	; CHECK-NEXT: lwz 3, 0(3)			; CHECK-NEXT: lwz 3, 0(3)
	; CHECK-NEXT: lwz 4, 0(4)			; CHECK-NEXT: lwz 4, 0(4)
	; CHECK-NEXT: li 5, 1			; CHECK-NEXT: xor 3, 3, 4
	; CHECK-NEXT: cmplw 3, 4			; CHECK-NEXT: cntlzw 3, 3
	; CHECK-NEXT: isel 3, 0, 5, 2			; CHECK-NEXT: srwi 3, 3, 5
	; CHECK-NEXT: clrldi 3, 3, 32			; CHECK-NEXT: xori 3, 3, 1
	; CHECK-NEXT: blr			; CHECK-NEXT: blr
	%call = tail call signext i32 @memcmp(i8* %x, i8* %y, i64 4)			%call = tail call signext i32 @memcmp(i8* %x, i8* %y, i64 4)
	%not.cmp = icmp ne i32 %call, 0			%not.cmp = icmp ne i32 %call, 0
	%. = zext i1 %not.cmp to i32			%. = zext i1 %not.cmp to i32
	ret i32 %.			ret i32 %.
	}			}

	; Check 16 bytes - requires 2 loads for each param (or use vectors?).			; Check 16 bytes - requires 2 loads for each param (or use vectors?).
	▲ Show 20 Lines • Show All 164 Lines • Show Last 20 Lines