This is an archive of the discontinued LLVM Phabricator instance.

In your test case aren't we merging these stores into a memset to be later lowered to a series of stores (or single 64-bit store) on most targets? The final code generated will be better in this case, because it can merge the 3 stores. In other cases I'm concerned the rather small memset might perturb optimizations that don't know how to deal with memsets. Don't we have a pass that merges stores into wider stores (e.g., SLP vectorizer or the counterpart to the LoadMerge pass or something)?

In your test case aren't we merging these stores into a memset to be later lowered to a series of stores (or single 64-bit store) on most targets? The final code generated will be better in this case, because it can merge the 3 stores.

Yes, for even small number of stores, isProfitableToUseMemset() try to see if we can reduce the total number of stores by merge them to memset which will be lowed later. But MaxIntSize should be in byte to make the heuristic correct.

In other cases I'm concerned the rather small memset might perturb optimizations that don't know how to deal with memsets.

Not sure exactly which pass, but I think it could be, if a pass gives up it's optimization when encountering a call instruction. If we concern merging small number of stores into memset here, we might need to discuss if doing this itself is profitable or not.

Don't we have a pass that merges stores into wider stores (e.g., SLP vectorizer or the counterpart to the LoadMerge pass or something)?

For sequential stores, it seams that the narrow stores are combined in DAG level :

store i16 0, i16* %0
store i16 0, i16* %1

into

STRWui %vreg1, %vreg0, 0;

As far as I check, however, this doesn't seem to handle the test case in this change, where I mixed the stored size : 2byte, 4byte, 2byte.

mehdi_amini added inline comments.May 13 2016, 8:33 AM

lib/Transforms/Scalar/MemCpyOptimizer.cpp
188 ↗	(On Diff #56938)	Good catch! Looks like a misnamed method, should be `getLargestLegalIntTypeSizeInBits()`

LGTM. Thanks for the clarification, Jun.

My comments weren't in opposition to this patch, I was just commenting on peculiarities of the test case. Feel free to post a patch to rename the function as Mendi suggests. You might also investigate why the stores aren't being merged.

This revision is now accepted and ready to land.May 13 2016, 8:42 AM

My comments weren't in opposition to this patch, I was just commenting on peculiarities of the test case. Feel free to post a patch to rename the function as Mendi suggests. You might also investigate why the stores aren't being merged.

Thanks Chad. I see your point. From what you pointed out, I found potential improvement in DAG combine which merge :

store i16 0, i16* %0
store i16 0, i16* %1

into

STRWui

but

store i16 0, i16* %0
store i16 0, i16* %1
store i32 0, i32* %2

into

STRHHui
STRHHui
STRWui

Closed by commit rL269433: [MemCpyOpt] Use MaxIntSize in byte instead of bit (authored by junbuml). · Explain WhyMay 13 2016, 9:58 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

MemCpyOptimizer.cpp

2 lines

test/

Transforms/

MemCpyOpt/

profitable-memset.ll

20 lines

Diff 57207

llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp

Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	bool MemsetRange::isProfitableToUseMemset(const DataLayout &DL) const {
// memset will be split into 2 32-bit stores anyway) and doing so can		// memset will be split into 2 32-bit stores anyway) and doing so can
// pessimize the llvm optimizer.		// pessimize the llvm optimizer.
//		//
// Since we don't have perfect knowledge here, make some assumptions: assume		// Since we don't have perfect knowledge here, make some assumptions: assume
// the maximum GPR width is the same size as the largest legal integer		// the maximum GPR width is the same size as the largest legal integer
// size. If so, check to see whether we will end up actually reducing the		// size. If so, check to see whether we will end up actually reducing the
// number of stores used.		// number of stores used.
unsigned Bytes = unsigned(End-Start);		unsigned Bytes = unsigned(End-Start);
unsigned MaxIntSize = DL.getLargestLegalIntTypeSize();		unsigned MaxIntSize = DL.getLargestLegalIntTypeSize() / 8;
if (MaxIntSize == 0)		if (MaxIntSize == 0)
MaxIntSize = 1;		MaxIntSize = 1;
unsigned NumPointerStores = Bytes / MaxIntSize;		unsigned NumPointerStores = Bytes / MaxIntSize;

// Assume the remaining bytes if any are done a byte at a time.		// Assume the remaining bytes if any are done a byte at a time.
unsigned NumByteStores = Bytes % MaxIntSize;		unsigned NumByteStores = Bytes % MaxIntSize;

// If we will reduce the # stores (according to this heuristic), do the		// If we will reduce the # stores (according to this heuristic), do the
▲ Show 20 Lines • Show All 1,189 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/MemCpyOpt/profitable-memset.ll

				; RUN: opt < %s -memcpyopt -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; CHECK-LABEL: @foo(
				; CHECK-NOT: store
				; CHECK: call void @llvm.memset.p0i8.i64(i8* %2, i8 0, i64 8, i32 2, i1 false)

				define void @foo(i64* nocapture %P) {
				entry:
				%0 = bitcast i64* %P to i16*
				%arrayidx = getelementptr inbounds i16, i16* %0, i64 1
				%1 = bitcast i16* %arrayidx to i32*
				%arrayidx1 = getelementptr inbounds i16, i16* %0, i64 3
				store i16 0, i16* %0, align 2
				store i32 0, i32* %1, align 4
				store i16 0, i16* %arrayidx1, align 2
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[MemCpyOpt] Use MaxIntSize in byte instead of bitClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 57207

llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp

llvm/trunk/test/Transforms/MemCpyOpt/profitable-memset.ll

[MemCpyOpt] Use MaxIntSize in byte instead of bit
ClosedPublic