Download Raw Diff

Details

Reviewers

kbarton
nemanjai
syzaara
lei
sfertile
jtony
rnk
efriedma
chandlerc
sanjoy

Commits

rG422e80aee2ba: [SROA] enable splitting for non-whole-alloca loads and stores
rL319407: [SROA] enable splitting for non-whole-alloca loads and stores

Summary

Currently, SROA splits loads and stores only when they are accessing the whole alloca.
This patch relaxes this limitation to allow splitting a load/store if all other loads and stores to the alloca are disjoint to or fully included in the current load/store. If there is no other load or store that crosses the boundary of the current load/store, the current splitting implementation works as is.
The whole-alloca loads and stores meet this new condition and so they are still splittable.

Here is a simplified motivating example.

struct record {
    long long a;
    int b;
    int c;
};

int func(struct record r) {
    for (int i = 0; i < r.c; i++)
        r.b++;
    return r.b;
}

When updating r.b (or r.c as well), LLVM generates redundant instructions on some platforms (such as x86_64, ppc64); here, r.b and r.c are packed into one 64-bit GPR when the struct is passed as a method argument.

With this patch, the above example is compiled into only few instructions without loop.
Without the patch, unnecessary loop-carried dependency is introduced by SROA and the loop cannot be eliminated by the later optimizers.

The path length of the std::string's assign method with istreambuf_iterator (an example below) is reduced by about 10% on ppc64.

assign((std::istreambuf_iterator<char>(ifs)),
       (std::istreambuf_iterator<char>()));

Diff Detail

Event Timeline

inouehrs created this revision.May 9 2017, 5:52 AM

rnk added a reviewer: rnk.May 12 2017, 3:35 PM

inouehrs updated this revision to Diff 98944.May 14 2017, 11:06 PM

inouehrs edited the summary of this revision. (Show Details)

efriedma added a subscriber: efriedma.May 15 2017, 1:03 PM

I added a new unit test and modified affected existing unit tests.

rnk added inline comments.May 18 2017, 10:54 AM

lib/Transforms/Scalar/SROA.cpp
4074–4075	Does this have to be n^2 in the number of slices? Can't you leverage the sorting to be more efficient?
test/Transforms/SROA/basictest.ll
1707–1709	These CHECKs have no colon, they aren't firing.

jroelofs added a subscriber: jroelofs.May 18 2017, 11:15 AM

jroelofs added inline comments.

test/Transforms/SROA/basictest.ll
1707–1709	Also, the `NEXT` must be capitalized.

Implemented a linear-time algorithm (instead of N^2 for number of slices) for searching splittable slices.
Fixed errors in the unit test.

inouehrs marked 3 inline comments as done.May 18 2017, 11:37 PM

inouehrs added inline comments.

lib/Transforms/Scalar/SROA.cpp
4074–4075	I changed the algorithm to a linear-time one. I confirmed that this algorithm generates the same results to the previous N^2 algorithm during the bootstrap test.
test/Transforms/SROA/basictest.ll
1707–1709	Thank you so much for pointing this out. Fixed these errors.

inouehrs marked 2 inline comments as done.May 18 2017, 11:38 PM

javed.absar added a subscriber: javed.absar.May 19 2017, 1:14 AM

efriedma added a reviewer: chandlerc.May 19 2017, 12:37 PM

Gentle ping.

This looks OK to me, but I don't have enough knowledge of SROA to accept this.

I'd like to see a response from Chandler, since he was the last one to touch this code.

Some more general performance numbers would also be nice (LLVM testsuite or SPEC); this could impact a lot of code.

The performance changes in SPEC on POWER8 are not significant; within a range of fluctuations.
I am trying to make additional SPEC runs.

Average of three runs with and without this patch.
A positive number means improvement by the patch.

400.perlbench	-0.11%
401.bzip2	-0.01%
403.gcc		0.64%
429.mcf		-1.14%
445.gobmk	-0.05%
456.hmmer	0.03%
458.sjeng	-0.16%
462.libquantum	0.36%
464.h264ref	-0.04%
471.omnetpp	1.39%
473.astar	0.08%
483.xalancbmk	0.18%
433.milc	1.33%
444.namd	-0.04%
447.dealII	-0.27%
450.soplex	0.20%
453.povray	-0.25%
470.lbm		-0.29%
482.sphinx3	0.30%
GEOMEAN		0.11%

I conducted more performance measurements.
Overall, I did not find significant degradetions or improvements except for the iterotor example above.

tcmalloc (on ppc64le)
This patch makes additional splitting but no visible changes in malloc_bench score

gipfeli (on ppc64le)
This patch makes additional splitting in the benchmark harness (for istreambuf_iterator) but not in compressor/decompressor.

snappy (on ppc64le)
This patch makes additional splitting but no visible changes in snappy-unittest score

SPECCPU2006 (on x86_64)
No additional splitting observed in 12 out of 19 C/C++ benchmarks.
Even for benchmarks with additional splitting the changes in the score is within range of fluctuation.

403.gcc	0.00%
462.libquantum	0.00%
471.omnetpp	-0.42%
473.astar	0.00%
483.xalancbmk	0.20%
447.dealII	0.13%
453.povray	-0.06%
SPECINT (5)	-0.04%
SPECFP  (2)	0.04%
TOTAL   (7)	-0.02%

@chandlerc @efriedma I appreciate any suggestions on this. Thanks!

rebased to the latest tree and ran tests
did minor touchup in comments

@chandlerc Ping?

rebase to the latest tree

We should try to decide what we're doing about this patch.
@inouehrs Does this have enough of an effect on performance with ToT to continue to pursue this or should it be abandoned.
If we do want to continue with it, it seems that there are no real objections (pending any from @chandlerc) so I imagine we can proceed with the patch. Perhaps add maintainers of some other targets to the review so they can run try it on their targets. You've already ensured there are no performance regressions for PPC and X86, perhaps a few other targets can confirm this as well.

I think I understand the new code, and it seems like an improvement. I'd land it and see what breaks.

This revision is now accepted and ready to land.Nov 29 2017, 10:28 AM

Closed by commit rL319407: [SROA] enable splitting for non-whole-alloca loads and stores (authored by inouehrs). · Explain WhyNov 29 2017, 11:45 PM

This revision was automatically updated to reflect the committed changes.

Diff 119463

lib/Transforms/Scalar/SROA.cpp

Show All 24 Lines

#include "llvm/Transforms/Scalar/SROA.h"		#include "llvm/Transforms/Scalar/SROA.h"
#include "llvm/ADT/APInt.h"		#include "llvm/ADT/APInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/PointerIntPair.h"		#include "llvm/ADT/PointerIntPair.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
▲ Show 20 Lines • Show All 4,001 Lines • ▼ Show 20 Lines	bool SROA::splitAlloca(AllocaInst &AI, AllocaSlices &AS) {

unsigned NumPartitions = 0;		unsigned NumPartitions = 0;
bool Changed = false;		bool Changed = false;
const DataLayout &DL = AI.getModule()->getDataLayout();		const DataLayout &DL = AI.getModule()->getDataLayout();

// First try to pre-split loads and stores.		// First try to pre-split loads and stores.
Changed \|= presplitLoadsAndStores(AI, AS);		Changed \|= presplitLoadsAndStores(AI, AS);

// Now that we have identified any pre-splitting opportunities, mark any		// Now that we have identified any pre-splitting opportunities,
// splittable (non-whole-alloca) loads and stores as unsplittable. If we fail		// mark loads and stores unsplittable except for the following case.
// to split these during pre-splitting, we want to force them to be		// We leave a slice splittable if all other slices are disjoint or fully
// rewritten into a partition.		// included in the slice, such as whole-alloca loads and stores.
		// If we fail to split these during pre-splitting, we want to force them
		// to be rewritten into a partition.
bool IsSorted = true;		bool IsSorted = true;

		// If a byte boundary is included in any load or store, a slice starting or
		// ending at the boundary is not splittable.
		unsigned AllocaSize = DL.getTypeAllocSize(AI.getAllocatedType());
		SmallBitVector SplittableOffset(AllocaSize+1, true);
		for (Slice &S : AS)
		for (unsigned O = S.beginOffset() + 1; O < S.endOffset() && O < AllocaSize;
		O++)
		SplittableOffset.reset(O);

for (Slice &S : AS) {		for (Slice &S : AS) {
if (!S.isSplittable())		if (!S.isSplittable())
continue;		continue;
// FIXME: We currently leave whole-alloca splittable loads and stores. This
// used to be the only splittable loads and stores and we need to be		if ((S.beginOffset() > AllocaSize \|\| SplittableOffset[S.beginOffset()]) &&
// confident that the above handling of splittable loads and stores is		(S.endOffset() > AllocaSize \|\| SplittableOffset[S.endOffset()]))
// completely sufficient before we forcibly disable the remaining handling.
if (S.beginOffset() == 0 &&
S.endOffset() >= DL.getTypeAllocSize(AI.getAllocatedType()))
continue;		continue;

		rnkUnsubmitted Done Reply Inline Actions Does this have to be n^2 in the number of slices? Can't you leverage the sorting to be more efficient? rnk: Does this have to be n^2 in the number of slices? Can't you leverage the sorting to be more…
		inouehrsAuthorUnsubmitted Not Done Reply Inline Actions I changed the algorithm to a linear-time one. I confirmed that this algorithm generates the same results to the previous N^2 algorithm during the bootstrap test. inouehrs: I changed the algorithm to a linear-time one. I confirmed that this algorithm generates the…
if (isa<LoadInst>(S.getUse()->getUser()) \|\|		if (isa<LoadInst>(S.getUse()->getUser()) \|\|
isa<StoreInst>(S.getUse()->getUser())) {		isa<StoreInst>(S.getUse()->getUser())) {
S.makeUnsplittable();		S.makeUnsplittable();
IsSorted = false;		IsSorted = false;
}		}
}		}
if (!IsSorted)		if (!IsSorted)
std::sort(AS.begin(), AS.end());		std::sort(AS.begin(), AS.end());
▲ Show 20 Lines • Show All 319 Lines • Show Last 20 Lines

test/DebugInfo/X86/sroasplit-2.ll

	Show All 15 Lines
	; int foo(Outer outer) {			; int foo(Outer outer) {
	; Inner i1 = outer.inner[1];			; Inner i1 = outer.inner[1];
	; return i1.a;			; return i1.a;
	; }			; }
	;			;

	; Verify that SROA creates a variable piece when splitting i1.			; Verify that SROA creates a variable piece when splitting i1.
	; CHECK: call void @llvm.dbg.value(metadata i64 %outer.coerce0, metadata ![[O:[0-9]+]], metadata !DIExpression(DW_OP_LLVM_fragment, 0, 64)),			; CHECK: call void @llvm.dbg.value(metadata i64 %outer.coerce0, metadata ![[O:[0-9]+]], metadata !DIExpression(DW_OP_LLVM_fragment, 0, 64)),
	; CHECK: call void @llvm.dbg.value(metadata i64 %outer.coerce1, metadata ![[O]], metadata !DIExpression(DW_OP_LLVM_fragment, 64, 64)),			; CHECK: call void @llvm.dbg.value(metadata i32 {{.*}}, metadata ![[O]], metadata !DIExpression(DW_OP_LLVM_fragment, 64, 32)),
				; CHECK: call void @llvm.dbg.value(metadata i32 {{.*}}, metadata ![[O]], metadata !DIExpression(DW_OP_LLVM_fragment, 96, 32)),
	; CHECK: call void @llvm.dbg.value({{.*}}, metadata ![[I1:[0-9]+]], metadata !DIExpression(DW_OP_LLVM_fragment, 0, 32)),			; CHECK: call void @llvm.dbg.value({{.*}}, metadata ![[I1:[0-9]+]], metadata !DIExpression(DW_OP_LLVM_fragment, 0, 32)),
	; CHECK-DAG: ![[O]] = !DILocalVariable(name: "outer",{{.*}} line: 10			; CHECK-DAG: ![[O]] = !DILocalVariable(name: "outer",{{.*}} line: 10
	; CHECK-DAG: ![[I1]] = !DILocalVariable(name: "i1",{{.*}} line: 11			; CHECK-DAG: ![[I1]] = !DILocalVariable(name: "i1",{{.*}} line: 11

	; ModuleID = 'sroasplit-2.c'			; ModuleID = 'sroasplit-2.c'
	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.9.0"			target triple = "x86_64-apple-macosx10.9.0"

	▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

test/Transforms/SROA/basictest.ll

Show First 20 Lines • Show All 1,609 Lines • ▼ Show 20 Lines	entry:
%load = load atomic volatile i64, i64* %ptr seq_cst, align 8		%load = load atomic volatile i64, i64* %ptr seq_cst, align 8
ret void		ret void
}		}

define i16 @PR24463() {		define i16 @PR24463() {
; Ensure we can handle a very interesting case where there is an integer-based		; Ensure we can handle a very interesting case where there is an integer-based
; rewrite of the uses of the alloca, but where one of the integers in that is		; rewrite of the uses of the alloca, but where one of the integers in that is
; a sub-integer that requires extraction and extends past the end of the		; a sub-integer that requires extraction and extends past the end of the
; alloca. In this case, we should extract the i8 and then zext it to i16.		; alloca. SROA can split the alloca to avoid shift or trunc.
;		;
; CHECK-LABEL: @PR24463(		; CHECK-LABEL: @PR24463(
; CHECK-NOT: alloca		; CHECK-NOT: alloca
; CHECK: %[[SHIFT:.*]] = lshr i16 0, 8		; CHECK-NOT: trunc
; CHECK: %[[TRUNC:.*]] = trunc i16 %[[SHIFT]] to i8		; CHECK-NOT: lshr
; CHECK: %[[ZEXT:.*]] = zext i8 %[[TRUNC]] to i16		; CHECK: %[[ZEXT:.]] = zext i8 {{.}} to i16
; CHECK: ret i16 %[[ZEXT]]		; CHECK: ret i16 %[[ZEXT]]
entry:		entry:
%alloca = alloca [3 x i8]		%alloca = alloca [3 x i8]
%gep1 = getelementptr inbounds [3 x i8], [3 x i8]* %alloca, i64 0, i64 1		%gep1 = getelementptr inbounds [3 x i8], [3 x i8]* %alloca, i64 0, i64 1
%bc1 = bitcast i8* %gep1 to i16*		%bc1 = bitcast i8* %gep1 to i16*
store i16 0, i16* %bc1		store i16 0, i16* %bc1
%gep2 = getelementptr inbounds [3 x i8], [3 x i8]* %alloca, i64 0, i64 2		%gep2 = getelementptr inbounds [3 x i8], [3 x i8]* %alloca, i64 0, i64 2
%bc2 = bitcast i8* %gep2 to i16*		%bc2 = bitcast i8* %gep2 to i16*
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
bb1:		bb1:
%e.7.sroa.6.i = alloca i32, align 1		%e.7.sroa.6.i = alloca i32, align 1
%e.7.sroa.6.0.load81.i = load i32, i32* %e.7.sroa.6.i, align 1		%e.7.sroa.6.0.load81.i = load i32, i32* %e.7.sroa.6.i, align 1
%0 = bitcast i32* %e.7.sroa.6.i to i8*		%0 = bitcast i32* %e.7.sroa.6.i to i8*
call void @llvm.lifetime.end.p0i8(i64 2, i8* %0)		call void @llvm.lifetime.end.p0i8(i64 2, i8* %0)
ret void		ret void
}		}

		define void @test28(i64 %v) #0 {
		; SROA should split the first i64 store to avoid additional and/or instructions
		; when storing into i32 fields

		; CHECK-LABEL: @test28(
		; CHECK-NOT: alloca
		; CHECK-NOT: and
		; CHECK-NOT: or
		; CHECK: %[[shift:.*]] = lshr i64 %v, 32
		; CHECK-NEXT: %{{.*}} = trunc i64 %[[shift]] to i32
		; CHECK-NEXT: ret void
		rnkUnsubmitted Done Reply Inline Actions These CHECKs have no colon, they aren't firing. rnk: These CHECKs have no colon, they aren't firing.
		jroelofsUnsubmitted Done Reply Inline Actions Also, the `NEXT` must be capitalized. jroelofs: Also, the `NEXT` must be capitalized.
		inouehrsAuthorUnsubmitted Not Done Reply Inline Actions Thank you so much for pointing this out. Fixed these errors. inouehrs: Thank you so much for pointing this out. Fixed these errors.

		entry:
		%t = alloca { i64, i32, i32 }

		%b = getelementptr { i64, i32, i32 }, { i64, i32, i32 }* %t, i32 0, i32 1
		%0 = bitcast i32* %b to i64*
		store i64 %v, i64* %0

		%1 = load i32, i32* %b
		%c = getelementptr { i64, i32, i32 }, { i64, i32, i32 }* %t, i32 0, i32 2
		store i32 %1, i32* %c
		ret void
		}

test/Transforms/SROA/big-endian.ll

	Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines

	; CHECK-NOT: store			; CHECK-NOT: store
	; CHECK-NOT: load			; CHECK-NOT: load

	%a0i16ptr = bitcast i8* %a0ptr to i16*			%a0i16ptr = bitcast i8* %a0ptr to i16*
	store i16 1, i16* %a0i16ptr			store i16 1, i16* %a0i16ptr

	store i8 1, i8* %a2ptr			store i8 1, i8* %a2ptr
	; CHECK: %[[mask1:.*]] = and i40 undef, 4294967295
	; CHECK-NEXT: %[[insert1:.*]] = or i40 %[[mask1]], 4294967296

	%a3i24ptr = bitcast i8* %a3ptr to i24*			%a3i24ptr = bitcast i8* %a3ptr to i24*
	store i24 1, i24* %a3i24ptr			store i24 1, i24* %a3i24ptr
	; CHECK-NEXT: %[[mask2:.*]] = and i40 %[[insert1]], -4294967041
	; CHECK-NEXT: %[[insert2:.*]] = or i40 %[[mask2]], 256

	%a2i40ptr = bitcast i8* %a2ptr to i40*			%a2i40ptr = bitcast i8* %a2ptr to i40*
	store i40 1, i40* %a2i40ptr			store i40 1, i40* %a2i40ptr
	; CHECK-NEXT: %[[ext3:.*]] = zext i40 1 to i56
	; CHECK-NEXT: %[[mask3:.*]] = and i56 undef, -1099511627776			; the alloca is splitted into multiple slices
	; CHECK-NEXT: %[[insert3:.*]] = or i56 %[[mask3]], %[[ext3]]			; Here, i8 1 is for %a[6]
				; CHECK: %[[ext1:.*]] = zext i8 1 to i40
				; CHECK-NEXT: %[[mask1:.*]] = and i40 undef, -256
				; CHECK-NEXT: %[[insert1:.*]] = or i40 %[[mask1]], %[[ext1]]

				; Here, i24 0 is for %a[3] to %a[5]
				; CHECK-NEXT: %[[ext2:.*]] = zext i24 0 to i40
				; CHECK-NEXT: %[[shift2:.*]] = shl i40 %[[ext2]], 8
				; CHECK-NEXT: %[[mask2:.*]] = and i40 %[[insert1]], -4294967041
				; CHECK-NEXT: %[[insert2:.*]] = or i40 %[[mask2]], %[[shift2]]

				; Here, i8 0 is for %a[2]
				; CHECK-NEXT: %[[ext3:.*]] = zext i8 0 to i40
				; CHECK-NEXT: %[[shift3:.*]] = shl i40 %[[ext3]], 32
				; CHECK-NEXT: %[[mask3:.*]] = and i40 %[[insert2]], 4294967295
				; CHECK-NEXT: %[[insert3:.*]] = or i40 %[[mask3]], %[[shift3]]

				; CHECK-NEXT: %[[ext4:.*]] = zext i40 %[[insert3]] to i56
				; CHECK-NEXT: %[[mask4:.*]] = and i56 undef, -1099511627776
				; CHECK-NEXT: %[[insert4:.*]] = or i56 %[[mask4]], %[[ext4]]

	; CHECK-NOT: store			; CHECK-NOT: store
	; CHECK-NOT: load			; CHECK-NOT: load

	%aiptr = bitcast [7 x i8]* %a to i56*			%aiptr = bitcast [7 x i8]* %a to i56*
	%ai = load i56, i56* %aiptr			%ai = load i56, i56* %aiptr
	%ret = zext i56 %ai to i64			%ret = zext i56 %ai to i64
	ret i64 %ret			ret i64 %ret
	; CHECK-NEXT: %[[ext4:.*]] = zext i16 1 to i56			; Here, i16 1 is for %a[0] to %a[1]
	; CHECK-NEXT: %[[shift4:.*]] = shl i56 %[[ext4]], 40			; CHECK-NEXT: %[[ext5:.*]] = zext i16 1 to i56
	; CHECK-NEXT: %[[mask4:.*]] = and i56 %[[insert3]], 1099511627775			; CHECK-NEXT: %[[shift5:.*]] = shl i56 %[[ext5]], 40
	; CHECK-NEXT: %[[insert4:.*]] = or i56 %[[mask4]], %[[shift4]]			; CHECK-NEXT: %[[mask5:.*]] = and i56 %[[insert4]], 1099511627775
	; CHECK-NEXT: %[[ret:.*]] = zext i56 %[[insert4]] to i64			; CHECK-NEXT: %[[insert5:.*]] = or i56 %[[mask5]], %[[shift5]]
				; CHECK-NEXT: %[[ret:.*]] = zext i56 %[[insert5]] to i64
	; CHECK-NEXT: ret i64 %[[ret]]			; CHECK-NEXT: ret i64 %[[ret]]
	}			}

	define i64 @PR14132(i1 %flag) {			define i64 @PR14132(i1 %flag) {
	; CHECK-LABEL: @PR14132(			; CHECK-LABEL: @PR14132(
	; Here we form a PHI-node by promoting the pointer alloca first, and then in			; Here we form a PHI-node by promoting the pointer alloca first, and then in
	; order to promote the other two allocas, we speculate the load of the			; order to promote the other two allocas, we speculate the load of the
	; now-phi-node-pointer. In doing so we end up loading a 64-bit value from an i8			; now-phi-node-pointer. In doing so we end up loading a 64-bit value from an i8
	▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SROA] enable splitting for non-whole-alloca loads and stores
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 119463

lib/Transforms/Scalar/SROA.cpp

test/DebugInfo/X86/sroasplit-2.ll

test/Transforms/SROA/basictest.ll

test/Transforms/SROA/big-endian.ll

This is an archive of the discontinued LLVM Phabricator instance.

[SROA] enable splitting for non-whole-alloca loads and storesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 119463

lib/Transforms/Scalar/SROA.cpp

test/DebugInfo/X86/sroasplit-2.ll

test/Transforms/SROA/basictest.ll

test/Transforms/SROA/big-endian.ll

[SROA] enable splitting for non-whole-alloca loads and stores
ClosedPublic