This is an archive of the discontinued LLVM Phabricator instance.

Differential D4954

[PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter
AbandonedPublic

Authored by • HaoLiu on Aug 18 2014, 3:29 AM.

Download Raw Diff

Details

Reviewers

chandlerc

Summary

Hi,

When SROA AllocaSliceRewriter tries to slice a load, if the loaded value is stored directly, currently we'll generate some redundant IR like ZEXT, SHL, AND, OR.

For the example in my patch:

%2 = load i64* %ref.tmp, align 8
store i64 %2, i64* %1, align 4

The load will be sliced into two i32 loads, and such two i32 will be combined into a i64 by ZEXT, SHL, AND, OR, and then it will be stored. But if we slice the i64 store into two i32 store, such additional operations won't be necessary.

This patch tries to find out and slice such STORE when slicing a LOAD.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

• HaoLiu updated this revision to Diff 12614.Aug 18 2014, 3:29 AM

• HaoLiu retitled this revision from to [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter.

• HaoLiu updated this object.

• HaoLiu edited the test plan for this revision. (Show Details)

• HaoLiu added a reviewer: chandlerc.

• HaoLiu added a subscriber: Unknown Object (MLST).

ping...

Thanks,
-Hao

Why is this the correct approach?

In particular, why did the load need to be sliced up but the store didn't?
That doesn't really make sense.

And what poor code problem is actually trying to solve?

Hi Chandler,

Sorry that I didn't provide more details.
Example for the specific case @load_store_i64 in the patch. There is an i64 alloca, then the alloca is stored by two i32 STOREs. Following is the i64 LOAD. Currently, the SROA splits the i64 LOAD, so that the former two i32 STOREs and the i64 LOAD can be optimized and removed.
The problem is that the split of i64 LOAD introducing additional ZEXT/SHL/AND/OR IRs to handle the following i64 uses. Such additional IRs try to combine two i32 into one i64 and then store it. So I think why don't we store two i32 separately so that such additional IRs can be removed.
I add patch in SROA, because following 3 reasons:

It is SROA that introduces such additional IRs.
Also it is easier to do such optimize in SROA. The LOAD can be sliced means the STORE also can be sliced. We can just split the STORE. If we keep such additional IRs and i64 STORE, the following optimizer or backend optimizer need more efforts to analyze both such additional IRs and STORE. Currently we don't have such similar optimization.
This patch does the same thing as SROA handles memory copy. E.g. If the i64 LOAD and STORE IRs in the test case: " %1 = bitcast %struct.point.two.i32* %ptr to i64* %2 = load i64* %ref.tmp, align 8 store i64 %2, i64* %1, align 4 " are a memory copy as following: " %1 = bitcast %struct.pointt* %0 to i8* %2 = bitcast i64* %ref.tmp to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* %1, i8* %2, i64 8, i32 4, i1 false) " The memory copy will also be sliced into two i32 LOAD and two i32 STORE.

The test result for @load_store_i64, if we use "opt", will get following code:
opt -S -O3 < input.ll
without the patch:

%1 = bitcast %struct.point.two.i32* %ptr to i64*
%ref.tmp.sroa.2.0.insert.ext = zext i32 %a to i64
%ref.tmp.sroa.2.0.insert.shift = shl nuw i64 %ref.tmp.sroa.2.0.insert.ext, 32
%ref.tmp.sroa.0.0.insert.insert = or i64 %ref.tmp.sroa.2.0.insert.shift, %ref.tmp.sroa.2.0.insert.ext
store i64 %ref.tmp.sroa.0.0.insert.insert, i64* %1, align 4

with the patch:

%ref.tmp.sroa.0.0..sroa_idx = getelementptr inbounds %struct.point.two.i32* %ptr, i64 0, i32 0
%ref.tmp.sroa.2.0..sroa_idx = getelementptr inbounds %struct.point.two.i32* %ptr, i64 0, i32 1
store i32 %a, i32* %ref.tmp.sroa.0.0..sroa_idx, align 4
store i32 %a, i32* %ref.tmp.sroa.2.0..sroa_idx, align 4

The second version looks simpler than the first version. Also if we use "llc" to compile them, we'll get following results for AArch64 and X86:
llc -march=aarch64 < input.ll
1st version in AArch64:

ubfx	x8, x0, #0, #32
bfi	x8, x8, #32, #32
str	 x8, [x1]

2nd version in AArch64:

stp	 w0, w0, [x1]

llc < input.ll
1st version in X86:

movl	%edi, %eax
movq	%rax, %rcx
shlq	$32, %rcx
orq	%rax, %rcx
movq	%rcx, (%rsi)

2nd version in X86:

movl	%edi, (%rsi)
movl	%edi, 4(%rsi)

Thanks,
-Hao

ping...

Did Hao answer your question, Chandler?

I don't think this patch is correct. I think the IR formulation is the correct formulation.

We have worked very hard to not slice stores, and I think we should continue to do so where tractable. The cases in this test seem quite reasonable to lower is a single store.

If lowering on a particular platform is faster with two stores rather than one store, we should do that in the backend. We could even create target independent code to do this and let each target opt into it. The IR should preserve the wide store though as that provides strictly more information to the backend about the set of possible lowering strategies.

This revision now requires changes to proceed.Mar 29 2015, 1:25 PM

• HaoLiu abandoned this revision.May 6 2015, 2:21 AM

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

SROA.cpp

55 lines

test/

Transforms/

SROA/

slice-load-store.ll

94 lines

Diff 12614

lib/Transforms/Scalar/SROA.cpp

Context not available.
	assert(LI.getType()->getIntegerBitWidth() ==	assert(LI.getType()->getIntegerBitWidth() ==
	DL.getTypeStoreSizeInBits(LI.getType()) &&	DL.getTypeStoreSizeInBits(LI.getType()) &&
	"Non-byte-multiple bit width");	"Non-byte-multiple bit width");
	// Move the insertion point just past the load so that we can refer to it.
	IRB.SetInsertPoint(std::next(BasicBlock::iterator(&LI)));	StoreInst *SI;
	// Create a placeholder value with the same type as LI to use as the	// If this load is stored directly, slice the store.
	// basis for the new value. This allows us to replace the uses of LI with	if (LI.hasOneUse() && (SI = dyn_cast<StoreInst>(LI.user_back())) &&
	// the computed value, and then replace the placeholder with LI, leaving	!SI->isVolatile()) {
	// LI only used for this computation.	// Record this instruction for deletion.
	Value *Placeholder	Pass.DeadInsts.insert(SI);
	= new LoadInst(UndefValue::get(LI.getType()->getPointerTo()));	Value *StorePtr = SI->getPointerOperand();
	V = insertInteger(DL, IRB, Placeholder, V, NewBeginOffset,	// Strip all inbounds GEPs and pointer casts to try to dig out any root
	"insert");	// alloca that should be re-examined after rewriting this instruction.
	LI.replaceAllUsesWith(V);	if (AllocaInst *AI =
	Placeholder->replaceAllUsesWith(&LI);	dyn_cast<AllocaInst>(StorePtr->stripInBoundsOffsets())) {
	delete Placeholder;	assert(AI != &OldAI && AI != &NewAI && "Splittable transfers cannot "
		"reach the same alloca on "
		"both ends.");
		Pass.Worklist.insert(AI);
		}
		unsigned DestAS = StorePtr->getType()->getPointerAddressSpace();
		Type *DestPtrTy = V->getType()->getPointerTo(DestAS);

		// Compute the relative offset for the dest pointer.
		unsigned IntPtrWidth = DL.getPointerSizeInBits(DestAS);
		APInt OtherOffset(IntPtrWidth, NewBeginOffset - BeginOffset);
		Value *DestPtr = getAdjustedPtr(IRB, DL, StorePtr, OtherOffset,
		DestPtrTy, StorePtr->getName() + ".");
		IRB.SetInsertPoint(SI);
		IRB.CreateAlignedStore(V, DestPtr, SI->getAlignment(), false);
		} else {
		// Move the insertion point just past the load so that we can refer to
		// it.
		IRB.SetInsertPoint(std::next(BasicBlock::iterator(&LI)));
		// Create a placeholder value with the same type as LI to use as the
		// basis for the new value. This allows us to replace the uses of LI
		// with the computed value, and then replace the placeholder with LI,
		// leaving LI only used for this computation.
		Value *Placeholder =
		new LoadInst(UndefValue::get(LI.getType()->getPointerTo()));
		V = insertInteger(DL, IRB, Placeholder, V, NewBeginOffset, "insert");
		LI.replaceAllUsesWith(V);
		Placeholder->replaceAllUsesWith(&LI);
		delete Placeholder;
		}
	} else {	} else {
	LI.replaceAllUsesWith(V);	LI.replaceAllUsesWith(V);
	}	}
Context not available.

test/Transforms/SROA/slice-load-store.ll

This file was added.

				; RUN: opt < %s -sroa -S \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; Test when splitting a load which is stored to another pointer, SROA should not
				; generate EXTEND, SHIFT, AND, OR.
				; To generate simpler code, SROA can split the store.

				%struct.point.two.i32 = type { i32, i32 }

				define void @load_store_i64(i32 %a, %struct.point.two.i32* %ptr) {
				; CHECK-LABEL: @load_store_i64(
				; CHECK-NOT: zext
				; CHECK-NOT: shl
				; CHECK-NOT: and
				; CHECK-NOT: or
				; CHECK: store i32 %a
				; CHECK: store i32 %a
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %struct.point.two.i32*
				%x.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 0
				store i32 %a, i32* %x.i, align 4
				%y.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 1
				store i32 %a, i32* %y.i, align 4
				%1 = bitcast %struct.point.two.i32* %ptr to i64*
				%2 = load i64* %ref.tmp, align 8
				store i64 %2, i64* %1, align 4
				ret void
				}

				define void @load_store_i64_ab(i32 %a, i32 %b, %struct.point.two.i32* %ptr) {
				; CHECK-LABEL: @load_store_i64_ab(
				; CHECK-NOT: zext
				; CHECK-NOT: shl
				; CHECK-NOT: and
				; CHECK-NOT: or
				; CHECK: store i32 %{{a\|b}}
				; CHECK: store i32 %{{a\|b}}
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %struct.point.two.i32*
				%x.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 0
				store i32 %a, i32* %x.i, align 4
				%y.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 1
				store i32 %b, i32* %y.i, align 4
				%1 = bitcast %struct.point.two.i32* %ptr to i64*
				%2 = load i64* %ref.tmp, align 8
				store i64 %2, i64* %1, align 4
				ret void
				}

				define void @load_store_2xAlloca_i64(i32 %a, %struct.point.two.i32* %ptr) {
				; CHECK-LABEL: @load_store_2xAlloca_i64(
				; CHECK-NOT: zext
				; CHECK-NOT: shl
				; CHECK-NOT: and
				; CHECK-NOT: or
				; CHECK: store i32 %a
				; CHECK: store i32 %a
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %struct.point.two.i32*
				%x.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 0
				store i32 %a, i32* %x.i, align 4
				%y.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 1
				store i32 %a, i32* %y.i, align 4
				%1 = alloca i64, align 8
				%2 = load i64* %ref.tmp, align 8
				store i64 %2, i64* %1, align 4
				%3 = load i64* %1, align 8
				%4 = bitcast %struct.point.two.i32* %ptr to i64*
				store i64 %3, i64* %4, align 4
				ret void
				}

				define void @load_store_2xAlloca_i64_ab(i32 %a, i32 %b, %struct.point.two.i32* %ptr) {
				; CHECK-LABEL: @load_store_2xAlloca_i64_ab(
				; CHECK-NOT: zext
				; CHECK-NOT: shl
				; CHECK-NOT: and
				; CHECK-NOT: or
				; CHECK: store i32 %{{a\|b}}
				; CHECK: store i32 %{{a\|b}}
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %struct.point.two.i32*
				%x.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 0
				store i32 %a, i32* %x.i, align 4
				%y.i = getelementptr inbounds %struct.point.two.i32* %tmpcast, i64 0, i32 1
				store i32 %b, i32* %y.i, align 4
				%1 = alloca i64, align 8
				%2 = load i64* %ref.tmp, align 8
				store i64 %2, i64* %1, align 4
				%3 = load i64* %1, align 8
				%4 = bitcast %struct.point.two.i32* %ptr to i64*
				store i64 %3, i64* %4, align 4
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriterAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 12614

lib/Transforms/Scalar/SROA.cpp

test/Transforms/SROA/slice-load-store.ll

[PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter
AbandonedPublic