This is an archive of the discontinued LLVM Phabricator instance.

[PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter
AbandonedPublic

Authored by HaoLiu on Aug 18 2014, 3:29 AM.

Details

Reviewers
chandlerc
Summary

Hi,

When SROA AllocaSliceRewriter tries to slice a load, if the loaded value is stored directly, currently we'll generate some redundant IR like ZEXT, SHL, AND, OR.

For the example in my patch:

%2 = load i64* %ref.tmp, align 8
store i64 %2, i64* %1, align 4

The load will be sliced into two i32 loads, and such two i32 will be combined into a i64 by ZEXT, SHL, AND, OR, and then it will be stored. But if we slice the i64 store into two i32 store, such additional operations won't be necessary.

This patch tries to find out and slice such STORE when slicing a LOAD.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu updated this revision to Diff 12614.Aug 18 2014, 3:29 AM
HaoLiu retitled this revision from to [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter.
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added a reviewer: chandlerc.
HaoLiu added a subscriber: Unknown Object (MLST).

ping...

Thanks,
-Hao

chandlerc edited edge metadata.Aug 25 2014, 6:33 PM

Why is this the correct approach?

In particular, why did the load need to be sliced up but the store didn't?
That doesn't really make sense.

And what poor code problem is actually trying to solve?

Hi Chandler,

Sorry that I didn't provide more details.
Example for the specific case @load_store_i64 in the patch. There is an i64 alloca, then the alloca is stored by two i32 STOREs. Following is the i64 LOAD. Currently, the SROA splits the i64 LOAD, so that the former two i32 STOREs and the i64 LOAD can be optimized and removed.
The problem is that the split of i64 LOAD introducing additional ZEXT/SHL/AND/OR IRs to handle the following i64 uses. Such additional IRs try to combine two i32 into one i64 and then store it. So I think why don't we store two i32 separately so that such additional IRs can be removed.
I add patch in SROA, because following 3 reasons:

  1. It is SROA that introduces such additional IRs.
  2. Also it is easier to do such optimize in SROA. The LOAD can be sliced means the STORE also can be sliced. We can just split the STORE. If we keep such additional IRs and i64 STORE, the following optimizer or backend optimizer need more efforts to analyze both such additional IRs and STORE. Currently we don't have such similar optimization.
  3. This patch does the same thing as SROA handles memory copy. E.g. If the i64 LOAD and STORE IRs in the test case: " %1 = bitcast %struct.point.two.i32* %ptr to i64* %2 = load i64* %ref.tmp, align 8 store i64 %2, i64* %1, align 4 " are a memory copy as following: " %1 = bitcast %struct.pointt* %0 to i8* %2 = bitcast i64* %ref.tmp to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* %1, i8* %2, i64 8, i32 4, i1 false) " The memory copy will also be sliced into two i32 LOAD and two i32 STORE.

The test result for @load_store_i64, if we use "opt", will get following code:
opt -S -O3 < input.ll
without the patch:

%1 = bitcast %struct.point.two.i32* %ptr to i64*
%ref.tmp.sroa.2.0.insert.ext = zext i32 %a to i64
%ref.tmp.sroa.2.0.insert.shift = shl nuw i64 %ref.tmp.sroa.2.0.insert.ext, 32
%ref.tmp.sroa.0.0.insert.insert = or i64 %ref.tmp.sroa.2.0.insert.shift, %ref.tmp.sroa.2.0.insert.ext
store i64 %ref.tmp.sroa.0.0.insert.insert, i64* %1, align 4

with the patch:

%ref.tmp.sroa.0.0..sroa_idx = getelementptr inbounds %struct.point.two.i32* %ptr, i64 0, i32 0
%ref.tmp.sroa.2.0..sroa_idx = getelementptr inbounds %struct.point.two.i32* %ptr, i64 0, i32 1
store i32 %a, i32* %ref.tmp.sroa.0.0..sroa_idx, align 4
store i32 %a, i32* %ref.tmp.sroa.2.0..sroa_idx, align 4

The second version looks simpler than the first version. Also if we use "llc" to compile them, we'll get following results for AArch64 and X86:
llc -march=aarch64 < input.ll
1st version in AArch64:

ubfx	x8, x0, #0, #32
bfi	x8, x8, #32, #32
str	 x8, [x1]

2nd version in AArch64:

stp	 w0, w0, [x1]

llc < input.ll
1st version in X86:

movl	%edi, %eax
movq	%rax, %rcx
shlq	$32, %rcx
orq	%rax, %rcx
movq	%rcx, (%rsi)

2nd version in X86:

movl	%edi, (%rsi)
movl	%edi, 4(%rsi)

Thanks,
-Hao

Did Hao answer your question, Chandler?

chandlerc requested changes to this revision.Mar 29 2015, 1:25 PM
chandlerc edited edge metadata.

I don't think this patch is correct. I think the IR formulation is the correct formulation.

We have worked very hard to not slice stores, and I think we should continue to do so where tractable. The cases in this test seem quite reasonable to lower is a single store.

If lowering on a particular platform is faster with two stores rather than one store, we should do that in the backend. We could even create target independent code to do this and let each target opt into it. The IR should preserve the wide store though as that provides strictly more information to the backend about the set of possible lowering strategies.

This revision now requires changes to proceed.Mar 29 2015, 1:25 PM
HaoLiu abandoned this revision.May 6 2015, 2:21 AM