This is an archive of the discontinued LLVM Phabricator instance.

Fix SROA for intrinsics
AbandonedPublic

Authored by apazos on Jan 28 2015, 5:46 PM.

Details

Summary

For intrinsics which do generate lifetime start and end evaluate all conditions before deciding whether to slice vector loads or not.

Diff Detail

Event Timeline

mgrang updated this revision to Diff 18935.Jan 28 2015, 5:46 PM
mgrang retitled this revision from to Fix SROA for intrinsics.
mgrang updated this object.
mgrang edited the test plan for this revision. (Show Details)
mgrang added reviewers: chandlerc, resistor, mcrosier, apazos.
mgrang added a subscriber: Unknown Object (MLST).
mgrang updated this revision to Diff 19017.Jan 29 2015, 5:54 PM

Added "target datalayout" and "target triple" to the test case .ll file

mgrang updated this revision to Diff 19421.Feb 5 2015, 11:29 AM
mcrosier edited edge metadata.Feb 5 2015, 11:45 AM

Mandeep,
I'll commit this shortly once I have verified the test case.

Chad

mgrang updated this revision to Diff 19432.Feb 5 2015, 2:12 PM
mgrang edited edge metadata.
mcrosier requested changes to this revision.Mar 2 2015, 11:58 AM
mcrosier edited edge metadata.

Had an offline discussion with Mandeep about this change. This patch needs additional investigation.

This revision now requires changes to proceed.Mar 2 2015, 11:58 AM
apazos commandeered this revision.Mar 12 2015, 2:12 PM
apazos edited reviewers, added: mgrang; removed: apazos.

I took a closer look at the degradation caused by Owen's patch on AArch64.

With Owen's patch SROA promotes more allocas to vector values and generates a lot of scattered vector insert element instructions. But the backend is not able to optimize those scattered vector insert element instructions when there are extension operations in between the load and the insert instruction. It ends up executing a lot more vector insert instructions degrading performance.

Here is a simple example:

x = ld
y = ld
y = insert x v1, 1
z = insert y v1, 5

Generates:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]

But
x = ld
ex = ext x
y = ld
ey = ex y
z = insert ex v1, 1
k = insert ey v1, 5

Generates:

ldrb     w8, [x0]
ldrb     w9, [x1]
ins    v0.h[1], w8
ins    v0.h[5], w9

Better code would be:

ld1    { v0.b }[1], [x0]
ld1    { v0.b }[5], [x1]
ushll    v0.8h, v0.8b, #0

Even though it is SROA who is generating the vector insert instructions (b.t.w, same issue with vecttor extract instructions), I do not think we should fix it there.

Chandler , what do you think? Should we try to generate better code from SROA?

In my opinion we should do an IR optimization (Instr Combine or even SLP vectorizer?) to allow the backend to generate better machine code. Here is what the transformation would look like:

; Problem: The difference in element size prevents optimized code from being generated
define <8 x i16> @test_ins4(i8* %arrayidx1, i8* %arrayidx2) {

%1 = load i8* %arrayidx1
%conv1 = zext i8 %1 to i16
%2 = load i8* %arrayidx2
%conv2 = zext i8 %2 to i16
%x = insertelement <8 x i16> undef, i16 %conv1, i32 1
%y = insertelement <8 x i16> undef, i16 %conv2, i32 5
%z = add <8 x i16> %x, %y
ret <8 x i16> %z

}

; Solution: Transforming the IR to eliminate the difference in element size allowing us to generate optimized code
define <8 x i16> @test_ins5(i8* %arrayidx1, i8* %arrayidx2) {

%1 = load i8* %arrayidx1
%conv1 = zext i8 %1 to i16
%2 = load i8* %arrayidx2
%conv2 = zext i8 %2 to i16
%x = insertelement <8 x i8> undef, i8 %1, i32 1
%y = insertelement <8 x i8> %x, i8 %2, i32 5
%z = zext <8 x i8> %y to <8 x i16>
ret <8 x i16> %z

}

With all of the above I think we can close this revision. We do not nee to change Owen's patch (tough the logic is quite confusing in that function).

apazos abandoned this revision.Mar 12 2015, 2:16 PM