For intrinsics which do generate lifetime start and end evaluate all conditions before deciding whether to slice vector loads or not.
Diff Detail
Event Timeline
Had an offline discussion with Mandeep about this change. This patch needs additional investigation.
I took a closer look at the degradation caused by Owen's patch on AArch64.
With Owen's patch SROA promotes more allocas to vector values and generates a lot of scattered vector insert element instructions. But the backend is not able to optimize those scattered vector insert element instructions when there are extension operations in between the load and the insert instruction. It ends up executing a lot more vector insert instructions degrading performance.
Here is a simple example:
x = ld
y = ld
y = insert x v1, 1
z = insert y v1, 5
Generates:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]
But
x = ld
ex = ext x
y = ld
ey = ex y
z = insert ex v1, 1
k = insert ey v1, 5
Generates:
ldrb w8, [x0] ldrb w9, [x1] ins v0.h[1], w8 ins v0.h[5], w9
Better code would be:
ld1 { v0.b }[1], [x0] ld1 { v0.b }[5], [x1] ushll v0.8h, v0.8b, #0
Even though it is SROA who is generating the vector insert instructions (b.t.w, same issue with vecttor extract instructions), I do not think we should fix it there.
Chandler , what do you think? Should we try to generate better code from SROA?
In my opinion we should do an IR optimization (Instr Combine or even SLP vectorizer?) to allow the backend to generate better machine code. Here is what the transformation would look like:
; Problem: The difference in element size prevents optimized code from being generated
define <8 x i16> @test_ins4(i8* %arrayidx1, i8* %arrayidx2) {
%1 = load i8* %arrayidx1 %conv1 = zext i8 %1 to i16 %2 = load i8* %arrayidx2 %conv2 = zext i8 %2 to i16 %x = insertelement <8 x i16> undef, i16 %conv1, i32 1 %y = insertelement <8 x i16> undef, i16 %conv2, i32 5 %z = add <8 x i16> %x, %y ret <8 x i16> %z
}
; Solution: Transforming the IR to eliminate the difference in element size allowing us to generate optimized code
define <8 x i16> @test_ins5(i8* %arrayidx1, i8* %arrayidx2) {
%1 = load i8* %arrayidx1 %conv1 = zext i8 %1 to i16 %2 = load i8* %arrayidx2 %conv2 = zext i8 %2 to i16 %x = insertelement <8 x i8> undef, i8 %1, i32 1 %y = insertelement <8 x i8> %x, i8 %2, i32 5 %z = zext <8 x i8> %y to <8 x i16> ret <8 x i16> %z
}
With all of the above I think we can close this revision. We do not nee to change Owen's patch (tough the logic is quite confusing in that function).