Apart from some minor improvements in SystemZTargetTransformInfo, this patch deals with the loop
vectorizers handling of addressing:
The loop vectorizer usually vectorizes any instruction it can and then extracts the elements for a scalarized use.
This is not a good for addressing without support for vector gather/scatter, at least not on SystemZ. On SystemZ, all elements containing addresses must be extracted into address registers (GRs). Since this extraction is not free, it is better to have the address in a suitable register to begin with. By forcing address arithmetic instructions and loads of addresses to be scalar after vectorization, two benefits result:
- No need to extract the register
- LSR optimizations trigger (LSR isn't handling vector addresses currently)
Preliminary benchmark results show nice improvements on SystemZ with this new behaviour.
I was not sure what other targets might think of this, so I used a new TTI hook 'prefersVectorizedAddressing()', which defaults to true. It should be fairly straight forward for another target to extend this in the futures so that any address used for gather/scatter is vectorized, while any other scalar pointers gets scalarized address computations. If this would be desired, this new hook might be removed again.
(I have done benchmarking with the SystemZTTI improvements included, and I hope it's not confusing to include those file in the patch.)
As an example, this loop final output shows on trunk both shifts and extracts that are completely eliminated with the patch:
Loop before vectorize pass:
for.body144.us: ; preds = %for.cond142.preheader.us, %for.body144.us %indvars.iv227 = phi i64 [ 0, %for.cond142.preheader.us ], [ %indvars.iv.next228, %for.body144.us ] %47 = shl nsw i64 %indvars.iv227, 1 %arrayidx147.us = getelementptr inbounds double, double* %colB138.0191.us, i64 %47 %48 = bitcast double* %arrayidx147.us to i64* %49 = load i64, i64* %48, align 8, !tbaa !13 %arrayidx150.us = getelementptr inbounds double, double* %colA137.0190.us, i64 %47 %50 = bitcast double* %arrayidx150.us to i64* store i64 %49, i64* %50, align 8, !tbaa !13 %51 = or i64 %47, 1 %arrayidx154.us = getelementptr inbounds double, double* %colB138.0191.us, i64 %51 %52 = bitcast double* %arrayidx154.us to i64* %53 = load i64, i64* %52, align 8, !tbaa !13 %arrayidx158.us = getelementptr inbounds double, double* %colA137.0190.us, i64 %51 %54 = bitcast double* %arrayidx158.us to i64* store i64 %53, i64* %54, align 8, !tbaa !13 %indvars.iv.next228 = add nuw nsw i64 %indvars.iv227, 1 %cmp143.us = icmp slt i64 %indvars.iv.next228, %46 br i1 %cmp143.us, label %for.body144.us, label %for.cond142.for.end161_crit_edge.us Trunk: BB#42: derived from LLVM BB %vector.body263 Live Ins: %V0 %V1 %V2 %V3 %R4D %R5D %R6D %R7D %R8D %R9D %R10D %R11D %R12D %R13D %R14D %R3L Predecessors according to CFG: BB#41 BB#42 %V4<def> = VESLG %V3, %noreg, 1 %R0D<def> = VLGVG %V4, %noreg, 0 %R1D<def> = SLLG %R0D<kill>, %noreg, 3 %V5<def> = VL %R8D, 0, %R1D; mem:LD16[%109](align=8)(tbaa=!15)(alias.scope=!135) VSTEG %V5, %R7D, 0, %R1D, 0; mem:ST8[%113](tbaa=!15)(alias.scope=!138)(noalias=!135) %V6<def> = VL %R8D, 16, %R1D<kill>; mem:LD16[%109+16](align=8)(tbaa=!15)(alias.scope=!135) %R0D<def> = VLGVG %V4, %noreg, 1 %V4<def> = VO %V4<kill>, %V1 %R1D<def> = SLLG %R0D<kill>, %noreg, 3 %R0D<def> = VLGVG %V4, %noreg, 0 %V3<def> = VAG %V3<kill>, %V2 VSTEG %V6, %R7D, 0, %R1D<kill>, 0; mem:ST8[%114](tbaa=!15)(alias.scope=!138)(noalias=!135) %R1D<def> = SLLG %R0D<kill>, %noreg, 3 %R0D<def> = VLGVG %V4<kill>, %noreg, 1 VSTEG %V5<kill>, %R7D, 0, %R1D<kill>, 1; mem:ST8[%122](tbaa=!15)(alias.scope=!138)(noalias=!135) %R1D<def> = SLLG %R0D<kill>, %noreg, 3 %R6D<def,tied1> = AGHI %R6D<tied0>, -2, %CC<imp-def> VSTEG %V6<kill>, %R7D, 0, %R1D<kill>, 1; mem:ST8[%123](tbaa=!15)(alias.scope=!138)(noalias=!135) BRC 15, 7, <BB#42>, %CC<imp-use,kill> Successors according to CFG: BB#43(0x04000000 / 0x80000000 = 3.12%) BB#42(0x7c000000 / 0x80000000 = 96.88%) Dev: BB#42: derived from LLVM BB %vector.body263 Live Ins: %R0D %R1D %R4D %R5D %R7D %R8D %R9D %R10D %R11D %R12D %R13D %R14D %R3L Predecessors according to CFG: BB#41 BB#42 %V0<def> = VL %R8D, 0, %R1D; mem:LD16[%109](align=8)(tbaa=!15)(alias.scope=!135) VST %V0<kill>, %R7D, 0, %R1D; mem:ST16[%112](align=8) %V0<def> = VL %R8D, 16, %R1D; mem:LD16[%109+16](align=8)(tbaa=!15)(alias.scope=!135) VST %V0<kill>, %R7D, 16, %R1D; mem:ST16[%115](align=8) %R0D<def,tied1> = AGHI %R0D<tied0>, -2, %CC<imp-def> %R1D<def> = LA %R1D<kill>, 32, %noreg BRC 15, 7, <BB#42>, %CC<imp-use,kill> Successors according to CFG: BB#43(0x04000000 / 0x80000000 = 3.12%) BB#42(0x7c000000 / 0x80000000 = 96.88%)
Another example: test/Transforms/LoopVectorize/bsd_regex.ll
trunk:
vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.ind = phi <2 x i64> [ <i64 0, i64 1>, %vector.ph ], [ %vec.ind.next, %vector.body ] %step.add = add <2 x i64> %vec.ind, <i64 2, i64 2> %0 = shl nsw <2 x i64> %vec.ind, <i64 2, i64 2> %1 = shl nsw <2 x i64> %step.add, <i64 2, i64 2> %2 = extractelement <2 x i64> %0, i32 0 %3 = getelementptr inbounds i32, i32* %A, i64 %2 %4 = extractelement <2 x i64> %0, i32 1 %5 = getelementptr inbounds i32, i32* %A, i64 %4 %6 = extractelement <2 x i64> %1, i32 0 %7 = getelementptr inbounds i32, i32* %A, i64 %6 %8 = extractelement <2 x i64> %1, i32 1 %9 = getelementptr inbounds i32, i32* %A, i64 %8 store i32 4, i32* %3, align 4 store i32 4, i32* %5, align 4 store i32 4, i32* %7, align 4 store i32 4, i32* %9, align 4 %index.next = add i64 %index, 4 %vec.ind.next = add <2 x i64> %vec.ind, <i64 4, i64 4> %10 = icmp eq i64 %index.next, 10000 br i1 %10, label %middle.block, label %vector.body, !llvm.loop !0 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 veslg %v3, %v0, 2 vlgvg %r3, %v3, 0 sllg %r3, %r3, 2 st %r1, 0(%r3,%r2) vlgvg %r3, %v3, 1 vag %v3, %v3, %v1 sllg %r3, %r3, 2 st %r1, 0(%r3,%r2) vlgvg %r3, %v3, 0 sllg %r3, %r3, 2 vag %v0, %v0, %v2 st %r1, 0(%r3,%r2) vlgvg %r3, %v3, 1 sllg %r3, %r3, 2 aghi %r0, -4 st %r1, 0(%r3,%r2) jne .LBB0_1 w/ patch: vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %0 = shl nsw i64 %index, 2 %1 = shl i64 %index, 2 %2 = or i64 %1, 4 %3 = shl i64 %index, 2 %4 = or i64 %3, 8 %5 = shl i64 %index, 2 %6 = or i64 %5, 12 %7 = getelementptr inbounds i32, i32* %A, i64 %0 %8 = getelementptr inbounds i32, i32* %A, i64 %2 %9 = getelementptr inbounds i32, i32* %A, i64 %4 %10 = getelementptr inbounds i32, i32* %A, i64 %6 store i32 4, i32* %7, align 4 store i32 4, i32* %8, align 4 store i32 4, i32* %9, align 4 store i32 4, i32* %10, align 4 %index.next = add i64 %index, 4 %11 = icmp eq i64 %index.next, 10000 br i1 %11, label %middle.block, label %vector.body, !llvm.loop !0 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 sty %r0, -32(%r1,%r2) sty %r0, -48(%r1,%r2) sty %r0, -16(%r1,%r2) st %r0, 0(%r1,%r2) la %r1, 64(%r1) cgfi %r1, 160048 jlh .LBB0_1
Can't you check for scatter/gather support directly?