This is an archive of the discontinued LLVM Phabricator instance.

LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)
ClosedPublic

Authored by jonpa on Apr 24 2017, 1:51 AM.

Details

Summary

Apart from some minor improvements in SystemZTargetTransformInfo, this patch deals with the loop
vectorizers handling of addressing:

The loop vectorizer usually vectorizes any instruction it can and then extracts the elements for a scalarized use.

This is not a good for addressing without support for vector gather/scatter, at least not on SystemZ. On SystemZ, all elements containing addresses must be extracted into address registers (GRs). Since this extraction is not free, it is better to have the address in a suitable register to begin with. By forcing address arithmetic instructions and loads of addresses to be scalar after vectorization, two benefits result:

  • No need to extract the register
  • LSR optimizations trigger (LSR isn't handling vector addresses currently)

Preliminary benchmark results show nice improvements on SystemZ with this new behaviour.

I was not sure what other targets might think of this, so I used a new TTI hook 'prefersVectorizedAddressing()', which defaults to true. It should be fairly straight forward for another target to extend this in the futures so that any address used for gather/scatter is vectorized, while any other scalar pointers gets scalarized address computations. If this would be desired, this new hook might be removed again.

(I have done benchmarking with the SystemZTTI improvements included, and I hope it's not confusing to include those file in the patch.)

As an example, this loop final output shows on trunk both shifts and extracts that are completely eliminated with the patch:

Loop before vectorize pass:

for.body144.us:                                   ; preds = %for.cond142.preheader.us, %for.body144.us
  %indvars.iv227 = phi i64 [ 0, %for.cond142.preheader.us ], [ %indvars.iv.next228, %for.body144.us ]
  %47 = shl nsw i64 %indvars.iv227, 1
  %arrayidx147.us = getelementptr inbounds double, double* %colB138.0191.us, i64 %47
  %48 = bitcast double* %arrayidx147.us to i64*
  %49 = load i64, i64* %48, align 8, !tbaa !13
  %arrayidx150.us = getelementptr inbounds double, double* %colA137.0190.us, i64 %47
  %50 = bitcast double* %arrayidx150.us to i64*
  store i64 %49, i64* %50, align 8, !tbaa !13
  %51 = or i64 %47, 1
  %arrayidx154.us = getelementptr inbounds double, double* %colB138.0191.us, i64 %51
  %52 = bitcast double* %arrayidx154.us to i64*
  %53 = load i64, i64* %52, align 8, !tbaa !13
  %arrayidx158.us = getelementptr inbounds double, double* %colA137.0190.us, i64 %51
  %54 = bitcast double* %arrayidx158.us to i64*
  store i64 %53, i64* %54, align 8, !tbaa !13
  %indvars.iv.next228 = add nuw nsw i64 %indvars.iv227, 1
  %cmp143.us = icmp slt i64 %indvars.iv.next228, %46
  br i1 %cmp143.us, label %for.body144.us, label %for.cond142.for.end161_crit_edge.us


Trunk:
BB#42: derived from LLVM BB %vector.body263
    Live Ins: %V0 %V1 %V2 %V3 %R4D %R5D %R6D %R7D %R8D %R9D %R10D %R11D %R12D %R13D %R14D %R3L
    Predecessors according to CFG: BB#41 BB#42
        %V4<def> = VESLG %V3, %noreg, 1
        %R0D<def> = VLGVG %V4, %noreg, 0
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %V5<def> = VL %R8D, 0, %R1D; mem:LD16[%109](align=8)(tbaa=!15)(alias.scope=!135)
        VSTEG %V5, %R7D, 0, %R1D, 0; mem:ST8[%113](tbaa=!15)(alias.scope=!138)(noalias=!135)
        %V6<def> = VL %R8D, 16, %R1D<kill>; mem:LD16[%109+16](align=8)(tbaa=!15)(alias.scope=!135)
        %R0D<def> = VLGVG %V4, %noreg, 1
        %V4<def> = VO %V4<kill>, %V1
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %R0D<def> = VLGVG %V4, %noreg, 0
        %V3<def> = VAG %V3<kill>, %V2
        VSTEG %V6, %R7D, 0, %R1D<kill>, 0; mem:ST8[%114](tbaa=!15)(alias.scope=!138)(noalias=!135)
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %R0D<def> = VLGVG %V4<kill>, %noreg, 1
        VSTEG %V5<kill>, %R7D, 0, %R1D<kill>, 1; mem:ST8[%122](tbaa=!15)(alias.scope=!138)(noalias=!135)
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %R6D<def,tied1> = AGHI %R6D<tied0>, -2, %CC<imp-def>
        VSTEG %V6<kill>, %R7D, 0, %R1D<kill>, 1; mem:ST8[%123](tbaa=!15)(alias.scope=!138)(noalias=!135)
        BRC 15, 7, <BB#42>, %CC<imp-use,kill>
    Successors according to CFG: BB#43(0x04000000 / 0x80000000 = 3.12%) BB#42(0x7c000000 / 0x80000000 = 96.88%)

Dev:
BB#42: derived from LLVM BB %vector.body263
    Live Ins: %R0D %R1D %R4D %R5D %R7D %R8D %R9D %R10D %R11D %R12D %R13D %R14D %R3L
    Predecessors according to CFG: BB#41 BB#42
        %V0<def> = VL %R8D, 0, %R1D; mem:LD16[%109](align=8)(tbaa=!15)(alias.scope=!135)
        VST %V0<kill>, %R7D, 0, %R1D; mem:ST16[%112](align=8)
        %V0<def> = VL %R8D, 16, %R1D; mem:LD16[%109+16](align=8)(tbaa=!15)(alias.scope=!135)
        VST %V0<kill>, %R7D, 16, %R1D; mem:ST16[%115](align=8)
        %R0D<def,tied1> = AGHI %R0D<tied0>, -2, %CC<imp-def>
        %R1D<def> = LA %R1D<kill>, 32, %noreg
        BRC 15, 7, <BB#42>, %CC<imp-use,kill>
    Successors according to CFG: BB#43(0x04000000 / 0x80000000 = 3.12%) BB#42(0x7c000000 / 0x80000000 = 96.88%)

Another example: test/Transforms/LoopVectorize/bsd_regex.ll

trunk:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.ind = phi <2 x i64> [ <i64 0, i64 1>, %vector.ph ], [ %vec.ind.next, %vector.body ]
  %step.add = add <2 x i64> %vec.ind, <i64 2, i64 2>
  %0 = shl nsw <2 x i64> %vec.ind, <i64 2, i64 2>
  %1 = shl nsw <2 x i64> %step.add, <i64 2, i64 2>
  %2 = extractelement <2 x i64> %0, i32 0
  %3 = getelementptr inbounds i32, i32* %A, i64 %2
  %4 = extractelement <2 x i64> %0, i32 1
  %5 = getelementptr inbounds i32, i32* %A, i64 %4
  %6 = extractelement <2 x i64> %1, i32 0
  %7 = getelementptr inbounds i32, i32* %A, i64 %6
  %8 = extractelement <2 x i64> %1, i32 1
  %9 = getelementptr inbounds i32, i32* %A, i64 %8
  store i32 4, i32* %3, align 4
  store i32 4, i32* %5, align 4
  store i32 4, i32* %7, align 4
  store i32 4, i32* %9, align 4
  %index.next = add i64 %index, 4
  %vec.ind.next = add <2 x i64> %vec.ind, <i64 4, i64 4>
  %10 = icmp eq i64 %index.next, 10000
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !0


.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        veslg   %v3, %v0, 2
        vlgvg   %r3, %v3, 0
        sllg    %r3, %r3, 2
        st      %r1, 0(%r3,%r2)
        vlgvg   %r3, %v3, 1
        vag     %v3, %v3, %v1
        sllg    %r3, %r3, 2
        st      %r1, 0(%r3,%r2)
        vlgvg   %r3, %v3, 0
        sllg    %r3, %r3, 2
        vag     %v0, %v0, %v2
        st      %r1, 0(%r3,%r2)
        vlgvg   %r3, %v3, 1
        sllg    %r3, %r3, 2
        aghi    %r0, -4
        st      %r1, 0(%r3,%r2)
        jne     .LBB0_1


w/ patch:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %0 = shl nsw i64 %index, 2
  %1 = shl i64 %index, 2
  %2 = or i64 %1, 4
  %3 = shl i64 %index, 2
  %4 = or i64 %3, 8
  %5 = shl i64 %index, 2
  %6 = or i64 %5, 12
  %7 = getelementptr inbounds i32, i32* %A, i64 %0
  %8 = getelementptr inbounds i32, i32* %A, i64 %2
  %9 = getelementptr inbounds i32, i32* %A, i64 %4
  %10 = getelementptr inbounds i32, i32* %A, i64 %6
  store i32 4, i32* %7, align 4
  store i32 4, i32* %8, align 4
  store i32 4, i32* %9, align 4
  store i32 4, i32* %10, align 4
  %index.next = add i64 %index, 4
  %11 = icmp eq i64 %index.next, 10000
  br i1 %11, label %middle.block, label %vector.body, !llvm.loop !0


.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        sty     %r0, -32(%r1,%r2)
        sty     %r0, -48(%r1,%r2)
        sty     %r0, -16(%r1,%r2)
        st      %r0, 0(%r1,%r2)
        la      %r1, 64(%r1)
        cgfi    %r1, 160048
        jlh     .LBB0_1

Diff Detail

Event Timeline

jonpa created this revision.Apr 24 2017, 1:51 AM
jonpa updated this revision to Diff 99270.May 17 2017, 4:53 AM

(This patch also includes a modeling of costs of div by power of 2 for SystemZ, which is separate from the LoopVectorizer scalar addressing change.)

The scalar addressing generation is guarded by the target with a new hook: prefersVectorizedAddressing(), which defaults to true. Only SystemZ (which returns false), is affected by this patch.

A new set for each VF 'ForcedScalars' is used. The new code inserts address computing instructions here. This means that

  • They are inserted into Scalars so that they get scalarized
  • getInstructionCost() returns scalar cost * VF (scalarization overhead excluded)

I added a comment explaining that memory instructions bar the heuristic that at least one instruction must be vectorized in order to proceed with vectorization.

The search is done in setCostBasedWideningDecision(), because these instructions also include loads of addresses, and the widening of instruction of those loads must be set in that method.

jonpa updated this revision to Diff 99286.May 17 2017, 6:47 AM

ping!

This patch has now been confirmed on benchmarks, and is ready to commit. See previous comment for a summary.

I added a comment like:
FIXME: Currently widenPHIInstruction() often creates a dead vector
induction variable when the PHI user is scalarized.

I am not sure if it is motivated or in the right place... It is nothing new with this patch.

(The SystemZ part which was unrelated has been removed from this review)

rengolin edited edge metadata.May 17 2017, 7:02 AM

Hi Jonas,

I understand your problem and the SystemZ part is probably fine (I can't review that myself), but I fear introducing such a call-back will not help the underlying cause.

What we really need to to know if the shuffle costs will be higher than the savings, and that should be done by asking the shuffle costs directly.

I'd assume that targets without scatter/gather support would return higher costs for those operations (probably a magnitude higher), so maybe there's a problem in the load/store cost analysis that could considerably simplify this.

cheers,
--renato

include/llvm/Analysis/TargetTransformInfo.h
394

Can't you check for scatter/gather support directly?

jonpa added a comment.May 17 2017, 7:32 AM

Hi Jonas,

I understand your problem and the SystemZ part is probably fine (I can't review that myself), but I fear introducing such a call-back will not help the underlying cause.

What we really need to to know if the shuffle costs will be higher than the savings, and that should be done by asking the shuffle costs directly.

I'd assume that targets without scatter/gather support would return higher costs for those operations (probably a magnitude higher), so maybe there's a problem in the load/store cost analysis that could considerably simplify this.

cheers,
--renato

I don't quite follow how this has to do with vector shuffles...? On SystemZ, all addresses residing in vectors must be extracted before use. (There are slow vector element gather/scatter, which are not used by the backend, so they are irrelevant).

I see your point that there is a theoretical possibility that if an address is computed in vectors, and that sequence of computations is long enough, this still might be cheaper. In practice, it however seems to be an impossible feat at the moment, because of the fact that LSR is not improving the vectorized addressing. So even if we computed the cost of vector instructions + extracts, and compared it to scalar instructions, it would be inaccurate.

include/llvm/Analysis/TargetTransformInfo.h
394

I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this.
I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.

I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?

rengolin added subscribers: delena, jmolloy.

I don't quite follow how this has to do with vector shuffles...? On SystemZ, all addresses residing in vectors must be extracted before use. (There are slow vector element gather/scatter, which are not used by the backend, so they are irrelevant).

Right, I see what you mean about the "address computation" to be forced into scalar registers. Does that mean that you have to use the same GR to load into all lanes of the vector, so supposedly load+increment+load+increment...?

Regardless, between iterations of the loop, you want to keep the addresses in the GRs throughout the execution of the loop. Is that correct?

I see your point that there is a theoretical possibility that if an address is computed in vectors, and that sequence of computations is long enough, this still might be cheaper. In practice, it however seems to be an impossible feat at the moment, because of the fact that LSR is not improving the vectorized addressing. So even if we computed the cost of vector instructions + extracts, and compared it to scalar instructions, it would be inaccurate.

Right, so this is not the cost of extract/insert, but the cost of indirect access, which is not the same as the masked scatter/gather that we currently have, I agree.

I'm adding Elena and James to see if they have some more ideas.

I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.

That's what I was thinking...

I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?

There are two issues to discern here:

  1. If the target has any kind of scatter/gather support.
  2. What is the cost of doing so in a particular case.

The first problem could be solved by having a sub-target feature or similar. The second may need to inspect which instruction we're talking about, etc, which goes inline with some changes to the cost functions we've seen recently.

Both solutions could be applied on the same function, for example:

getGatherScatterOpCost(Inst *I) {
  int Cost = 1;
  if (!Target->hasScatterGatther())
    return INT_MAX;
  // Uses inst to make sure this it what I think it is...
  ...
  return Cost;
}

Does that make sense? @delena @jmolloy do we even need this level of complexity?

cheers,
--renato

jonpa added a comment.May 17 2017, 8:13 AM

Right, I see what you mean about the "address computation" to be forced into scalar registers. Does that mean that you have to use the same GR to load into all lanes of the vector, so supposedly load+increment+load+increment...?

It doesn't have to be the same GR for all the lanes, but it must be a GR (since we don't use the slow "element gather/scatter")

Regardless, between iterations of the loop, you want to keep the addresses in the GRs throughout the execution of the loop. Is that correct?

Yes, that's better than having to extract elements. The extracts are relatively expensive.

Does that make sense?

I am thinking that we are past the point of making the widening decision at the point where this patch starts to run in setCostBasedWideningDecision(). The only question then is what type of address computation the target wants.

  1. For a gather/scatter access, it is obviously in vector registers - this is something the patch shouldn't change.
  2. For the scalarized/widened/interleaved access, I think it is a general question of preference which may vary from target to target. It seems far-fetched to handle this at instruction level, at least on SystemZ.

BTW, If we indeed want to let target with gather/scatter support have a say about 2), there should also be a check there so that gather/scatter accesses aren't included in AddrDefs, since they shouldn't be scalarized.

Right, I see what you mean about the "address computation" to be forced into scalar registers. Does that mean that you have to use the same GR to load into all lanes of the vector, so supposedly load+increment+load+increment...?

It doesn't have to be the same GR for all the lanes, but it must be a GR (since we don't use the slow "element gather/scatter")

Regardless, between iterations of the loop, you want to keep the addresses in the GRs throughout the execution of the loop. Is that correct?

Yes, that's better than having to extract elements. The extracts are relatively expensive.

Ack.

I am thinking that we are past the point of making the widening decision at the point where this patch starts to run in setCostBasedWideningDecision(). The only question then is what type of address computation the target wants.

  1. For a gather/scatter access, it is obviously in vector registers - this is something the patch shouldn't change.
  2. For the scalarized/widened/interleaved access, I think it is a general question of preference which may vary from target to target. It seems far-fetched to handle this at instruction level, at least on SystemZ.

I think you have a good point, there. I may be over thinking it.

I'm ok with the patch as it is, but I'll let it simmer for a bit so that Ulrich, Hal, Elena and James can have a look. I'm happy if they are.

cheers,
--renato

delena edited edge metadata.May 17 2017, 1:54 PM

In general, I think that one-level-up scalarization is good for all targets. Anyway, attempt to scalarize the Ptr should be done if the taken decision is "Scalarize".

lib/Transforms/Vectorize/LoopVectorize.cpp
7301

But interleave loads are also here. You don't need to scalarize their pointers.

7305

You will find getPointerOperand(Inst) utility in this file. It retrieves a pointer from load and store.

7322

In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up.

7322

The scalarized intsruction should belong to the same Basic block and , probably, have only one user.

7331

At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.

jonpa marked an inline comment as done.May 18 2017, 8:52 AM

Elena, thanks for a nice review! I'll wait for your comments before updating the patch.

I have tried all your points, and tried to explain my findings inlined below.

In general, I think that one-level-up scalarization is good for all targets.

You mean to handle just the case where we know this only changes one instructions decision to CM_Scalarize?

lib/Transforms/Vectorize/LoopVectorize.cpp
7301

attempt to scalarize the Ptr should be done if the taken decision is "Scalarize".

I tried

if (PtrDef && TheLoop->contains(PtrDef) &&
    getWideningDecision(&I, VF) == CM_Scalarize)
  AddrDefs.insert(PtrDef);

This was a minor change: only 17 loops became different. This seems to relate to the cases of mixed context (used both as address and otherwise), which means that we are instead of extracting an address now do inserts instead.

Not sure exactly if this matters, but for what I can see, it seemed slightly better in terms of final MIs to scalarize "anything" (not just scalarized case), perhaps because the big benefit of LSR working on scalar addresses?

If you don't have a strong opinion, I guess I would be most happy not to change anything here.

7305

Aah, thanks.

7322

In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up.

To my experience, if there are many levels of address computation stages I think it would still be wise to scalarize this and in those cases let this become "too expensive", and let LSR do its magic, rather than vectorizing complex addressing. In practice on benchmark it has been shown that about the same number of loops get vectorized still, so this shouldn't be common.

I still think it is a pity to see vectorized code that suffers from no LSR -- feel free to join discussion at https://reviews.llvm.org/D32166 and share any thoughts...

The scalarized instruction should belong to the same Basic block

Changing the check from "contained in loop" to "contained in same bb" actually meant identical results...

and , probably, have only one user.

I have experimented with something like this and found a few cases were an address computation was also used in another context. What I did then was an iterative pruning of AddrDefs of any instructions used outside of AddrDefs. It was quite rare, and did not seem to matter much, and I left it like this with the idea that maybe it is best to help the memory access to be fast, but this is nothing confirmed.

I tried this now - just to add "InstOp->hasOneUse() &&" here, and found that on spec, a set of loops (73) were affected. This seems quite clearly to be for the worse, with just more extracts, so if you do not have a very strong opinion, I would like to leave this alone...

7331

I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway.

Here's an example:

Loop before vectorize pass:

for.body61:                                       ; preds = %for.body61, %for.body61.lr.ph
  %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ]
  %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1
  %20 = load i32, i32* %next65, align 4, !tbaa !11
  %idxprom66 = sext i32 %20 to i64
  %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66
  %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7
  store i32 %21, i32* %next65, align 4, !tbaa !11
  %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1
  %cmp59 = icmp slt i64 %indvars.iv81, %16
  br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit

PATCH UNMODIFIED:

Loop after vectorize pass:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %20 = add i64 %index, 0
  %21 = add i64 %index, 1
  %22 = add i64 %index, 2
  %23 = add i64 %index, 3
  %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
  %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
  %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
  %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
  %28 = load i32, i32* %24, align 4, !tbaa !11
  %29 = load i32, i32* %25, align 4, !tbaa !11
  %30 = load i32, i32* %26, align 4, !tbaa !11
  %31 = load i32, i32* %27, align 4, !tbaa !11
  %32 = sext i32 %28 to i64
  %33 = sext i32 %29 to i64
  %34 = sext i32 %30 to i64
  %35 = sext i32 %31 to i64
  %36 = getelementptr inbounds i32, i32* %1, i64 %32
  %37 = getelementptr inbounds i32, i32* %1, i64 %33
  %38 = getelementptr inbounds i32, i32* %1, i64 %34
  %39 = getelementptr inbounds i32, i32* %1, i64 %35
  %40 = load i32, i32* %36, align 4, !tbaa !7
  %41 = load i32, i32* %37, align 4, !tbaa !7
  %42 = load i32, i32* %38, align 4, !tbaa !7
  %43 = load i32, i32* %39, align 4, !tbaa !7
  store i32 %40, i32* %24, align 4, !tbaa !11
  store i32 %41, i32* %25, align 4, !tbaa !11
  store i32 %42, i32* %26, align 4, !tbaa !11
  store i32 %43, i32* %27, align 4, !tbaa !11
  %index.next = add i64 %index, 4
  %44 = icmp eq i64 %index.next, %n.vec
  br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12

->

Final Header: 
BB#22: derived from LLVM BB %vector.body
    Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
    Predecessors according to CFG: BB#21 BB#23
        %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21)
        %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21)
        %R5D<def> = SLLG %R5D<kill>, %noreg, 2
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2)
        %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2)
        %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21)
        %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21)
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R11D<def> = SLLG %R11D<kill>, %noreg, 2
        %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2)
        %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2)
        STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21)
        STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21)
        STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21)
        STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21)
        CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>

instr-count:  17

MODIFIED TO NOT TOUCH INTERLEAVED LOADS:

Loop after vectorize pass:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %20 = add i64 %index, 0
  %21 = add i64 %index, 1
  %22 = add i64 %index, 2
  %23 = add i64 %index, 3
  %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
  %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
  %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
  %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
  %28 = getelementptr i32, i32* %24, i32 0
  %29 = bitcast i32* %28 to <8 x i32>*
  %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11
  %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %30 = extractelement <4 x i32> %strided.vec, i32 0
  %31 = sext i32 %30 to i64
  %32 = extractelement <4 x i32> %strided.vec, i32 1
  %33 = sext i32 %32 to i64
  %34 = extractelement <4 x i32> %strided.vec, i32 2
  %35 = sext i32 %34 to i64
  %36 = extractelement <4 x i32> %strided.vec, i32 3
  %37 = sext i32 %36 to i64
  %38 = getelementptr inbounds i32, i32* %1, i64 %31
  %39 = getelementptr inbounds i32, i32* %1, i64 %33
  %40 = getelementptr inbounds i32, i32* %1, i64 %35
  %41 = getelementptr inbounds i32, i32* %1, i64 %37
  %42 = load i32, i32* %38, align 4, !tbaa !7
  %43 = load i32, i32* %39, align 4, !tbaa !7
  %44 = load i32, i32* %40, align 4, !tbaa !7
  %45 = load i32, i32* %41, align 4, !tbaa !7
  store i32 %42, i32* %24, align 4, !tbaa !11
  store i32 %43, i32* %25, align 4, !tbaa !11
  store i32 %44, i32* %26, align 4, !tbaa !11
  store i32 %45, i32* %27, align 4, !tbaa !11
  %index.next = add i64 %index, 4
  %46 = icmp eq i64 %index.next, %n.vec
  br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12

Final Header: 
BB#22: derived from LLVM BB %vector.body
    Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
    Predecessors according to CFG: BB#21 BB#23
        %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21)
        %R5D<def> = VLGVF %V0, %noreg, 2
        %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill>
        %R5D<def> = SLLG %R5D<kill>, %noreg, 2
        %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2)
        %R14D<def> = VLGVF %V0<kill>, %noreg, 0
        %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21)
        %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2)
        %R14D<def> = VLGVF %V0, %noreg, 0
        %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2)
        %R11D<def> = VLGVF %V0<kill>, %noreg, 2
        %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill>
        %R11D<def> = SLLG %R11D<kill>, %noreg, 2
        %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2)
        STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21)
        ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21)
        STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21)
        ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21)
        CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>

instr-count:  23

I experimented with scalarizing the whole interleave group instead, like:

for (auto *I : AddrDefs) {
  if (isa<LoadInst>(I)) {
    if (getWideningDecision(I, VF) == CM_Widen)
      // Scalarize a widened load of address.
      setWideningDecision(I, VF, CM_Scalarize,
                          (VF * getMemoryInstructionCost(I, 1)));
    else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
      // Scalarize an interleave group of address loads.
      for (unsigned I = 0; I < Group->getFactor(); ++I) {
        if (Instruction *Member = Group->getMember(I))
          setWideningDecision(Member, VF, CM_Scalarize,
                              (VF * getMemoryInstructionCost(Member, 1)));
      }
    }
  } else
    // Make sure I gets scalarized and a cost estimate without
    // scalarization overhead.
    ForcedScalars[VF].insert(I);
}

This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads.

So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...

delena added inline comments.May 19 2017, 12:56 AM
include/llvm/Analysis/TargetTransformInfo.h
394
"Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?"

Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets.

394

"I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this."
The API could be extended. We can add isLegalGather()/isLegalScatter() and it will be consistent and clear.

include/llvm/Analysis/TargetTransformInfoImpl.h
234

I assume that X86 would prefer to scalarize addressing for scalarised memops. But it is not clear how to use this interface..

lib/Transforms/Vectorize/LoopVectorize.cpp
2107

May be we can add these instructions to the widening decisions map? This map contains only memops now, right?

7301

I just do not understand why do you need to scalarize addressing if the instruction itself is not scalarized.
Can you give me an IR example?

7331

You can't change decision of one load from the interleave group - or all, or nothing.

7331
> "So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. "

This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.

jonpa marked an inline comment as done.May 20 2017, 12:32 AM
jonpa added inline comments.
include/llvm/Analysis/TargetTransformInfo.h
394

Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets.

I meant that this could be applied for *other* cases than gather/scatter.

394

I suppose that "preferrsVectorizedAddressing()" could be named something else, but the point with this is currently that it is separate from deciding on the widening decision, which means that even though it currently is only a way to enable this for SystemZ only, it could become also for other targets a way to get scalarized addressing for *non* gather/scatter accesses (or CM_Scalarize only, perhaps).

So first, the widening decision is made, which included calling isLegalScatter/Gather or similar. With all memory accesses having gotten mapped to such a decision, we now ask if for any non-gather/scatter the target wants the current vectorized addressing, or to keep it scalar.

include/llvm/Analysis/TargetTransformInfoImpl.h
234

OK - Sofar this patch is just working for SystemZ. As discussed already, we have to improve it somehow if X86 has an interest.

lib/Transforms/Vectorize/LoopVectorize.cpp
2107

They are not just CM_Scalarize, because they don't have any scalarization overhead.
(With a more intelligent handling of scalarized instruction sequences I guess this would become equivalent)

ForcedScalars contains not just memory instructions, but any other instruction leading to an address.

7301

Again, "This seems to relate to the cases of mixed context (used both as address and otherwise)"...

For example:

for.body.i:                                       ; preds = %for.body.i, %for.body.preheader.i
  %indvars.iv241.i = phi i64 [ %47, %for.body.preheader.i ], [ %indvars.iv.next242.i, %for.body.i ]
  %indvars.iv.next242.i = add nsw i64 %indvars.iv241.i, -1
  %arrayidx.i.i = getelementptr inbounds i32, i32* %42, i64 %indvars.iv.next242.i
  %48 = trunc i64 %indvars.iv.next242.i to i32
  store i32 %48, i32* %arrayidx.i.i, align 4, !tbaa !47
  %tobool.i = icmp eq i32 %48, 0
  br i1 %tobool.i, label %for.end.i.loopexit, label %for.body.i

This is a mixed-use where the IV is stored truncated at the address indexed with IV. Due to the non-address use, it will be contained in a vector.

7331

You can't change decision of one load from the interleave group - or all, or nothing.

Ok - I will update the patch then to scalarize the whole interleave group, which seemed best also.

7331

This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.

It would have been nice to just return a too-great value in getInterleavedMemoryOpCost() instead for loads of addresses, but I did however find that there are cases where the index is loaded, so it is not just a matter of checking if the passed type is a pointer type to catch this.

An alternative then would be to pass the instruction pointer into that method, and to have to do a search then to see if the loaded value is involved in addressing. That way the widening decision for such a load would always be CM_Scalarize before this patch starts to work. This is however just extra work compared to the current patch, which is already finding those instructions.

Perhaps an alternative would then be another boolean argument to that method and getMemoryOpCost() like IsInvolvedInAddressing, so that this new patch could call it again to check if it returned a very high value or not? This would then be called at this point where the patch is scalarizing things.

jonpa added inline comments.May 20 2017, 11:27 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
7301

In this example, it is the same instruction (store) using the value both as addressing and as a general value. There are also other cases where it may be used in a store, but used in a different instruction also, like an arithmetic one...

I would like to point out that this actually wanted right now since performance measurements show better results on SystemZ at least on one benchmark by doing this "as much as possible", meaning that any mixed use like this gets included in this transformation. (I have actually tried this version of the patch and compared to a less aggressive version, which do not touch mixed-uses, and it was found that more aggressive is better...)

jonpa updated this revision to Diff 100055.May 24 2017, 2:35 AM

Elena,

I have updated the patch based on your review. The changes are:

  • Use getPointerOperand()
  • Only handle memory accesses which *are not gather/scatter*. This way, SystemZ gets the behavior it wants, and other targets such as x86 can enable this without disrupting gather/scatter accesses.
  • Check for operand in same BB, instead of inside loop (currently completely equivalent)
  • Handle the whole interleave group if changing widening decision. Also added comment as to why this is not (currently) handled by the cost functions.

Does this look ok to you now?

delena added inline comments.May 24 2017, 3:06 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7308

At this point you may have "interleave" decision. I assume, you have nothing to do with it.
I'd check getWideningDecision(&I, VF) == CM_Scalarize)

7321

AddrDefs.insert(InstOp).second == true) -> AddrDefs.count(InstOp)

jonpa added inline comments.May 24 2017, 3:20 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7308

As explained before, there are just a few loops that this relates to. When I tried to handle only CM_Scalarize like you suggested, it changed just 17 loops, but it seemed to be just a bit better to do the interleaved accessess as well.

I believe in this case the the address is scalar, but that register is also used for something else. That means that it ends up in a vector register and has to be extracted.

The reason I would also like to see this is as explained before that due to LSR, it is generally better if all else same to keep addressing scalar.

7321

no - I want to insert InstOp into AddrDefs if it isn't there, right?

The code LGTM, I do not have more comments on it. I can't say anything about profitability of this algorithm for SystemZ, may be you need additional LGTM from one of SystemZ developers?

lib/Transforms/Vectorize/LoopVectorize.cpp
7308

Ok. You scalarize address calculation even for wide loads. Probably it make sense for SystemZ.

7321

ok, you are right!

uweigand accepted this revision.May 24 2017, 6:09 AM

The code LGTM, I do not have more comments on it. I can't say anything about profitability of this algorithm for SystemZ, may be you need additional LGTM from one of SystemZ developers?

We've seen that this is profitable in measurements, so if you're fine with the implementation, the patch LGTM as well.

This revision is now accepted and ready to land.May 24 2017, 6:09 AM
jonpa closed this revision.May 24 2017, 6:43 AM

Thanks for review, everyone.
r303744.

What test suite did you use to demonstrate performance?

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2023, 12:31 AM