This is an archive of the discontinued LLVM Phabricator instance.

LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)
ClosedPublic

Authored by jonpa on Apr 24 2017, 1:51 AM.

Download Raw Diff

Details

Reviewers

rengolin
uweigand
hfinkel
delena
jmolloy

Summary

Apart from some minor improvements in SystemZTargetTransformInfo, this patch deals with the loop
vectorizers handling of addressing:

The loop vectorizer usually vectorizes any instruction it can and then extracts the elements for a scalarized use.

This is not a good for addressing without support for vector gather/scatter, at least not on SystemZ. On SystemZ, all elements containing addresses must be extracted into address registers (GRs). Since this extraction is not free, it is better to have the address in a suitable register to begin with. By forcing address arithmetic instructions and loads of addresses to be scalar after vectorization, two benefits result:

No need to extract the register
LSR optimizations trigger (LSR isn't handling vector addresses currently)

Preliminary benchmark results show nice improvements on SystemZ with this new behaviour.

I was not sure what other targets might think of this, so I used a new TTI hook 'prefersVectorizedAddressing()', which defaults to true. It should be fairly straight forward for another target to extend this in the futures so that any address used for gather/scatter is vectorized, while any other scalar pointers gets scalarized address computations. If this would be desired, this new hook might be removed again.

(I have done benchmarking with the SystemZTTI improvements included, and I hope it's not confusing to include those file in the patch.)

As an example, this loop final output shows on trunk both shifts and extracts that are completely eliminated with the patch:

Loop before vectorize pass:

for.body144.us:                                   ; preds = %for.cond142.preheader.us, %for.body144.us
  %indvars.iv227 = phi i64 [ 0, %for.cond142.preheader.us ], [ %indvars.iv.next228, %for.body144.us ]
  %47 = shl nsw i64 %indvars.iv227, 1
  %arrayidx147.us = getelementptr inbounds double, double* %colB138.0191.us, i64 %47
  %48 = bitcast double* %arrayidx147.us to i64*
  %49 = load i64, i64* %48, align 8, !tbaa !13
  %arrayidx150.us = getelementptr inbounds double, double* %colA137.0190.us, i64 %47
  %50 = bitcast double* %arrayidx150.us to i64*
  store i64 %49, i64* %50, align 8, !tbaa !13
  %51 = or i64 %47, 1
  %arrayidx154.us = getelementptr inbounds double, double* %colB138.0191.us, i64 %51
  %52 = bitcast double* %arrayidx154.us to i64*
  %53 = load i64, i64* %52, align 8, !tbaa !13
  %arrayidx158.us = getelementptr inbounds double, double* %colA137.0190.us, i64 %51
  %54 = bitcast double* %arrayidx158.us to i64*
  store i64 %53, i64* %54, align 8, !tbaa !13
  %indvars.iv.next228 = add nuw nsw i64 %indvars.iv227, 1
  %cmp143.us = icmp slt i64 %indvars.iv.next228, %46
  br i1 %cmp143.us, label %for.body144.us, label %for.cond142.for.end161_crit_edge.us


Trunk:
BB#42: derived from LLVM BB %vector.body263
    Live Ins: %V0 %V1 %V2 %V3 %R4D %R5D %R6D %R7D %R8D %R9D %R10D %R11D %R12D %R13D %R14D %R3L
    Predecessors according to CFG: BB#41 BB#42
        %V4<def> = VESLG %V3, %noreg, 1
        %R0D<def> = VLGVG %V4, %noreg, 0
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %V5<def> = VL %R8D, 0, %R1D; mem:LD16[%109](align=8)(tbaa=!15)(alias.scope=!135)
        VSTEG %V5, %R7D, 0, %R1D, 0; mem:ST8[%113](tbaa=!15)(alias.scope=!138)(noalias=!135)
        %V6<def> = VL %R8D, 16, %R1D<kill>; mem:LD16[%109+16](align=8)(tbaa=!15)(alias.scope=!135)
        %R0D<def> = VLGVG %V4, %noreg, 1
        %V4<def> = VO %V4<kill>, %V1
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %R0D<def> = VLGVG %V4, %noreg, 0
        %V3<def> = VAG %V3<kill>, %V2
        VSTEG %V6, %R7D, 0, %R1D<kill>, 0; mem:ST8[%114](tbaa=!15)(alias.scope=!138)(noalias=!135)
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %R0D<def> = VLGVG %V4<kill>, %noreg, 1
        VSTEG %V5<kill>, %R7D, 0, %R1D<kill>, 1; mem:ST8[%122](tbaa=!15)(alias.scope=!138)(noalias=!135)
        %R1D<def> = SLLG %R0D<kill>, %noreg, 3
        %R6D<def,tied1> = AGHI %R6D<tied0>, -2, %CC<imp-def>
        VSTEG %V6<kill>, %R7D, 0, %R1D<kill>, 1; mem:ST8[%123](tbaa=!15)(alias.scope=!138)(noalias=!135)
        BRC 15, 7, <BB#42>, %CC<imp-use,kill>
    Successors according to CFG: BB#43(0x04000000 / 0x80000000 = 3.12%) BB#42(0x7c000000 / 0x80000000 = 96.88%)

Dev:
BB#42: derived from LLVM BB %vector.body263
    Live Ins: %R0D %R1D %R4D %R5D %R7D %R8D %R9D %R10D %R11D %R12D %R13D %R14D %R3L
    Predecessors according to CFG: BB#41 BB#42
        %V0<def> = VL %R8D, 0, %R1D; mem:LD16[%109](align=8)(tbaa=!15)(alias.scope=!135)
        VST %V0<kill>, %R7D, 0, %R1D; mem:ST16[%112](align=8)
        %V0<def> = VL %R8D, 16, %R1D; mem:LD16[%109+16](align=8)(tbaa=!15)(alias.scope=!135)
        VST %V0<kill>, %R7D, 16, %R1D; mem:ST16[%115](align=8)
        %R0D<def,tied1> = AGHI %R0D<tied0>, -2, %CC<imp-def>
        %R1D<def> = LA %R1D<kill>, 32, %noreg
        BRC 15, 7, <BB#42>, %CC<imp-use,kill>
    Successors according to CFG: BB#43(0x04000000 / 0x80000000 = 3.12%) BB#42(0x7c000000 / 0x80000000 = 96.88%)

Another example: test/Transforms/LoopVectorize/bsd_regex.ll

trunk:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.ind = phi <2 x i64> [ <i64 0, i64 1>, %vector.ph ], [ %vec.ind.next, %vector.body ]
  %step.add = add <2 x i64> %vec.ind, <i64 2, i64 2>
  %0 = shl nsw <2 x i64> %vec.ind, <i64 2, i64 2>
  %1 = shl nsw <2 x i64> %step.add, <i64 2, i64 2>
  %2 = extractelement <2 x i64> %0, i32 0
  %3 = getelementptr inbounds i32, i32* %A, i64 %2
  %4 = extractelement <2 x i64> %0, i32 1
  %5 = getelementptr inbounds i32, i32* %A, i64 %4
  %6 = extractelement <2 x i64> %1, i32 0
  %7 = getelementptr inbounds i32, i32* %A, i64 %6
  %8 = extractelement <2 x i64> %1, i32 1
  %9 = getelementptr inbounds i32, i32* %A, i64 %8
  store i32 4, i32* %3, align 4
  store i32 4, i32* %5, align 4
  store i32 4, i32* %7, align 4
  store i32 4, i32* %9, align 4
  %index.next = add i64 %index, 4
  %vec.ind.next = add <2 x i64> %vec.ind, <i64 4, i64 4>
  %10 = icmp eq i64 %index.next, 10000
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !0


.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        veslg   %v3, %v0, 2
        vlgvg   %r3, %v3, 0
        sllg    %r3, %r3, 2
        st      %r1, 0(%r3,%r2)
        vlgvg   %r3, %v3, 1
        vag     %v3, %v3, %v1
        sllg    %r3, %r3, 2
        st      %r1, 0(%r3,%r2)
        vlgvg   %r3, %v3, 0
        sllg    %r3, %r3, 2
        vag     %v0, %v0, %v2
        st      %r1, 0(%r3,%r2)
        vlgvg   %r3, %v3, 1
        sllg    %r3, %r3, 2
        aghi    %r0, -4
        st      %r1, 0(%r3,%r2)
        jne     .LBB0_1


w/ patch:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %0 = shl nsw i64 %index, 2
  %1 = shl i64 %index, 2
  %2 = or i64 %1, 4
  %3 = shl i64 %index, 2
  %4 = or i64 %3, 8
  %5 = shl i64 %index, 2
  %6 = or i64 %5, 12
  %7 = getelementptr inbounds i32, i32* %A, i64 %0
  %8 = getelementptr inbounds i32, i32* %A, i64 %2
  %9 = getelementptr inbounds i32, i32* %A, i64 %4
  %10 = getelementptr inbounds i32, i32* %A, i64 %6
  store i32 4, i32* %7, align 4
  store i32 4, i32* %8, align 4
  store i32 4, i32* %9, align 4
  store i32 4, i32* %10, align 4
  %index.next = add i64 %index, 4
  %11 = icmp eq i64 %index.next, 10000
  br i1 %11, label %middle.block, label %vector.body, !llvm.loop !0


.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        sty     %r0, -32(%r1,%r2)
        sty     %r0, -48(%r1,%r2)
        sty     %r0, -16(%r1,%r2)
        st      %r0, 0(%r1,%r2)
        la      %r1, 64(%r1)
        cgfi    %r1, 160048
        jlh     .LBB0_1

Diff Detail

Event Timeline

jonpa created this revision.Apr 24 2017, 1:51 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 24 2017, 1:51 AM

(This patch also includes a modeling of costs of div by power of 2 for SystemZ, which is separate from the LoopVectorizer scalar addressing change.)

The scalar addressing generation is guarded by the target with a new hook: prefersVectorizedAddressing(), which defaults to true. Only SystemZ (which returns false), is affected by this patch.

A new set for each VF 'ForcedScalars' is used. The new code inserts address computing instructions here. This means that

They are inserted into Scalars so that they get scalarized
getInstructionCost() returns scalar cost * VF (scalarization overhead excluded)

I added a comment explaining that memory instructions bar the heuristic that at least one instruction must be vectorized in order to proceed with vectorization.

The search is done in setCostBasedWideningDecision(), because these instructions also include loads of addresses, and the widening of instruction of those loads must be set in that method.

Herald added a subscriber: krytarowski. · View Herald TranscriptMay 17 2017, 4:53 AM

ping!

This patch has now been confirmed on benchmarks, and is ready to commit. See previous comment for a summary.

I added a comment like:
FIXME: Currently widenPHIInstruction() often creates a dead vector
induction variable when the PHI user is scalarized.

I am not sure if it is motivated or in the right place... It is nothing new with this patch.

(The SystemZ part which was unrelated has been removed from this review)

Hi Jonas,

I understand your problem and the SystemZ part is probably fine (I can't review that myself), but I fear introducing such a call-back will not help the underlying cause.

What we really need to to know if the shuffle costs will be higher than the savings, and that should be done by asking the shuffle costs directly.

I'd assume that targets without scatter/gather support would return higher costs for those operations (probably a magnitude higher), so maybe there's a problem in the load/store cost analysis that could considerably simplify this.

cheers,
--renato

include/llvm/Analysis/TargetTransformInfo.h
394	Can't you check for scatter/gather support directly?

In D32422#757327, @rengolin wrote:

Hi Jonas,

I understand your problem and the SystemZ part is probably fine (I can't review that myself), but I fear introducing such a call-back will not help the underlying cause.

What we really need to to know if the shuffle costs will be higher than the savings, and that should be done by asking the shuffle costs directly.

I'd assume that targets without scatter/gather support would return higher costs for those operations (probably a magnitude higher), so maybe there's a problem in the load/store cost analysis that could considerably simplify this.

cheers,
--renato

I don't quite follow how this has to do with vector shuffles...? On SystemZ, all addresses residing in vectors must be extracted before use. (There are slow vector element gather/scatter, which are not used by the backend, so they are irrelevant).

I see your point that there is a theoretical possibility that if an address is computed in vectors, and that sequence of computations is long enough, this still might be cheaper. In practice, it however seems to be an impossible feat at the moment, because of the fact that LSR is not improving the vectorized addressing. So even if we computed the cost of vector instructions + extracts, and compared it to scalar instructions, it would be inaccurate.

include/llvm/Analysis/TargetTransformInfo.h
394	I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this. I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX. I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?

In D32422#757376, @jonpa wrote:

I don't quite follow how this has to do with vector shuffles...? On SystemZ, all addresses residing in vectors must be extracted before use. (There are slow vector element gather/scatter, which are not used by the backend, so they are irrelevant).

Right, I see what you mean about the "address computation" to be forced into scalar registers. Does that mean that you have to use the same GR to load into all lanes of the vector, so supposedly load+increment+load+increment...?

Regardless, between iterations of the loop, you want to keep the addresses in the GRs throughout the execution of the loop. Is that correct?

I see your point that there is a theoretical possibility that if an address is computed in vectors, and that sequence of computations is long enough, this still might be cheaper. In practice, it however seems to be an impossible feat at the moment, because of the fact that LSR is not improving the vectorized addressing. So even if we computed the cost of vector instructions + extracts, and compared it to scalar instructions, it would be inaccurate.

Right, so this is not the cost of extract/insert, but the cost of indirect access, which is not the same as the masked scatter/gather that we currently have, I agree.

I'm adding Elena and James to see if they have some more ideas.

I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.

That's what I was thinking...

I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?

There are two issues to discern here:

If the target has any kind of scatter/gather support.
What is the cost of doing so in a particular case.

The first problem could be solved by having a sub-target feature or similar. The second may need to inspect which instruction we're talking about, etc, which goes inline with some changes to the cost functions we've seen recently.

Both solutions could be applied on the same function, for example:

getGatherScatterOpCost(Inst *I) {
  int Cost = 1;
  if (!Target->hasScatterGatther())
    return INT_MAX;
  // Uses inst to make sure this it what I think it is...
  ...
  return Cost;
}

Does that make sense? @delena @jmolloy do we even need this level of complexity?

cheers,
--renato

Right, I see what you mean about the "address computation" to be forced into scalar registers. Does that mean that you have to use the same GR to load into all lanes of the vector, so supposedly load+increment+load+increment...?

It doesn't have to be the same GR for all the lanes, but it must be a GR (since we don't use the slow "element gather/scatter")

Regardless, between iterations of the loop, you want to keep the addresses in the GRs throughout the execution of the loop. Is that correct?

Yes, that's better than having to extract elements. The extracts are relatively expensive.

Does that make sense?

I am thinking that we are past the point of making the widening decision at the point where this patch starts to run in setCostBasedWideningDecision(). The only question then is what type of address computation the target wants.

For a gather/scatter access, it is obviously in vector registers - this is something the patch shouldn't change.
For the scalarized/widened/interleaved access, I think it is a general question of preference which may vary from target to target. It seems far-fetched to handle this at instruction level, at least on SystemZ.

BTW, If we indeed want to let target with gather/scatter support have a say about 2), there should also be a check there so that gather/scatter accesses aren't included in AddrDefs, since they shouldn't be scalarized.

In D32422#757432, @jonpa wrote:

Right, I see what you mean about the "address computation" to be forced into scalar registers. Does that mean that you have to use the same GR to load into all lanes of the vector, so supposedly load+increment+load+increment...?

It doesn't have to be the same GR for all the lanes, but it must be a GR (since we don't use the slow "element gather/scatter")

Regardless, between iterations of the loop, you want to keep the addresses in the GRs throughout the execution of the loop. Is that correct?

Yes, that's better than having to extract elements. The extracts are relatively expensive.

Ack.

I am thinking that we are past the point of making the widening decision at the point where this patch starts to run in setCostBasedWideningDecision(). The only question then is what type of address computation the target wants.

For a gather/scatter access, it is obviously in vector registers - this is something the patch shouldn't change.

For the scalarized/widened/interleaved access, I think it is a general question of preference which may vary from target to target. It seems far-fetched to handle this at instruction level, at least on SystemZ.

I think you have a good point, there. I may be over thinking it.

I'm ok with the patch as it is, but I'll let it simmer for a bit so that Ulrich, Hal, Elena and James can have a look. I'm happy if they are.

cheers,
--renato

In general, I think that one-level-up scalarization is good for all targets. Anyway, attempt to scalarize the Ptr should be done if the taken decision is "Scalarize".

lib/Transforms/Vectorize/LoopVectorize.cpp
7301	But interleave loads are also here. You don't need to scalarize their pointers.
7305	You will find getPointerOperand(Inst) utility in this file. It retrieves a pointer from load and store.
7322	In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up.
7322	The scalarized intsruction should belong to the same Basic block and , probably, have only one user.
7331	At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.

Elena, thanks for a nice review! I'll wait for your comments before updating the patch.

I have tried all your points, and tried to explain my findings inlined below.

In general, I think that one-level-up scalarization is good for all targets.

You mean to handle just the case where we know this only changes one instructions decision to CM_Scalarize?

lib/Transforms/Vectorize/LoopVectorize.cpp
7301	attempt to scalarize the Ptr should be done if the taken decision is "Scalarize". I tried if (PtrDef && TheLoop->contains(PtrDef) && getWideningDecision(&I, VF) == CM_Scalarize) AddrDefs.insert(PtrDef); This was a minor change: only 17 loops became different. This seems to relate to the cases of mixed context (used both as address and otherwise), which means that we are instead of extracting an address now do inserts instead. Not sure exactly if this matters, but for what I can see, it seemed slightly better in terms of final MIs to scalarize "anything" (not just scalarized case), perhaps because the big benefit of LSR working on scalar addresses? If you don't have a strong opinion, I guess I would be most happy not to change anything here.
7305	Aah, thanks.
7322	In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up. To my experience, if there are many levels of address computation stages I think it would still be wise to scalarize this and in those cases let this become "too expensive", and let LSR do its magic, rather than vectorizing complex addressing. In practice on benchmark it has been shown that about the same number of loops get vectorized still, so this shouldn't be common. I still think it is a pity to see vectorized code that suffers from no LSR -- feel free to join discussion at https://reviews.llvm.org/D32166 and share any thoughts... The scalarized instruction should belong to the same Basic block Changing the check from "contained in loop" to "contained in same bb" actually meant identical results... and , probably, have only one user. I have experimented with something like this and found a few cases were an address computation was also used in another context. What I did then was an iterative pruning of AddrDefs of any instructions used outside of AddrDefs. It was quite rare, and did not seem to matter much, and I left it like this with the idea that maybe it is best to help the memory access to be fast, but this is nothing confirmed. I tried this now - just to add "InstOp->hasOneUse() &&" here, and found that on spec, a set of loops (73) were affected. This seems quite clearly to be for the worse, with just more extracts, so if you do not have a very strong opinion, I would like to leave this alone...
7331	I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway. Here's an example: Loop before vectorize pass: for.body61: ; preds = %for.body61, %for.body61.lr.ph %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ] %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1 %20 = load i32, i32* %next65, align 4, !tbaa !11 %idxprom66 = sext i32 %20 to i64 %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66 %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7 store i32 %21, i32* %next65, align 4, !tbaa !11 %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1 %cmp59 = icmp slt i64 %indvars.iv81, %16 br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit PATCH UNMODIFIED: Loop after vectorize pass: vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0 %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3> %20 = add i64 %index, 0 %21 = add i64 %index, 1 %22 = add i64 %index, 2 %23 = add i64 %index, 3 %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1 %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1 %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1 %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1 %28 = load i32, i32* %24, align 4, !tbaa !11 %29 = load i32, i32* %25, align 4, !tbaa !11 %30 = load i32, i32* %26, align 4, !tbaa !11 %31 = load i32, i32* %27, align 4, !tbaa !11 %32 = sext i32 %28 to i64 %33 = sext i32 %29 to i64 %34 = sext i32 %30 to i64 %35 = sext i32 %31 to i64 %36 = getelementptr inbounds i32, i32* %1, i64 %32 %37 = getelementptr inbounds i32, i32* %1, i64 %33 %38 = getelementptr inbounds i32, i32* %1, i64 %34 %39 = getelementptr inbounds i32, i32* %1, i64 %35 %40 = load i32, i32* %36, align 4, !tbaa !7 %41 = load i32, i32* %37, align 4, !tbaa !7 %42 = load i32, i32* %38, align 4, !tbaa !7 %43 = load i32, i32* %39, align 4, !tbaa !7 store i32 %40, i32* %24, align 4, !tbaa !11 store i32 %41, i32* %25, align 4, !tbaa !11 store i32 %42, i32* %26, align 4, !tbaa !11 store i32 %43, i32* %27, align 4, !tbaa !11 %index.next = add i64 %index, 4 %44 = icmp eq i64 %index.next, %n.vec br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12 -> Final Header: BB#22: derived from LLVM BB %vector.body Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D Predecessors according to CFG: BB#21 BB#23 %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21) %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21) %R5D<def> = SLLG %R5D<kill>, %noreg, 2 %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2) %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2) %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21) %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21) %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R11D<def> = SLLG %R11D<kill>, %noreg, 2 %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2) %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2) STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21) STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21) STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21) STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21) CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead> instr-count: 17 MODIFIED TO NOT TOUCH INTERLEAVED LOADS: Loop after vectorize pass: vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0 %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3> %20 = add i64 %index, 0 %21 = add i64 %index, 1 %22 = add i64 %index, 2 %23 = add i64 %index, 3 %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1 %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1 %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1 %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1 %28 = getelementptr i32, i32* %24, i32 0 %29 = bitcast i32* %28 to <8 x i32>* %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11 %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %30 = extractelement <4 x i32> %strided.vec, i32 0 %31 = sext i32 %30 to i64 %32 = extractelement <4 x i32> %strided.vec, i32 1 %33 = sext i32 %32 to i64 %34 = extractelement <4 x i32> %strided.vec, i32 2 %35 = sext i32 %34 to i64 %36 = extractelement <4 x i32> %strided.vec, i32 3 %37 = sext i32 %36 to i64 %38 = getelementptr inbounds i32, i32* %1, i64 %31 %39 = getelementptr inbounds i32, i32* %1, i64 %33 %40 = getelementptr inbounds i32, i32* %1, i64 %35 %41 = getelementptr inbounds i32, i32* %1, i64 %37 %42 = load i32, i32* %38, align 4, !tbaa !7 %43 = load i32, i32* %39, align 4, !tbaa !7 %44 = load i32, i32* %40, align 4, !tbaa !7 %45 = load i32, i32* %41, align 4, !tbaa !7 store i32 %42, i32* %24, align 4, !tbaa !11 store i32 %43, i32* %25, align 4, !tbaa !11 store i32 %44, i32* %26, align 4, !tbaa !11 store i32 %45, i32* %27, align 4, !tbaa !11 %index.next = add i64 %index, 4 %46 = icmp eq i64 %index.next, %n.vec br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12 Final Header: BB#22: derived from LLVM BB %vector.body Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D Predecessors according to CFG: BB#21 BB#23 %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21) %R5D<def> = VLGVF %V0, %noreg, 2 %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill> %R5D<def> = SLLG %R5D<kill>, %noreg, 2 %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2) %R14D<def> = VLGVF %V0<kill>, %noreg, 0 %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21) %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill> %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2) %R14D<def> = VLGVF %V0, %noreg, 0 %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill> %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2) %R11D<def> = VLGVF %V0<kill>, %noreg, 2 %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill> %R11D<def> = SLLG %R11D<kill>, %noreg, 2 %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2) STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21) ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21) STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21) ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21) CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead> instr-count: 23 I experimented with scalarizing the whole interleave group instead, like: for (auto I : AddrDefs) { if (isa<LoadInst>(I)) { if (getWideningDecision(I, VF) == CM_Widen) // Scalarize a widened load of address. setWideningDecision(I, VF, CM_Scalarize, (VF getMemoryInstructionCost(I, 1))); else if (auto Group = Legal->getInterleavedAccessGroup(I)) { // Scalarize an interleave group of address loads. for (unsigned I = 0; I < Group->getFactor(); ++I) { if (Instruction Member = Group->getMember(I)) setWideningDecision(Member, VF, CM_Scalarize, (VF getMemoryInstructionCost(Member, 1))); } } } else // Make sure I gets scalarized and a cost estimate without // scalarization overhead. ForcedScalars[VF].insert(I); } This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads. So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...

delena added inline comments.May 19 2017, 12:56 AM

include/llvm/Analysis/TargetTransformInfo.h
394	"Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?" Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets.
394	"I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this." The API could be extended. We can add isLegalGather()/isLegalScatter() and it will be consistent and clear.
include/llvm/Analysis/TargetTransformInfoImpl.h
234	I assume that X86 would prefer to scalarize addressing for scalarised memops. But it is not clear how to use this interface..
lib/Transforms/Vectorize/LoopVectorize.cpp
2107	May be we can add these instructions to the widening decisions map? This map contains only memops now, right?
7301	I just do not understand why do you need to scalarize addressing if the instruction itself is not scalarized. Can you give me an IR example?
7331	You can't change decision of one load from the interleave group - or all, or nothing.
7331	> "So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. " This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.

jonpa marked an inline comment as done.May 20 2017, 12:32 AM

jonpa added inline comments.

include/llvm/Analysis/TargetTransformInfo.h
394	Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. I meant that this could be applied for other cases than gather/scatter.
394	I suppose that "preferrsVectorizedAddressing()" could be named something else, but the point with this is currently that it is separate from deciding on the widening decision, which means that even though it currently is only a way to enable this for SystemZ only, it could become also for other targets a way to get scalarized addressing for non gather/scatter accesses (or CM_Scalarize only, perhaps). So first, the widening decision is made, which included calling isLegalScatter/Gather or similar. With all memory accesses having gotten mapped to such a decision, we now ask if for any non-gather/scatter the target wants the current vectorized addressing, or to keep it scalar.
include/llvm/Analysis/TargetTransformInfoImpl.h
234	OK - Sofar this patch is just working for SystemZ. As discussed already, we have to improve it somehow if X86 has an interest.
lib/Transforms/Vectorize/LoopVectorize.cpp
2107	They are not just CM_Scalarize, because they don't have any scalarization overhead. (With a more intelligent handling of scalarized instruction sequences I guess this would become equivalent) ForcedScalars contains not just memory instructions, but any other instruction leading to an address.
7301	Again, "This seems to relate to the cases of mixed context (used both as address and otherwise)"... For example: for.body.i: ; preds = %for.body.i, %for.body.preheader.i %indvars.iv241.i = phi i64 [ %47, %for.body.preheader.i ], [ %indvars.iv.next242.i, %for.body.i ] %indvars.iv.next242.i = add nsw i64 %indvars.iv241.i, -1 %arrayidx.i.i = getelementptr inbounds i32, i32* %42, i64 %indvars.iv.next242.i %48 = trunc i64 %indvars.iv.next242.i to i32 store i32 %48, i32* %arrayidx.i.i, align 4, !tbaa !47 %tobool.i = icmp eq i32 %48, 0 br i1 %tobool.i, label %for.end.i.loopexit, label %for.body.i This is a mixed-use where the IV is stored truncated at the address indexed with IV. Due to the non-address use, it will be contained in a vector.
7331	You can't change decision of one load from the interleave group - or all, or nothing. Ok - I will update the patch then to scalarize the whole interleave group, which seemed best also.
7331	This is a cost model decision. Interleaving of 64-bit loads is profitable on X86. It would have been nice to just return a too-great value in getInterleavedMemoryOpCost() instead for loads of addresses, but I did however find that there are cases where the index is loaded, so it is not just a matter of checking if the passed type is a pointer type to catch this. An alternative then would be to pass the instruction pointer into that method, and to have to do a search then to see if the loaded value is involved in addressing. That way the widening decision for such a load would always be CM_Scalarize before this patch starts to work. This is however just extra work compared to the current patch, which is already finding those instructions. Perhaps an alternative would then be another boolean argument to that method and getMemoryOpCost() like IsInvolvedInAddressing, so that this new patch could call it again to check if it returned a very high value or not? This would then be called at this point where the patch is scalarizing things.

jonpa added inline comments.May 20 2017, 11:27 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7301	In this example, it is the same instruction (store) using the value both as addressing and as a general value. There are also other cases where it may be used in a store, but used in a different instruction also, like an arithmetic one... I would like to point out that this actually wanted right now since performance measurements show better results on SystemZ at least on one benchmark by doing this "as much as possible", meaning that any mixed use like this gets included in this transformation. (I have actually tried this version of the patch and compared to a less aggressive version, which do not touch mixed-uses, and it was found that more aggressive is better...)

Elena,

I have updated the patch based on your review. The changes are:

Use getPointerOperand()
Only handle memory accesses which *are not gather/scatter*. This way, SystemZ gets the behavior it wants, and other targets such as x86 can enable this without disrupting gather/scatter accesses.
Check for operand in same BB, instead of inside loop (currently completely equivalent)
Handle the whole interleave group if changing widening decision. Also added comment as to why this is not (currently) handled by the cost functions.

Does this look ok to you now?

delena added inline comments.May 24 2017, 3:06 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7308	At this point you may have "interleave" decision. I assume, you have nothing to do with it. I'd check getWideningDecision(&I, VF) == CM_Scalarize)
7321	AddrDefs.insert(InstOp).second == true) -> AddrDefs.count(InstOp)

jonpa added inline comments.May 24 2017, 3:20 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7308	As explained before, there are just a few loops that this relates to. When I tried to handle only CM_Scalarize like you suggested, it changed just 17 loops, but it seemed to be just a bit better to do the interleaved accessess as well. I believe in this case the the address is scalar, but that register is also used for something else. That means that it ends up in a vector register and has to be extracted. The reason I would also like to see this is as explained before that due to LSR, it is generally better if all else same to keep addressing scalar.
7321	no - I want to insert InstOp into AddrDefs if it isn't there, right?

The code LGTM, I do not have more comments on it. I can't say anything about profitability of this algorithm for SystemZ, may be you need additional LGTM from one of SystemZ developers?

lib/Transforms/Vectorize/LoopVectorize.cpp
7308	Ok. You scalarize address calculation even for wide loads. Probably it make sense for SystemZ.
7321	ok, you are right!

In D32422#763120, @delena wrote:

The code LGTM, I do not have more comments on it. I can't say anything about profitability of this algorithm for SystemZ, may be you need additional LGTM from one of SystemZ developers?

We've seen that this is profitable in measurements, so if you're fine with the implementation, the patch LGTM as well.

This revision is now accepted and ready to land.May 24 2017, 6:09 AM

Thanks for review, everyone.
r303744.

What test suite did you use to demonstrate performance?

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2023, 12:31 AM

Herald added subscribers: wangpc, artagnon, shiva0217 and 3 others. · View Herald Transcript

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

SystemZ/

SystemZTargetTransformInfo.h

1 line

Transforms/

Vectorize/

LoopVectorize.cpp

74 lines

test/

Transforms/

LoopVectorize/

SystemZ/

addressing.ll

72 lines

Diff 100055

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	public:
bool isLegalMaskedLoad(Type *DataType) const;		bool isLegalMaskedLoad(Type *DataType) const;

/// \brief Return true if the target supports masked gather/scatter		/// \brief Return true if the target supports masked gather/scatter
/// AVX-512 fully supports gather and scatter for vectors with 32 and 64		/// AVX-512 fully supports gather and scatter for vectors with 32 and 64
/// bits scalar type.		/// bits scalar type.
bool isLegalMaskedScatter(Type *DataType) const;		bool isLegalMaskedScatter(Type *DataType) const;
bool isLegalMaskedGather(Type *DataType) const;		bool isLegalMaskedGather(Type *DataType) const;

		/// Return true if target doesn't mind addresses in vectors.
		bool prefersVectorizedAddressing() const;
		rengolinUnsubmitted Not Done Reply Inline Actions Can't you check for scatter/gather support directly? rengolin: Can't you check for scatter/gather support directly?
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this. I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX. I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible? jonpa: I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has…
		delenaUnsubmitted Not Done Reply Inline Actions "Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?" Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. delena: "Couldn't it be that a target with such support actually wants scalarized addressing for…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. I meant that this could be applied for other cases than gather/scatter. jonpa: > Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say…
		delenaUnsubmitted Not Done Reply Inline Actions "I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this." The API could be extended. We can add isLegalGather()/isLegalScatter() and it will be consistent and clear. delena: "I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I suppose that "preferrsVectorizedAddressing()" could be named something else, but the point with this is currently that it is separate from deciding on the widening decision, which means that even though it currently is only a way to enable this for SystemZ only, it could become also for other targets a way to get scalarized addressing for non gather/scatter accesses (or CM_Scalarize only, perhaps). So first, the widening decision is made, which included calling isLegalScatter/Gather or similar. With all memory accesses having gotten mapped to such a decision, we now ask if for any non-gather/scatter the target wants the current vectorized addressing, or to keep it scalar. jonpa: I suppose that "preferrsVectorizedAddressing()" could be named something else, but the point…

/// \brief Return the cost of the scaling factor used in the addressing		/// \brief Return the cost of the scaling factor used in the addressing
/// mode represented by AM for this target, for a load/store		/// mode represented by AM for this target, for a load/store
/// of the specified type.		/// of the specified type.
/// If the AM is supported, the return value must be >= 0.		/// If the AM is supported, the return value must be >= 0.
/// If the AM is not supported, it returns a negative value.		/// If the AM is not supported, it returns a negative value.
/// TODO: Handle pre/postinc as well.		/// TODO: Handle pre/postinc as well.
int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
▲ Show 20 Lines • Show All 374 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace) = 0;		unsigned AddrSpace) = 0;
virtual bool isLegalMaskedStore(Type *DataType) = 0;		virtual bool isLegalMaskedStore(Type *DataType) = 0;
virtual bool isLegalMaskedLoad(Type *DataType) = 0;		virtual bool isLegalMaskedLoad(Type *DataType) = 0;
virtual bool isLegalMaskedScatter(Type *DataType) = 0;		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
virtual bool isLegalMaskedGather(Type *DataType) = 0;		virtual bool isLegalMaskedGather(Type *DataType) = 0;
		virtual bool prefersVectorizedAddressing() = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace) = 0;		int64_t Scale, unsigned AddrSpace) = 0;
virtual bool isFoldableMemAccessOffset(Instruction *I, int64_t Offset) = 0;		virtual bool isFoldableMemAccessOffset(Instruction *I, int64_t Offset) = 0;
virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;		virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
virtual bool isProfitableToHoist(Instruction *I) = 0;		virtual bool isProfitableToHoist(Instruction *I) = 0;
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
▲ Show 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	bool isLegalMaskedLoad(Type *DataType) override {
return Impl.isLegalMaskedLoad(DataType);		return Impl.isLegalMaskedLoad(DataType);
}		}
bool isLegalMaskedScatter(Type *DataType) override {		bool isLegalMaskedScatter(Type *DataType) override {
return Impl.isLegalMaskedScatter(DataType);		return Impl.isLegalMaskedScatter(DataType);
}		}
bool isLegalMaskedGather(Type *DataType) override {		bool isLegalMaskedGather(Type *DataType) override {
return Impl.isLegalMaskedGather(DataType);		return Impl.isLegalMaskedGather(DataType);
}		}
		bool prefersVectorizedAddressing() override {
		return Impl.prefersVectorizedAddressing();
		}
int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace) override {		unsigned AddrSpace) override {
return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,		return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace);		Scale, AddrSpace);
}		}
bool isFoldableMemAccessOffset(Instruction *I, int64_t Offset) override {		bool isFoldableMemAccessOffset(Instruction *I, int64_t Offset) override {
return Impl.isFoldableMemAccessOffset(I, Offset);		return Impl.isFoldableMemAccessOffset(I, Offset);
▲ Show 20 Lines • Show All 313 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	public:
bool isLegalMaskedStore(Type *DataType) { return false; }		bool isLegalMaskedStore(Type *DataType) { return false; }

bool isLegalMaskedLoad(Type *DataType) { return false; }		bool isLegalMaskedLoad(Type *DataType) { return false; }

bool isLegalMaskedScatter(Type *DataType) { return false; }		bool isLegalMaskedScatter(Type *DataType) { return false; }

bool isLegalMaskedGather(Type *DataType) { return false; }		bool isLegalMaskedGather(Type *DataType) { return false; }

		bool prefersVectorizedAddressing() { return true; }
		delenaUnsubmitted Not Done Reply Inline Actions I assume that X86 would prefer to scalarize addressing for scalarised memops. But it is not clear how to use this interface.. delena: I assume that X86 would prefer to scalarize addressing for scalarised memops. But it is not…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions OK - Sofar this patch is just working for SystemZ. As discussed already, we have to improve it somehow if X86 has an interest. jonpa: OK - Sofar this patch is just working for SystemZ. As discussed already, we have to improve it…

int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {		bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {
// Guess that all legal addressing mode are free.		// Guess that all legal addressing mode are free.
if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,		if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace))		Scale, AddrSpace))
return 0;		return 0;
return -1;		return -1;
}		}
▲ Show 20 Lines • Show All 444 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::isLegalMaskedGather(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedGather(Type *DataType) const {
	return TTIImpl->isLegalMaskedGather(DataType);			return TTIImpl->isLegalMaskedGather(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType) const {
	return TTIImpl->isLegalMaskedGather(DataType);			return TTIImpl->isLegalMaskedGather(DataType);
	}			}

				bool TargetTransformInfo::prefersVectorizedAddressing() const {
				return TTIImpl->prefersVectorizedAddressing();
				}

	int TargetTransformInfo::getScalingFactorCost(Type Ty, GlobalValue BaseGV,			int TargetTransformInfo::getScalingFactorCost(Type Ty, GlobalValue BaseGV,
	int64_t BaseOffset,			int64_t BaseOffset,
	bool HasBaseReg,			bool HasBaseReg,
	int64_t Scale,			int64_t Scale,
	unsigned AddrSpace) const {			unsigned AddrSpace) const {
	int Cost = TTIImpl->getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,			int Cost = TTIImpl->getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,
	Scale, AddrSpace);			Scale, AddrSpace);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	▲ Show 20 Lines • Show All 392 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	public:
/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);		unsigned getRegisterBitWidth(bool Vector);

		bool prefersVectorizedAddressing() { return false; }
bool supportsEfficientVectorElementLoadStore() { return true; }		bool supportsEfficientVectorElementLoadStore() { return true; }
bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
Show All 24 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,096 Lines • ▼ Show 20 Lines	private:
/// Holds the instructions known to be uniform after vectorization.		/// Holds the instructions known to be uniform after vectorization.
/// The data is collected per VF.		/// The data is collected per VF.
DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Uniforms;		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Uniforms;

/// Holds the instructions known to be scalar after vectorization.		/// Holds the instructions known to be scalar after vectorization.
/// The data is collected per VF.		/// The data is collected per VF.
DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Scalars;		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Scalars;

		/// Holds the instructions (address computations) that are forced to be
		/// scalarized.
		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> ForcedScalars;
		delenaUnsubmitted Not Done Reply Inline Actions May be we can add these instructions to the widening decisions map? This map contains only memops now, right? delena: May be we can add these instructions to the widening decisions map? This map contains only…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions They are not just CM_Scalarize, because they don't have any scalarization overhead. (With a more intelligent handling of scalarized instruction sequences I guess this would become equivalent) ForcedScalars contains not just memory instructions, but any other instruction leading to an address. jonpa: They are not just CM_Scalarize, because they don't have any scalarization overhead. (With a…

/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
unsigned VF);		unsigned VF);

▲ Show 20 Lines • Show All 3,499 Lines • ▼ Show 20 Lines	for (auto &Induction : *Legal->getInductionVars()) {
if (Induction.second.getKind() != InductionDescriptor::IK_PtrInduction)		if (Induction.second.getKind() != InductionDescriptor::IK_PtrInduction)
continue;		continue;
Worklist.insert(Ind);		Worklist.insert(Ind);
Worklist.insert(IndUpdate);		Worklist.insert(IndUpdate);
DEBUG(dbgs() << "LV: Found scalar instruction: " << *Ind << "\n");		DEBUG(dbgs() << "LV: Found scalar instruction: " << *Ind << "\n");
DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate << "\n");		DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate << "\n");
}		}

		// Insert the forced scalars.
		// FIXME: Currently widenPHIInstruction() often creates a dead vector
		// induction variable when the PHI user is scalarized.
		if (ForcedScalars.count(VF))
		for (auto *I : ForcedScalars.find(VF)->second)
		Worklist.insert(I);

// Expand the worklist by looking through any bitcasts and getelementptr		// Expand the worklist by looking through any bitcasts and getelementptr
// instructions we've already identified as scalar. This is similar to the		// instructions we've already identified as scalar. This is similar to the
// expansion step in collectLoopUniforms(); however, here we're only		// expansion step in collectLoopUniforms(); however, here we're only
// expanding to include additional bitcasts and getelementptr instructions.		// expanding to include additional bitcasts and getelementptr instructions.
unsigned Idx = 0;		unsigned Idx = 0;
while (Idx != Worklist.size()) {		while (Idx != Worklist.size()) {
Instruction *Dst = Worklist[Idx++];		Instruction *Dst = Worklist[Idx++];
if (!isLoopVaryingBitCastOrGEP(Dst->getOperand(0)))		if (!isLoopVaryingBitCastOrGEP(Dst->getOperand(0)))
▲ Show 20 Lines • Show All 1,559 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (isUniformAfterVectorization(I, VF))		if (isUniformAfterVectorization(I, VF))
VF = 1;		VF = 1;

if (VF > 1 && isProfitableToScalarize(I, VF))		if (VF > 1 && isProfitableToScalarize(I, VF))
return VectorizationCostTy(InstsToScalarize[VF][I], false);		return VectorizationCostTy(InstsToScalarize[VF][I], false);

		// Forced scalars do not have any scalarization overhead.
		if (VF > 1 && ForcedScalars.count(VF) &&
		ForcedScalars.find(VF)->second.count(I))
		return VectorizationCostTy((getInstructionCost(I, 1).first * VF), false);

Type *VectorTy;		Type *VectorTy;
unsigned C = getInstructionCost(I, VF, VectorTy);		unsigned C = getInstructionCost(I, VF, VectorTy);

		// Note: Even if all instructions are scalarized, return true if any memory
		// accesses appear in the loop to get benefits from address folding etc.
bool TypeNotScalarized =		bool TypeNotScalarized =
VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;		VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
return VectorizationCostTy(C, TypeNotScalarized);		return VectorizationCostTy(C, TypeNotScalarized);
}		}

void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {		void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
if (VF == 1)		if (VF == 1)
return;		return;
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
// receives the same decision. The whole group receives the cost, but		// receives the same decision. The whole group receives the cost, but
// the cost will actually be assigned to one instruction.		// the cost will actually be assigned to one instruction.
if (auto Group = Legal->getInterleavedAccessGroup(&I))		if (auto Group = Legal->getInterleavedAccessGroup(&I))
setWideningDecision(Group, VF, Decision, Cost);		setWideningDecision(Group, VF, Decision, Cost);
else		else
setWideningDecision(&I, VF, Decision, Cost);		setWideningDecision(&I, VF, Decision, Cost);
}		}
}		}

		// Make sure that any load of address and any other address computation
		// remains scalar unless there is gather/scatter support. This avoids
		// inevitable extracts into address registers, and also has the benefit of
		// activating LSR more, since that pass can't optimize vectorized
		// addresses.
		if (TTI.prefersVectorizedAddressing())
		return;

		// Start with all scalar pointer uses.
		delenaUnsubmitted Not Done Reply Inline Actions But interleave loads are also here. You don't need to scalarize their pointers. delena: But interleave loads are also here. You don't need to scalarize their pointers.
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions attempt to scalarize the Ptr should be done if the taken decision is "Scalarize". I tried if (PtrDef && TheLoop->contains(PtrDef) && getWideningDecision(&I, VF) == CM_Scalarize) AddrDefs.insert(PtrDef); This was a minor change: only 17 loops became different. This seems to relate to the cases of mixed context (used both as address and otherwise), which means that we are instead of extracting an address now do inserts instead. Not sure exactly if this matters, but for what I can see, it seemed slightly better in terms of final MIs to scalarize "anything" (not just scalarized case), perhaps because the big benefit of LSR working on scalar addresses? If you don't have a strong opinion, I guess I would be most happy not to change anything here. jonpa: > attempt to scalarize the Ptr should be done if the taken decision is "Scalarize". I tried…
		delenaUnsubmitted Not Done Reply Inline Actions I just do not understand why do you need to scalarize addressing if the instruction itself is not scalarized. Can you give me an IR example? delena: I just do not understand why do you need to scalarize addressing if the instruction itself is…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions Again, "This seems to relate to the cases of mixed context (used both as address and otherwise)"... For example: for.body.i: ; preds = %for.body.i, %for.body.preheader.i %indvars.iv241.i = phi i64 [ %47, %for.body.preheader.i ], [ %indvars.iv.next242.i, %for.body.i ] %indvars.iv.next242.i = add nsw i64 %indvars.iv241.i, -1 %arrayidx.i.i = getelementptr inbounds i32, i32* %42, i64 %indvars.iv.next242.i %48 = trunc i64 %indvars.iv.next242.i to i32 store i32 %48, i32* %arrayidx.i.i, align 4, !tbaa !47 %tobool.i = icmp eq i32 %48, 0 br i1 %tobool.i, label %for.end.i.loopexit, label %for.body.i This is a mixed-use where the IV is stored truncated at the address indexed with IV. Due to the non-address use, it will be contained in a vector. jonpa: Again, "This seems to relate to the cases of mixed context (used both as address and…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions In this example, it is the same instruction (store) using the value both as addressing and as a general value. There are also other cases where it may be used in a store, but used in a different instruction also, like an arithmetic one... I would like to point out that this actually wanted right now since performance measurements show better results on SystemZ at least on one benchmark by doing this "as much as possible", meaning that any mixed use like this gets included in this transformation. (I have actually tried this version of the patch and compared to a less aggressive version, which do not touch mixed-uses, and it was found that more aggressive is better...) jonpa: In this example, it is the same instruction (store) using the value both as addressing and as a…
		SmallPtrSet<Instruction *, 8> AddrDefs;
		for (BasicBlock *BB : TheLoop->blocks())
		for (Instruction &I : *BB) {
		Instruction *PtrDef =
		delenaUnsubmitted Done Reply Inline Actions You will find getPointerOperand(Inst) utility in this file. It retrieves a pointer from load and store. delena: You will find getPointerOperand(Inst) utility in this file. It retrieves a pointer from load…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions Aah, thanks. jonpa: Aah, thanks.
		dyn_cast_or_null<Instruction>(getPointerOperand(&I));
		if (PtrDef && TheLoop->contains(PtrDef) &&
		getWideningDecision(&I, VF) != CM_GatherScatter)
		delenaUnsubmitted Not Done Reply Inline Actions At this point you may have "interleave" decision. I assume, you have nothing to do with it. I'd check getWideningDecision(&I, VF) == CM_Scalarize) delena: At this point you may have "interleave" decision. I assume, you have nothing to do with it. I'd…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions As explained before, there are just a few loops that this relates to. When I tried to handle only CM_Scalarize like you suggested, it changed just 17 loops, but it seemed to be just a bit better to do the interleaved accessess as well. I believe in this case the the address is scalar, but that register is also used for something else. That means that it ends up in a vector register and has to be extracted. The reason I would also like to see this is as explained before that due to LSR, it is generally better if all else same to keep addressing scalar. jonpa: As explained before, there are just a few loops that this relates to. When I tried to handle…
		delenaUnsubmitted Not Done Reply Inline Actions Ok. You scalarize address calculation even for wide loads. Probably it make sense for SystemZ. delena: Ok. You scalarize address calculation even for wide loads. Probably it make sense for SystemZ.
		AddrDefs.insert(PtrDef);
		}

		// Add all instructions used to generate the addresses.
		SmallVector<Instruction *, 4> Worklist;
		for (auto *I : AddrDefs)
		Worklist.push_back(I);
		while (!Worklist.empty()) {
		Instruction *I = Worklist.pop_back_val();
		for (auto &Op : I->operands())
		if (auto *InstOp = dyn_cast<Instruction>(Op))
		if ((InstOp->getParent() == I->getParent()) && !isa<PHINode>(InstOp) &&
		AddrDefs.insert(InstOp).second == true)
		delenaUnsubmitted Not Done Reply Inline Actions AddrDefs.insert(InstOp).second == true) -> AddrDefs.count(InstOp) delena: AddrDefs.insert(InstOp).second == true) -> AddrDefs.count(InstOp)
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions no - I want to insert InstOp into AddrDefs if it isn't there, right? jonpa: no - I want to insert InstOp into AddrDefs if it isn't there, right?
		delenaUnsubmitted Not Done Reply Inline Actions ok, you are right! delena: ok, you are right!
		Worklist.push_back(InstOp);
		delenaUnsubmitted Not Done Reply Inline Actions In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up. delena: In your examples you show only one level up. I don't believe that we have too many real cases…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up. To my experience, if there are many levels of address computation stages I think it would still be wise to scalarize this and in those cases let this become "too expensive", and let LSR do its magic, rather than vectorizing complex addressing. In practice on benchmark it has been shown that about the same number of loops get vectorized still, so this shouldn't be common. I still think it is a pity to see vectorized code that suffers from no LSR -- feel free to join discussion at https://reviews.llvm.org/D32166 and share any thoughts... The scalarized instruction should belong to the same Basic block Changing the check from "contained in loop" to "contained in same bb" actually meant identical results... and , probably, have only one user. I have experimented with something like this and found a few cases were an address computation was also used in another context. What I did then was an iterative pruning of AddrDefs of any instructions used outside of AddrDefs. It was quite rare, and did not seem to matter much, and I left it like this with the idea that maybe it is best to help the memory access to be fast, but this is nothing confirmed. I tried this now - just to add "InstOp->hasOneUse() &&" here, and found that on spec, a set of loops (73) were affected. This seems quite clearly to be for the worse, with just more extracts, so if you do not have a very strong opinion, I would like to leave this alone... jonpa: > In your examples you show only one level up. I don't believe that we have too many real cases…
		delenaUnsubmitted Not Done Reply Inline Actions The scalarized intsruction should belong to the same Basic block and , probably, have only one user. delena: The scalarized intsruction should belong to the same Basic block and , probably, have only one…
		}

		for (auto *I : AddrDefs) {
		if (isa<LoadInst>(I)) {
		// Setting the desired widening decision should ideally be handled in
		// by cost functions, but since this involves the task of finding out
		// if the loaded register is involved in an address computation, it is
		// instead changed here when we know this is the case.
		if (getWideningDecision(I, VF) == CM_Widen)
		delenaUnsubmitted Not Done Reply Inline Actions At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group. delena: At this point you change widening decision of already analyzed Load inst. Make sure that this…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway. Here's an example: Loop before vectorize pass: for.body61: ; preds = %for.body61, %for.body61.lr.ph %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ] %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1 %20 = load i32, i32* %next65, align 4, !tbaa !11 %idxprom66 = sext i32 %20 to i64 %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66 %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7 store i32 %21, i32* %next65, align 4, !tbaa !11 %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1 %cmp59 = icmp slt i64 %indvars.iv81, %16 br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit PATCH UNMODIFIED: Loop after vectorize pass: vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0 %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3> %20 = add i64 %index, 0 %21 = add i64 %index, 1 %22 = add i64 %index, 2 %23 = add i64 %index, 3 %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1 %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1 %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1 %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1 %28 = load i32, i32* %24, align 4, !tbaa !11 %29 = load i32, i32* %25, align 4, !tbaa !11 %30 = load i32, i32* %26, align 4, !tbaa !11 %31 = load i32, i32* %27, align 4, !tbaa !11 %32 = sext i32 %28 to i64 %33 = sext i32 %29 to i64 %34 = sext i32 %30 to i64 %35 = sext i32 %31 to i64 %36 = getelementptr inbounds i32, i32* %1, i64 %32 %37 = getelementptr inbounds i32, i32* %1, i64 %33 %38 = getelementptr inbounds i32, i32* %1, i64 %34 %39 = getelementptr inbounds i32, i32* %1, i64 %35 %40 = load i32, i32* %36, align 4, !tbaa !7 %41 = load i32, i32* %37, align 4, !tbaa !7 %42 = load i32, i32* %38, align 4, !tbaa !7 %43 = load i32, i32* %39, align 4, !tbaa !7 store i32 %40, i32* %24, align 4, !tbaa !11 store i32 %41, i32* %25, align 4, !tbaa !11 store i32 %42, i32* %26, align 4, !tbaa !11 store i32 %43, i32* %27, align 4, !tbaa !11 %index.next = add i64 %index, 4 %44 = icmp eq i64 %index.next, %n.vec br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12 -> Final Header: BB#22: derived from LLVM BB %vector.body Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D Predecessors according to CFG: BB#21 BB#23 %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21) %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21) %R5D<def> = SLLG %R5D<kill>, %noreg, 2 %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2) %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2) %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21) %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21) %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R11D<def> = SLLG %R11D<kill>, %noreg, 2 %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2) %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2) STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21) STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21) STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21) STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21) CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead> instr-count: 17 MODIFIED TO NOT TOUCH INTERLEAVED LOADS: Loop after vectorize pass: vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0 %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3> %20 = add i64 %index, 0 %21 = add i64 %index, 1 %22 = add i64 %index, 2 %23 = add i64 %index, 3 %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1 %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1 %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1 %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1 %28 = getelementptr i32, i32* %24, i32 0 %29 = bitcast i32* %28 to <8 x i32>* %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11 %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %30 = extractelement <4 x i32> %strided.vec, i32 0 %31 = sext i32 %30 to i64 %32 = extractelement <4 x i32> %strided.vec, i32 1 %33 = sext i32 %32 to i64 %34 = extractelement <4 x i32> %strided.vec, i32 2 %35 = sext i32 %34 to i64 %36 = extractelement <4 x i32> %strided.vec, i32 3 %37 = sext i32 %36 to i64 %38 = getelementptr inbounds i32, i32* %1, i64 %31 %39 = getelementptr inbounds i32, i32* %1, i64 %33 %40 = getelementptr inbounds i32, i32* %1, i64 %35 %41 = getelementptr inbounds i32, i32* %1, i64 %37 %42 = load i32, i32* %38, align 4, !tbaa !7 %43 = load i32, i32* %39, align 4, !tbaa !7 %44 = load i32, i32* %40, align 4, !tbaa !7 %45 = load i32, i32* %41, align 4, !tbaa !7 store i32 %42, i32* %24, align 4, !tbaa !11 store i32 %43, i32* %25, align 4, !tbaa !11 store i32 %44, i32* %26, align 4, !tbaa !11 store i32 %45, i32* %27, align 4, !tbaa !11 %index.next = add i64 %index, 4 %46 = icmp eq i64 %index.next, %n.vec br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12 Final Header: BB#22: derived from LLVM BB %vector.body Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D Predecessors according to CFG: BB#21 BB#23 %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21) %R5D<def> = VLGVF %V0, %noreg, 2 %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill> %R5D<def> = SLLG %R5D<kill>, %noreg, 2 %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2) %R14D<def> = VLGVF %V0<kill>, %noreg, 0 %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21) %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill> %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2) %R14D<def> = VLGVF %V0, %noreg, 0 %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill> %R14D<def> = SLLG %R14D<kill>, %noreg, 2 %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2) %R11D<def> = VLGVF %V0<kill>, %noreg, 2 %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill> %R11D<def> = SLLG %R11D<kill>, %noreg, 2 %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2) STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21) ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21) STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21) ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21) CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead> instr-count: 23 I experimented with scalarizing the whole interleave group instead, like: for (auto I : AddrDefs) { if (isa<LoadInst>(I)) { if (getWideningDecision(I, VF) == CM_Widen) // Scalarize a widened load of address. setWideningDecision(I, VF, CM_Scalarize, (VF getMemoryInstructionCost(I, 1))); else if (auto Group = Legal->getInterleavedAccessGroup(I)) { // Scalarize an interleave group of address loads. for (unsigned I = 0; I < Group->getFactor(); ++I) { if (Instruction Member = Group->getMember(I)) setWideningDecision(Member, VF, CM_Scalarize, (VF getMemoryInstructionCost(Member, 1))); } } } else // Make sure I gets scalarized and a cost estimate without // scalarization overhead. ForcedScalars[VF].insert(I); } This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads. So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts... jonpa: I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of…
		delenaUnsubmitted Not Done Reply Inline Actions You can't change decision of one load from the interleave group - or all, or nothing. delena: You can't change decision of one load from the interleave group - or all, or nothing.
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions You can't change decision of one load from the interleave group - or all, or nothing. Ok - I will update the patch then to scalarize the whole interleave group, which seemed best also. jonpa: > You can't change decision of one load from the interleave group - or all, or nothing. Ok - I…
		delenaUnsubmitted Not Done Reply Inline Actions > "So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. " This is a cost model decision. Interleaving of 64-bit loads is profitable on X86. delena: > "So this isn't a big deal, but still I would insist that there isn't a good reason here to…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions This is a cost model decision. Interleaving of 64-bit loads is profitable on X86. It would have been nice to just return a too-great value in getInterleavedMemoryOpCost() instead for loads of addresses, but I did however find that there are cases where the index is loaded, so it is not just a matter of checking if the passed type is a pointer type to catch this. An alternative then would be to pass the instruction pointer into that method, and to have to do a search then to see if the loaded value is involved in addressing. That way the widening decision for such a load would always be CM_Scalarize before this patch starts to work. This is however just extra work compared to the current patch, which is already finding those instructions. Perhaps an alternative would then be another boolean argument to that method and getMemoryOpCost() like IsInvolvedInAddressing, so that this new patch could call it again to check if it returned a very high value or not? This would then be called at this point where the patch is scalarizing things. jonpa: > This is a cost model decision. Interleaving of 64-bit loads is profitable on X86. It would…
		// Scalarize a widened load of address.
		setWideningDecision(I, VF, CM_Scalarize,
		(VF * getMemoryInstructionCost(I, 1)));
		else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
		// Scalarize an interleave group of address loads.
		for (unsigned I = 0; I < Group->getFactor(); ++I) {
		if (Instruction *Member = Group->getMember(I))
		setWideningDecision(Member, VF, CM_Scalarize,
		(VF * getMemoryInstructionCost(Member, 1)));
		}
		}
		} else
		// Make sure I gets scalarized and a cost estimate without
		// scalarization overhead.
		ForcedScalars[VF].insert(I);
		}
}		}

unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,		unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
unsigned VF,		unsigned VF,
Type *&VectorTy) {		Type *&VectorTy) {
Type *RetTy = I->getType();		Type *RetTy = I->getType();
if (canTruncateToMinimalBitwidth(I, VF))		if (canTruncateToMinimalBitwidth(I, VF))
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
▲ Show 20 Lines • Show All 705 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/SystemZ/addressing.ll

This file was added.

				; RUN: opt -S -mtriple=s390x-unknown-linux -mcpu=z13 -loop-vectorize -dce \
				; RUN: -instcombine -force-vector-width=2 < %s \| FileCheck %s
				;
				; Test that loop vectorizer does not generate vector addresses that must then
				; always be extracted.

				; Check that the addresses for a scalarized memory access is not extracted
				; from a vector register.
				define i32 @foo(i32* nocapture %A) {
				;CHECK-LABEL: @foo(
				;CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				;CHECK: %0 = shl nsw i64 %index, 2
				;CHECK: %1 = shl i64 %index, 2
				;CHECK: %2 = or i64 %1, 4
				;CHECK: %3 = getelementptr inbounds i32, i32* %A, i64 %0
				;CHECK: %4 = getelementptr inbounds i32, i32* %A, i64 %2
				;CHECK: store i32 4, i32* %3, align 4
				;CHECK: store i32 4, i32* %4, align 4

				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%0 = shl nsw i64 %indvars.iv, 2
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %0
				store i32 4, i32* %arrayidx, align 4
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, 10000
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret i32 undef
				}


				; Check that a load of address is scalarized.
				define i32 @foo1(i32* nocapture noalias %A, i32** nocapture %PtrPtr) {
				;CHECK-LABEL: @foo1(
				;CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				;CHECK: %0 = or i64 %index, 1
				;CHECK: %1 = getelementptr inbounds i32, i32* %PtrPtr, i64 %index
				;CHECK: %2 = getelementptr inbounds i32, i32* %PtrPtr, i64 %0
				;CHECK: %3 = load i32, i32* %1, align 8
				;CHECK: %4 = load i32, i32* %2, align 8
				;CHECK: %5 = load i32, i32* %3, align 4
				;CHECK: %6 = load i32, i32* %4, align 4
				;CHECK: %7 = insertelement <2 x i32> undef, i32 %5, i32 0
				;CHECK: %8 = insertelement <2 x i32> %7, i32 %6, i32 1
				;CHECK: %9 = getelementptr inbounds i32, i32* %A, i64 %index
				;CHECK: %10 = bitcast i32* %9 to <2 x i32>*
				;CHECK: store <2 x i32> %8, <2 x i32>* %10, align 4

				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%ptr = getelementptr inbounds i32, i32* %PtrPtr, i64 %indvars.iv
				%el = load i32, i32* %ptr
				%v = load i32, i32* %el
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %v, i32* %arrayidx, align 4
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, 10000
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret i32 undef
				}

This is an archive of the discontinued LLVM Phabricator instance.

LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 100055

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/SystemZ/SystemZTargetTransformInfo.h

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/SystemZ/addressing.ll

LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)
ClosedPublic