Page MenuHomePhabricator

[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
ClosedPublic

Authored by david-arm on Mar 12 2021, 7:50 AM.

Details

Summary

This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.

unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...

After some investigation it seems that there are only these cases that occur
in practice:

  1. VF is a scalar, in which case N = 1.
  2. VF is a vector. We can only get here if: a) the instruction is a GEP/bitcast with scalar uses, or b) this is an update to an induction variable that remains scalar.

I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. For all other cases I have added an assert that none of the
users needs scalarising, which didn't fire in any unit tests.

Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.

Diff Detail

Event Timeline

david-arm created this revision.Mar 12 2021, 7:50 AM
david-arm requested review of this revision.Mar 12 2021, 7:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 12 2021, 7:50 AM

I agree that the IV update in the test should have a cost of 1. (It seems strange it doesn't already, I suspect it isn't matched the same as an increment 1).

Would it always be true for the addresses of scalarized loads/stores? They do need to calculate N addresses.

This didn't show any change in the tests I ran on it, but there is potentially not a lot in there that would scalarize memops.

Hi @dmgreen, thanks for taking a look at the patch. So you raise a good point - the loads/stores will be dealt with before reaching this variant getInstructionCost, however the GEPs/Bitcasts for those will still be costed here. However, for GEPs at least you can see this code in getInstructionCost:

case Instruction::GetElementPtr:
  // We mark this instruction as zero-cost because the cost of GEPs in
  // vectorized code depends on whether the corresponding memory instruction
  // is scalarized or not. Therefore, we handle GEPs with the memory
  // instruction cost.
  return 0;

which only leaves us with Bitcasts with pointer type results. Normal ptr-ptr bitcasts will be 0 cost I think, so I imagine what's left are bitcasts of non-ptr types to ptr types. It's very unfortunate that the Scalars list includes so many different types of Scalars that need costing differently. What I could do here is have a single special case for Bitcasts where the cost (likely 0) is multiplied by VF, with an assertion that we're not dealing with scalable vectors.

I guess for bitcasts the cost may be non-trivial for big endian targets.

fhahn added a subscriber: fhahn.Mar 15 2021, 3:25 AM

I agree that the IV update in the test should have a cost of 1. (It seems strange it doesn't already, I suspect it isn't matched the same as an increment 1).

Would it always be true for the addresses of scalarized loads/stores? They do need to calculate N addresses.

Hmm, but if there are uses of the IV in this case, don't we need to generate a scalar value for each element in the VF? E.g consider the example below

define void @test(half* %b, i32 %n) {
entry:
  br label %for.body

for.body:
  %i = phi i32 [ 0, %entry ], [ %i.next, %for.body ]
  %tmp0 = getelementptr half, half* %b, i32 %i
  %lv = load half, half* %tmp0
  %add = fadd half %lv, 40.0
  store half %add, half* %tmp0, align 1
  %i.next = add nuw nsw i32 %i, 3
  %cond = icmp eq i32 %i.next, %n
  br i1 %cond, label %for.end, label %for.body

for.end:
  ret void
}

Generates the vector body below with -S -force-vector-width=4 -mtriple=arm64-apple-iphoneos. Note %5 = add i32 %offset.idx, 0, %6 = add i32 %offset.idx, 3, %7 = add i32 %offset.idx, 6 & %8 = add i32 %offset.idx, 9. With this change, the cost of the induction increment is 1, but scalarization requires us to create 4 scalar increments. I don't think the cost of that is accounted elsewhere?

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = mul i32 %index, 3
  %5 = add i32 %offset.idx, 0
  %6 = add i32 %offset.idx, 3
  %7 = add i32 %offset.idx, 6
  %8 = add i32 %offset.idx, 9
  %9 = getelementptr half, half* %b, i32 %5
  %10 = getelementptr half, half* %b, i32 %6
  %11 = getelementptr half, half* %b, i32 %7
  %12 = getelementptr half, half* %b, i32 %8
  %13 = getelementptr half, half* %9, i32 0
  %14 = bitcast half* %13 to <12 x half>*
  %wide.vec = load <12 x half>, <12 x half>* %14, align 2
  %strided.vec = shufflevector <12 x half> %wide.vec, <12 x half> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
  %15 = fadd <4 x half> %strided.vec, <half 0xH5100, half 0xH5100, half 0xH5100, half 0xH5100>
  %16 = extractelement <4 x half> %15, i32 0
  store half %16, half* %9, align 1
  %17 = extractelement <4 x half> %15, i32 1
  store half %17, half* %10, align 1
  %18 = extractelement <4 x half> %15, i32 2
  store half %18, half* %11, align 1
  %19 = extractelement <4 x half> %15, i32 3
  store half %19, half* %12, align 1
  %index.next = add i32 %index, 4
  %20 = icmp eq i32 %index.next, %n.vec
  br i1 %20, label %middle.block, label %vector.body, !llvm.loop !0
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7413–7415

I think conceptually there's nothing preventing us from reaching this code with an instruction that needs to be scalarized if the VF > 1. It might not occur in practice for some/most backends at the moment, because most of the opcodes above can be widened on most backends and we do not really properly track which instructions will be scalarized at this point, except for pointer & induction variables.

If we make the code less general, I think we should have at least an assertion to make sure we never reach this code when VF > 1 && scalarization is requested, so we at least notice once it really matters and we need to be more general again.

Thanks for the review @fhahn. If anything this just highlights the poor code coverage we have at the moment for cost calculations when scalarising. At the very least I'll add some tests that cover some of these corner cases, since there was nothing that really broke when I made this change!

Although I think the adds you mentioned above @fhahn don't fall into the Scalars variable - I'll double check this though. This patch only affects instructions that are members of Scalars. Users of the IV don't fall into that - only updates to the IV I think.

Hi @fhahn, I tested the example you pasted above before and after applying my patch and got this debug output:

Before:
LV: Found an estimated cost of 0 for VF 4 For instruction:   %i = phi i32 [ 0, %entry ], [ %i.next, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %tmp0 = getelementptr half, half* %b, i32 %i
LV: Found an estimated cost of 3 for VF 4 For instruction:   %lv = load half, half* %tmp0, align 2
LV: Found an estimated cost of 2 for VF 4 For instruction:   %add = fadd half %lv, 0xH5100
LV: Found an estimated cost of 23 for VF 4 For instruction:   store half %add, half* %tmp0, align 1
LV: Found an estimated cost of 4 for VF 4 For instruction:   %i.next = add nuw nsw i32 %i, 3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %cond = icmp eq i32 %i.next, %n
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %cond, label %for.end, label %for.body

After:
LV: Found an estimated cost of 0 for VF 4 For instruction:   %i = phi i32 [ 0, %entry ], [ %i.next, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %tmp0 = getelementptr half, half* %b, i32 %i
LV: Found an estimated cost of 3 for VF 4 For instruction:   %lv = load half, half* %tmp0, align 2
LV: Found an estimated cost of 2 for VF 4 For instruction:   %add = fadd half %lv, 0xH5100
LV: Found an estimated cost of 23 for VF 4 For instruction:   store half %add, half* %tmp0, align 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %i.next = add nuw nsw i32 %i, 3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %cond = icmp eq i32 %i.next, %n
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %cond, label %for.end, label %for.body

It looks like the only change is the IV update which goes from 4 -> 1. I think the adds you mentioned are included as part of the load/store cost.

fhahn added a comment.Mar 15 2021, 4:19 AM

It looks like the only change is the IV update which goes from 4 -> 1. I think the adds you mentioned are included as part of the load/store cost.

I am not sure. From a quick look at the code, it seems like the high load/store cost comes mostly from the required scalarization of the stored/loaded values and the cost of the extra increments seems not fully included at least on the IR instruction level. Unfortunately I don't think there's really any different option to write a test that hits the code paths in the patch, even if the scalarized values are used in other places later on. So it's probably not worth worrying too much about it. I still think we should at least have an assertion as mentioned above, to catch the gap if the cost-modeling is improved.

david-arm added a comment.EditedMar 15 2021, 6:17 AM

It looks like the only change is the IV update which goes from 4 -> 1. I think the adds you mentioned are included as part of the load/store cost.

I am not sure. From a quick look at the code, it seems like the high load/store cost comes mostly from the required scalarization of the stored/loaded values and the cost of the extra increments seems not fully included at least on the IR instruction level. Unfortunately I don't think there's really any different option to write a test that hits the code paths in the patch, even if the scalarized values are used in other places later on. So it's probably not worth worrying too much about it. I still think we should at least have an assertion as mentioned above, to catch the gap if the cost-modeling is improved.

Hi @fhahn I guess I'm just a bit confused as to what assert you're asking for here to be honest. We do reach the Add case for the scalar IV increment (that stays scalar after vectorisation) that happens in your test case above, where VF > 1. What assert are you proposing here? Something like:

assert(VF == 1 || !isScalarAfterVectorization(Instruction)); ?

If so, I think that this is guaranteed to fail with your test case. Alternatively, something like this should work I think:

assert(VF == 1 || !InstsToScalarize.count(Instruction));

since scalarised vector instructions should not get into this function.

In your test above the other adds that appear near the start of the loop after vectorisation I believe are due to expansion of the GEP during vectorisation, rather than at the cost phase. The cost for the GEP is explicitly 0 in getInstructionCost and the comment says that the cost is calculated as part of the load/store cost. The purpose of this patch isn't really to fix/improve the cost calculation of loads/stores by taking the extra generated adds into account - it's mainly just to simplify the calculation of costs for the few cases where isScalarAfterVectorization returns true.

It looks like the only change is the IV update which goes from 4 -> 1. I think the adds you mentioned are included as part of the load/store cost.

I am not sure. From a quick look at the code, it seems like the high load/store cost comes mostly from the required scalarization of the stored/loaded values and the cost of the extra increments seems not fully included at least on the IR instruction level. Unfortunately I don't think there's really any different option to write a test that hits the code paths in the patch, even if the scalarized values are used in other places later on. So it's probably not worth worrying too much about it. I still think we should at least have an assertion as mentioned above, to catch the gap if the cost-modeling is improved.

Hi @fhahn I guess I'm just a bit confused as to what assert you're asking for here to be honest. We do reach the Add case for the scalar IV increment (that stays scalar after vectorisation) that happens in your test case above, where VF > 1. What assert are you proposing here? Something like:

assert(VF == 1 || !isScalarAfterVectorization(Instruction)); ?

If so, I think that this is guaranteed to fail with your test case. Alternatively, something like this should work I think:

assert(VF == 1 || !InstsToScalarize.count(Instruction));

since scalarised vector instructions should not get into this function.

Yes I think if we would use !isScalarAfterVectorization(Instruction) in the assertion, we would at least need to also allow VF != 1 for induction updates, but that's going to be a bit hard to check. So checking InstsToScalarize is probably the best option for now

In your test above the other adds that appear near the start of the loop after vectorisation I believe are due to expansion of the GEP during vectorisation, rather than at the cost phase. The cost for the GEP is explicitly 0 in getInstructionCost and the comment says that the cost is calculated as part of the load/store cost. The purpose of this patch isn't really to fix/improve the cost calculation of loads/stores by taking the extra generated adds into account - it's mainly just to simplify the calculation of costs for the few cases where isScalarAfterVectorization returns true.

That should be fine. Could you also add a comment explaining why we do not need to scale if isScalarAfterVectorization currently.

david-arm updated this revision to Diff 332257.Mar 22 2021, 5:29 AM
  • Added an assert that if an instruction remains scalar after vectorisation, then both the instruction and all it's users will not be scalarised. If any users were to be scalarised then potentially that means multiple copies of the instruction will be generated. In the case of GEPs the cost is actually calculated as part of the load/store cost.
david-arm edited the summary of this revision. (Show Details)Mar 22 2021, 5:30 AM
david-arm edited the summary of this revision. (Show Details)
fhahn accepted this revision.Mar 23 2021, 2:29 PM

LGTM, thanks! Please wait a day or 2 with committing in case there are additional comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7253

nit: Change to have an early exit for VF.isScalar?

7267

nit: I'd just define hasSingleCopyAfterVectorization as lambda here, because the comment already explains the intent.

This revision is now accepted and ready to land.Mar 23 2021, 2:29 PM
sdesmalen added inline comments.Mar 24 2021, 2:15 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7250

I'd prefer to either remove the ifndef here so that it's always defined (and the compiler can optimize it away if it isn't called), or implementing the lambda suggestion from Florian.

7560–7561

This should probably return TTI.getArithmeticInstrCost(..) now, because the VF.getFixedValue() multiplier and possible scalarization overhead are now represented as part of VectorTy and the call to TTI.getArithmeticInstrCost. There be no reason why this default opcode should behave differently from the other opcodes now that you've done the work to verify the right type is passed to the TTI.get*Cost calls.

david-arm updated this revision to Diff 333001.Mar 24 2021, 8:16 AM
david-arm marked 2 inline comments as done.
This revision was landed with ongoing or failed builds.Mar 26 2021, 4:27 AM
This revision was automatically updated to reflect the committed changes.