This is an archive of the discontinued LLVM Phabricator instance.

[LV] Use the known trip count when costing non-tail folded VFs
ClosedPublic

Authored by dmgreen on Apr 6 2023, 9:29 AM.

Details

Summary

Now that we store the ScalarCost in the VectorizationFactor it is possible to use it to get a slightly more accurate cost in isMoreProfitable between two vector factors. This extends the logic added in D101726 to non-tail-folded cases, using the costs of VecCost * (TripCount / VF) + ScalarCost * (TripCount % VF) to compare VFs where the TripCount is known.

This shouldn't alter very much as small trip counts are usually not vectorized, but does seem to help in the testcase where 4 * VF4 is chosen as profitable compared to 2 * VF8 + 4* scalar.

Diff Detail

Event Timeline

dmgreen created this revision.Apr 6 2023, 9:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 9:29 AM
dmgreen requested review of this revision.Apr 6 2023, 9:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 9:29 AM
david-arm added inline comments.Apr 19 2023, 6:46 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5406

Hi @dmgreen, perhaps I've missed something here but it looks like there is a divide-by-zero in the code?

If A.Width.getFixedValue() is non-zero then we calculate (CostA * divideCeil(MaxTripCount, A.Width.getFixedValue())), and if it is zero then we do (CostA * (MaxTripCount / A.Width.getFixedValue()) + ...

dmgreen added inline comments.Apr 21 2023, 5:53 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5406

Oh yeah, that was wrong. It was meant to be based on foldTailByMasking. That must have been confusing. I've reran the testing and it still looks OK, this patch doesnt usually alter a lot.

dmgreen updated this revision to Diff 516316.Apr 24 2023, 2:30 AM

Update this to what it should have been. One of the tests has changed because 12345 as a i8 is really 57, which was too many scalar iterations to pick v16 over v8. I've changed it to 241 so that it keeps testing the same thing.

sdesmalen added inline comments.Apr 24 2023, 7:35 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5410–5414

nit: Is it worth using a lambda for this, e.g.

auto GetCostForTC = [MaxTripCount, this](unsigned VF, InstructionCost VectorCost,
                                         InstructionCost ScalarCost) {
  return foldTailByMasking() ?
    VectorCost * divideCeil(MaxTripCount, VF);
    VectorCost * (MaxTripCount / VF) + ScalarCost * (MaxTripCount % VF);
};

auto RTCostA = GetCostForTC(A.Width.getFixedValue(), CostA, A.ScalarCost);
auto RTCostB = GetCostForTC(B.Width.getFixedValue(), CostB, B.ScalarCost);
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
98

I'm curious why this test needed changing. What VF does it pick with 12345?

dmgreen added inline comments.Apr 24 2023, 7:38 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5410–5414

Sounds good I'll do that now.

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
98

12345 as a i8 is really 57, which was too many scalar iterations to pick v16 over v8. It is 3*vf16 + 9*vf1 vs 7*vf8 + 1*vf1.

I've changed it to 241 so that it keeps testing the same thing. And is a i8 value.

dmgreen updated this revision to Diff 516408.Apr 24 2023, 7:43 AM

Update as suggested

sdesmalen accepted this revision.Apr 24 2023, 7:58 AM

LGTM!

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
98

Ha, I didn't spot it said i8 12345, that's silly. Your explanation makes sense, thanks for clarifying!

This revision is now accepted and ready to land.Apr 24 2023, 7:58 AM
This revision was landed with ongoing or failed builds.Apr 24 2023, 2:02 PM
This revision was automatically updated to reflect the committed changes.