This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Make sext/zext free if folded into a masked load
ClosedPublic

Authored by david-arm on Apr 12 2023, 6:16 AM.

Details

Summary

The BasicTTIImpl implementation of getCastInstrCost ensures
that the cost of zext/sext is 0 when following a load if we
know the combined extending load is legal. For SVE we can do
the same for masked loads too, since they use exactly the
same underlying instruction.

Diff Detail

Event Timeline

david-arm created this revision.Apr 12 2023, 6:16 AM
david-arm requested review of this revision.Apr 12 2023, 6:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 6:16 AM
sdesmalen added inline comments.Apr 12 2023, 6:49 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1830

nit: this can be hasSVEorSME

1833–1834

When the target has SVE and the CCH is masked, is it worth just calling getCastInstrCost again but instead passing CastContextHint::Normal ?

david-arm updated this revision to Diff 512847.Apr 12 2023, 8:23 AM
  • Address review comments
david-arm marked 2 inline comments as done.Apr 12 2023, 8:23 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1833–1834

Good suggestion, thanks @sdesmalen !

sdesmalen accepted this revision.Apr 12 2023, 8:43 AM
This revision is now accepted and ready to land.Apr 12 2023, 8:43 AM
dmgreen added inline comments.Apr 12 2023, 10:16 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2131

Is this always true that they are equivalent?

For a nxv8i16 load zext to nxv8i32 (without masking) you can convert it into a pair of extending load (each cost 1, so load+zext costs 2 total).

The same can't be done for nxv8i16 masked_load zext to nxv8i32 without either converting the nxv8i1 mask to two nxv4i1 masks, or zext a single load with a pair of uunpk's. (For MVE both are expensive so we give the instruction a high cost, preferring lower vector factors).

llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
138

If I'm reading this correctly, there are tests here for loading smaller types and extending them to legal types, but none for loading legal types and extending them. They might be worth adding.

david-arm updated this revision to Diff 514235.Apr 17 2023, 7:52 AM
david-arm marked an inline comment as done.
  • Only consider types where the destination is legal.
david-arm marked 2 inline comments as done.Apr 17 2023, 7:53 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2131

Good point! I think the new version should fix that.

dmgreen added inline comments.Apr 17 2023, 8:14 AM
llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
111–112

Do you know why this has gone from 3 to 1, if the Dst type is not legal? I think I would expect the score to be 2!

Matt added a subscriber: Matt.Apr 17 2023, 9:38 AM
david-arm marked an inline comment as done.Apr 18 2023, 1:01 AM
david-arm added inline comments.
llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
111–112

I assumed this was because it was deciding to create two legal extending loads and then stitch the results together, i.e. reinvoke getCastInstrCost with the type split in half.

%load.nxv88.1 = call <vscale x 16 x i8> @llvm.masked.load.nxv8i8.p0(ptr undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x i8> undef)
%zext.nxv16i8to16.1 = zext <vscale x 8 x i8> %load.nxv8i8.1 to <vscale x 8 x i16>
%load.nxv8i8.2 = call <vscale x 16 x i8> @llvm.masked.load.nxv8i8.p0(ptr undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x i8> undef)
%zext.nxv16i8to16.2 = zext <vscale x 8 x i8> %load.nxv8i8.2 to <vscale x 8 x i16>
%zext.nxv16i8to16 = concat (%load.nxv8i8.1, %load.nxv8i8.2)

so that the only cost of the zext is just the concat, but it's a good question. I can double check.

david-arm added inline comments.Apr 18 2023, 1:17 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2131

This is in BasicTTIImpl::getCastInstrCost:

// If we are legalizing by splitting, query the concrete TTI for the cost
// of casting the original vector twice. We also need to factor in the
// cost of the split itself. Count that as 1, to be consistent with
// getTypeLegalizationCost().
bool SplitSrc =
    TLI->getTypeAction(Src->getContext(), TLI->getValueType(DL, Src)) ==
    TargetLowering::TypeSplitVector;
bool SplitDst =
    TLI->getTypeAction(Dst->getContext(), TLI->getValueType(DL, Dst)) ==
    TargetLowering::TypeSplitVector;
if ((SplitSrc || SplitDst) && SrcVTy->getElementCount().isVector() &&
    DstVTy->getElementCount().isVector()) {
  Type *SplitDstTy = VectorType::getHalfElementsVectorType(DstVTy);
  Type *SplitSrcTy = VectorType::getHalfElementsVectorType(SrcVTy);
  T *TTI = static_cast<T *>(this);
  // If both types need to be split then the split is free.
  InstructionCost SplitCost =
      (!SplitSrc || !SplitDst) ? TTI->getVectorSplitCost() : 0;
  return SplitCost +
         (2 * TTI->getCastInstrCost(Opcode, SplitDstTy, SplitSrcTy, CCH,
                                    CostKind, I));
}

which explains what's happening. It splits the types, then recalculates the zext/sext when the dest is a legal type. It just so happens that this becomes a legal extending load, which is correct! So the extend is absorbed into each load and becomes free. The only cost is then the SplitCost, which I guess could account for the additional load required.

That would make sense for normal loads, but masked loads will not split like that (unless they can extend the mask). https://godbolt.org/z/x9T8vP6Kx. It may be simpler to be more precise about the cost if it returned it directly.

That would make sense for normal loads, but masked loads will not split like that (unless they can extend the mask). https://godbolt.org/z/x9T8vP6Kx. It may be simpler to be more precise about the cost if it returned it directly.

I see what you mean. At the moment we won't lower to two extending loads, although we could do if we thought it help - it would just require a punpklo and punpkhi instead of sunpklo and sunpkhi. However, I think that extends the scope of this patch beyond what was originally intended, which is to only consider extends to legal types. This patch isn't changing the behaviour of BasicTTIImpl::getCastInstrCost - before and after this patch we are splitting the types in exactly the same way. I can have a look at what's required to be more precise, but I think that's going to require adding many entries adding to the existing table to deal with all the possible combinations. I feel that might be better in a separate patch, rather than complicate this one?

That would make sense for normal loads, but masked loads will not split like that (unless they can extend the mask). https://godbolt.org/z/x9T8vP6Kx. It may be simpler to be more precise about the cost if it returned it directly.

Hi @dmgreen , 've rebased this patch now that https://reviews.llvm.org/D142456 has landed, which explicitly accounts for extends to illegal types, so I think your concerns should be addressed now!

dmgreen accepted this revision.Apr 19 2023, 9:02 AM

I see. I am surprised that patch could be committed without causing regressions again.

The way this patch works makes it look like extends of masked costs should be the same as normal loads, I can't say I'm a huge fan of that just because it is not really how they work. Plus you have to try to reason about what the costs will become. So long as there are tests though, that should not be a problem and we can see we are getting the right costs out. The number do now look good.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2130

You can drop the brackets around CCH == TTI::CastContextHint::Masked

I see. I am surprised that patch could be committed without causing regressions again.

Yeah, @hassnaa-arm was able to reland that because of this patch https://reviews.llvm.org/D147522

This revision was landed with ongoing or failed builds.Apr 20 2023, 1:49 AM
This revision was automatically updated to reflect the committed changes.