This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ, SLP] Enable FP horizontal reductions and fix SLP cost computation.
Needs ReviewPublic

Authored by jonpa on Feb 20 2023, 3:38 AM.

Details

Reviewers
uweigand
ABataev
Summary

Implement SystemZTTIImpl::getArithmeticReductionCost() and SystemZTTIImpl::getMinMaxReductionCost() for floating point. I ran into some issues with the integer versions so it seemed simplest to do the FP opcodes first.

In order to enable reductions also when the elements are loaded from non-consecutive addresses, SLP needs a small patch for the computation in getGatherCost(). If the target supports it, there is no extra cost for a vector element load. This did not change any existing tests. Given the element load instructions, I think it makes sense to do this.

This gives a nice improvement on f519.lbm_r (with -funsafe-math-optimizations).

fp128 also benefits from this I think, due to the reassociation of the operands. In order to enable it I gave an arbitrary discount by dividing the number of elements with 2 and using that as the cost.

@ABataev : Does the SLP change look good to you?

Diff Detail

Event Timeline

jonpa created this revision.Feb 20 2023, 3:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 20 2023, 3:38 AM
jonpa requested review of this revision.Feb 20 2023, 3:38 AM
ABataev added inline comments.Feb 20 2023, 5:17 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
8470–8472

Better to calculate scalarization overhead for the loadinst elements and subtract from the gather cost.

llvm/test/Analysis/CostModel/SystemZ/vector-reductions-fp.ll
1

Better to precomit new tests to better see the changes/differences in a separate patch

ABataev added inline comments.Feb 20 2023, 5:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
8470–8472

Or better to include this check to the getGatherCost function

jonpa updated this revision to Diff 499915.Feb 23 2023, 10:46 AM

Better to calculate scalarization overhead for the loadinst elements and subtract from the gather cost.

Or better to include this check to the getGatherCost function

Thanks for review. Good suggestion to add back the scalarization overhead as it was originally added (instead of just '1'). This made me realize that on SystemZ there is an even further complication here, namely that for i64 vectors, two GPRs can be inserted with a single instruction.

I have experimented further with this and found a solution where the getGatherCost() function is lifted into TTI instead of being SLP specific. That way there can be a default implementation per our previous discussions and also a SystemZ specific handling of the i64 vector costs. Does this seem reasonable?

ABataev added inline comments.Feb 23 2023, 10:55 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
1250–1254 ↗(On Diff #499915)

Better to exclude DemandedElts[Idx] before calling getScalarizationOverhead for such loads rather than subtract the cost.

llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
1030–1051

I don't see the call for getScalariationOverhead. Also, can you try to implement overloaded version of getScalarizationOverhead, that knows how to handle it, instead of moving getGatherCost to TTI ?