Implement SystemZTTIImpl::getArithmeticReductionCost() and SystemZTTIImpl::getMinMaxReductionCost() for floating point. I ran into some issues with the integer versions so it seemed simplest to do the FP opcodes first.
In order to enable reductions also when the elements are loaded from non-consecutive addresses, SLP needs a small patch for the computation in getGatherCost(). If the target supports it, there is no extra cost for a vector element load. This did not change any existing tests. Given the element load instructions, I think it makes sense to do this.
This gives a nice improvement on f519.lbm_r (with -funsafe-math-optimizations).
fp128 also benefits from this I think, due to the reassociation of the operands. In order to enable it I gave an arbitrary discount by dividing the number of elements with 2 and using that as the cost.
@ABataev : Does the SLP change look good to you?
I don't see the call for getScalariationOverhead. Also, can you try to implement overloaded version of getScalarizationOverhead, that knows how to handle it, instead of moving getGatherCost to TTI ?