Implement SystemZTTIImpl::getArithmeticReductionCost() and SystemZTTIImpl::getMinMaxReductionCost() for floating point. I ran into some issues with the integer versions so it seemed simplest to do the FP opcodes first.
In order to enable reductions also when the elements are loaded from non-consecutive addresses, SLP needs a small patch for the computation in getGatherCost(). If the target supports it, there is no extra cost for a vector element load. This did not change any existing tests. Given the element load instructions, I think it makes sense to do this.
This gives a nice improvement on f519.lbm_r (with -funsafe-math-optimizations).
fp128 also benefits from this I think, due to the reassociation of the operands. In order to enable it I gave an arbitrary discount by dividing the number of elements with 2 and using that as the cost.
@ABataev : Does the SLP change look good to you?
Better to exclude DemandedElts[Idx] before calling getScalarizationOverhead for such loads rather than subtract the cost.