Limiting scalar hadd/hsub generation to the lowest xmm looks to be unnecessary - we will be extracting one upper xmm whatever, and we can remove a shuffle by using the hop which is inline with what shouldUseHorizontalOp expects to happen anyway.
Testing on btver2 (the main target for fast-hops) shows this is beneficial even for float ops where we have a 'shuffle' to extract the float result:
https://godbolt.org/z/0R-U-K