As reported on PR39920, "slow horizontal ops" targets tend to internally expand to 2*shuffle+add/sub - so if we can reduce 2*shuffle+add/sub to a hadd/sub then we should do it - similar port usage but reduced instruction count.
This works out in most cases, although the "PR22377" regression in vector-shuffle-combining.ll is annoying - going from 2*shuffle+add+shuffle to hadd+2*shuffle - I'm open to suggestions - I've been trying to think of ways to get foldShuffleOfHorizOp to work with general shuffles but haven't found anything yet.