The new IR fixes a mismatch in the final extractelement for the i32 intrinsics. Previously we extracted a 64-bit element even though we only wanted 32 bits.
SimplifyDemandedElts isn't able to make FP elements undef now and the shuffle mask I used prevents the use of horizontal add we had before. Not sure we should have been using horizontal add anyway. It's implemented on Intel with two port 5 shuffles and an add. So we have on less shuffle now, but an additional instruction to decode.