2xi64 is the legalized type for wide reductions (like 16xi64) and setting the
cost to 2 makes load-reduce and load-zext-reduce patterns profitable.
The few performance measurments that I did on an aarch64 machine confirm that
these patterns are actually faster when vectorized.