If we don't have NEON, we use the generic fallback, which takes 12 instructions. Make sure the costs reflect that.
(On a related note, we could optimize the generic fallback a bit. It currently uses sequences like lsr+and+add; if we use and+lsr+add instead, we can fold the lsr into the add. But I'm not planning to look into that at the moment.)