For vector strided instructions, as the RVV spec says:
When rs2=x0, then an implementation is allowed, but not required, to
perform fewer memory operations than the number of active elements, and
may perform different numbers of memory operations across different
dynamic executions of the same static instruction.
So compiler shouldn't assume that fewer memory operations will be
performed when rs2=x0.
We add a target feature to specify whether u-arch supports optimized
zero-stride vector load. And we do vector splat optimization iff this
feature is supported.
This feature is enabled by default since most designs implement this
optimization.
FeatureNoOptimizedZeroStrideLoad -> TuneNoOptimizedZeroStrideLoad.