In vectorized float min/max reduction code, the final "reduce" step
is sub-optimal. In AArch64, this change wll combine :
svn0 = vector_shuffle t0, undef<2,3,u,u> fmin = fminnum t0,svn0 svn1 = vector_shuffle fmin, undef<1,u,u,u> cc = setcc fmin, svn1, ole n0 = extract_vector_elt cc, #0 n1 = extract_vector_elt fmin, #0 n2 = extract_vector_elt fmin, #1 result = select n0, n1,n2
becomes:
result = llvm.aarch64.neon.fminnmv t0
This change extends r247575.
What about FMAXNAN and FMINNAN (-> FMAXV, FMINV)?