There have been some patterns in the AArch64 backend to optimize code of the form:
ldrsh w8, [x0] scvtf s0, w8
to:
ldr h0, [x0] sshll v0.4s, v0.4h, #0 scvtf s0, s0
The idea is to remove the GRP->FPR move, but in reality is making code larger and slower (or the same) on all the cpus I tried.
This patch adds the UseAlternateSExtLoadCVTF32 predicate similar to nearby related pattern.