There was no pattern to fold into these instructions. This patch adds
the pattern obtained from the following ACLE intrinsics so that they
generate sqdmlal/sqdmlsl instructions instead of separate sqdmull and
sqadd/sqsub instructions:
- vqdmlalh_s16, vqdmlslh_s16
- vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, vqdmlslh_laneq_s16 (when the lane index is 0)
It also modifies the result of the existing pattern for the latter, when
the lane index is not 0, to use the v1i32_indexed instructions instead
of the v4i16_indexed ones.
Fixes #49997.
This matches the vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, and vqdmlslh_laneq_s16 ACLE functions when the lane is not 0.
I can simply remove this definition and the SQDML*Lv1i32_indexed instructions will be used, however it will result in longer AArch64 code (extra dup instruction).
Is this FIXME really an issue?