This patch adds a new feature flag:
sme-f16f16 to represent FEAT_SME-F16F16
This patch add the following instructions:
SME2.1 stand alone instructions:
MOVAZ (array to vector, four registers): Move and zero four ZA single-vector groups to vector registers. (array to vector, two registers): Move and zero two ZA single-vector groups to vector registers. (tile to vector, four registers): Move and zero four ZA tile slices to vector registers. (tile to vector, single): Move and zero ZA tile slice to vector register. (tile to vector, two registers): Move and zero two ZA tile slices to vector registers. LUTI2 (Strided four registers): Lookup table read with 2-bit indexes. (Strided two registers): Lookup table read with 2-bit indexes. LUTI4 (Strided four registers): Lookup table read with 4-bit indexes. (Strided two registers): Lookup table read with 4-bit indexes. ZERO (double-vector): Zero ZA double-vector groups. (quad-vector): Zero ZA quad-vector groups. (single-vector): Zero ZA single-vector groups.
SME2p1 and SME-F16F16:
All instructions are half precision elements:
FADD: Floating-point add multi-vector to ZA array vector accumulators. FSUB: Floating-point subtract multi-vector from ZA array vector accumulators. FMLA (multiple and indexed vector): Multi-vector floating-point fused multiply-add by indexed element. (multiple and single vector): Multi-vector floating-point fused multiply-add by vector. (multiple vectors): Multi-vector floating-point fused multiply-add. FMLS (multiple and indexed vector): Multi-vector floating-point fused multiply-subtract by indexed element. (multiple and single vector): Multi-vector floating-point fused multiply-subtract by vector. (multiple vectors): Multi-vector floating-point fused multiply-subtract. FCVT (widening): Multi-vector floating-point convert from half-precision to single-precision (in-order). FCVTL: Multi-vector floating-point convert from half-precision to deinterleaved single-precision. FMOPA (non-widening): Floating-point outer product and accumulate. FMOPS (non-widening): Floating-point outer product and subtract.
SME2p1 and B16B16:
BFADD: BFloat16 floating-point add multi-vector to ZA array vector accumulators. BFSUB: BFloat16 floating-point subtract multi-vector from ZA array vector accumulators. BFCLAMP: Multi-vector BFloat16 floating-point clamp to minimum/maximum number. BFMLA (multiple and indexed vector): Multi-vector BFloat16 floating-point fused multiply-add by indexed element. (multiple and single vector): Multi-vector BFloat16 floating-point fused multiply-add by vector. (multiple vectors): Multi-vector BFloat16 floating-point fused multiply-add. BFMLS (multiple and indexed vector): Multi-vector BFloat16 floating-point fused multiply-subtract by indexed element. (multiple and single vector): Multi-vector BFloat16 floating-point fused multiply-subtract by vector. (multiple vectors): Multi-vector BFloat16 floating-point fused multiply-subtract. BFMAX (multiple and single vector): Multi-vector BFloat16 floating-point maximum by vector. (multiple vectors): Multi-vector BFloat16 floating-point maximum. BFMAXNM (multiple and single vector): Multi-vector BFloat16 floating-point maximum number by vector. (multiple vectors): Multi-vector BFloat16 floating-point maximum number. BFMIN (multiple and single vector): Multi-vector BFloat16 floating-point minimum by vector. (multiple vectors): Multi-vector BFloat16 floating-point minimum. BFMINNM (multiple and single vector): Multi-vector BFloat16 floating-point minimum number by vector. (multiple vectors): Multi-vector BFloat16 floating-point minimum number. BFMOPA (non-widening): BFloat16 floating-point outer product and accumulate. BFMOPS (non-widening): BFloat16 floating-point outer product and subtract.
The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2022-09
Depends on: D137410
clang-format suggested style edits found: