I've implemented them as target-specific IR intrinsics rather than
using @llvm.experimental.vector.reduce.add, on the grounds that the
'experimental' intrinsic doesn't currently have much code generation
benefit, and my replacements encapsulate the sign- or zero-extension
so that you don't expose the illegal MVE vector type (<4 x i64>) in
IR.
The machine instructions come in two versions: with and without an
input accumulator. My new IR intrinsics, like the 'experimental' one,
don't take an accumulator parameter: we represent that by just adding
on the input value using an ordinary i32 or i64 add. So if you write
the vaddvaq C-language intrinsic with an input accumulator of zero,
it can be optimised to VADDV, and conversely, if you write something
like x += vaddvq(y) then that can be combined into VADDVA.
Most of this is achieved in isel lowering, by converting these IR
intrinsics into the existing ARMISD::VADDV family of custom SDNode
types. For the difficult case (64-bit accumulators), isel lowering
already implements the optimization of folding an addition into a
VADDLV to make a VADDLVA; so once we've made a VADDLV, our job is
already done, except that I had to introduce a parallel set of ARMISD
nodes for the predicated forms of VADDLV.
For the simpler VADDV, we handle the predicated form by just leaving
the IR intrinsic alone and matching it in an ordinary dag pattern.