If you're writing C code using the ACLE MVE intrinsics that passes the
result of a vcmp as input to a predicated intrinsic, e.g.
mve_pred16_t pred = vcmpeqq(v1, v2); v_out = vaddq_m(v_inactive, v3, v4, pred);
then clang's codegen for the compare intrinsic will create calls to
@llvm.arm.mve.pred.v2i to convert the output of icmp into an
mve_pred16_t integer representation, and then the next intrinsic
will call @llvm.arm.mve.pred.i2v to convert it straight back again.
This will be visible in the generated code as a vmrs/vmsr pair
that move the predicate value pointlessly out of p0 and back into it again.
To prevent that, I've added InstCombine rules to remove round trips of
the form v2i(i2v(x)) and i2v(v2i(x)). Also I've taught InstCombine
about the known and demanded bits of those intrinsics. As a result,
you now get just the generated code you wanted:
vpt.u16 eq, q1, q2 vaddt.u16 q0, q3, q4