This patch is meant to discuss the prototype of RVV intrinsic and implement the code generation for intrinsic based on the initial infrastructure D89449.
What this patch has done:
- Propose RVV intrinsic prototype:
- Separate optional mask and vl arguments.
- Naming and operand order is aligned to rvv-intrinsic-doc. https://github.com/riscv/rvv-intrinsic-doc
For example, a VADD intrinsic has four prototypes.
VADD(op1, op2)
VADD.M(mask, maskedoff, op1, op2)
VADD.VL(op1, op2, vl)
VADD.M.VL(mask, maskedoff, op1, op2, vl)
Any idea about the prototype of RVV intrinsic?
- Code generation for VLE/VSE/VADD intrinsics without mask and vl. The implementation is based on the initial infrastructure D89449.
Do we really need those complex patterns written in the target description file?
In this way, we may need five patterns to select four VVV-form intrinsics and one IR node, such as
VADD, VSUB. and five more patterns for VWADD and VWSUB. Eventually, we may suffer maintenance hell.
Our solution is to select RVV intrinsic without any pattern matching rules
Build two searchable tables to provide information.
- RVVLMULIndex table: guide the selection function how to determine the LMUL and SEW of an intrinsic.
- RVVIntrinsicToPseudo table: look up a pseudo RVV instruction by intrinsic and the LMUL inferred from above.
Example:
RVVLMULIndex table:
(VADDVV, index 1): LMUL can be inferred by the first operand
(VWADDVV, index 1): LMUL can be inferred by the first operand
(VWADDWV, index 1, dividedBy2): LMUL can be inferred by the first operand then divide LMUL by 2.
LMULIndex = lookupLMULIndexByIntrinsic(VADDVV);
LMUL = inferLMUL(VADDVV, LMULIndex);
The LMUL can be inferred by the operand. check out: https://github.com/riscv/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md#data-types
RVVIntrinsicToPseudo table:
(VADDVV, LMUL M1, VADDVV_M1)
(VADDVV, LMUL M2, VADDVV_M2)
(VADDVV, LMUL M4, VADDVV_M4)
(VADDVV, LMUL M8, VADDVV_M8)
PseudoOp = lookupPseudoByIntrinsicAndLMUL(VADDVV, LMUL);
Above can be done within a C++ function.
This is setting the VL to VLmax which isn't what the spec wants. It should get the value from the previous vsetvl intrinsic or maybe the previous intrinsic that had a vl argument.
Our internal implementation has been implementing the intrinsics without vl by inserting a readvl intrinsic and a call to the intrisics that take vl. But we've been finding issues with this. The readvl is acting as an optimization barrier. It also doesn't have any ordering in IR with respect to intrinsics that have a vl argument unless we mark all intrinsics has having side effects.
What are your thoughts on this?