diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst new file mode 100644 --- /dev/null +++ b/llvm/docs/RISCV/RISCVVectorExtension.rst @@ -0,0 +1,407 @@ +========================= + RISC-V Vector Extension +========================= + +.. contents:: + :local: + +The RISC-V Vector extension provides vector computation capabilities to the RISC-V architecture [RVV]_. + +This guide is based off the original RFC proposing code generation for the extension [RVV-CodeGen-RFC]_, and briefly outlines the features of the extension, as well as giving an overview of how the RISC-V backend generates code for it. + +Overview +======== + +The vector extension adds 32 vector registers ``v0``, ``v1``, ..., ``v31`` to the ISA. +Unlike typical SIMD ISAs, the size in bits of each vector register is an implementation-specific parameter called ``VLEN`` and must be a power of two. +``VLEN`` may also have additional constraints depending on the exact vector extension, see :ref:`standard vector extensions` for more details. + +Vector registers are partitioned (i.e. densely packed) in elements whose size in bits is a power of two, ranging from 8 to a maximum called ``ELEN``. +``ELEN`` is also a power of two and :math:`\texttt{ELEN} \leq \texttt{VLEN}`. + +Due to encoding constraints, not all the operands of a vector operation are encoded in the instructions themselves. +Two CSR (control and status registers) are used instead: + +- ``vl``: the number of elements being operated, called the vector length. A vector instruction will operate the elements ``0`` to ``vl-1`` +- ``vtype``: the vector type. This register encodes the element size of the operation, called the standard element width (SEW) and a vector grouping mechanism called the length multiplier (LMUL) + + +Length multiplier +----------------- + +The length multiplier (LMUL) can take values 1, 2, 4, 8, 1/2, 1/4, 1/8. +It is encoded as a power of two, where :math:`\text{LMUL} = 2^k, -3 \leq k \leq 3`. + +- When :math:`\text{LMUL} = 1` the vector instructions operate on the (32) vector registers. +- When :math:`\text{LMUL} \lt 1` the vector instructions operate on the lowest half, quarter or eighth of a vector register. +- When :math:`\text{LMUL} \gt 1` the vector instructions operate on vector groups encoded in the instruction using the lowest numbered vector register of the group. + A vector group is the set of consecutive vector registers ``v{LMUL*i}``, ``v{LMUL*i+1}``, ... , ``v{LMUL*(i + 1) - 1}``. So + + - :math:`\text{LMUL}=2` has 16 groups: ``v0``, ``v2``, ``v4``, ..., ``v28``, ``v30`` + - :math:`\text{LMUL}=4` has 8 groups: ``v0``, ``v4``, ``v8``, ``v12``, ``v16``, ``v20``, ``v24``, ``v28`` + - :math:`\text{LMUL}=8` has 4 groups: ``v0``, ``v8``, ``v16``, ``v24`` + +For instance, under :math:`\text{LMUL}=4`, a vector group ``v4`` operand includes vector registers ``v4``, ``v5``, ``v6`` and ``v7`` as if they had been concatenated as a four times larger vector register. + +LMUL is useful to align the number of elements in vector codes whose element sizes are different (say when combining vectors of 32- and 64-bit elements) or when doing *widenings* (zero, sign or fp extensions) or *narrowings* (truncations). + +Setting ``vl`` and ``vtype`` +---------------------------- + +A program must ensure that both ``vl`` and ``vtype`` have the correct values for a vector operation before executing a vector instruction. +This is done using the ``vsetvli`` instruction. + +.. code-block:: nasm + + vsetvli rdest, rsrc, sew,lmul,tx,mx # tx,mx is described in Masks and tails + +``rsrc`` is the application vector length (AVL) and will be used when setting the ``vl``. ``rdest`` is updated with the value of ``vl``. +The spec allows some latitude here but a simple functional model of what ``vsetvli`` does is the following: + +.. math:: + + \text{vl} &\gets \min(\text{rsrc}, \frac{\text{LMUL} \times \text{VLEN}}{\text{SEW}}) \\ + \text{vtype} &\gets \text{SEW},\text{lmul},\dots + +There is also ``vsetivli`` for when the AVL is an immediate, and ``vsetvl`` for when the AVL and ``vtype`` are both registers. + +``vsetvli`` has a couple of special cases: + +- When ``rsrc`` is ``x0`` and ``rdest`` is not ``x0`` then :math:`\text{vl} \gets \text{lmul} \times \frac{\text{VLEN}}{\text{SEW}}`. + In other words, sets ``vl`` to be the maximum vector length for a given LMUL and SEW. + This is useful for whole-register operations. + + .. code-block:: nasm + + vsetvli t0, x0, e32,m2,ta,ma # vl ← 2*VLEN/64 + # vtype ← e32,m2,… + # t0 ← vl + +- When ``rsrc`` and ``rdest`` are both ``x0`` (the hard-coded zero of RISC-V) then ``vl`` is used as the AVL. This can be used to change the ``vtype`` when we know the ratio :math:`\frac{\text{SEW}}{\text{LMUL}}` will be preserved. + + .. code-block:: nasm + + vsetvli x0, x0, e64,m4,ta,ma # changing vtype from e32,m2 to e64,m4 is OK (vl is unchanged) + # vtype ← e64,m4,… + +Two simple examples (register ``x10`` contains the AVL) + +- Add two 32-bit element vectors under :math:`\text{LMUL}=1` + + .. code-block:: nasm + + vsetvli x0, x10, e32,m1,ta,ma + vadd.vv v1, v2, v3 # v1[0:vl-1] ← v2[0:vl-1] + v3[0:vl-1] + # where v[i:j] is all v[x] where i <= x <= j + +- Add two 64-bit element vectors under :math:`\text{LMUL}=2` + + .. code-block:: nasm + + vsetvli x0, x10, e64,m2,ta,ma + vadd.vv v2, v4, v6 # Updates v2 and v3. Reads v4, v5 and v6, v7 + # v2[0:x-1] ← v4[0:x-1] + v6[0:x-1] where x = min(VLEN/64, vl) + # v3[0:y-1] ← v5[0:y-1] + v7[0:y-1] where y = vl - x + +.. note:: + + ``vsetvli`` is commonly used for stripmining, like in the example below: + + .. code-block:: nasm + + # on entry: + # a0 holds the total number of elements + # a1 holds the address of the source array + loop: + vsetvli t0, a0, e32,m8,ta,ma # setup VL, LMUL=8 + vle32.v v8, (a1) # load elements + vadd.vi v8, v8, 1 # process elements + vse32.v v8, (a1) # store updated elements + sub a0, a0, t0 # decrement count + slli t0, t0, 2 # increment address + add a1, a1, t0 + bnez a0, loop # loop until all processed + + The way you would read the ``vsetvli`` is as follows: + + - ``e32,m8``: Group the registers together into groups of 8 (:math:`\text{LMUL}=8`) and partition them into 32-bit elements. + - ``ta,mu``: Be tail agnostic and mask agnostic: We don't care about what's in the elements that aren't processed. + - ``a0``: Try and process ``a0`` elements, or as many as the hardware supports. + - ``t0``: Store ``vl``, i.e. the number of elements that will be processed this iteration + +.. _masks and tails: + +Masks and tails +--------------- +The RISC-V Vector extension supports masks in almost all of its instructions. +There are no distinguished mask registers, instead vector registers can be used to represent masks. + +However an instruction whose execution is masked can only use the ``v0`` register as the mask operand. +Elements of the destination register that are masked off by the mask are called *inactive elements* (i.e. masked-off) + +A vector instruction can be executed under a ``vl`` setting where :math:`\texttt{vl} \lt \text{LMUL} \times \frac{\texttt{VLEN}}{\text{SEW}}`. +Elements of the destination register past the current ``vl`` are called the tail elements. + +There are two modes for the tail and inactive elements + +- undisturbed, in which the element of the destination register is left unmodified +- agnostic, in which the elements of the destination register is either left unmodified or all its bits set to 1 (for debugging purposes). In this mode we cannot assume anything about the bits of those elements + +``tx,mx`` in ``vsetvli`` above correspond to these two policies and can be combined in 4 ways: + +- ``tu,mu``: Both tail and inactive are left undisturbed +- ``ta,ma``: Both tail and inactive are agnostic +- ``tu,ma``: Tail is left undisturbed and inactive are agnostic +- ``ta,mu``: Tail is agnostic and inactive are left undisturbed. + + + +.. _standard vector extensions: + +Standard vector extensions +-------------------------- + +Formally, the vector extension exists in multiple variants, each of which imposes additional constraints on ``VLEN`` and ``EEW`` (the effective ``SEW`` for a specific vector operand): + +``Zvl*`` + Extensions of the form ``Zvl32b``, ``Xvl64b``, etc. + These don't actually contain any instructions but just dictate the minimum required ``VLEN``. + All the extensions below require one of the ``Zvl`` extensions. + +``Zve`` + A smaller subset of the vector extension designed for use in embedded devices. + Specifies a minimum ``VLEN`` and the range of supported ``EEW``s. + For example, ``Zve32x`` requires ``Zvl32b`` and supports ``EEW = {8, 16, 32}``. + ``Zve64f`` requires ``Zvl64b``, supports ``EEW = {8, 16, 32, 64}`` and also provides 32-bit floating point instructions. + +``v`` + This is the single letter version of the vector extension intended for use in application contexts. + It requires ``Zvl128b`` as well as the ``f`` and ``d`` extensions, and provides all the instructions defined in the specification. + + +Mapping to LLVM IR Types +======================== + +Since ``VLEN`` is an unknown constant from the compiler's perspective, the RISC-V backend takes the same approach as AArch64's SVE and uses scalable vector types [SVE-RFC]_. + +Scalable vector types are of the form ````, which indicate a vector with a multiple of ``n`` elements of type ``ty``. +LLVM supports only ``ELEN=32`` or ``ELEN=64``, so ``vscale`` is defined as ``VLEN/64``. +This makes the LLVM IR types stable between the two ``ELEN`` s considered, i.e. every LLVM IR scalable vector type has exactly one corresponding pair of element type and LMUL, and vice-versa. + ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| | LMUL=⅛ | LMUL=¼ | LMUL=½ | LMUL=1 | LMUL=2 | LMUL=4 | LMUL=8 | ++===================+===============+================+==================+===================+===================+===================+===================+ +| i64 (ELEN=64) | N/A | N/A | N/A | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| i32 | N/A | N/A | | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| i16 | N/A | | | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| i8 | | | | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| double (ELEN=64) | N/A | N/A | N/A | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| float | N/A | N/A | | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ +| half | N/A | | | | | | | ++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+ + +(Read ```` as ````) + +One downside of this design is that it doesn’t allow vectors of i128 (this is, ELEN=128). +In that case vscale would have to be 1/2 under :math:`\text{LMUL}=1`. +This type (and its fp counterpart float128) are not that common and in case of extreme necessity types for :math:`\text{LMUL}=2` could be used instead. + +Additionally, this design prevents us from being able to compute a value for ``vscale`` when ``VLEN=32``. + +Mask vector types +----------------- + +As for mask vectors, they are physically represented using a layout of densely packed bits in a vector register. +They are mapped to the following LLVM IR types: + +- +- +- +- +- +- +- + +Two types with the same ratio SEW/LMUL will have the same related mask type. For instance, two different comparisons one under SEW=64, LMUL=2 and the other under SEW=32, LMUL=1 will both generate a mask . + +Register classes +================ + +There are four register classes for vectors: + +- ``VR`` for vector registers (``v0``, ``v1,``, ..., ``v32``). Used when :math:`\text{LMUL} \leq 1` and mask registers. +- ``VRM2`` for vector groups of length 2 i.e. :math:`\text{LMUL}=2` (``v0m2``, ``v2m2``, ..., ``v30m2``) +- ``VRM4`` for vector groups of length 4 i.e. :math:`\text{LMUL}=4` (``v0m4``, ``v4m4``, ..., ``v28m4``) +- ``VRM8`` for vector groups of length 8 i.e. :math:`\text{LMUL}=8` (``v0m8``, ``v8m8``, ..., ``v24m8``) + +:math:`\text{LMUL} \lt 1` types and mask types do not benefit from having a dedicated class, so ``VR`` is used in their case. + +.. _scalable vector codegen: + +Scalable Vector Codegen +======================= + +Let's consider a very simple case using a whole-register op (this example uses :math:`\text{LMUL}=2`) + +.. code-block:: llvm + + %c = add %a, %b + +From the above we get the following ISel DAG: + +.. code-block:: + + t5: nxv4i32 = add t2, t4 + +Which then gets selected as a pseudo instruction: + +.. code-block:: + + t6: nxv4i32 = PseudoVADD_VV_M2 t2, t4, TargetConstant:i32<-1>, TargetConstant:i32<5> + +Each vector instruction has multiple pseudo instructions defined in ``RISCVInstrInfoVPseudos.td``, with their patterns defined in ``RISCVInstrInfoVSDPatterns.td``. +For example, ``VADD_VV`` has pseudo instructions for ``PseudoVADD_VV_M1``, ``PseudoVADD_VV_M2``, and so on. + +The ``M2`` suffix means that we're operating on groups of :math:`\text{LMUL}=2`, and the ``VV`` suffix means we're doing a vector-vector operation (i.e. ``vadd.vv``). +Other suffixes include ``VX`` for vector-scalar and ``VI`` for vector-immediate. + +The first two operands ``t2`` and ``t4`` to the pseudo instruction are the inputs to the regular ``VADD_VV`` instruction, ``vs1`` and ``vs2`` respectively. + +The third is the AVL, i.e. how many elements do we want to operate on, and is of type ``XLenVT``. It's set to -1 here because we want to operate on all the elements. + +.. note:: + + Pseudo instructions ending in ``TU`` are executed in tail undisturbed mode (see :ref:`masks and tails`). + They take an additional merge operand which is a vector whose elements should be preserved in the tail. + +The last operand is SEW, which is encoded as ``5`` here. (``i32 = 2^5``) + +The AVL and SEW operands aren't actually part of the ``vadd.vv`` instruction, but instead are used by the ``RISCVInsertVSETVLI.cpp`` pass to insert the necessary ``vsetvli`` instruction in front of it, after which the MIR looks like this: + +.. code-block:: + + dead %3:gpr = PseudoVSETVLIX0 $x0, 209, implicit-def $vl, implicit-def $vtype + %2:vrm2 = PseudoVADD_VV_M2 %0:vrm2, %1:vrm2, -1, 5, implicit $vl, implicit $vtype + +Now the physical ``$vl`` and ``$vtype`` registers are set up correctly after being implicitly defined by the ``VSETVLI``, after which they are then implicitly used by the ``VADD``. +See ``RISCVVType::encodeVTYPE`` for details on how ``vtype`` is encoded (``209`` in this example). + +.. note:: + It is not necessary to emit a ``vsetvli`` instruction before every vector instruction if the current ``vl`` and ``vtype`` are still suitable for the intended vector operation, and ``RISCVInsertVSETVLI.cpp`` takes this into account: + It won't insert an instruction if neither ``vl`` nor ``vtype`` change. + +After register allocation, the ``RISCVExpandPseudoInsts.cpp`` pass then expands out the ``PseudoVSETVLI``. + +.. code-block:: + + dead $x10 = VSETVLI $x0, 209, implicit-def $vtype, implicit-def $vl + renamable $v8m2 = PseudoVADD_VV_M2 killed renamable $v8m2, killed renamable $v10m2, -1, 5, implicit $vl, implicit $vtype + +Finally ``AsmPrinter`` lowers the pseudo instructions into real ``MCInsts``, discarding uneeded operands. +Note that the existing pseudo instruction remains until MCInst lowering. +See ``lowerRISCVVMachineInstrToMCInst`` to see how the pseudo instruction is matched up with the actual instruction. + +.. code-block:: nasm + + vsetvli a0, zero, e32,m2,ta,ma + vadd.vv v8, v8, v10 + +Fixed Length Vector Codegen +=========================== + +As shown above, instruction selection works on scalable vectors, that is vectors with a type like ````. +So for fixed length vectors like ````, they need to be converted to scalable vectors first. +To assist with this, an intermediate layer of nodes that take an explicit ``VL`` operand is used. +The nodes and their patterns are defined in ``RISCVInstrInfoVVLPatterns.td``. + +For example, for the following LLVM IR on a fixed-length vector of 4 elements: + +.. code-block:: llvm + + %x = add <4 x i32> %a, %b + +The initial ISel DAG will look like this: + +.. code-block:: + + t4: v4i32 = extract_subvector t2, Constant:i32<0> + t7: v4i32 = extract_subvector t6, Constant:i32<0> + t8: v4i32 = add t4, t7 + +But instead of being lowered to a ``PseudoVADD_VV``, it gets converted to a scalable vector and an ``ADD_VL`` SDNode is selected: + +.. code-block:: + + t15: nxv2i1 = RISCVISD::VMSET_VL Constant:i32<4> + t16: nxv2i32 = RISCVISD::ADD_VL t2, t6, undef:nxv2i32, t15, Constant:i32<4> + +These ``_VL`` suffixed nodes are counterparts to their pseudo instructions, but don't specify LMUL and are tagged with a ``VL`` operand, which is 4 here. +It will be later used by the pass inserting ``vsetvli`` so that it can statically set ``VL`` to the number of elements in the fixed-length vector. + +.. note:: + + Because the ``vadd`` can be masked, the third operand on this VL node is a merge operand that is used for undisturbed semantics (otherwise set to ``undef`` in this example). This operand is tied to the destination. If it is an actual value it entails ``tu,mu`` (see :ref:`masks and tails`). + + The following operand is a mask operand of type ````, which is set by ``VMSET``. + ``VMSET`` is a RISC-V pseudo instruction (not an LLVM pseudo instruction) that sets the destination register bits to all ones, so this is the equivalent of not using a mask. + Its operand is the AVL. + + The final operand is the explicit ``VL``, of type ``XLenVT``. + +It is then selected as the corresponding pseudo instruction with a suitable LMUL: + +.. code-block:: + + t15: nxv2i1 = PseudoVMSET_M_B2 TargetConstant:i32<4>, TargetConstant:i32<0> + t22: ch,glue = CopyToReg t0, Register:nxv2i1 $v0, t15 + t16: nxv2i32 = PseudoVADD_VV_M1_MASK undef:nxv2i32, t2, t6, Register:nxv2i1 $v0, TargetConstant:i32<4>, TargetConstant:i32<5>, TargetConstant:i32<1>, t22:1 + +During post-processing, ``RISCVDAGToDAGISel::doPeepholeMaskedRVV`` then detects that the mask in ``$v0`` is all ones and converts the masked form to the unmasked form: + +.. code-block:: + + t24: nxv2i32 = PseudoVADD_VV_M1 t2, t6, TargetConstant:i32<4>, TargetConstant:i32<5> + +Code generation then proceeds as normal as shown in :ref:`scalable vector codegen`. + +Vector Predication instructions +=============================== + +Similarly to fixed-length vectors, vector predicate intrinsics are lowered to ``VL`` nodes first. So the use of the following ``@llvm.vp`` intrinsic + +.. code-block:: llvm + + %x = call @llvm.vp.add.nxv4i32( %a, %b, %m, i32 4) + +Enters the DAG as a ``vp_add`` node: + +.. code-block:: + + t10: nxv4i32 = vp_add t2, t4, t6, t8 + +Which ``RISCVTargetLowering::lowerVPOp`` then lowers into the corresponding ``VL`` node: + +.. code-block:: + + t15: nxv4i32 = RISCVISD::ADD_VL t2, t4, undef:nxv4i32, t6, Constant:i32<4> + +And subsequently the correpsonding masked pseudo instruction, where the mask is copied into ``$v0``: + +.. code-block:: + + t6: nxv4i1,ch = CopyFromReg t0, Register:nxv4i1 %2 + t20: ch,glue = CopyToReg t0, Register:nxv4i1 $v0, t6 + t16: nxv4i32 = PseudoVADD_VV_M2_MASK IMPLICIT_DEF:nxv4i32, t2, t4, Register:nxv4i1 $v0, t8, TargetConstant:i32<5>, TargetConstant:i32<1>, t20:1 + +References +========== + +.. [RVV] `RISC-V "V" Vector Extension `_ +.. [RVV-CodeGen-RFC] `[llvm-dev] [RFC] Code generation for RISC-V V-extension `_ +.. [SVE-RFC] `[RFC][SVE] Supporting SIMD instruction sets with variable vector lengths `_ diff --git a/llvm/docs/UserGuides.rst b/llvm/docs/UserGuides.rst --- a/llvm/docs/UserGuides.rst +++ b/llvm/docs/UserGuides.rst @@ -59,6 +59,7 @@ ResponseGuide Remarks RISCVUsage + RISCV/RISCVVectorExtension SourceLevelDebugging SPIRVUsage StackSafetyAnalysis @@ -261,3 +262,5 @@ :doc:`RISCVUsage` This document describes using the RISCV-V target. +:doc:`RISCV/RISCVVectorExtension` + This document describes how code is generated for the RISC-V Vector extension.