If we have a known (or bounded) index which definitely fits in a smaller LMUL register group size, we can reduce the LMUL of the slide and extract instructions. This loosens constraints on register allocation, and allows the hardware to do less work, at the potential cost of some additional VTYPE toggles. In practice, we appear (after prior patches) to do a decent job of eliminating the additional VTYPE toggles in most cases.
A couple of side notes:
- I stopped at m1 here. For machines with a DLEN < VLEN, we should probably be doing mf2, but we need to make that change a bit more globally as well.
- Arguably, we should be narrowing the LMUL of *most* operations which are provably don't care in their input and outputs. We've got a few selected cases, but maybe it's time to generalize something more general? (Definitely future work!)