(Still somewhat WIP - posted for feedback, and frankly to grab a phab revision)
Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold.
For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b).
output/sifive-x280/buildvector_m1_via_stack.mca:Total Cycles: 1904
output/sifive-x280/buildvector_m2_via_stack.mca:Total Cycles: 2104
output/sifive-x280/buildvector_m4_via_stack.mca:Total Cycles: 2504
output/sifive-x280/buildvector_m8_via_stack.mca:Total Cycles: 3304
output/sifive-x280/buildvector_m1_via_vslidedown.mca:Total Cycles: 804
output/sifive-x280/buildvector_m2_via_vslidedown.mca:Total Cycles: 1604
output/sifive-x280/buildvector_m4_via_vslide1down.mca:Total Cycles: 6400
output/sifive-x280/buildvector_m8_via_vslide1down.mca:Total Cycles: 25599
There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme.
Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity.
space after switch