This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Cap build vector cost to avoid quadratic cost at high LMULs
ClosedPublic

Authored by reames on Aug 31 2023, 8:35 PM.

Details

Summary

(Still somewhat WIP - posted for feedback, and frankly to grab a phab revision)

Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold.

For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b).
output/sifive-x280/buildvector_m1_via_stack.mca:Total Cycles: 1904
output/sifive-x280/buildvector_m2_via_stack.mca:Total Cycles: 2104
output/sifive-x280/buildvector_m4_via_stack.mca:Total Cycles: 2504
output/sifive-x280/buildvector_m8_via_stack.mca:Total Cycles: 3304
output/sifive-x280/buildvector_m1_via_vslidedown.mca:Total Cycles: 804
output/sifive-x280/buildvector_m2_via_vslidedown.mca:Total Cycles: 1604
output/sifive-x280/buildvector_m4_via_vslide1down.mca:Total Cycles: 6400
output/sifive-x280/buildvector_m8_via_vslide1down.mca:Total Cycles: 25599

There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme.

Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity.

Diff Detail