This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Lower fixed vectors extract_vector_elt through stack at high LMUL
ClosedPublic

Authored by reames on Sep 1 2023, 12:49 PM.

Details

Summary

This is the extra side of D159332. The goal is to avoid non-linear costing on patterns where an entire vector is split back into scalars. This is an idiomatic pattern for SLP.

Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique extracts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if we're beyond LMUL2.

There's a subtly here. For this to work, we're *relying* on an optimization in LegalizeDAG which tries to reuse the stack slot from a previous extract. In practice, this appear to trigger for patterns within a block, but if we ended up with an explode idiom split across multiple blocks, we'd still be in quadratic territory. I don't think that variant is fixable within SDAG.

It's tempting to think we can do better than going through the stack, but well, I haven't found it yet if it exists. Here's the results for sifive-s280 on all the variants I wrote (all 16 x i64 with V):

output/sifive-x280/linear_decomp_with_slidedown.mca:Total Cycles:      20703
output/sifive-x280/linear_decomp_with_vrgather.mca:Total Cycles:      23903
output/sifive-x280/naive_linear_with_slidedown.mca:Total Cycles:      21604
output/sifive-x280/naive_linear_with_vrgather.mca:Total Cycles:      22804
output/sifive-x280/recursive_decomp_with_slidedown.mca:Total Cycles:      15204
output/sifive-x280/recursive_decomp_with_vrgather.mca:Total Cycles:      18404
output/sifive-x280/stack_by_vreg.mca:Total Cycles:      12104
output/sifive-x280/stack_element_by_element.mca:Total Cycles:      4304

I am deliberately excluding scalable vectors. It functionally works, but frankly, the code quality for an idiomatic explode loop is so terrible either way that it felt better to leave that for future work.

Diff Detail

Event Timeline

reames created this revision.Sep 1 2023, 12:49 PM
reames requested review of this revision.Sep 1 2023, 12:49 PM
Herald added a project: Restricted Project. · View Herald Transcript
reames updated this revision to Diff 555508.Sep 1 2023, 2:17 PM

(Correct patch this time)

Just to make sure I understand. For a full explode, we still handle the lower LMUL2 portion of the vector using slides, but the upper portion will use vector store plus scalar loads?

reames added a comment.Sep 5 2023, 8:28 AM

Just to make sure I understand. For a full explode, we still handle the lower LMUL2 portion of the vector using slides, but the upper portion will use vector store plus scalar loads?

That's correct. This isn't the goal per se, it just happens to fall out of the existing code structure and be "good enough".

This revision is now accepted and ready to land.Sep 5 2023, 8:56 AM