These types of shuffles get lowered as a vmerge with a potential
vslide{up,down}.
There is still room for improvement in modeling these types, including
concatenating shuffles which are designated SK_InsertSubvector. These
can be lowered to one vslideup provided the types are legal. (And
sometimes, if the concatenated shuffle is used in something like an
interleave shuffle, then it has no cost at all. But there isn't any way
to detect that from the target hooks.)
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
LGTM - once all the earlier patches have been approved.
Note that you explicitly do *not* need to add the slideup only case to the costing for this LGTM. Doing so is a good idea, but this is much closer than the current code and is thus a worthwhile stepping stone regardless.
Remove FIXME: The test case seems to be fully vectorizing the function, rather than half-vectorizing it as the comment states.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
299 | Thanks for implementing this! |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
299 | Good point, I'm not sure. Aarch64 seems to SubTp for costing their subvector inserts. Should we be using both legalisation costs? |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
299 | Using SubTp for the legalisation cost here gives us this diff for this test case define <8 x i64> @insert_subvector_offset_1_v8i64(<8 x i64> %v, <8 x i64> %w) { ; CHECK-LABEL: 'insert_subvector_offset_1_v8i64' -; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %res = shufflevector <8 x i64> %v, <8 x i64> %w, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 5, i32 6, i32 7> +; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %res = shufflevector <8 x i64> %v, <8 x i64> %w, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 5, i32 6, i32 7> ; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <8 x i64> %res ; %res = shufflevector <8 x i64> %v, <8 x i64> %w, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 5, i32 6, i32 7> ret <8 x i64> %res } This is what's actually generated: insert_subvector_offset_1_v8i64: # @insert_subvector_offset_1_v8i64 .cfi_startproc # %bb.0: vsetivli zero, 5, e64, m4, tu, ma vslideup.vi v8, v12, 1 ret It's using LMUL=4 here so I would presume we still want to cost it as 4 * one vslideup. |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
299 |
I thought the operation was done after 2*VLEN was written even though LMUL is 4, but after having second thoughts, I think it really depends on HW implementation. So it is fine to me now. |
Thanks for implementing this!
I have a question: LT is from Tp, is it supposed to use SubTp instead of Tp here?