This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Model select and insertsubvector shuffle kinds
ClosedPublic

Authored by luke on Mar 23 2023, 10:58 AM.

Details

Summary

These types of shuffles get lowered as a vmerge with a potential
vslide{up,down}.
There is still room for improvement in modeling these types, including
concatenating shuffles which are designated SK_InsertSubvector. These
can be lowered to one vslideup provided the types are legal. (And
sometimes, if the concatenated shuffle is used in something like an
interleave shuffle, then it has no cost at all. But there isn't any way
to detect that from the target hooks.)

Diff Detail

Event Timeline

luke created this revision.Mar 23 2023, 10:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2023, 10:58 AM
luke requested review of this revision.Mar 23 2023, 10:58 AM
luke updated this revision to Diff 507825.Mar 23 2023, 11:02 AM

Remove FIXME

reames accepted this revision.Mar 23 2023, 2:04 PM

LGTM - once all the earlier patches have been approved.

Note that you explicitly do *not* need to add the slideup only case to the costing for this LGTM. Doing so is a good idea, but this is much closer than the current code and is thus a worthwhile stepping stone regardless.

This revision is now accepted and ready to land.Mar 23 2023, 2:04 PM
luke updated this revision to Diff 508058.Mar 24 2023, 5:23 AM

Update costs to account for new vslideup patch

luke updated this revision to Diff 508121.Mar 24 2023, 9:06 AM

Update SLP vectorizer test case

luke updated this revision to Diff 508124.Mar 24 2023, 9:10 AM

Remove FIXME: The test case seems to be fully vectorizing the function, rather than half-vectorizing it as the comment states.

luke added inline comments.Mar 24 2023, 9:12 AM
llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
204 ↗(On Diff #508124)

@reames Just want to draw your attention to these bits being vectorised now. Presumably this is caused by the above shuffles now having a cheaper cost.

luke updated this revision to Diff 508148.Mar 24 2023, 10:18 AM

clang-format

This revision was landed with ongoing or failed builds.Mar 24 2023, 10:31 AM
This revision was automatically updated to reflect the committed changes.
arcbbb added a subscriber: arcbbb.Mar 30 2023, 11:17 PM
arcbbb added inline comments.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
299

Thanks for implementing this!
I have a question: LT is from Tp, is it supposed to use SubTp instead of Tp here?

luke added inline comments.Mar 31 2023, 2:52 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
299

Good point, I'm not sure. Aarch64 seems to SubTp for costing their subvector inserts. Should we be using both legalisation costs?

luke added inline comments.Mar 31 2023, 3:03 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
299

Using SubTp for the legalisation cost here gives us this diff for this test case

 define <8 x i64> @insert_subvector_offset_1_v8i64(<8 x i64> %v, <8 x i64> %w) {
 ; CHECK-LABEL: 'insert_subvector_offset_1_v8i64'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %res = shufflevector <8 x i64> %v, <8 x i64> %w, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = shufflevector <8 x i64> %v, <8 x i64> %w, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <8 x i64> %res
 ;
   %res = shufflevector <8 x i64> %v, <8 x i64> %w, <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 5, i32 6, i32 7>
   ret <8 x i64> %res
 }

This is what's actually generated:

insert_subvector_offset_1_v8i64:        # @insert_subvector_offset_1_v8i64
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 5, e64, m4, tu, ma
	vslideup.vi	v8, v12, 1
	ret

It's using LMUL=4 here so I would presume we still want to cost it as 4 * one vslideup.

arcbbb added inline comments.Apr 1 2023, 11:05 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
299

vsetivli zero, 5, e64, m4, tu, ma
vslideup.vi v8, v12, 1

I thought the operation was done after 2*VLEN was written even though LMUL is 4, but after having second thoughts, I think it really depends on HW implementation. So it is fine to me now.