This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Custom ISelLowering for 256b `shuffle_vector v, undef, <1, 1, 0, 0>`
Needs ReviewPublic

Authored by cameron.mcinally on May 3 2023, 7:57 AM.

Details

Summary

This is more a bug report than earnest patch...

We've found a couple of inefficiently lowered shuffles when targeting neoverse-v1 and VLS. This patch covers:

%x = shufflevector <2 x double> %v, <2 x double> poison, <4 x i32> <i32 1, i32 1, i32 0, i32 0>

It could be lowered in a number of ways, but I chose:

zip1 z0.d, z0.d, z0.d
ext z0.b, z0.b, z0.b, #16

The new lowering shows an 11% performance boost on 538.namd with our out-of-tree compiler.

Diff Detail

Event Timeline

cameron.mcinally requested review of this revision.May 3 2023, 7:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 7:57 AM
cameron.mcinally edited the summary of this revision. (Show Details)May 3 2023, 7:59 AM

For NEON, we would use the PerfectShuffle tables for something like this... should we try to use those tables here? I mean, I guess it's kind of narrow to implement perfect shuffle tables specifically for <4 x double>, but it might make sense...

We should probably also consider implementing a general-purpose fallback for shuffling that doesn't involve the stack. For a shuffle with one source, we can use tbl; I guess for the general case we'd have to use tbl+tbl+orr. (Sort of messy, but almost certainly better than the default fallback of storing to the stack element by element.)

@efriedma, I agree. There's a more general solution out there, but I'm too far removed from the AArch64 backend to see it.

The other interesting case from 538.namd is:

shufflevector <2 x double> %v, <2 x double> poison, <4 x i32> <i32 1, i32 0, i32 1, i32 0>

And I suspect there are more that I haven't found yet, especially from Complex.

Matt added a subscriber: Matt.May 5 2023, 2:38 PM