shuffle_vector instructions are serialized targeting SVE fixed vectors, see https://reviews.llvm.org/D139111. This patch disables optimizeExtendOrTruncateConversion peepholes that generates shuffle_vector.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi Zino,
Looks like the test case is failing? See https://reviews.llvm.org/harbormaster/unit/view/5711271/
I am wondering how big of a hammer this is. Are there no cases at all where doing this is beneficial?
Is the test case minimal? Think I see a loop, which we don't need? To see the codegen differences, it would be good if we can precommit a test, but then we want it more minimal if that is possible.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
14181 | Nit: I would prefer a pointer to source-code (e.g. a function name) rather than a hyperlink, or just omit it if it is obvious. |
It looks like this patch is SVE-related, rather than SME. Can you change the title from [AARCH64][SME] to [AARCH64][SVE] please? Thanks!
Thanks for reducing the test case. I think bailing out early here makes sense indeed, so LGTM.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
14178 | For clarification, is this a bad optimisation when SVE is available? or is it the case the current code generation for SVE is suboptimal? |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
14178 | In a nutshell I don't know. Is the zext peep that rely on tbl instruction is always profitable even when targeting NEON? I have limited access to ARM HW, this peep performs slower on my small toy test. On SVE, disabling the peep, uunpklo is generated and it is fast enough. Please feel free to propose better code generation options. thanks |
I have double checked with Zino that a significant performance uplift was observed for a benchmark app, measured on SVE hardware.
While Zino is in the process of getting llvm commit rights, I am happy/confident to land this on his behalf as the codegen improvement for the examples I have seen are obvious. I think we can iterate on this should there be other/better things to do in this area.
For clarification, is this a bad optimisation when SVE is available? or is it the case the current code generation for SVE is suboptimal?