This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower 3 and 4 sources buildvectors to TBL
ClosedPublic

Authored by dmgreen on Mar 7 2022, 10:19 AM.

Details

Summary

The default expansion for buildvectors is to extract each element and insert them into a new vector. That involves a lot of copying to/from the GPR registers. TLB3 and TLB4 can be relatively slow instructions with the mask needing to be loaded from a constant pool, but they are at least better than all the moves to/from GPRs.

Diff Detail

Event Timeline

dmgreen created this revision.Mar 7 2022, 10:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 10:19 AM
dmgreen requested review of this revision.Mar 7 2022, 10:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 10:19 AM
samtebbs added inline comments.Mar 10 2022, 2:12 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9064

that -> than

9076

This rather complex calculation could do with a comment.

dmgreen updated this revision to Diff 417215.Mar 22 2022, 2:03 AM

Update comments and other cleanup.

This revision is now accepted and ready to land.Mar 23 2022, 9:05 AM
This revision was landed with ongoing or failed builds.Mar 24 2022, 3:02 AM
This revision was automatically updated to reflect the committed changes.

FYI, headsup, I'm seeing a misoptimization introduced by this commit.

FYI, headsup, I'm seeing a misoptimization introduced by this commit.

The misoptimization can be triggered within this standalone C file: https://martin.st/temp/dctref-preproc.c
Compiled with clang -target aarch64-linux-gnu -c -O3 dctref-preproc.c

For a full repro, you can follow these steps:

git clone git://source.ffmpeg.org/ffmpeg
cd ffmpeg
./configure --cc=clang
make -j$(nproc) fate-idct8x8-0

(The misoptimized object file is libavcodec/dctref.o.)

Thanks. I'll take a look

I've recommitted with a fix (hopefully). Thanks for the reproducer, please let me know if anything else shows up as incorrect.