The default expansion for buildvectors is to extract each element and insert them into a new vector. That involves a lot of copying to/from the GPR registers. TLB3 and TLB4 can be relatively slow instructions with the mask needing to be loaded from a constant pool, but they are at least better than all the moves to/from GPRs.
Details
Details
Diff Detail
Diff Detail
Event Timeline
Comment Actions
The misoptimization can be triggered within this standalone C file: https://martin.st/temp/dctref-preproc.c
Compiled with clang -target aarch64-linux-gnu -c -O3 dctref-preproc.c
For a full repro, you can follow these steps:
git clone git://source.ffmpeg.org/ffmpeg cd ffmpeg ./configure --cc=clang make -j$(nproc) fate-idct8x8-0
(The misoptimized object file is libavcodec/dctref.o.)
Comment Actions
I've recommitted with a fix (hopefully). Thanks for the reproducer, please let me know if anything else shows up as incorrect.
that -> than