Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.
Details
- Reviewers
RKSimon vdmitrie dmgreen - Commits
- rGac254fc05598: [SLP]Improve tryToGatherExtractElements by using per-register analysis.
rG9dfdbd788707: [SLP]Improve tryToGatherExtractElements by using per-register analysis.
rG3e6d7c6d983d: [SLP]Improve tryToGatherExtractElements by using per-register analysis.
rG0a34aaedd8ec: [SLP]Improve tryToGatherExtractElements by using per-register analysis.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
rebase?
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
557–558 | Pull out NFC-ish refactors like this into pre-commits. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
10013 | Could you extend the description too please? How input values are interpreted? What are output of the routine? Mask seems to serve both ways in and out - that needs to be described too. | |
10042 | Please add a comment describing intent of the code below | |
10460 | Use ScalarTy ? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7225 | Does enumerate make Idx + I references? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7225 | Will remove enumerate here, it is not needed. Yes, they both are refs |
This seems to have caused a misoptimization in ffmpeg for aarch64.
To reproduce, you can follow these steps, on aarch64 Linux:
$ git clone https://github.com/ffmpeg/ffmpeg $ mkdir ffmpeg-build $ cd ffmpeg-build $ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../fate-samples $ make fate-rsync $ make -j$(nproc) fate-vp9-00-quantizer-18
The misoptimized object file is libavcodec/vp9dsp_8bpp.o.
The standalone preprocessed input for that object file is available at https://martin.st/temp/vp9dsp_8bpp-preproc.c, you can reproduce the misoptimization with clang -target aarch64-linux-gnu -c -O3 vp9dsp_8bpp-preproc.c -o vp9dsp_8bpp.o.
Can you look into this, and possibly revert if fixing takes some time?
Hi, thanks for the report. Generally speaking, this change does he same, what InstCombiner does with extractelement/insertelement sequences. I'll check what's the cause. Do not know the reason yet, but most probably either TTI cost model problem, or codegen (lowering) problem.
Compared the output, llvm ir output actually becomes smaller. I cannot do perf run. Looks like the lowering does not do the good job, need to create a ticket against AARCH64 codegen to improve it
I'm not saying the code became slower - I'm saying the code no longer produces the right result.
I'll push a revert for now, to unbreak things.
This causes asserts:
clang: /work/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:10082: Value *llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::adjustExtracts(const TreeEntry *, MutableArrayRef<int>, unsigned int, bool &): Assertion `Part == 0 && "Expected firs part."' failed.
See https://bugs.chromium.org/p/chromium/issues/detail?id=1499846#c1 for a stand-alone reproducer.
I'll revert it for now.
we've bisected multiple test failures to this commit, trying to come up with a reduced repro
Tanks for the report, please try to update the compiler, I fixed one bug in this patch already. If it still fails, please send the reproducer, will fix it ASAP
I mean, earlier today I landed another one fix, try to update the compiler and check if it still fails
ah ok, 95703642e3ef617275fd80b5316b05c5a09c6219 does fix one of the tests I was looking at, thanks
Add back this newline