This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve tryToGatherExtractElements by using per-register analysis.
ClosedPublic

Authored by ABataev on Apr 20 2023, 4:04 PM.

Details

Summary

Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.

Diff Detail

Event Timeline

ABataev created this revision.Apr 20 2023, 4:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 4:04 PM
ABataev requested review of this revision.Apr 20 2023, 4:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 4:04 PM
ABataev updated this revision to Diff 516820.Apr 25 2023, 8:38 AM
ABataev updated this revision to Diff 538254.Jul 7 2023, 1:52 PM

Rebase, ping!

ABataev updated this revision to Diff 538738.Jul 10 2023, 10:51 AM

Rebase, ping!

ABataev updated this revision to Diff 541011.Jul 17 2023, 7:05 AM

Rebase, ping!

ABataev updated this revision to Diff 544358.Jul 26 2023, 7:16 AM

Rebase, ping!

ABataev updated this revision to Diff 546057.Aug 1 2023, 7:36 AM

Rebase, ping!!!
Required to unify the cost model + non-power-2.

ABataev updated this revision to Diff 547898.Aug 7 2023, 12:03 PM

Rebase, ping!

ABataev updated this revision to Diff 548999.Aug 10 2023, 5:54 AM

Rebase, ping!

ABataev updated this revision to Diff 552349.Aug 22 2023, 7:16 AM

Rebase, ping!

ABataev updated this revision to Diff 556448.Sep 11 2023, 10:15 AM

Rebase, ping!

rebase?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
557–558

Pull out NFC-ish refactors like this into pre-commits.

vdmitrie added inline comments.Oct 18 2023, 2:53 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10013

Could you extend the description too please? How input values are interpreted? What are output of the routine? Mask seems to serve both ways in and out - that needs to be described too.

10042

Please add a comment describing intent of the code below

10460

Use ScalarTy ?

ABataev updated this revision to Diff 557812.Oct 20 2023, 7:28 AM

Rebase, address comments

RKSimon added inline comments.Oct 27 2023, 4:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7225

Does enumerate make Idx + I references?

ABataev added inline comments.Oct 27 2023, 4:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7225

Will remove enumerate here, it is not needed. Yes, they both are refs

ABataev updated this revision to Diff 557911.Oct 27 2023, 6:20 AM

Rebase, remove enumerate where not required

RKSimon accepted this revision.Oct 30 2023, 9:50 AM

LGTM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
550

Add back this newline

This revision is now accepted and ready to land.Oct 30 2023, 9:50 AM

This seems to have caused a misoptimization in ffmpeg for aarch64.

To reproduce, you can follow these steps, on aarch64 Linux:

$ git clone https://github.com/ffmpeg/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../fate-samples
$ make fate-rsync
$ make -j$(nproc) fate-vp9-00-quantizer-18

The misoptimized object file is libavcodec/vp9dsp_8bpp.o.

The standalone preprocessed input for that object file is available at https://martin.st/temp/vp9dsp_8bpp-preproc.c, you can reproduce the misoptimization with clang -target aarch64-linux-gnu -c -O3 vp9dsp_8bpp-preproc.c -o vp9dsp_8bpp.o.

Can you look into this, and possibly revert if fixing takes some time?

This seems to have caused a misoptimization in ffmpeg for aarch64.

To reproduce, you can follow these steps, on aarch64 Linux:

$ git clone https://github.com/ffmpeg/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../fate-samples
$ make fate-rsync
$ make -j$(nproc) fate-vp9-00-quantizer-18

The misoptimized object file is libavcodec/vp9dsp_8bpp.o.

The standalone preprocessed input for that object file is available at https://martin.st/temp/vp9dsp_8bpp-preproc.c, you can reproduce the misoptimization with clang -target aarch64-linux-gnu -c -O3 vp9dsp_8bpp-preproc.c -o vp9dsp_8bpp.o.

Can you look into this, and possibly revert if fixing takes some time?

Hi, thanks for the report. Generally speaking, this change does he same, what InstCombiner does with extractelement/insertelement sequences. I'll check what's the cause. Do not know the reason yet, but most probably either TTI cost model problem, or codegen (lowering) problem.

This seems to have caused a misoptimization in ffmpeg for aarch64.

To reproduce, you can follow these steps, on aarch64 Linux:

$ git clone https://github.com/ffmpeg/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../fate-samples
$ make fate-rsync
$ make -j$(nproc) fate-vp9-00-quantizer-18

The misoptimized object file is libavcodec/vp9dsp_8bpp.o.

The standalone preprocessed input for that object file is available at https://martin.st/temp/vp9dsp_8bpp-preproc.c, you can reproduce the misoptimization with clang -target aarch64-linux-gnu -c -O3 vp9dsp_8bpp-preproc.c -o vp9dsp_8bpp.o.

Can you look into this, and possibly revert if fixing takes some time?

Compared the output, llvm ir output actually becomes smaller. I cannot do perf run. Looks like the lowering does not do the good job, need to create a ticket against AARCH64 codegen to improve it

This seems to have caused a misoptimization in ffmpeg for aarch64.

To reproduce, you can follow these steps, on aarch64 Linux:

$ git clone https://github.com/ffmpeg/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../fate-samples
$ make fate-rsync
$ make -j$(nproc) fate-vp9-00-quantizer-18

The misoptimized object file is libavcodec/vp9dsp_8bpp.o.

The standalone preprocessed input for that object file is available at https://martin.st/temp/vp9dsp_8bpp-preproc.c, you can reproduce the misoptimization with clang -target aarch64-linux-gnu -c -O3 vp9dsp_8bpp-preproc.c -o vp9dsp_8bpp.o.

Can you look into this, and possibly revert if fixing takes some time?

Compared the output, llvm ir output actually becomes smaller. I cannot do perf run. Looks like the lowering does not do the good job, need to create a ticket against AARCH64 codegen to improve it

I'm not saying the code became slower - I'm saying the code no longer produces the right result.

I'll push a revert for now, to unbreak things.

This seems to have caused a misoptimization in ffmpeg for aarch64.

To reproduce, you can follow these steps, on aarch64 Linux:

$ git clone https://github.com/ffmpeg/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --cc=clang --samples=$(pwd)/../fate-samples
$ make fate-rsync
$ make -j$(nproc) fate-vp9-00-quantizer-18

The misoptimized object file is libavcodec/vp9dsp_8bpp.o.

The standalone preprocessed input for that object file is available at https://martin.st/temp/vp9dsp_8bpp-preproc.c, you can reproduce the misoptimization with clang -target aarch64-linux-gnu -c -O3 vp9dsp_8bpp-preproc.c -o vp9dsp_8bpp.o.

Can you look into this, and possibly revert if fixing takes some time?

Compared the output, llvm ir output actually becomes smaller. I cannot do perf run. Looks like the lowering does not do the good job, need to create a ticket against AARCH64 codegen to improve it

I'm not saying the code became slower - I'm saying the code no longer produces the right result.

I'll push a revert for now, to unbreak things.

Ah, ok, now I see. Ok, go ahead and revert it, I'll investigate it tomorrow.

hans added a subscriber: hans.Nov 6 2023, 4:56 AM

This causes asserts:

clang: /work/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:10082: Value *llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::adjustExtracts(const TreeEntry *, MutableArrayRef<int>, unsigned int, bool &): Assertion `Part == 0 && "Expected firs part."' failed.

See https://bugs.chromium.org/p/chromium/issues/detail?id=1499846#c1 for a stand-alone reproducer.

I'll revert it for now.

we've bisected multiple test failures to this commit, trying to come up with a reduced repro

we've bisected multiple test failures to this commit, trying to come up with a reduced repro

Tanks for the report, please try to update the compiler, I fixed one bug in this patch already. If it still fails, please send the reproducer, will fix it ASAP

yeah, it repros with ToT after the reland

yeah, it repros with ToT after the reland

I mean, earlier today I landed another one fix, try to update the compiler and check if it still fails

yeah, it repros with ToT after the reland

I mean, earlier today I landed another one fix, try to update the compiler and check if it still fails

ah ok, 95703642e3ef617275fd80b5316b05c5a09c6219 does fix one of the tests I was looking at, thanks