This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Add detection of shuffled/perfect matching of tree entries.
ClosedPublic

Authored by ABataev on Apr 14 2021, 11:16 AM.

Details

Summary

SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.

Diff Detail

Event Timeline

ABataev created this revision.Apr 14 2021, 11:16 AM
ABataev requested review of this revision.Apr 14 2021, 11:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2021, 11:16 AM
RKSimon retitled this revision from [SLP]Add detectyion of shuffled/perfect matching of tree entries. to [SLP] Add detection of shuffled/perfect matching of tree entries..Apr 15 2021, 3:16 AM

A few comments along the theme of reducing code duplication

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4290

This is the same as the getTreeEntry code above - merge them?

4305

This shuffle kind decode code feels like the kind of thing it should be somewhere like Analysis\VectorUtils.h ?

4599

Can't we merge these? if we assert that (Entries.size() == 1 || Entries.size() == 2) then Entries,.back() will return Entries.front() for a unary shuffle and Builder.CreateShuffleVector should handle the canonicalization.

ABataev updated this revision to Diff 338089.Apr 16 2021, 6:44 AM
ABataev marked an inline comment as done.

Rebase + address comments.

RKSimon accepted this revision.Apr 16 2021, 7:30 AM

LGTM, cheers - I will looking at moving the shuffle kind decode out in a separate commit

This revision is now accepted and ready to land.Apr 16 2021, 7:30 AM

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Thanks for the report, waiting for the reproducer.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4305

Yes, that would be good, just need to agree where we should this, in getShuffleCost or somewhere else?

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Looks like I know the cause of the problem, will try to prepare a fix later today

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Hmm, looks like still need a reproducer because the possible problem I thought to exist in the patch, was fixed already. So, a possible miscompile is caused by another problem I don't currently see. Still need a reproducer. :(

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Any updates about a reproducer?

phawkins added a subscriber: phawkins.EditedApr 25 2021, 11:28 AM

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Any updates about a reproducer?

Here's the smallest reproducer we have so far.

Grab:

and run (on a machine that supports AVX 512):

llvm-project/build/bin/clang  module.ll  driver.cc -o repro -march=skylake-avx512 -O3 -lstdc++ -lm
./repro buffer-assignment.txt

Before this PR (at the preceding commit) this repro produces:

Output:
0.502001, -0.462127, 0.459155, 0.10126, -0.0533019, -0.557239

After this PR, this repro produces:

Output:
0.519662, -0.450291, 0.492509, 0.0910437, -0.0563238, -0.522649

We expect this procedure to be accurate to something more like 1e-10, so this is a significant difference.

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Any updates about a reproducer?

Here's the smallest reproducer we have so far.

Grab:

and run (on a machine that supports AVX 512):

llvm-project/build/bin/clang  module.ll  driver.cc -o repro -march=skylake-avx512 -O3 -lstdc++ -lm
./repro buffer-assignment.txt

Before this PR (at the preceding commit) this repro produces:

Output:
0.502001, -0.462127, 0.459155, 0.10126, -0.0533019, -0.557239

After this PR, this repro produces:

Output:
0.519662, -0.450291, 0.492509, 0.0910437, -0.0563238, -0.522649

We expect this procedure to be accurate to something more like 1e-10, so this is a significant difference.

Thanks for the reproducer, will investigate it and prepare a fix ASAP!

Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.

Any updates about a reproducer?

Here's the smallest reproducer we have so far.

Grab:

and run (on a machine that supports AVX 512):

llvm-project/build/bin/clang  module.ll  driver.cc -o repro -march=skylake-avx512 -O3 -lstdc++ -lm
./repro buffer-assignment.txt

Before this PR (at the preceding commit) this repro produces:

Output:
0.502001, -0.462127, 0.459155, 0.10126, -0.0533019, -0.557239

After this PR, this repro produces:

Output:
0.519662, -0.450291, 0.492509, 0.0910437, -0.0563238, -0.522649

We expect this procedure to be accurate to something more like 1e-10, so this is a significant difference.

Thanks for the reproducer again, investigated it. This patch is not a real cause of the issue, it just revealed again a known problem with masked gathers. Just after this patch, we started to vectorize more code with gathers. If I completely disable masked gathers in the code, the result I get is the next:

0.502001, -0.462127, 0.459155, 0.10126, -0.0533019, -0.557239

. And looks like it is correct. @anton-afanasyev, maybe there is a bug in lowering of masked gathers? And we really need to check if it is profitable/allowed to use masked gathers. I have initial patch for it in the non-power-2 patch, will prepare it as a separate patch later today.

@ABataev If you can email me the IR with/without the masked gathers I can investigate if there's a problem in the backend (or a later pass).

@ABataev If you can email me the IR with/without the masked gathers I can investigate if there's a problem in the backend (or a later pass).

What kind of reproducer do you need? The reproducer reported by @phawkins and attached here after SLP vectorizer with and without masked gathers?

@ABataev If you can email me the IR with/without the masked gathers I can investigate if there's a problem in the backend (or a later pass).

What kind of reproducer do you need? The reproducer reported by @phawkins and attached here after SLP vectorizer with and without masked gathers?

Yes please, the IR files emitted after the SLP pass should be sufficient - how much diff is there other than with/without masked gathers ?

@ABataev If you can email me the IR with/without the masked gathers I can investigate if there's a problem in the backend (or a later pass).

What kind of reproducer do you need? The reproducer reported by @phawkins and attached here after SLP vectorizer with and without masked gathers?

Yes please, the IR files emitted after the SLP pass should be sufficient - how much diff is there other than with/without masked gathers ?

Ok, no problem, just give me few minutes

@ABataev If you can email me the IR with/without the masked gathers I can investigate if there's a problem in the backend (or a later pass).

What kind of reproducer do you need? The reproducer reported by @phawkins and attached here after SLP vectorizer with and without masked gathers?

Yes please, the IR files emitted after the SLP pass should be sufficient - how much diff is there other than with/without masked gathers ?

Uploaded. They must be quite similar except for some cases where we do not vectorize small trees of height 2 with second gather entry.