SLP supports perfect diamond matching for the vectorized tree entries
but do not support it for gathered entries and does not support
non-perfect (shuffled) matching with 1 or 2 tree entries. Patch adds
support for this matching to improve cost of the vectorized tree.
Details
- Reviewers
spatel RKSimon dtemirbulatov anton-afanasyev - Commits
- rGaf870e11aed7: [SLP] Add detection of shuffled/perfect matching of tree entries.
rGdaf6e18c55c2: [SLP] Add detection of shuffled/perfect matching of tree entries.
rGb232771acad6: [SLP] Add detection of shuffled/perfect matching of tree entries.
rGd6fde913790d: [SLP]Add detection of shuffled/perfect matching of tree entries.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
A few comments along the theme of reducing code duplication
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4290 | This is the same as the getTreeEntry code above - merge them? | |
4305 | This shuffle kind decode code feels like the kind of thing it should be somewhere like Analysis\VectorUtils.h ? | |
4599 | Can't we merge these? if we assert that (Entries.size() == 1 || Entries.size() == 2) then Entries,.back() will return Entries.front() for a unary shuffle and Builder.CreateShuffleVector should handle the canonicalization. |
LGTM, cheers - I will looking at moving the shuffle kind decode out in a separate commit
Seeing a miscompile after this has landed. Trying to reduce a test case, but no luck so far.
Thanks for the report, waiting for the reproducer.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4305 | Yes, that would be good, just need to agree where we should this, in getShuffleCost or somewhere else? |
Hmm, looks like still need a reproducer because the possible problem I thought to exist in the patch, was fixed already. So, a possible miscompile is caused by another problem I don't currently see. Still need a reproducer. :(
Here's the smallest reproducer we have so far.
Grab:
- driver.cc: https://gist.github.com/hawkinsp/de3ee1adc72d9e4618e00cde5139a5c3
- buffer-assignment.txt: https://gist.github.com/hawkinsp/01719b73adb20cccb8d06a57267d42a3
- module.ll: https://gist.github.com/hawkinsp/cbe6e3d508f6bf8172d29bd0c6f986a0
and run (on a machine that supports AVX 512):
llvm-project/build/bin/clang module.ll driver.cc -o repro -march=skylake-avx512 -O3 -lstdc++ -lm ./repro buffer-assignment.txt
Before this PR (at the preceding commit) this repro produces:
Output: 0.502001, -0.462127, 0.459155, 0.10126, -0.0533019, -0.557239
After this PR, this repro produces:
Output: 0.519662, -0.450291, 0.492509, 0.0910437, -0.0563238, -0.522649
We expect this procedure to be accurate to something more like 1e-10, so this is a significant difference.
Thanks for the reproducer again, investigated it. This patch is not a real cause of the issue, it just revealed again a known problem with masked gathers. Just after this patch, we started to vectorize more code with gathers. If I completely disable masked gathers in the code, the result I get is the next:
0.502001, -0.462127, 0.459155, 0.10126, -0.0533019, -0.557239
. And looks like it is correct. @anton-afanasyev, maybe there is a bug in lowering of masked gathers? And we really need to check if it is profitable/allowed to use masked gathers. I have initial patch for it in the non-power-2 patch, will prepare it as a separate patch later today.
@ABataev If you can email me the IR with/without the masked gathers I can investigate if there's a problem in the backend (or a later pass).
What kind of reproducer do you need? The reproducer reported by @phawkins and attached here after SLP vectorizer with and without masked gathers?
Yes please, the IR files emitted after the SLP pass should be sufficient - how much diff is there other than with/without masked gathers ?
Uploaded. They must be quite similar except for some cases where we do not vectorize small trees of height 2 with second gather entry.
This is the same as the getTreeEntry code above - merge them?