This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve multinode analysis.
ClosedPublic

Authored by ABataev on Apr 22 2021, 2:01 PM.

Details

Summary

Changes the preliminary multinode analysis:

  1. Introduced scores for reversed loads/extractelements.
  2. Improved shallow score calculation.
  3. Lowered the cost of external uses (no need to consider it several times, just ones).
  4. The initial lane for analysis is the one with the minimal possible reorderings.

These changes in general shall reduce compile time and improve the
reordering in many cases.

Part of D57059.

Diff Detail

Event Timeline

ABataev created this revision.Apr 22 2021, 2:01 PM
ABataev requested review of this revision.Apr 22 2021, 2:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2021, 2:01 PM
ABataev updated this revision to Diff 345547.May 14 2021, 1:45 PM

Have you been able to investigate any of these instruction increase regressions?

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll
87

Regression?

231

Regression?

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll
83

Regression?

231

Regression?

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll
592

Regression?

Have you been able to investigate any of these instruction increase regressions?

I think most of them can be fixed, requires D103247 and 1 or 2 extra patches to allow tree reordering for larger subsets of trees.

Matt added a subscriber: Matt.Jun 4 2021, 8:27 AM

rebase?

Need to prepare 1 or 2 extra patches to fix the regressions introduced in this patch (allow reordering for insertelements etc.). Will rebase it after this.

There are still regressions, even after we allowed reordering of insertelements. It is because the reordering is not quite effective. I have an idea of how to improve it (and avoid rebuilding the tree for the second time and improve compile time), will try to implement it next week.

RKSimon added inline comments.Jun 16 2021, 12:37 AM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
139

A lot of these tests aren't preserving the broadcast any more - I'm not sure if it really matters although the testnames now look wrong?

ABataev added inline comments.Jun 16 2021, 4:03 AM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
139

I'll rename affected test cases

It depends on D105020, which should fix all the regressions caused by this patch

RKSimon added inline comments.Jul 13 2021, 2:16 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1015

Should this be m_Deferred or m_Specific? I thought m_Deferred was only necessary in the same match call?

1020

Can this be a single line comment?

ABataev updated this revision to Diff 360863.Jul 22 2021, 10:02 AM

Rebase, fixes and addressed comments.

vporpo added a subscriber: vporpo.Nov 6 2021, 3:20 AM

Perhaps the score changes could be split into a separate patch?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1317

Could you add a comment what these unsigned values are for? (or perhaps use a struct instead?).
Could you also describe the the purpose of HashMap and what the keys and values are?

1318

Is there any reasoning behind the iteration in reverse? If so could you please add a comment?

1329

NIT: I find Code a bit confusing, also perhaps there is no need to refer to Parent in the variable name? Perhaps rename to something like NumOpsWithSameOpcode?

1330

Could you add a bit more text in the comment what the hashed is used for? I can see that it is used as a key in the HashMap above, but could you explain how it is being used?

1336–1363

Could you elaborate a bit on this? If I understand correctly the more similar opcodes we can find, the easier it is to reorder them, therefore this can act as a tie-breaker when the NumOfAPOs is equal?

1366

If I am not mistaken this code will count the consecutive operands with same opcode and BB. Is it because this is a good enough approximation?

Perhaps the score changes could be split into a separate patch?

Not alone, they cause regressions. Will try to separate cost-model changes.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1317

Yes, just forgot to add some extra comments, will add them after updates.
std::pair<unsigned, unsigned> is used to implement a simple voting algorithm and choose the lane with the least number of operands that can freely move about or less profitable because it already has the most optimal set of operands.
The first unsigned is a counter for voting, the second unsigned is the counter of lanes with instructions with same/alternate opcodes and same parent basic block.

1318

This is just to be closer to the original results, before this patch, nothing else, if we have multiple lanes with same cost.

1329

Will rename it but I'd rather keep Parent, because I compare not only opcodes but the parent too.

1330

It is used to count operands, actually their position id and opcode value. It is used in the voting mechanism to find the lane with the least number of operands that can freely move about or less profitable because it already has the most optimal set of operands. I can use SmallVector<unsigned> instead but to use hash code, it is faster and requires less memory.

1336–1363

If the lane already has operands with the same opcode and same parent, no need to swap the operands in this lane, with a high probability such lane already can be vectorized effectively.

1366

Yes, exactly, in most cases it results in the optimal values in the lane.

ABataev updated this revision to Diff 386887.Nov 12 2021, 10:35 AM

Rebase + address comments

@vporpo Any more comments?

vporpo added inline comments.Dec 3 2021, 12:41 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1368

Are you using NumOpsWithSameOpcodeParent == 0 as a check for the first iteration ? Shouldn't you be using !OpcodeI intsead ?

I find this code a bit hard to follow, because I can't tell which of the if conditions are for checking for the first iteration and which ones are part of the heuristic. Should it be updating the OpcodeI and Parent only in the first iteration (like below), or should it be doing it whenever there is a mismatch?

if (auto *I = dyn_cast<Instruction>(OpData.V)) {
  // First iteration
  if (!OpcodeI) {
    OpcodeI = I;
    Parent = I->getParent();
  }
  // Mismatch
  if (!getSameOpcode({OpcodeI, I}).getOpcode() || I->getParent() != Parent)
    ++NumOpsWithSameOpcodeParent;
   else
     NumOpsWithSameOpcodeParent = std::min(NumOpsWithSameOpcodeParent-1, 0);
}

Perhaps peeling the first iteration might make the code easier to follow?

1369

Why is NumOpsWithSameOpcodeParent set to 1 the first time a mismatch is found? Shouldn't it be set to 0 ?

ABataev added inline comments.Dec 7 2021, 1:56 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1368

This is again a kind of voting algorithm. This code works every time, we start voting on a value with the new opcode, not only on the first iteration. We just try to find the opcode with not less than NumOperands/2 number of occurrences here, if no such opcode - just choose any of them, there are no profitable elements.

1369

It is a kind of increasing the counter for the first element in the sequence.

vporpo added inline comments.Dec 7 2021, 3:19 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1369

Yes, it is increasing it, but shouldn't it be decreasing it instead (or letting it remain 0) ? This code block executes when there is a mismatch of opcode or parent (or if it is the first iteration), so shouldn't we be decreasing the value of`NumOpsWithSameOpcodeParent` (like in line 1463)?

What confuses me here is that NumOpsWithSameOpcodeParent looks like a normal counter that counts the opcode/parent matches. So I would expect it to increase by one if the opcode/parents match (like what line 1466 does), and to decrease by one if there is a mismatch. But it seems to be more complicated than that: When it reaches 0 it foced to 1 even when there is an opcode mismatch. I find this a bit counter intuitive.

For example if we have mismatching opcodes in sequence, I would expect it to keep decreasing, or at least be capped to 0. But it seems like the value of NumOpsWithSameOpcodeParent will be 0, then 1, then 0, then 1 like so:

before the loop: 0
iteration 1:  1 (because it was == 0)
iteration 2:  0 (because of opcode mismatch)
iteration 3:  1 (because it was == 0)
ABataev added inline comments.Dec 7 2021, 3:24 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1369

This is how the voting algorithm works. Here is described the main idea https://www.geeksforgeeks.org/boyer-moore-majority-voting-algorithm/

vporpo added inline comments.Dec 7 2021, 4:10 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1369

OK that makes sense now, thanks for clarifying! Could you please add a comment saying that this loop is a Boyer-Moore majority voting for finding the majority opcode and the number of times it occurs?

ABataev added inline comments.Dec 7 2021, 4:15 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1369

Sure, will do it tomorrow.

ABataev updated this revision to Diff 394028.Dec 13 2021, 1:19 PM

Rebase + improve analysis for extractelements.

vporpo accepted this revision.Dec 13 2021, 3:16 PM

LGTM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1016–1022

Nit: Could you use temporary variables and perhaps try to simplify the expression to make it a bit easier to read, something like:

bool MatchExtract1 = match(V1, m_ExtractElt(m_Value(EV1), m_ConstantInt(Ex1Idx)));
bool MatchExtract2 = match(V2, m_ExtractElt(m_Value(EV2), m_CombineOr(m_ConstantInt(Ex2Idx),  m_Undef())));
bool AcceptedEV2 = !EV2 || (isUndefVector(EV2) && EV2->getType() == EV1->getType()) || EV2 == EV1;
if ((MatchExtract1 && isa<UndefValue>(V2)) ||
    (MatchExtract1 && MatchExtract2 && AcceptedEV2)) {
This revision is now accepted and ready to land.Dec 13 2021, 3:16 PM
This revision was landed with ongoing or failed builds.Dec 14 2021, 6:18 AM
This revision was automatically updated to reflect the committed changes.
ye-luo added a subscriber: ye-luo.Dec 29 2021, 11:59 PM

This patch caused many test failure in my application on Power9. Although this patch sounds like affecting SLP, adding -fno-slp-vectorize doesn't improve the pass rate but changing -O3 to -O0 does.

This patch caused many test failure in my application on Power9. Although this patch sounds like affecting SLP, adding -fno-slp-vectorize doesn't improve the pass rate but changing -O3 to -O0 does.

Hi, do you have a reproducer?

ye-luo added a comment.Jan 3 2022, 3:21 PM

This patch caused many test failure in my application on Power9. Although this patch sounds like affecting SLP, adding -fno-slp-vectorize doesn't improve the pass rate but changing -O3 to -O0 does.

Hi, do you have a reproducer?

Initially I was not sure where the issue is from and just reported my observation. After a careful inspection, I found it is an interaction between clang and the random number generator in the boost libraries. Since I had little knowledge about the inside details of the library, I decided not to debug it. Instead I just moved my application out of boost and the RNG from C++ standard library works well with Clang. So I won't work on an reproducer. If boost developers find an issue, they will report bugs. Right now assume everything is good.

This patch caused many test failure in my application on Power9. Although this patch sounds like affecting SLP, adding -fno-slp-vectorize doesn't improve the pass rate but changing -O3 to -O0 does.

Hi, do you have a reproducer?

Initially I was not sure where the issue is from and just reported my observation. After a careful inspection, I found it is an interaction between clang and the random number generator in the boost libraries. Since I had little knowledge about the inside details of the library, I decided not to debug it. Instead I just moved my application out of boost and the RNG from C++ standard library works well with Clang. So I won't work on an reproducer. If boost developers find an issue, they will report bugs. Right now assume everything is good.

Ok, thanks for letting me know!