[SLP] Fix for PR30787: Failure to beneficially vectorize 'copyable' elements in integer binary ops.
Needs ReviewPublic

Authored by dtemirbulatov on Jan 19 2017, 8:56 AM.

Details

Summary

Patch tries to improve vectorization of the following code:

void add1(int * __restrict dst, const int * __restrict src) {
  *dst++ = *src++;
  *dst++ = *src++ + 1;
  *dst++ = *src++ + 2;
  *dst++ = *src++ + 3;
}

Currently this code cannot be vectorized because the very first operation is not a binary add, but just a load.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
dtemirbulatov added inline comments.Dec 20 2017, 1:40 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4090–4091

Yes, thanks, I see the problem in setInsertPointAfterBundle function.

dtemirbulatov added inline comments.Dec 20 2017, 2:19 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3595–3596

For consecutive loads [load0, load1, load2, load3] in HorizontalReduction::matchAssociativeReduction function ReducedVals vector was formed as [load1, load0, load2, load3] and later TreeEntry was added with State.OpValue pointing to load1 in BoUpSLP::buildTree_rec, while TreeEntry->Scalars contained correct sequence of loads [load0, load1, load2, load3].

ABataev added inline comments.Dec 20 2017, 6:50 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3595–3596

Could you investigate why does this happen?

dtemirbulatov added inline comments.Dec 20 2017, 7:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3595–3596

ok

Update proposed change, fixed scheduling issue by avoiding using dominators.

ABataev added inline comments.Jan 12 2018, 1:30 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2718

auto *Inst

3297

auto *VL0

4092

auto *I

4100

Why you're checking for ExtractValues here?

4102

auto *I1

4103
  1. I1 != nullptr -> I1
  2. I1 = vis very bad name
4108

I still don't understand what are checking for here. Need more description and real tests

4110

auto *UserInst

4114
  1. auto *I1
  2. I1 is very bad name
4512
  1. Wrong formatting. Use clang-format
  2. Enclose into braces
test/Transforms/SLPVectorizer/X86/pr35497.ll
1

Add test checks

Fixed more scheduling errors, still scheduling algorithm could be improved in term of complexity, but I tried to minimally impact existing scheduling.

dtemirbulatov added inline comments.Jan 25 2018, 3:42 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4108

I hit the issue without "(SameOrAlt <= (VL.size() / 2))" limitation, I can send you testcases offline. but I believe with could trigger the issue without this check even in the current implementation.

dtemirbulatov added inline comments.Jan 25 2018, 3:48 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2718

done.

3297

done.

4100

done. I don't have any limitation for the extract element operation anymore.

4102

done.

4103

done.

4110

done.

4114

done.

4512

done.

test/Transforms/SLPVectorizer/X86/pr35497.ll
1

done.

dtemirbulatov added inline comments.Jan 25 2018, 4:02 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4550

And these dominance errors occur rarely.

Remove dominance usage from the scheduler by adding support for nonalternative opcodes in calculateDependencies.

ABataev added inline comments.Jan 26 2018, 9:48 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2349–2350

You can use VL0 instead of E->State.OpValue

4080

DominatorTree *DT is not required.

4094

Seems to me, we should extend scheduling region only for instructions with the same or alternate opcode. Maybe, it will resolve all your problems.

4108

The check itself is very expensive and not clear.That's why it causes a lot of questions. You should describe the problem in more details, add the test case to the patch that reveals the problem and prove, that this fix is universal.

4497
  1. Use SmallSet instead
  2. Why you can't use SmallVector instead?
dtemirbulatov added inline comments.Jan 28 2018, 12:23 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4094

No, in that case, we would not be able to extend scheduling region with non-alternative opcodes.

Update after ABataev remarks.

dtemirbulatov added inline comments.Jan 29 2018, 1:37 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2349–2350

done.

4080

done.

4108

Added testcase internal-dep.ll for that, the issue could be reproduced if we remove this check and remove another "(SameOrAlt <= (VL.size() / 2))" limitation at line 1556. I believe that the issue is present even in the current implementation, but I just don't have enough code base to prove it.

4497

Replaced to SmallSet, We could not use Vector here since it is not possible to have multiple entities to a single Instruction.

ABataev added inline comments.Jan 30 2018, 7:12 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4094

Did you try that? Actually, we're not going to vectorize instructions with the different opcodes. So, maybe, we should exclude them from the scheduling region?

4497

The main problem with this code that you're checking only one level of dependency. What if you have dependency deeper, in 2, 3 or more level? Will it work? The code itself is very complex and, if we're going to keep this solution, must be outlined in a separate function.

test/Transforms/SLPVectorizer/X86/internal-dep.ll
10 ↗(On Diff #131752)

That's why I suggest to try to exclude this instruction from the scheduling. Actually, we should not schedule %sub.i, we should schedule pseudo instruction lshr i32 %sub.i, 0. Shall we include it into scheduling?

dtemirbulatov added inline comments.Jan 30 2018, 7:30 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4094

oh, I see what you mean now.

4497

yes, I thought the same, I have another heavier version that utilizes Dependancy Analyst, but I have not seen yet such an example where current implementation is not enough. I have 2 or 3 testcases for the issue so far. I will think again about the problem.

test/Transforms/SLPVectorizer/X86/internal-dep.ll
10 ↗(On Diff #131752)

ok, Thanks

dtemirbulatov added inline comments.Jan 30 2018, 7:52 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4497

probably level 2 or 3 dependencies might be ok since it is not encapsulated in a single operation.

What is happening with this patch? It's been in development for over a year now and still seems to be having problems. Getting PR30787 fixed would be VERY useful and I'm thinking we should be looking at alternatives if this patch is going to carry on stalling/reverting.

I will update the solution shortly, I am currently in changing scheduler in order to avoid non-alternative opcodes in a bundle.

dtemirbulatov updated this revision to Diff 140208.EditedMar 29 2018, 4:38 AM

Removed internal vector dependency check as incorrect and I added the dependency check in case of partial bundle vectorization with non-alternative operations only. At BoUpSLP::vectorizeTree, BoUpSLP::buildTree.

ABataev added inline comments.Mar 29 2018, 6:34 AM
test/Transforms/SLPVectorizer/SystemZ/pr34619.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/partail.ll
2

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/pr35497.ll
2

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/resched.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

Update after tests commit, formatting, delete Instruction::ExtractElement restricktion from tryToRepresentAsInstArg.

ABataev added inline comments.Mar 30 2018, 9:08 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1538
  1. Remove extra parens around Entry->State.IsNonAlt
  2. Why are you skipping bundles with non alternarive opcodes only?
3146–3150

Rewrite it this way:

SmallPtrSet<Instruction *, 4> BundleInst;
Bundle = Bundle->FirstInBundle;
LastInst = Bundle->Inst;
while (Bundle) {
  BundleInst.insert(Bundle->Inst);
  Bundle = Bundle->NextInBundle;
}
3151–3153

Why do you need to scan all the instructions in the basic block starting from First? Why you can't use only scheduled instructions?

3154

Move ++Iter to the header of for loop

3963

Name of the variable must start from capital letter.

3966
  1. The code is not formatted
  2. Seems to me you missed break;

Update after Alexey's remarks.

dtemirbulatov added inline comments.Mar 31 2018, 8:10 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1538

yes, we don't need to check Entry->State.IsNonAlt here, that is similar to getTreeEntry(U, Scalar) == Entry.

3146–3150

Done. Thanks.

3151–3153

What do you mean? please elaborate. If you mean ScheduleData here then it is also not always gives up correct sequence of scheduled instructions in a block. Like, bundle member with NextInBundle equals to null is not guaranty to be the last instruction of this bundle among other instructions. If we start with First then it is highly likely that we could iterate across all bundle members and exit instead of iterating to the end of the basic block. Also, I am thinking, to avoid this overhead we could note the last scheduled instruction in BB during scheduling in scheduleBlock() and keep this information in ScheduleData structure.

3154

Done.

3963

Done.

3966

Correct, Thanks.

Rebased, improved complexity of setInsertPointAfterBundle() for already scheduled instructions, minor changes in scheduleBlock()

Also, your code seems not quite formatted. Please, use clang-format on your changes to format it properly.

lib/Transforms/Vectorize/SLPVectorizer.cpp
433

Enclose it into braces

435

This too

475
bool IsNonAlt = llvm::one_of(VL, [Opcode, AltOpcode](Value *V) {return isa<Instruction>(V) && !sameOpcodeOrAlt(Opcode, AltOpcode, cast<Instruction>(V)->getOpcode());});
1622

Add the debug message here

3147

Why you can't put bundles in the list in the right order: from the very first instruction to the very last?

3965–3972

Better to do it this way:

if (llvm::any_of(Scalar->users(), [this, Entry, Scalar](User *U){return !getTreeEntry(U) && getTreeEntry(U, Scalar) == Entry;}))
  continue;
4493

Remove this, it is not needed

4507

pickedInst->PickedInst

4507–4509

Remove it, does not do anything

Update after remarks from Alexey Bataev.

dtemirbulatov added inline comments.Apr 16 2018, 2:27 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

I could do this in scheduleBlock() function with a queue, but that could add additional complexity.

ABataev added inline comments.Apr 17 2018, 8:10 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

Could you explain why?

dtemirbulatov added inline comments.Apr 22 2018, 6:19 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

One instruction could belong to one or more separate bundles... and while we try to change order in bundles at scheduleBlock() we have to update ScheduleDataMap, ExtraScheduleDataMap.

dtemirbulatov added inline comments.Apr 23 2018, 4:09 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

I mean pseudo operation could occur in more than one bundle.

ABataev added inline comments.Apr 23 2018, 6:24 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

But these schedule bundles must have different scheduling region id and they must be in a different bundles, why their order changes?

Implement post scheduling bundle reorder of the last element of the bundle according how it was scheduled.

dtemirbulatov added inline comments.Tue, May 22, 6:00 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

The bundle is differerent, but scheduling region id is the same.

dtemirbulatov added inline comments.Tue, May 22, 6:27 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

I mean, for example, for this function:
define void @add0(i32* noalias %dst, i32* noalias %src) {
entry:

%incdec.ptr = getelementptr inbounds i32, i32* %src, i64 1
%0 = load i32, i32* %src, align 4
%add = add nsw i32 %0, 1
%incdec.ptr1 = getelementptr inbounds i32, i32* %dst, i64 1
store i32 %add, i32* %dst, align 4
%incdec.ptr2 = getelementptr inbounds i32, i32* %src, i64 2
%1 = load i32, i32* %incdec.ptr, align 4
%add3 = add nsw i32 %1, 1
%incdec.ptr4 = getelementptr inbounds i32, i32* %dst, i64 2
store i32 %add3, i32* %incdec.ptr1, align 4
%incdec.ptr5 = getelementptr inbounds i32, i32* %src, i64 3
%2 = load i32, i32* %incdec.ptr2, align 4
%add6 = add nsw i32 %2, 2
%incdec.ptr7 = getelementptr inbounds i32, i32* %dst, i64 3
store i32 %add6, i32* %incdec.ptr4, align 4
%3 = load i32, i32* %incdec.ptr5, align 4
%add9 = add nsw i32 %3, 3
store i32 %add9, i32* %incdec.ptr7, align 4
ret void

}

We have two bundles:
[ %3 = load i32, i32* %src, align 4; %add3 = add nsw i32 %2, 1; %add6 = add nsw i32 %1, 2; %add9 = add nsw i32 %0, 3]
and
[ %3 = load i32, i32* %src, align 4; %2 = load i32, i32* %incdec.ptr, align 4; %1 = load i32, i32* %incdec.ptr2, align 4; %0 = load i32, i32* %incdec.ptr5, align 4]
with the same instruction %3 = load i32, i32* %src, align 4 and one is a pseudo instruction in this bundle [ %3 = load i32, i32* %src, align 4; %add3; %add6; %add9]
and all in the same scheduling region id that equal to 1.

dtemirbulatov added inline comments.Tue, May 22, 6:34 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3147

why their order changes?

sometimes we have to reschedule a pseudo instruction first in both bundles in order to form correct dependencies.