[SLP] Fix for PR30787: Failure to beneficially vectorize 'copyable' elements in integer binary ops.
Needs ReviewPublic

Authored by dtemirbulatov on Jan 19 2017, 8:56 AM.

Details

Summary

Patch tries to improve vectorization of the following code:

void add1(int * __restrict dst, const int * __restrict src) {
  *dst++ = *src++;
  *dst++ = *src++ + 1;
  *dst++ = *src++ + 2;
  *dst++ = *src++ + 3;
}

Currently this code cannot be vectorized because the very first operation is not a binary add, but just a load.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Jan 30 2018, 7:12 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4557

The main problem with this code that you're checking only one level of dependency. What if you have dependency deeper, in 2, 3 or more level? Will it work? The code itself is very complex and, if we're going to keep this solution, must be outlined in a separate function.

test/Transforms/SLPVectorizer/X86/internal-dep.ll
10 ↗(On Diff #131752)

That's why I suggest to try to exclude this instruction from the scheduling. Actually, we should not schedule %sub.i, we should schedule pseudo instruction lshr i32 %sub.i, 0. Shall we include it into scheduling?

dtemirbulatov added inline comments.Jan 30 2018, 7:30 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4077

oh, I see what you mean now.

4557

yes, I thought the same, I have another heavier version that utilizes Dependancy Analyst, but I have not seen yet such an example where current implementation is not enough. I have 2 or 3 testcases for the issue so far. I will think again about the problem.

test/Transforms/SLPVectorizer/X86/internal-dep.ll
10 ↗(On Diff #131752)

ok, Thanks

dtemirbulatov added inline comments.Jan 30 2018, 7:52 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4557

probably level 2 or 3 dependencies might be ok since it is not encapsulated in a single operation.

What is happening with this patch? It's been in development for over a year now and still seems to be having problems. Getting PR30787 fixed would be VERY useful and I'm thinking we should be looking at alternatives if this patch is going to carry on stalling/reverting.

I will update the solution shortly, I am currently in changing scheduler in order to avoid non-alternative opcodes in a bundle.

dtemirbulatov updated this revision to Diff 140208.EditedMar 29 2018, 4:38 AM

Removed internal vector dependency check as incorrect and I added the dependency check in case of partial bundle vectorization with non-alternative operations only. At BoUpSLP::vectorizeTree, BoUpSLP::buildTree.

ABataev added inline comments.Mar 29 2018, 6:34 AM
test/Transforms/SLPVectorizer/SystemZ/pr34619.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/partail.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/pr35497.ll
2

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/resched.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

Update after tests commit, formatting, delete Instruction::ExtractElement restricktion from tryToRepresentAsInstArg.

ABataev added inline comments.Mar 30 2018, 9:08 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1514
  1. Remove extra parens around Entry->State.IsNonAlt
  2. Why are you skipping bundles with non alternarive opcodes only?
3108–3112

Rewrite it this way:

SmallPtrSet<Instruction *, 4> BundleInst;
Bundle = Bundle->FirstInBundle;
LastInst = Bundle->Inst;
while (Bundle) {
  BundleInst.insert(Bundle->Inst);
  Bundle = Bundle->NextInBundle;
}
3113–3115

Why do you need to scan all the instructions in the basic block starting from First? Why you can't use only scheduled instructions?

3116

Move ++Iter to the header of for loop

3947

Name of the variable must start from capital letter.

3950
  1. The code is not formatted
  2. Seems to me you missed break;

Update after Alexey's remarks.

dtemirbulatov added inline comments.Mar 31 2018, 8:10 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1514

yes, we don't need to check Entry->State.IsNonAlt here, that is similar to getTreeEntry(U, Scalar) == Entry.

3108–3112

Done. Thanks.

3113–3115

What do you mean? please elaborate. If you mean ScheduleData here then it is also not always gives up correct sequence of scheduled instructions in a block. Like, bundle member with NextInBundle equals to null is not guaranty to be the last instruction of this bundle among other instructions. If we start with First then it is highly likely that we could iterate across all bundle members and exit instead of iterating to the end of the basic block. Also, I am thinking, to avoid this overhead we could note the last scheduled instruction in BB during scheduling in scheduleBlock() and keep this information in ScheduleData structure.

3116

Done.

3947

Done.

3950

Correct, Thanks.

Rebased, improved complexity of setInsertPointAfterBundle() for already scheduled instructions, minor changes in scheduleBlock()

Also, your code seems not quite formatted. Please, use clang-format on your changes to format it properly.

lib/Transforms/Vectorize/SLPVectorizer.cpp
319

Enclose it into braces

321

This too

372
bool IsNonAlt = llvm::one_of(VL, [Opcode, AltOpcode](Value *V) {return isa<Instruction>(V) && !sameOpcodeOrAlt(Opcode, AltOpcode, cast<Instruction>(V)->getOpcode());});
1598

Add the debug message here

3109

Why you can't put bundles in the list in the right order: from the very first instruction to the very last?

3949–3956

Better to do it this way:

if (llvm::any_of(Scalar->users(), [this, Entry, Scalar](User *U){return !getTreeEntry(U) && getTreeEntry(U, Scalar) == Entry;}))
  continue;
4553

Remove this, it is not needed

4563

pickedInst->PickedInst

4567–4569

Remove it, does not do anything

Update after remarks from Alexey Bataev.

dtemirbulatov added inline comments.Apr 16 2018, 2:27 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

I could do this in scheduleBlock() function with a queue, but that could add additional complexity.

ABataev added inline comments.Apr 17 2018, 8:10 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

Could you explain why?

dtemirbulatov added inline comments.Apr 22 2018, 6:19 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

One instruction could belong to one or more separate bundles... and while we try to change order in bundles at scheduleBlock() we have to update ScheduleDataMap, ExtraScheduleDataMap.

dtemirbulatov added inline comments.Apr 23 2018, 4:09 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

I mean pseudo operation could occur in more than one bundle.

ABataev added inline comments.Apr 23 2018, 6:24 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

But these schedule bundles must have different scheduling region id and they must be in a different bundles, why their order changes?

Implement post scheduling bundle reorder of the last element of the bundle according how it was scheduled.

dtemirbulatov added inline comments.May 22 2018, 6:00 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

The bundle is differerent, but scheduling region id is the same.

dtemirbulatov added inline comments.May 22 2018, 6:27 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

I mean, for example, for this function:
define void @add0(i32* noalias %dst, i32* noalias %src) {
entry:

%incdec.ptr = getelementptr inbounds i32, i32* %src, i64 1
%0 = load i32, i32* %src, align 4
%add = add nsw i32 %0, 1
%incdec.ptr1 = getelementptr inbounds i32, i32* %dst, i64 1
store i32 %add, i32* %dst, align 4
%incdec.ptr2 = getelementptr inbounds i32, i32* %src, i64 2
%1 = load i32, i32* %incdec.ptr, align 4
%add3 = add nsw i32 %1, 1
%incdec.ptr4 = getelementptr inbounds i32, i32* %dst, i64 2
store i32 %add3, i32* %incdec.ptr1, align 4
%incdec.ptr5 = getelementptr inbounds i32, i32* %src, i64 3
%2 = load i32, i32* %incdec.ptr2, align 4
%add6 = add nsw i32 %2, 2
%incdec.ptr7 = getelementptr inbounds i32, i32* %dst, i64 3
store i32 %add6, i32* %incdec.ptr4, align 4
%3 = load i32, i32* %incdec.ptr5, align 4
%add9 = add nsw i32 %3, 3
store i32 %add9, i32* %incdec.ptr7, align 4
ret void

}

We have two bundles:
[ %3 = load i32, i32* %src, align 4; %add3 = add nsw i32 %2, 1; %add6 = add nsw i32 %1, 2; %add9 = add nsw i32 %0, 3]
and
[ %3 = load i32, i32* %src, align 4; %2 = load i32, i32* %incdec.ptr, align 4; %1 = load i32, i32* %incdec.ptr2, align 4; %0 = load i32, i32* %incdec.ptr5, align 4]
with the same instruction %3 = load i32, i32* %src, align 4 and one is a pseudo instruction in this bundle [ %3 = load i32, i32* %src, align 4; %add3; %add6; %add9]
and all in the same scheduling region id that equal to 1.

dtemirbulatov added inline comments.May 22 2018, 6:34 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

why their order changes?

sometimes we have to reschedule a pseudo instruction first in both bundles in order to form correct dependencies.

@dtemirbulatov @ABataev Is there anything that can be done to please keep this patch moving? The next release branch won't be far away now and I originally requested this over a year and a half ago (PR30787), and ideally I'd like to get this in and then I can more easily implement PR33744 as well in time.

I notice there's still some minor refactoring in the patch to use InstructionState in newTreeEntry/tryScheduleBundle calls - is it worth getting that done quickly first?

What about testing - how much testing has been done with external code?

@dtemirbulatov @ABataev Is there anything that can be done to please keep this patch moving? The next release branch won't be far away now and I originally requested this over a year and a half ago (PR30787), and ideally I'd like to get this in and then I can more easily implement PR33744 as well in time.

I think that algorithm at 4512 is incorrect, we need to copy common instructions for bundles, instead of reordering bundles. Alexey?

ABataev added a subscriber: Ayal.Jun 4 2018, 5:44 PM

@dtemirbulatov @ABataev Is there anything that can be done to please keep this patch moving? The next release branch won't be far away now and I originally requested this over a year and a half ago (PR30787), and ideally I'd like to get this in and then I can more easily implement PR33744 as well in time.

I notice there's still some minor refactoring in the patch to use InstructionState in newTreeEntry/tryScheduleBundle calls - is it worth getting that done quickly first?

What about testing - how much testing has been done with external code?

Hi guys, sorry, but I cannot review this patch currently as I'm on the vacation for 3 more weeks. It would be good if somebody else could look at this patch. Try to ask Ayal Zaks (@Ayal), maybe he could help you. As to the implementation itself, I don't like the way it is implemented now. Current scheduling implementation looks like a crutch or a hack. It is very complex, has some unclear logic that may be broken in many ways.

Vasilis added a subscriber: Vasilis.Jun 5 2018, 3:02 PM
Vasilis removed a subscriber: Vasilis.Jun 5 2018, 3:04 PM
vporpo added a subscriber: vporpo.Jun 5 2018, 3:16 PM

Hi, when do you plan to update the path for the latest version? I have an idea of improving the scheduling model, maybe it will help us to resolve the problems with the patch.

Hi, when do you plan to update the path for the latest version? I have an idea of improving the scheduling model, maybe it will help us to resolve the problems with the patch.

Hi, I think I have one issue remaining with LNT's MultiSource/Benchmarks/Olden/bh/bh, I will rebased the current version today, tomorrow. Thanks.

@dtemirbulatov @ABataev I'm not sure how much this will cross over but I'm investigating how to extend the alternate opcode mechanism to work with non-binary instructions. Initially I'm just looking at cast + call operators (e.g. sitofp/uitofp, floor/ceil etc.) but even that requires moderate refactoring of the getEntryCost and vectorizeTree ShuffleVector handling. What I'm not sure of is whether we could extend this idea to accept any partial vectorization and the alternate becomes a pass through - what do you think?

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

spatel added a comment.Jul 2 2018, 9:46 AM

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

For reference in D48830, I'm proposing that we use (and extend) ConstantExpr::getBinOpIdentity(). IIUC, that would be the same thing that is shown as getDefaultConstantForOpcode() here.

@dtemirbulatov @ABataev I'm not sure how much this will cross over but I'm investigating how to extend the alternate opcode mechanism to work with non-binary instructions. Initially I'm just looking at cast + call operators (e.g. sitofp/uitofp, floor/ceil etc.) but even that requires moderate refactoring of the getEntryCost and vectorizeTree ShuffleVector handling. What I'm not sure of is whether we could extend this idea to accept any partial vectorization and the alternate becomes a pass through - what do you think?

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

I think this is possible if I correctly understood your idea. But we need to make this patch land at first. To do so we need to resolve the problems with the scheduling.

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

Yes, this is also a possible solution. What we need to do in this case is to tweak the cost model + improve gathering algorithm. Yes, that might work.

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

For reference in D48830, I'm proposing that we use (and extend) ConstantExpr::getBinOpIdentity(). IIUC, that would be the same thing that is shown as getDefaultConstantForOpcode() here.

Yes, I see, we can try to use it. AT least we need to think about this solution, need to estimate all pros/cons here

here is change before 74cd05c4a4f94a27daf2d3fc173e7213060cc47c commit, I am currently rebaseing the change.

dtemirbulatov added inline comments.Jul 2 2018, 12:17 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
622

we don't need to do this.

dtemirbulatov added inline comments.Jul 2 2018, 12:22 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3109

I will move this solution to a dedicated function, so we don't have to measure distance here.

OK - please shout if there is anything I can do to help - I realise my alternate opcode refactoring has made rebasing this patch less straightforward!

Rebase. Still, I need to implement bundle reordering instead of the change in setInsertPointAfterBundle(), fully tested this change. I will split this change in several independent reviews.

Fixed one more issue with duplicating memory dependencies in calculateDependencies() by checking whether we already counted this particular dependency.

RKSimon added inline comments.Tue, Jul 31, 4:07 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
476

InstructionState was keeping the Base/Alt instructions the same if AltOpcodeNum ==0 (BaseIndex ==AltIndex) - why are you inserting nulls?

1529

Can we use ConstantExpr::getBinOpIdentity instead?

dtemirbulatov added inline comments.Tue, Jul 31, 5:41 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
476

yes, correct, Thanks.

1529

no, ConstantExpr::getBinOpIdentity does support only commutative operations.

lebedev.ri added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1529

Are you sure?

Constant *ConstantExpr::getBinOpIdentity(unsigned Opcode, Type *Ty,
                                         bool AllowRHSConstant) {
...

  // Non-commutative opcodes: AllowRHSConstant must be set.
  if (!AllowRHSConstant)
    return nullptr;
dtemirbulatov added inline comments.Tue, Jul 31, 6:01 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1529

oh, yes, sorry, it just a different behaviour here. we want for example 1 division and ConstantExpr::getBinOpIdentity would return 0.

Rebase, Fixed RKSimon's remark, implemented bundle reordering function.