[SLP] Fix for PR30787: Failure to beneficially vectorize 'copyable' elements in integer binary ops.
Needs ReviewPublic

Authored by dtemirbulatov on Jan 19 2017, 8:56 AM.

Details

Summary

Patch tries to improve vectorization of the following code:

void add1(int * __restrict dst, const int * __restrict src) {
  *dst++ = *src++;
  *dst++ = *src++ + 1;
  *dst++ = *src++ + 2;
  *dst++ = *src++ + 3;
}

Currently this code cannot be vectorized because the very first operation is not a binary add, but just a load.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Mar 29 2018, 6:34 AM
test/Transforms/SLPVectorizer/SystemZ/pr34619.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/partail.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/pr35497.ll
2

Commit this test as a separate NFC patch with the checks against trunk

test/Transforms/SLPVectorizer/X86/resched.ll
2 ↗(On Diff #140208)

Commit this test as a separate NFC patch with the checks against trunk

Update after tests commit, formatting, delete Instruction::ExtractElement restricktion from tryToRepresentAsInstArg.

ABataev added inline comments.Mar 30 2018, 9:08 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1565
  1. Remove extra parens around Entry->State.IsNonAlt
  2. Why are you skipping bundles with non alternarive opcodes only?
3134–3138

Rewrite it this way:

SmallPtrSet<Instruction *, 4> BundleInst;
Bundle = Bundle->FirstInBundle;
LastInst = Bundle->Inst;
while (Bundle) {
  BundleInst.insert(Bundle->Inst);
  Bundle = Bundle->NextInBundle;
}
3139–3141

Why do you need to scan all the instructions in the basic block starting from First? Why you can't use only scheduled instructions?

3142

Move ++Iter to the header of for loop

3999

Name of the variable must start from capital letter.

4002
  1. The code is not formatted
  2. Seems to me you missed break;

Update after Alexey's remarks.

dtemirbulatov added inline comments.Mar 31 2018, 8:10 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1565

yes, we don't need to check Entry->State.IsNonAlt here, that is similar to getTreeEntry(U, Scalar) == Entry.

3134–3138

Done. Thanks.

3139–3141

What do you mean? please elaborate. If you mean ScheduleData here then it is also not always gives up correct sequence of scheduled instructions in a block. Like, bundle member with NextInBundle equals to null is not guaranty to be the last instruction of this bundle among other instructions. If we start with First then it is highly likely that we could iterate across all bundle members and exit instead of iterating to the end of the basic block. Also, I am thinking, to avoid this overhead we could note the last scheduled instruction in BB during scheduling in scheduleBlock() and keep this information in ScheduleData structure.

3142

Done.

3999

Done.

4002

Correct, Thanks.

Rebased, improved complexity of setInsertPointAfterBundle() for already scheduled instructions, minor changes in scheduleBlock()

Also, your code seems not quite formatted. Please, use clang-format on your changes to format it properly.

lib/Transforms/Vectorize/SLPVectorizer.cpp
319

Enclose it into braces

321

This too

369
bool IsNonAlt = llvm::one_of(VL, [Opcode, AltOpcode](Value *V) {return isa<Instruction>(V) && !sameOpcodeOrAlt(Opcode, AltOpcode, cast<Instruction>(V)->getOpcode());});
1644

Add the debug message here

3135

Why you can't put bundles in the list in the right order: from the very first instruction to the very last?

4001–4008

Better to do it this way:

if (llvm::any_of(Scalar->users(), [this, Entry, Scalar](User *U){return !getTreeEntry(U) && getTreeEntry(U, Scalar) == Entry;}))
  continue;
4565

Remove this, it is not needed

4574

pickedInst->PickedInst

4579–4581

Remove it, does not do anything

Update after remarks from Alexey Bataev.

dtemirbulatov added inline comments.Apr 16 2018, 2:27 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

I could do this in scheduleBlock() function with a queue, but that could add additional complexity.

ABataev added inline comments.Apr 17 2018, 8:10 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

Could you explain why?

dtemirbulatov added inline comments.Apr 22 2018, 6:19 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

One instruction could belong to one or more separate bundles... and while we try to change order in bundles at scheduleBlock() we have to update ScheduleDataMap, ExtraScheduleDataMap.

dtemirbulatov added inline comments.Apr 23 2018, 4:09 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

I mean pseudo operation could occur in more than one bundle.

ABataev added inline comments.Apr 23 2018, 6:24 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

But these schedule bundles must have different scheduling region id and they must be in a different bundles, why their order changes?

Implement post scheduling bundle reorder of the last element of the bundle according how it was scheduled.

dtemirbulatov added inline comments.May 22 2018, 6:00 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

The bundle is differerent, but scheduling region id is the same.

dtemirbulatov added inline comments.May 22 2018, 6:27 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

I mean, for example, for this function:
define void @add0(i32* noalias %dst, i32* noalias %src) {
entry:

%incdec.ptr = getelementptr inbounds i32, i32* %src, i64 1
%0 = load i32, i32* %src, align 4
%add = add nsw i32 %0, 1
%incdec.ptr1 = getelementptr inbounds i32, i32* %dst, i64 1
store i32 %add, i32* %dst, align 4
%incdec.ptr2 = getelementptr inbounds i32, i32* %src, i64 2
%1 = load i32, i32* %incdec.ptr, align 4
%add3 = add nsw i32 %1, 1
%incdec.ptr4 = getelementptr inbounds i32, i32* %dst, i64 2
store i32 %add3, i32* %incdec.ptr1, align 4
%incdec.ptr5 = getelementptr inbounds i32, i32* %src, i64 3
%2 = load i32, i32* %incdec.ptr2, align 4
%add6 = add nsw i32 %2, 2
%incdec.ptr7 = getelementptr inbounds i32, i32* %dst, i64 3
store i32 %add6, i32* %incdec.ptr4, align 4
%3 = load i32, i32* %incdec.ptr5, align 4
%add9 = add nsw i32 %3, 3
store i32 %add9, i32* %incdec.ptr7, align 4
ret void

}

We have two bundles:
[ %3 = load i32, i32* %src, align 4; %add3 = add nsw i32 %2, 1; %add6 = add nsw i32 %1, 2; %add9 = add nsw i32 %0, 3]
and
[ %3 = load i32, i32* %src, align 4; %2 = load i32, i32* %incdec.ptr, align 4; %1 = load i32, i32* %incdec.ptr2, align 4; %0 = load i32, i32* %incdec.ptr5, align 4]
with the same instruction %3 = load i32, i32* %src, align 4 and one is a pseudo instruction in this bundle [ %3 = load i32, i32* %src, align 4; %add3; %add6; %add9]
and all in the same scheduling region id that equal to 1.

dtemirbulatov added inline comments.May 22 2018, 6:34 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

why their order changes?

sometimes we have to reschedule a pseudo instruction first in both bundles in order to form correct dependencies.

@dtemirbulatov @ABataev Is there anything that can be done to please keep this patch moving? The next release branch won't be far away now and I originally requested this over a year and a half ago (PR30787), and ideally I'd like to get this in and then I can more easily implement PR33744 as well in time.

I notice there's still some minor refactoring in the patch to use InstructionState in newTreeEntry/tryScheduleBundle calls - is it worth getting that done quickly first?

What about testing - how much testing has been done with external code?

@dtemirbulatov @ABataev Is there anything that can be done to please keep this patch moving? The next release branch won't be far away now and I originally requested this over a year and a half ago (PR30787), and ideally I'd like to get this in and then I can more easily implement PR33744 as well in time.

I think that algorithm at 4512 is incorrect, we need to copy common instructions for bundles, instead of reordering bundles. Alexey?

ABataev added a subscriber: Ayal.Jun 4 2018, 5:44 PM

@dtemirbulatov @ABataev Is there anything that can be done to please keep this patch moving? The next release branch won't be far away now and I originally requested this over a year and a half ago (PR30787), and ideally I'd like to get this in and then I can more easily implement PR33744 as well in time.

I notice there's still some minor refactoring in the patch to use InstructionState in newTreeEntry/tryScheduleBundle calls - is it worth getting that done quickly first?

What about testing - how much testing has been done with external code?

Hi guys, sorry, but I cannot review this patch currently as I'm on the vacation for 3 more weeks. It would be good if somebody else could look at this patch. Try to ask Ayal Zaks (@Ayal), maybe he could help you. As to the implementation itself, I don't like the way it is implemented now. Current scheduling implementation looks like a crutch or a hack. It is very complex, has some unclear logic that may be broken in many ways.

Vasilis added a subscriber: Vasilis.Jun 5 2018, 3:02 PM
Vasilis removed a subscriber: Vasilis.Jun 5 2018, 3:04 PM
vporpo added a subscriber: vporpo.Jun 5 2018, 3:16 PM

Hi, when do you plan to update the path for the latest version? I have an idea of improving the scheduling model, maybe it will help us to resolve the problems with the patch.

Hi, when do you plan to update the path for the latest version? I have an idea of improving the scheduling model, maybe it will help us to resolve the problems with the patch.

Hi, I think I have one issue remaining with LNT's MultiSource/Benchmarks/Olden/bh/bh, I will rebased the current version today, tomorrow. Thanks.

@dtemirbulatov @ABataev I'm not sure how much this will cross over but I'm investigating how to extend the alternate opcode mechanism to work with non-binary instructions. Initially I'm just looking at cast + call operators (e.g. sitofp/uitofp, floor/ceil etc.) but even that requires moderate refactoring of the getEntryCost and vectorizeTree ShuffleVector handling. What I'm not sure of is whether we could extend this idea to accept any partial vectorization and the alternate becomes a pass through - what do you think?

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

spatel added a comment.Jul 2 2018, 9:46 AM

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

For reference in D48830, I'm proposing that we use (and extend) ConstantExpr::getBinOpIdentity(). IIUC, that would be the same thing that is shown as getDefaultConstantForOpcode() here.

@dtemirbulatov @ABataev I'm not sure how much this will cross over but I'm investigating how to extend the alternate opcode mechanism to work with non-binary instructions. Initially I'm just looking at cast + call operators (e.g. sitofp/uitofp, floor/ceil etc.) but even that requires moderate refactoring of the getEntryCost and vectorizeTree ShuffleVector handling. What I'm not sure of is whether we could extend this idea to accept any partial vectorization and the alternate becomes a pass through - what do you think?

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

I think this is possible if I correctly understood your idea. But we need to make this patch land at first. To do so we need to resolve the problems with the scheduling.

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

Yes, this is also a possible solution. What we need to do in this case is to tweak the cost model + improve gathering algorithm. Yes, that might work.

If not there is scope to further simplify this patch, for instance @spatel's work in InstCombine for PR37806 should mean that you can rely on InstCombine to perform much of the work in getDefaultConstantForOpcode etc. and all SLP needs to do is create a "passthrough" SK_Select shuffle stage.

For reference in D48830, I'm proposing that we use (and extend) ConstantExpr::getBinOpIdentity(). IIUC, that would be the same thing that is shown as getDefaultConstantForOpcode() here.

Yes, I see, we can try to use it. AT least we need to think about this solution, need to estimate all pros/cons here

here is change before 74cd05c4a4f94a27daf2d3fc173e7213060cc47c commit, I am currently rebaseing the change.

dtemirbulatov added inline comments.Jul 2 2018, 12:17 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
615

we don't need to do this.

dtemirbulatov added inline comments.Jul 2 2018, 12:22 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3135

I will move this solution to a dedicated function, so we don't have to measure distance here.

OK - please shout if there is anything I can do to help - I realise my alternate opcode refactoring has made rebasing this patch less straightforward!

Rebase. Still, I need to implement bundle reordering instead of the change in setInsertPointAfterBundle(), fully tested this change. I will split this change in several independent reviews.

Fixed one more issue with duplicating memory dependencies in calculateDependencies() by checking whether we already counted this particular dependency.

RKSimon added inline comments.Jul 31 2018, 4:07 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
469

InstructionState was keeping the Base/Alt instructions the same if AltOpcodeNum ==0 (BaseIndex ==AltIndex) - why are you inserting nulls?

1575

Can we use ConstantExpr::getBinOpIdentity instead?

dtemirbulatov added inline comments.Jul 31 2018, 5:41 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
469

yes, correct, Thanks.

1575

no, ConstantExpr::getBinOpIdentity does support only commutative operations.

lebedev.ri added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1575

Are you sure?

Constant *ConstantExpr::getBinOpIdentity(unsigned Opcode, Type *Ty,
                                         bool AllowRHSConstant) {
...

  // Non-commutative opcodes: AllowRHSConstant must be set.
  if (!AllowRHSConstant)
    return nullptr;
dtemirbulatov added inline comments.Jul 31 2018, 6:01 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1575

oh, yes, sorry, it just a different behaviour here. we want for example 1 division and ConstantExpr::getBinOpIdentity would return 0.

Rebase, Fixed RKSimon's remark, implemented bundle reordering function.

Add hash code based indexing instead of instruction based, split ScheduleData into InstScheduleData, PseudoScheduleData, improve diagnostics of bundle scheduling.

Why you don't want to use pair<Instruction*, Opcode> as the key in all maps/sets? I expect that it will lead to much more simpler slution.

lib/Transforms/Vectorize/SLPVectorizer.cpp
308

Instead, I would check for the supported instructions rather than for the unsupported ones.

390

Not sure if this always produces unique hashes, might lead to the incorrect compiler work.

Implemented pair<Value*, Opcode> as the key in all maps/sets.
Fixed issue with incorrect memory dependency that is attached in testcase memory-dep.ll.
Allow Non-alterative operations to be stored in InstScheduleDataMap.
Removed IsNonAlt field out of InstructionsState.

dtemirbulatov added inline comments.Wed, Oct 10, 3:46 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
308

The other list is a bit long in my implementation, this one looks better.

Update after I found another couple of errors after additional testing the change. Here are changes:
Removed OpValue field out of PseudoScheduleData.
Forbid any bundles with non-alternative operations and remainder operation, see rem-bundle.ll.
Fixed error in setInsertPointAfterBundle() function by using getScheduleData() instead of getInstrScheduleData and if a bundle member is present multiple bundles at the same time then walk through the bundle to find the last scheduled member of the bundle. see insert-after-multiple-bundle.ll
Restore MemoryDependencies to SmallVector, we don't have to count a member presents in calculateDependencies().