This is an archive of the discontinued LLVM Phabricator instance.

[SLP][X86] Improve reordering to consider alternate instruction bundles
ClosedPublic

Authored by vporpo on May 16 2022, 11:01 AM.

Details

Summary

During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.

Diff Detail

Event Timeline

vporpo created this revision.May 16 2022, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2022, 11:01 AM
vporpo requested review of this revision.May 16 2022, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2022, 11:01 AM
bgraur added a subscriber: bgraur.May 17 2022, 12:39 AM

Do you have many more examples of these patterns? If the addsub<->subadd pattern is the main problem it shouldn't take much to fix it later on in the backend

I am not sure if there is any other pattern, but this one showed up in a regression in the eigen benchmark.
Hmm yeah fixing it in the back-end may be an option but what if we end up not vectorizing the code because of this? The alternative is a blend + add + sub which should increase the cost quite a bit. So it is probably best to teach reordering not to mess up this pattern in the first place.

vporpo updated this revision to Diff 431043.May 20 2022, 1:41 PM

Fixed matching of AddSub pattern.

ABataev added inline comments.May 25 2022, 7:50 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

I don't quite understand what's the difference here. Could you explain, please?

vporpo added inline comments.May 25 2022, 9:24 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

Before this patch the pattern shuffle + fadd + fsub lowers to 3 instructions: blend + vector add + vector sub (the shuffle selects TMP3[0],TMP2[1], which is fadd[0],fsub[1] , the inverse of the addsub pattern).

With this patch shuffle + fadd +fsub lowers to a single addsub instruction (the shuffle selects TMP2[0], TMP3[2] which is fsub[0],fadd[1]).
This saves 2 instructions which means that during reordering we should keep track of this pattern since reordering it can increase the overhead.

ABataev added inline comments.May 25 2022, 9:46 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

Ok, why not fixing it in the backend? This new function you added does not affect the cost, but ignoring shuffle actually increases the cost of the tree.

vporpo added inline comments.May 25 2022, 9:58 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

I think it is better to do it here because we may end up not vectorizing some code because of the additional cost of the blend + add + sub pattern. I think I can fix the tree cost with a follow-up patch that fixes the cost of the altshuffle pattern when it corresponds to the addsub instruction.

ABataev added inline comments.May 25 2022, 10:07 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

Ah, I see, thanks. What about trying to do both - lowering in the backend and the cost adjustment? Is it possible?

vporpo added inline comments.May 25 2022, 10:55 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

Hmm I guess in the back-end we won't have access to cost benefit analysis like the one we are doing during the reordering step (i.e., finding the most popular order). So we would have to do a simple conversion of any sub-add pattern to an add-sub + shuffles, but I am not sure that this would always be profitable. I think that the addsub pattern should be taken into consideration when looking for the most popular ordering so that it can influence the decision, but I am not so sure we can always justify the cost of the extra shuffles.

ABataev added inline comments.May 25 2022, 11:02 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

The backend does not need the cost, it sjut checks the pattern and lowers the sequence to the instructions. Why does it need the cost?

vporpo added inline comments.May 25 2022, 11:36 AM
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127

I think there is a difference between doing the transformation here vs in the back-end. In the back-end we can't easily check if reordering a sub-add to an add-sub can also remove some of the shuffles that are already in the code.

For example if we have code like this:

%vsub = fsub <2 x double> %subop, ...
%vadd = fadd <2 x double> %addop, ...
%shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3  # vsub[0], vadd[1]
store <2 x double> %shuffle, ...

This will be lowered to: add + sub + blend + store.

But if we convert the sub-add pattern (i.e., add + sub + blend) to an add-sub + shuffles in the back-end, then this will introduce 3 shuffles: 2 for the operands and 1 for the user, resulting in a pattern: pshuf + pshuf + addsub + pshuf + store. This is clearly worse than the original code.

But the transformation could be profitable if we could remove some of the shuffles, for example if the code looked like this:

%vsub = fsub <2 x double> %subop, ...
%vadd = fadd <2 x double> %addop, ...
%shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3  # vsub[0], vadd[1]
%shuffle2 = shufflevector <2 x double> %shuffle, zero, 1, 0 # shuffle[1], shuffle[0]
store <2 x double> %shuffle, ...

Converting this to sub-add in the back-end would result in pshuf + pshuf + addsub + pshuf + pshuf + store, but the latter two pshuf instructions negate each other so the could be optimized away. A similar optimization could happen if the inputs were already reordered with a shuffle earlier.

What I am trying to say is that simply converting a sub-add pattern to an add-sub pattern does not look profitable unless we can also get rid of some of the shuffles that cancel each other out. I can't think of how we could do this in the back-end more effectively than we could do it here, but perhaps I am missing something.

RKSimon added inline comments.May 26 2022, 2:35 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
685

If we're going to do this (and I'm still not convinced we should) we're going to need a shuffle mask - otherwise addsub matching is not going to be accurate.

vporpo added inline comments.May 26 2022, 9:11 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
685

I agree, the shuffle mask will make matching a lot more accurate, I will update the patches.

Why do you think we should not be going ahead with this? To me this is very similar to the shuffle-reducing transformations that we already have, so it looks like it fits pretty well the scope of the SLP pass.

I'm concerned its yet something else that is being added to SLP to make up for a weakness some place else (usually the cost model mechanism) and may take years to unravel again.

In this case we don't have a way to determine realistic costs for instruction sequences - at best we can optionally provide operands when getting the cost of a single instruction - we already have this problem with folded loads, this could end up being a lot more complex :-(

I understand your concerns, but I feel that this pattern is not as tricky as folded loads. We can accurately detect the add-sub pattern within the SLP pass itself, since it matches the AltShuffle TreeEntry. So we can even make the tree cost more accurate (this is in a follow up patch). The only issue that I see is perhaps with cost modeling of this pattern from within a different pass, other than SLP, since that would require special logic to skip the cost of the vector add and vector sub. But given the current infrastructure I don't think we can do anything about this, but that shouldn't stop us from making getShuffleCost() more accurate, even if it will only benefit SLP.

Lets get the shuffle mask arg added - we need that to correctly cost alt patterns. We might be able to reduce the SLP impact to a getAltCost() wrapper of some kind.

vporpo updated this revision to Diff 432373.May 26 2022, 1:29 PM

Added shufflemask argument to TTI function.

vdmitrie added inline comments.May 26 2022, 6:36 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2387

I'd say we don't really need this interface.
IMO what is likely to be more useful is something like buildAltShuffleMask which you could directly feed into the TTI interface you added. See also buildShuffleEntryMask which could also benefit from having it.

vporpo added inline comments.May 26 2022, 8:01 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2387

Well, the issue is that the TTI function can't check whether this is an actual add-sub pattern, because it doesn't have all the information needed. For this to work, it would have to know how we are planning to combine the scalar instructions into vectors, which is an SLP-specific thing. So it feels that this logic should not belong to TTI and instead it should all be in the SLP pass.

Regarding the mask, I don't really like the way this is done now. Actually I am not even sure passing the shuffle mask is a good idea because it doesn't actually help with checking the pattern. It is probably best to just pass the even/odd opcodes and the vector type and ask for TTI to very whether these the compatible ones with the addsub instructions.

Wdyt?

vdmitrie added inline comments.May 27 2022, 9:10 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2387

my 2 cents wrt the TTI interface
The interface as it defined in current revision has assumptions: two opcodes and that final result is built with select shuffle. And the latter is its main drawback. More generic version could look like
bool isLegalSIMDInstruction(Type* VecTy, ArrayRef<unsigned> Opcodes, ArrayRef<unsigned>LaneOpcMask)
where LaneOpcMask[Lane] is an index into Opcodes array which gives per lane opcode for a SIMD instruction.
Or if you want to limit it with just 2 opcodes (perhaps makes sense) LaneOpcMask could be represented with a bit vector.

vporpo added inline comments.May 27 2022, 9:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2387

Yeah, this makes sense. I will update the patch.

vporpo updated this revision to Diff 432605.May 27 2022, 11:07 AM

Replaced shuffle mask argument with a bitvector and moved all the opcode checking code to TTI->isLegalAltInstr().

vporpo added inline comments.Jun 1 2022, 10:00 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2387

I updated the isLegalAltInstr() TTI interface function. I think it looks more like what you described in your previous comment.

vdmitrie accepted this revision.Jun 1 2022, 10:50 AM

Looks good yo me.
@RKSimon , @ABataev , do you have any comments/objections?

This revision is now accepted and ready to land.Jun 1 2022, 10:50 AM

Don't we need to account for cost somehow?

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5372

The AVX case is unnecessary as any NumElements % 4 == 0 case will be supported as multiple ADDSUB ops.

5375

Same here NumElements % 2 == 0 is enough

vporpo updated this revision to Diff 433581.Jun 1 2022, 3:52 PM
vporpo marked 2 inline comments as done.

Addressed comments.

vporpo added a comment.Jun 1 2022, 3:52 PM

Don't we need to account for cost somehow?

The cost calculation changes are in a followup patch: https://reviews.llvm.org/D126432

@RKSimon any more comments on this?

This revision was landed with ongoing or failed builds.Jun 21 2022, 4:46 PM
This revision was automatically updated to reflect the committed changes.