This is an archive of the discontinued LLVM Phabricator instance.

[SLP] avoid reduction transform on patterns that the backend can load-combine
ClosedPublic

Authored by spatel on Sep 20 2019, 9:45 AM.

Details

Summary

I don't see an ideal solution to these 2 related, potentially large, perf regressions:
https://bugs.llvm.org/show_bug.cgi?id=42708
https://bugs.llvm.org/show_bug.cgi?id=43146

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend. Therefore, preventing SLP from destroying load combine opportunities requires that it recognizes patterns that could be combined later, but not do the optimization itself (it's not a vector combine anyway, so it's probably out-of-scope for SLP).

In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining will produce a single instruction on these tests like:

movbe   rax, qword ptr [rdi]

or:

mov     rax, qword ptr [rdi]

Not some (half) vector monstrosity as we currently do using SLP:

vpmovzxbq       ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero
vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
movzx   eax, byte ptr [rdi]
movzx   ecx, byte ptr [rdi + 5]
shl     rcx, 40
movzx   edx, byte ptr [rdi + 6]
shl     rdx, 48
or      rdx, rcx
movzx   ecx, byte ptr [rdi + 7]
shl     rcx, 56
or      rcx, rdx
or      rcx, rax
vextracti128    xmm1, ymm0, 1
vpor    xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
vpor    xmm0, xmm0, xmm1
vmovq   rax, xmm0
or      rax, rcx
vzeroupper
ret

Diff Detail

Event Timeline

spatel created this revision.Sep 20 2019, 9:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2019, 9:45 AM

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.

For reference, can link those patches/discussions?

Is this similar to D42981?

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.

For reference, can link those patches/discussions?

http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html

More recently:
http://llvm.1065342.n5.nabble.com/llvm-dev-Load-combine-pass-td101187i20.html#a131076

This is the commit to remove the pass from trunk:
https://reviews.llvm.org/rL306067

Is this similar to D42981?

I think it's trying to accomplish a similar goal, but that patch affects far more than this (and I'm not sure it would solve this problem). Ie, this is intentionally trying hard to affect only the patterns that we know will be combined into a single larger scalar load.

Is this similar to D42981?

I think it's trying to accomplish a similar goal, but that patch affects far more than this (and I'm not sure it would solve this problem). Ie, this is intentionally trying hard to affect only the patterns that we know will be combined into a single larger scalar load.

I looked at the other patch again, and it is not solving the examples in the motivating bug reports. The problem in these cases is not the vector cost model; it's the scalar cost model. Unless we resurrect LoadCombiner in IR, we need to recognize that small consecutive scalar loads can be combined to a larger scalar op.

The pattern-matching here isn't specific enough to fully recognize a bswap for example, but that's not necessary IMO. We have perf regressions, and this is a conservative change to the cost model calc that fixes them. If we underestimated the scalar cost for some sequence that can't be reduced to a single load, that can be improved later.

ABataev added inline comments.Sep 25 2019, 11:10 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6458

Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer?

spatel updated this revision to Diff 221961.Sep 26 2019, 8:27 AM
spatel marked an inline comment as done.

Patch updated:
Move the analysis/calculation of load-combining from SLP to the TTI cost model.

ABataev added inline comments.Sep 26 2019, 8:46 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6496

Shall we move all this stuff to TargetTransformInfo::getArithmeticInstrCost?

spatel marked 2 inline comments as done.Sep 26 2019, 9:04 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6496

Do you have a specific suggestion about how to alter that API to make this fit?

I'm not seeing how to do it without making this more confusing than having a dedicated function.

ABataev added inline comments.Sep 26 2019, 10:02 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6496

getArithmeticInstrCost function has optional parameter Args. We can use it when we call this function and try to make a pattern matching for the arguments of the instruction rather than the instruction itself as in your original patch. Also, you will need to change the way you estimate the cost of the instruction. You need to estimate it for the single instruction. It is going to be less precise than in your current implementation but this is the common problem of the cost model, which should be addressed separately.

spatel updated this revision to Diff 222018.Sep 26 2019, 1:16 PM
spatel marked 2 inline comments as done.

Patch updated:
Make load combining cost calculation a specialization of general arithmetic instruction cost by using optional parameter.

ABataev added inline comments.Sep 26 2019, 1:38 PM
llvm/lib/Analysis/TargetTransformInfo.cpp
633

Ooops, sorry, my fault, pointed to the wrong function. Better to modify a function in include/llvm/CodeGen/BasicTTIImpl.h, BasicTTIImplBase::getArithmeticInstrCost, if this is common limitation.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6500

I don't think it is a good idea to break the contract of the function. Better to pass the arguments of the instruction, just like requested by the function.

spatel updated this revision to Diff 222178.Sep 27 2019, 8:21 AM
spatel marked an inline comment as done.

Patch updated:
Pass in reduction root operands to cost model rather than reduced value.

spatel marked an inline comment as done.Sep 27 2019, 8:28 AM
spatel added inline comments.
llvm/lib/Analysis/TargetTransformInfo.cpp
633

AFAICT, if we change the concept/model base classes, that will only be called after any target-specific override, so we would need to change all of the override implementations to access the new code. Adding code to the concrete class allows us to add the target-independent customization for a load-combine before the target does its usual logic.

Can you point to an example of what you'd like to see and/or show it exactly?

ABataev added inline comments.Sep 27 2019, 8:42 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
633

Yes, this is what I was afraid of. In general, it would be good to put it to the base class, but if you think that this not possible without code duplication in derived classes, better to leave it as is. Other opinions?

spatel added a comment.Oct 4 2019, 5:40 AM

Ping.

(I did draft a code change that altered the concept/model classes, but as mentioned earlier, I couldn't make that work as intended.)

ABataev accepted this revision.Oct 4 2019, 5:52 AM

I'm good with this unless there are no other opinions.

This revision is now accepted and ready to land.Oct 4 2019, 5:52 AM
RKSimon accepted this revision.Oct 4 2019, 8:09 AM

LGTM as a stopgap

This revision was automatically updated to reflect the committed changes.

This caused lots of failed asserts in building many different projects, see https://bugs.llvm.org/show_bug.cgi?id=43582, so I went ahead and reverted it for now.

RKSimon reopened this revision.Oct 7 2019, 2:50 AM
This revision is now accepted and ready to land.Oct 7 2019, 2:50 AM
RKSimon requested changes to this revision.Oct 7 2019, 2:50 AM
This revision now requires changes to proceed.Oct 7 2019, 2:50 AM
spatel added a comment.Oct 7 2019, 7:13 AM

This caused lots of failed asserts in building many different projects, see https://bugs.llvm.org/show_bug.cgi?id=43582, so I went ahead and reverted it for now.

Thanks. I looked at the test cases attached to the bug report, and this patch causes scary behavior:
LV: Found an estimated cost of 4294967293 for VF 1 For instruction: %or75 = or i32 %shl74, %shl71

The loop vectorizer assumes that costs are always positive (it converts the value returned by the cost model to an *unsigned* value).
This matches the assert in the getArithmeticInstrCost() implementation that we tried to bypass:

assert(Cost >= 0 && "TTI should not produce negative costs!");

But we want SLP to weigh the *relative* cost of scalar code (that will be reduced) vs. vector code.

I think we should use the earlier revision of this patch that created a dedicated function for estimating a load combining pattern. Ie, we tried to squeeze this into the more general getArithmeticInstrCost() API, but it does not belong there. Existing callers have made assumptions about using that cost model API, and we violated the contract:

/// This is an approximation of reciprocal throughput of a math/logic op.
/// A higher cost indicates less expected throughput.
/// From Agner Fog's guides, reciprocal throughput is "the average number of
/// clock cycles per instruction when the instructions are not part of a
/// limiting dependency chain."
/// Therefore, costs should be scaled to account for multiple execution units
/// on the target that can process this type of instruction. For example, if
/// there are 5 scalar integer units and 2 vector integer units that can
/// calculate an 'add' in a single cycle, this model should indicate that the
/// cost of the vector add instruction is 2.5 times the cost of the scalar
/// add instruction.
/// \p Args is an optional argument which holds the instruction operands
/// values so the TTI can analyze those values searching for special
/// cases or optimizations based on those values.
int getArithmeticInstrCost(
spatel updated this revision to Diff 223613.Oct 7 2019, 8:33 AM

Patch updated:
Jump back to the earlier revision which created a new method for cost of a load-combine pattern. This is independent of the existing arithmetic instruction cost API, so we can be sure that there is no conflict with other cost model users. It's also more accurate because we can add the cost of a wider load just once for the entire pattern. Test case for the loop vectorizer crash was added at rL373913.

spatel updated this revision to Diff 223625.Oct 7 2019, 10:25 AM

Patch updated:
Moved "using PatternMatch" line within function that uses that API.

Side note: filed https://bugs.llvm.org/show_bug.cgi?id=43591 for larger questions about the cost model. I really don't want to hold up the underlying motivating bugs for this patch while we try to untangle that mess.

Ping.

There seems to be general agreement that this is a temporary (stopgap) solution until we can do limited load combining in IR. So it's a question of whether we're ok with a cost model hack to overcome the motivating bugs while we figure out how to add a new pass.

ABataev added inline comments.Oct 11 2019, 9:39 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
1698 ↗(On Diff #223625)

Shall we just terminate the reduction of this special construct unconditionally? Do we need this cost calculation just to prevent the reduction all the time we see this pattern? If yes, then, probably, we don't need to calculate the cost.
There is a function isTreeTinyAndNotFullyVectorizable(). Can you put pattern matching analysis for this particular construct in this function without any additional cost analysis?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6495–6501

Maybe better to put the check for the operation into the function itself?

spatel marked 2 inline comments as done.Oct 16 2019, 4:46 AM
spatel added inline comments.
llvm/include/llvm/CodeGen/BasicTTIImpl.h
1698 ↗(On Diff #223625)

I'm not opposed to ignoring costs for this patch. It's makes the patch simpler (although less flexible since targets can't then override via cost model).

I don't think we should put this directly into isTreeTinyAndNotFullyVectorizable() though. In this case, the tree may not be tiny, and it may be fully vectorizable, so that would be wrong on both counts. :)

I think the first revision of the patch was already close in spirit to this suggestion - it modified SLP alone rather than the cost model. Let me rework that, and we'll see if that looks better.

spatel updated this revision to Diff 225197.Oct 16 2019, 5:01 AM

Patch updated:
Change implementation to a pure SLP bailout based on pattern-matching of a load-combine-reduction opportunity. Same test results.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 16 2019, 11:05 AM
This revision was automatically updated to reflect the committed changes.