Page MenuHomePhabricator

[SLP] add another bailout for load-combine patterns
ClosedPublic

Authored by spatel on Apr 28 2020, 6:01 AM.

Details

Summary

This builds on the or-reduction bailout that was added with D67841. We still do not have IR-level load combining, although that could be a target-specific enhancement for -vector-combiner.

The heuristic is narrowly defined to catch the motivating case from PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.

That is, there's an unmodified test of pure load/zext/store that is not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll. That's the reason for the logic difference to require the 'or' instructions. The chances that vectorization would actually help a memory-bound sequence like that seem small, but it looks nicer with:

vpmovzxwd	(%rsi), %xmm0
vmovdqu	%xmm0, (%rdi)

rather than:

movzwl	(%rsi), %eax
movl	%eax, (%rdi)
...

In the motivating test, we avoid creating a vector mess that is unrecoverable in the backend, and SDAG forms the expected bswap instructions after load combining:

movzbl (%rdi), %eax
vmovd %eax, %xmm0
movzbl 1(%rdi), %eax
vmovd %eax, %xmm1
movzbl 2(%rdi), %eax
vpinsrb $4, 4(%rdi), %xmm0, %xmm0
vpinsrb $8, 8(%rdi), %xmm0, %xmm0
vpinsrb $12, 12(%rdi), %xmm0, %xmm0
vmovd %eax, %xmm2
movzbl 3(%rdi), %eax
vpinsrb $1, 5(%rdi), %xmm1, %xmm1
vpinsrb $2, 9(%rdi), %xmm1, %xmm1
vpinsrb $3, 13(%rdi), %xmm1, %xmm1
vpslld $24, %xmm0, %xmm0
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpslld $16, %xmm1, %xmm1
vpor %xmm0, %xmm1, %xmm0
vpinsrb $1, 6(%rdi), %xmm2, %xmm1
vmovd %eax, %xmm2
vpinsrb $2, 10(%rdi), %xmm1, %xmm1
vpinsrb $3, 14(%rdi), %xmm1, %xmm1
vpinsrb $1, 7(%rdi), %xmm2, %xmm2
vpinsrb $2, 11(%rdi), %xmm2, %xmm2
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpinsrb $3, 15(%rdi), %xmm2, %xmm2
vpslld $8, %xmm1, %xmm1
vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
vpor %xmm2, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rsi)
movl	(%rdi), %eax
movl	4(%rdi), %ecx
movl	8(%rdi), %edx
movbel	%eax, (%rsi)
movbel	%ecx, 4(%rsi)
movl	12(%rdi), %ecx
movbel	%edx, 8(%rsi)
movbel	%ecx, 12(%rsi)

Diff Detail

Event Timeline

spatel created this revision.Apr 28 2020, 6:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2020, 6:01 AM
ABataev added inline comments.Apr 28 2020, 6:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3725–3736

Maybe make it a part of isTreeTinyAndNotFullyVectorizable() function?

spatel marked an inline comment as done.Apr 28 2020, 7:13 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3725–3736

I was against that in D67841 because:
"I don't think we should put this directly into isTreeTinyAndNotFullyVectorizable() though. In this case, the tree may not be tiny, and it may be fully vectorizable, so that would be wrong on both counts. :)"

Also, there's still hope that we can eventually remove these heuristics, so that will be simpler if we keep this logic independent.

May it happen at all that on some platforms it might be profitable to vectorize such sequences rather than use scalarized code? If so, better to implement cost analysis.

spatel added a comment.Tue, May 5, 7:52 AM

May it happen at all that on some platforms it might be profitable to vectorize such sequences rather than use scalarized code? If so, better to implement cost analysis.

I think we already went down this path as an intermediate step with:
https://reviews.llvm.org/D67841?id=223377

And that had to be reverted.

There is no question of whether the vectorization may be profitable. The cost model -- as shown on this x86 example -- says the vectorization is profitable, and that is correct given the IR.

The problem is that the cost model can't tell us that the scalar code will eventually be transformed by the backend into something much simpler. So we need a heuristic somewhere, and we agreed in the last patch that this is only a problem for SLP. Trying to make the patch a general cost model fix exposed much larger problems in the definition/usage of the cost models themselves (see for example https://bugs.llvm.org/show_bug.cgi?id=43591 ). That's being addressed with a major rework of the cost models (D76124, D78547, D78922, etc).

This revision is now accepted and ready to land.Tue, May 5, 8:06 AM
This revision was automatically updated to reflect the committed changes.
hans added a subscriber: hans.Thu, May 7, 7:32 AM

This caused asserts in Chromium builds, see https://bugs.chromium.org/p/chromium/issues/detail?id=1079294#c2 for a stand-alone reproducer.

I've reverted in c54c6ee1a7a26c994eff32ce35862641db47f305 for now.

spatel added a comment.Thu, May 7, 9:04 AM

This caused asserts in Chromium builds, see https://bugs.chromium.org/p/chromium/issues/detail?id=1079294#c2 for a stand-alone reproducer.

I've reverted in c54c6ee1a7a26c994eff32ce35862641db47f305 for now.

Thanks. Looks like constant expressions strike again.