Page MenuHomePhabricator

[SLP] allow matching integer min/max intrinsics as reduction ops
ClosedPublic

Authored by spatel on Mar 19 2021, 12:53 PM.

Details

Summary

As noted in D98152, we need to patch SLP to avoid regressions when we start canonicalizing to integer min/max intrinsics.
Most of the real work to make this possible was in 7202f47508 .
But I am posting this for review just in case anyone sees or knows of other problems that may result from the switch to intrinsics.

I suspect that we will need to adjust the cost models or tests because the PhaseOrdering test in D98152 still doesn't vectorize with only x86 SSE2 (it does change if I add an attribute for SSE4.1).

Diff Detail

Event Timeline

spatel created this revision.Mar 19 2021, 12:53 PM
spatel requested review of this revision.Mar 19 2021, 12:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2021, 12:53 PM

I suspect that we will need to adjust the cost models or tests because the PhaseOrdering test in D98152 still doesn't vectorize with only x86 SSE2 (it does change if I add an attribute for SSE4.1).

Taking a closer look at that example, and I think this will actually be an improvement (ie, we should adjust the test, not the cost model).

Currently, we are favoring vectorization based on x86 SSE2 costs, but it seems wrong...

Without vectorizing we have a chain of cmov:

movl	(%rdi), %eax
movl	4(%rdi), %ecx
cmpl	%eax, %ecx
cmovll	%ecx, %eax
movl	8(%rdi), %ecx
cmpl	%eax, %ecx
cmovll	%ecx, %eax
movl	12(%rdi), %ecx
cmpl	%eax, %ecx
cmovll	%ecx, %eax

With vectorization (but without the expected min/max instructions or even blendv), we have more code + transfer from xmm to GPR:

movdqu	(%rdi), %xmm0
pshufd	$238, %xmm0, %xmm1              # xmm1 = xmm0[2,3,2,3]
movdqa	%xmm1, %xmm2
pcmpgtd	%xmm0, %xmm2
pand	%xmm2, %xmm0
pandn	%xmm1, %xmm2
por	%xmm0, %xmm2
pshufd	$85, %xmm2, %xmm0               # xmm0 = xmm2[1,1,1,1]
movdqa	%xmm0, %xmm1
pcmpgtd	%xmm2, %xmm1
pand	%xmm1, %xmm2
pandn	%xmm0, %xmm1
por	%xmm2, %xmm1
movd	%xmm1, %eax
RKSimon accepted this revision.Mar 22 2021, 7:15 AM

LGTM - I agree pre-SSE41 4i32 min/max patterns aren't particularly great so you should probably just ensure we have SSE4.1 + later test coverage in D98152

This revision is now accepted and ready to land.Mar 22 2021, 7:15 AM

FYI: this is causing bug https://bugs.llvm.org/show_bug.cgi?id=49730. It seems that there is more than one place scattered over the code that match exactly select instruction, and they need some update. I propose to revert this and re-enable after all such places are updated properly.

spatel reopened this revision.Mar 26 2021, 11:41 AM

Reopening - the original commit was reverted because it could crash ( https://llvm.org/PR49730 ).

This revision is now accepted and ready to land.Mar 26 2021, 11:41 AM
spatel updated this revision to Diff 333602.Mar 26 2021, 11:48 AM
spatel added a reviewer: mkazantsev.

Patch updated:
This version includes a change to create min/max intrinsics only if we started by matching min/max intrinsics. We continue to create cmp+select if the original code matched that pattern. This avoids the crashing seen in PR49730. Hopefully, we can remove all of the select matching/creation after we canonicalize to the intrinsics.

We could pull the new part of this patch into an NFC-preliminary commit, but I could not find a way to show a test difference from that part alone.

mkazantsev added a comment.EditedMar 28 2021, 10:25 PM

I confirm that with this version of patch the original failure is gone. For me it looks fine, but I'm not a compitent reviewer to approve it because I don't know SLP well enough. Thanks!

RKSimon accepted this revision.Mar 29 2021, 3:54 AM

LGTM

I confirm that with this version of patch the original failure is gone. For me it looks fine, but I'm not a compitent reviewer to approve it because I don't know SLP well enough. Thanks!

Thanks again for fuzz testing this! I can't say if there are more corner-cases that I haven't accounted for.
If we can't get this right, an alternative approach would be to give up on trying to handle both forms (cmp+select and intrinsic) of min/max. We could change SLP to only recognize min/max intrinsics simultaneously with the patch for instcombine that canonicalizes to those forms.