This is an archive of the discontinued LLVM Phabricator instance.

[x86] narrow extract subvector of vector select
ClosedPublic

Authored by spatel on Jun 6 2019, 10:52 AM.

Details

Summary

This is a potentially large perf win for AVX1 targets because of the way we auto-vectorize to 256-bit but then expect the backend to legalize/optimize for the half-implemented AVX1 ISA.

On the motivating example from PR37428 (even though this patch doesn't solve the vector shift issue):
https://bugs.llvm.org/show_bug.cgi?id=37428
...there's a 16% speedup when compiling with "-mavx" (perf tested on Haswell) because we eliminate the remaining 256-bit vblendv ops.

I added comments on a couple of tests that require further work. If we have 256-bit logic ops separating the vselect and extract, we should probably narrow everything to 128-bit, but that requires a larger pattern match.

Diff Detail

Event Timeline

spatel created this revision.Jun 6 2019, 10:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2019, 10:52 AM
RKSimon added inline comments.Jun 6 2019, 2:18 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
43172

Should we limit this to cases where all the uses of Sel.getOperand(0) are EXTRACT_SUBVECTORs?

RKSimon added inline comments.Jun 6 2019, 2:23 PM
llvm/test/CodeGen/X86/horizontal-reduce-smax.ll
541

This case probably has a extract_element(extract_subvector(vselect())) pattern - should the existing DAGCombine code catch it?

spatel marked 2 inline comments as done.Jun 6 2019, 4:01 PM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
43172

This would be difficult because we peeked through bitcasts to get here (because that was required to match most of the cases in the regression tests). So we'd have to check uses of uses. A 1st attempt at that didn't go well - we lost almost all of the improvements.

llvm/test/CodeGen/X86/horizontal-reduce-smax.ll
541

I don't think we have generic combines for shrinking/scalarizing vselects. We would almost certainly need a TLI hook to decide when those are profitable.

The reason I've proposed this as an x86-specific fold is because it allows us to be more aggressive for these specific patterns. We know the 128-bit sequences are going to be cheaper than the 256-bit variants in almost all cases - even if it means we have to extract the true/false values or have extra uses of the select.

RKSimon accepted this revision.Jun 7 2019, 1:29 AM

agreed, lets go with this. LGTM.

This revision is now accepted and ready to land.Jun 7 2019, 1:29 AM
This revision was automatically updated to reflect the committed changes.