This is an archive of the discontinued LLVM Phabricator instance.

[x86] make SLM extract vector element more expensive than default
ClosedPublic

Authored by spatel on Nov 22 2019, 10:51 AM.

Details

Summary

I'm not sure what the effect of this change will be on all of the affected tests or a larger benchmark, but it fixes the horizontal add/sub problems noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc

The costs are based on reciprocal throughput numbers in Agner's tables for PEXTR*; these appear to be very slow ops on Silvermont.

This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605

Also, it seems likely that insert/extract is the source of perf regressions on other CPUs (up to 30%) that were cited as part of the reason to revert D59710, so maybe we'll extend the table-based approach to other subtargets.

Diff Detail

Event Timeline

spatel created this revision.Nov 22 2019, 10:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2019, 10:51 AM
craig.topper added inline comments.Nov 22 2019, 11:07 AM
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
302–305

I'm not sure I understand what's happening here. SLM doesn't have 256-bit vectors. Is this going to codegen well?

RKSimon added inline comments.Nov 22 2019, 12:19 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
2412

You should be able to do:

int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD);
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
302–305

Probably the cost model type legalization has kicked in. It maybe that its not handling EXTRACT_SUBVECTOR shuffle costs or something so it ends up scalarizing?

spatel marked an inline comment as done.Nov 22 2019, 1:02 PM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
302–305

I didn't step through SLP, but I agree this is suspicious. But then we end up with virtually identical asm before and after this change:

movdqa	%xmm0, %xmm4
movdqa	%xmm1, %xmm5
punpckhqdq	%xmm2, %xmm0    # xmm0 = xmm0[1],xmm2[1]
punpckhqdq	%xmm3, %xmm1    # xmm1 = xmm1[1],xmm3[1]
punpcklqdq	%xmm2, %xmm4    # xmm4 = xmm4[0],xmm2[0]
punpcklqdq	%xmm3, %xmm5    # xmm5 = xmm5[0],xmm3[0]
paddq	%xmm4, %xmm0
paddq	%xmm5, %xmm1
spatel updated this revision to Diff 230715.Nov 22 2019, 1:17 PM
spatel marked an inline comment as done.

Patch updated:
Use InstructionOpcodeToISD() to simplify code.

spatel marked an inline comment as done.Nov 24 2019, 6:00 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
302–305

I'm still not clear on exactly how SLP does its accounting, but debug output shows that when it used to evaluate the 4-wide vector ops, it saw this:
SLP: Spill Cost = 0.
SLP: Extract Cost = 4.
SLP: Total Cost = 6.

...and decided that would not be profitable. But then it evaluates doing the ops as 2-wide (128-bit), it sees this:

SLP: Spill Cost = 0.
SLP: Extract Cost = 2.
SLP: Total Cost = -1.
SLP: Vectorizing list at cost:-5.

So that's worth doing. With this patch, it now sees this at 4-wide:

SLP: Spill Cost = 0.
SLP: Extract Cost = 56.
SLP: Total Cost = -40.
SLP: Vectorizing list at cost:-44.

This seems more truthful - the cost of extract on SLM is very large relative to the cost of vector ops.

The cost model itself deals with illegal types (as here - 256-bit on a subtarget where that is not legal) by doing a simple scaling: see lines 2393, 2412 in the source code diff in this patch.

This revision is now accepted and ready to land.Nov 26 2019, 11:30 AM
This revision was automatically updated to reflect the committed changes.