Page MenuHomePhabricator

[CostModel][X86] Improve extract/insert element costs (PR43605)
ClosedPublic

Authored by RKSimon on Feb 21 2020, 10:48 AM.

Details

Summary

This tries to improve the accuracy of extract/insert element costs by accounting for subvector extraction/insertion for >128-bit vectors and the shuffling of elements to/from the 0'th index.

It also adds INSERTPS for f32 types and PINSR/PEXTR costs for integer types (at the moment we assume the same cost as MOVD/MOVQ - which isn't always true).

Diff Detail

Event Timeline

RKSimon created this revision.Feb 21 2020, 10:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2020, 10:48 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon updated this revision to Diff 245927.Feb 21 2020, 11:00 AM

Cleaned up some test regenerations

lebedev.ri added inline comments.Feb 21 2020, 11:05 AM
llvm/test/Transforms/LoopVectorize/X86/interleaving.ll
2–6

This test no longer does what it was doing, i think you want to add some faster cpu runline

RKSimon updated this revision to Diff 245932.Feb 21 2020, 11:26 AM

Simplify extract cost - we only care about getting the element into the 0'th index (so not a full unary shuffle, something much cheaper).

Add AVX1/AVX2 interleaving target tests

RKSimon marked an inline comment as done.Feb 21 2020, 11:27 AM
RKSimon marked an inline comment as done.Feb 21 2020, 11:37 AM
RKSimon added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll
37

It doesn't look like we were ever detecting this correctly as an insert+broadcast, we were relying on the cheap insert cost.

RKSimon planned changes to this revision.Feb 22 2020, 2:43 PM
RKSimon marked an inline comment as done.
RKSimon added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
2425

We need to make this extractelement only.

RKSimon updated this revision to Diff 246189.Feb 24 2020, 5:47 AM

Free/cheap 'Index == 0' cases should only be used for extractelement

RKSimon updated this revision to Diff 246210.Feb 24 2020, 7:45 AM
RKSimon edited the summary of this revision. (Show Details)

For many fp cases where we are inserting into index #0 we can fold the insertion into a preceding scalar-fp op. It seems that assuming this is always true isn't too much of a stretch.

lebedev.ri added inline comments.Feb 25 2020, 1:21 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
2408–2423

I don't really like how we are conflating vector element count and vector/element bit width,
It took me a moment to notice that Width is actually element count..

RKSimon updated this revision to Diff 246674.Feb 26 2020, 4:27 AM

Rename Width/SubWidth with NumElts/SubNumElts

lebedev.ri added inline comments.Feb 26 2020, 5:15 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
2449

insertps/extractps

llvm/test/Analysis/CostModel/X86/vector-extract.ll
41

Hm, why is this cost=0, don't we need to extract 3'th sub-register here?

RKSimon marked an inline comment as done.Feb 26 2020, 6:01 AM
RKSimon added inline comments.
llvm/test/Analysis/CostModel/X86/vector-extract.ll
41

AVX1/2 - this will legalize to 2 * <4 x double> to fit into ymm registers, so its already at index0 of the 2nd <4 x double>

lebedev.ri marked an inline comment as done.Feb 26 2020, 6:35 AM

Thanks, no further comments, this looks about reasonable to me.

llvm/test/Analysis/CostModel/X86/vector-extract.ll
41

Right.

andreadb accepted this revision.Feb 27 2020, 6:56 AM

Looks good to me.

This revision is now accepted and ready to land.Feb 27 2020, 6:56 AM
This revision was automatically updated to reflect the committed changes.

We got a report that this slows down PDF rendering in pdfium rendering of one pdf by 11%, and it slowed down rendering of 6% of pdfs of a test corpus by 2+%.

https://bugs.chromium.org/p/chromium/issues/detail?id=1067111 has repro steps. Could you take a look?

We got a report that this slows down PDF rendering in pdfium rendering of one pdf by 11%, and it slowed down rendering of 6% of pdfs of a test corpus by 2+%.

https://bugs.chromium.org/p/chromium/issues/detail?id=1067111 has repro steps. Could you take a look?

Sure, but my access to a decent linux box is limited while I'm on lockdown - is there anyway that you can raise a bug and post the IR/asm diffs from the repro please?

Sure, but my access to a decent linux box is limited while I'm on lockdown - is there anyway that you can raise a bug and post the IR/asm diffs from the repro please?

Filed https://bugs.llvm.org/show_bug.cgi?id=45418 and added a link to before/after asm.