Page MenuHomePhabricator

[X86][AVX] Prefer VINSERTF128 over VPERM2F128 for 128->256 subvector concatenations
ClosedPublic

Authored by RKSimon on Oct 17 2021, 5:08 AM.

Details

Summary

The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances.

This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source.

There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this always creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses?

Noticed while investigating the quality of interleaved load/store codegen.

Diff Detail

Event Timeline

RKSimon created this revision.Oct 17 2021, 5:08 AM
RKSimon requested review of this revision.Oct 17 2021, 5:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 17 2021, 5:08 AM
RKSimon edited the summary of this revision. (Show Details)Oct 17 2021, 5:18 AM
pengfei added inline comments.Oct 17 2021, 7:27 AM
llvm/lib/Target/X86/X86InstrFragmentsSIMD.td
1061

Nit: one more space.

llvm/test/CodeGen/X86/pr50823.ll
11–13 ↗(On Diff #380238)

Is this a regression?

RKSimon added inline comments.Oct 17 2021, 7:50 AM
llvm/test/CodeGen/X86/pr50823.ll
11–13 ↗(On Diff #380238)

I don't believe so: https://simd.godbolt.org/z/rhrqsss5a - as I said in the summary, vinsertX128 tends to be cheaper than more general cross-lane shuffles.

lebedev.ri added inline comments.Oct 17 2021, 8:08 AM
llvm/test/CodeGen/X86/pr50823.ll
11–13 ↗(On Diff #380238)

We were loading 128 bits, and then fold-loading 128 more bits,
and now we load 256 bits, and then fold-load high 128 bits we just loaded, no?
The vinsertf128 should be dropped because it is a no-op.

RKSimon added inline comments.Oct 17 2021, 8:16 AM
llvm/test/CodeGen/X86/pr50823.ll
11–13 ↗(On Diff #380238)

Sorry - brain fog - the offset is +32 bytes, not +16bytes - so isn't that loading the lower half of the next <8 x float>?

lebedev.ri added inline comments.Oct 17 2021, 9:01 AM
llvm/test/CodeGen/X86/pr50823.ll
11–13 ↗(On Diff #380238)

Err, right. But still, the first load is wider now.
Is that intentional to break the false dep,
or is this a demandedelts failure?
IIRC demandedelts generally doesn't want to narrow the loads,
so perhaps the fix should lie in not forming this YMM load in the first place.

Speaking of later updating the costs, what do you think about marking the tuples
needing revisiting in the cost tables for which their codegen has changed
in llvm/test/CodeGen/X86/vector-interleaved-* here?

Speaking of later updating the costs, what do you think about marking the tuples
needing revisiting in the cost tables for which their codegen has changed
in llvm/test/CodeGen/X86/vector-interleaved-* here?

I agree - we should be trying to help maintain the link between the two.

pengfei added inline comments.Oct 17 2021, 6:10 PM
llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
4419–4422 ↗(On Diff #380238)

How about this one, seems we have one more vinsertf128 now?
https://simd.godbolt.org/z/4MMs99E7K

llvm/test/CodeGen/X86/pr50823.ll
11–13 ↗(On Diff #380238)

I understand it now, thank you.

RKSimon updated this revision to Diff 381894.Oct 25 2021, 2:26 AM

Limited the fold to AVX1 only - until I can determine a better way to prefer PERMQ/PERMPD

pengfei accepted this revision.Oct 29 2021, 6:32 AM

LGTM.

llvm/lib/Target/X86/X86ISelLowering.cpp
36871

Ending with space.

llvm/lib/Target/X86/X86InstrSSE.td
7136–7138

So these patterns are not used in this patch?

This revision is now accepted and ready to land.Oct 29 2021, 6:32 AM
RKSimon added inline comments.Oct 29 2021, 6:52 AM
llvm/lib/Target/X86/X86InstrSSE.td
7136–7138

They are used - although on AVX2 targets (down at line 7753-7756), there's no way to make use of the new 'Folding "To" vector' pattern inside vinsert_lowering for VINSERTI128, but the other patterns are still used on AVX2 targets.

pengfei added inline comments.Oct 29 2021, 7:05 AM
llvm/lib/Target/X86/X86InstrSSE.td
7136–7138

Oh, I guess I wanted to comment on 7753 :)

This revision was landed with ongoing or failed builds.Nov 1 2021, 3:51 AM
This revision was automatically updated to reflect the committed changes.