This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] scalar_to_vector(load_scalar()) -> load_vector() for fast dereferencable loads
AbandonedPublic

Authored by RKSimon on Jul 19 2021, 7:43 AM.

Details

Summary

As reported on PR51075, we fail to make use of dereferencable 128-bit vector loads for float2 loads which were then being widened for float4 operations, preventing a useful load-fold.

We already do a similar fold for insert_subvector patterns of 128-bit loads with 256-bit dereferencable pointers.

Diff Detail

Event Timeline

RKSimon created this revision.Jul 19 2021, 7:43 AM
RKSimon requested review of this revision.Jul 19 2021, 7:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2021, 7:43 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
8611

Please can you precommit this case change

efriedma added inline comments.
llvm/test/CodeGen/X86/load-partial-dot-product.ll
183

Even if we're allowed to do this, it doesn't seem wise; having zero in the high bits of the register is better than random junk. Can we mark up the loads somehow?

RKSimon added inline comments.Jul 19 2021, 9:43 AM
llvm/test/CodeGen/X86/load-partial-dot-product.ll
183

Isn't that what the dereferenceable(16) tag is doing?

pengfei added inline comments.Jul 19 2021, 5:46 PM
llvm/test/CodeGen/X86/load-partial-dot-product.ll
183

I have the same doubt. dereferenceable(16) tells the memory of the high bits is available. But shouldn't we always prefer to loading less bytes for performance?

RKSimon planned changes to this revision.Jul 20 2021, 3:47 AM

I'll have a think about possible alternatives for now

Please note that this patch is very partial.
The actual assembly diff should be as follows:
https://godbolt.org/z/W47nvzc4e

I think from it is it clear that the wide load is unquestionably better.

llvm/test/CodeGen/X86/load-partial-dot-product.ll
183

You are comparing apples to oranges here.
The problem here is that vinsertps is (obviously) redundant and should go away.
Then it's obviously better - we have one less memory access.

pengfei added inline comments.Jul 20 2021, 6:09 AM
llvm/test/CodeGen/X86/load-partial-dot-product.ll
183

I see it now. It makes sense if it wants to turn

vmovsd {{.*#+}} xmm0 = mem[0],zero
vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]

into
vmovups (%rdi), %xmm0

Took a stab at VectorCombine side of the puzzle: D106399

Should this wait for D106447?

Should this wait for D106447?

No, probably not.

RKSimon abandoned this revision.Jan 25 2022, 5:07 AM