This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Widening of partial vector loads
AbandonedPublic

Authored by lebedev.ri on Jul 20 2021, 2:40 PM.

Details

Summary

If we are loading some vector, and we know it will be legalized into a vector,
and occupy (potentially a number of) vector registers, iff we load less bytes
than the total size of the occupied vector registers, then the legalization
will have a hard time. At worst, the load will be scalarized, at least partially,
and scalar vector elements inserted forming the narrow vector.

But sometimes, if the vector will be widened, we can tell that we are allowed
to load those extra 'padding' elements, based on dereferenceability or alignment knowledge.

I think, this approach with asking the legalization about the final vector size
is the most straight-forward.

I've checked, and i believe this is endianness-agnostic as per alive2.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 20 2021, 2:40 PM
lebedev.ri edited the summary of this revision. (Show Details)Jul 20 2021, 2:43 PM

Streamline the code.

Thanks for looking at this - I'd delayed requesting something like this until we have a better idea of what the non-pow2 SLP IR from D57059 is likely to look like.

I also ended up wondering whether we should consider using the dereferencable data in DAGTypeLegalizer::GenWidenVectorLoads? As that would help with the float3 case on D106280 that raised concern

Thanks for looking at this - I'd delayed requesting something like this until we have a better idea of what the non-pow2 SLP IR from D57059 is likely to look like.

Even if SLP learns to emit wide-enough loads (which it should, regardless of non-pow2 vectorization support/etc),
i would guess we'd still want this, because here in IR we have much more information to deduce
the legality of such a transformation rather than leaving it up to the backend.

I also ended up wondering whether we should consider using the dereferencable data in DAGTypeLegalizer::GenWidenVectorLoads? As that would help with the float3 case on D106280 that raised concern

I actually tried looking at exactly that before doing this, without much success.

Thanks for working on this! I agree that this is useful independent of whatever we can/should do to improve SLP.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
290

worth -> width

312–313

Use the unary variant of CreateShuffleVector here.
Could call this value ExtractSubvector or state that in the code comment. Can we always assume that the extract op is free, or should we add that potential cost into the equation?

llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
590–593

Can we preserve the better alignment?

spatel added inline comments.Jul 21 2021, 7:36 AM
llvm/test/Transforms/VectorCombine/X86/load-widening.ll
1–2

Hmm...I don't think I've ever tried that experiment. :)
Did you confirm that the layout "wins" over the target to provide the coverage you expected?

D57059 looks to be performing masked loads when the wider range is dereferencable (costs permitting) - see the llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll test changes

lebedev.ri marked 4 inline comments as done.

Thanks for taking a look!
Addressing nits.

D57059 looks to be performing masked loads when the wider range is dereferencable (costs permitting) - see the llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll test changes

Then we'll also need a load unmasking transform.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
290

This is not a typo, but okay.

312–313

That subvector we're extracting here will be legalized (widened) into the full vector we've just loaded,
so the shuffle just pretends that a few high elements of that legal vector don't exist.
That is the whole assumption behind the transformation here.
There isn't any actual subvector extraction here.
So i'm not really seeing why we'd need to check the cost of that.

llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
590–593

Some other transformation does this, i'll take a look.

llvm/test/Transforms/VectorCombine/X86/load-widening.ll
1–2
spatel added inline comments.Jul 21 2021, 8:28 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
312–313

Ok - please add a code comment with that explanation then. That way, we'll document why we don't factor cost of the shuffle.

llvm/test/Transforms/VectorCombine/X86/load-widening.ll
1–2

I'm not questioning the logic, just that there is no big-endian x86 target, so I don't know what is happening internally in LLVM in this situation.

I think it would be better to add a different test file for a target that actually does support both modes (AArch64 or PowerPC?). We still want an x86 test file to verify that SSE vs. AVX is behaving as expected though.

194

Can you explain the difference between this test and vec_with_7elts_256bits for an SSE target? It's not obvious to me why we are ok widening to 256-bit if that's not legal, but not wider than that.

Something to keep in mind. Loading more data than was stored can prevent store to load forwarding. If there is a store in the store buffer that hasn't written to the cache yet, its data can forward to a load instead of waiting until the store gets written to the cache. This doesn't work if the load is larger than the size of the store. The load will have to wait until the store gets to the cache to merge with the surrounding data.

lebedev.ri marked an inline comment as done.Jul 21 2021, 12:04 PM
lebedev.ri marked an inline comment as done.Jul 21 2021, 12:10 PM
lebedev.ri added inline comments.
llvm/test/Transforms/VectorCombine/X86/load-widening.ll
172

We widen this to either 2x XMM, or an 1x YMM, and we know we can do this as per deref info.

194

We need to widen this to either 3x XMM or 2x YMM, but we don't know we can load that many bytes.

There is another problem hiding here, iff we know we can load 3x XMM, we still try to widen to 4x XMM,
because that's what the legalizer told us, because it only knows how to double.

Matt added a subscriber: Matt.Jul 22 2021, 3:35 AM
lebedev.ri planned changes to this revision.Jul 22 2021, 5:48 AM
lebedev.ri marked an inline comment as done.

I guess i need to mimic X86TTIImpl::getMemoryOpCost() more.

lebedev.ri abandoned this revision.Jan 17 2022, 2:35 PM