This is an archive of the discontinued LLVM Phabricator instance.

[SLP] try to create vector loads from bitcasted scalar pointers
AbandonedPublic

Authored by spatel on Jul 3 2019, 10:14 AM.

Details

Summary

This doesn't help the motivating cases in:
https://bugs.llvm.org/show_bug.cgi?id=16739
...yet, but I'd like to get feedback on the general approach.

The general idea is that if we have a legal vector pointer type, but we are bitcasting that pointer to only load a subset of a vector, then load the whole vector (if that is safe) and extract the subset of the vector.

This will allow SLP and/or instcombine to fold subsequent scalar ops together more easily because they will see extractelement ops from a single vector rather than incomplete parts of that vector.

Currently, this transform will make no overall difference to these most basic patterns because the backend (DAGCombiner) will narrow the loads back down to scalars via narrowExtractedVectorLoad().

Diff Detail

Event Timeline

spatel created this revision.Jul 3 2019, 10:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2019, 10:14 AM
ABataev added inline comments.Jul 3 2019, 10:37 AM
llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

Seems to me, it must be masked load rather than just load. Plus, what about the cost? This does not look like cost optimal.

spatel marked an inline comment as done.Jul 3 2019, 10:47 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

If the load is guaranteed dereferenceable, does that not allow speculated load of the entire vector?

I'm open to suggestions about the cost calc. It's not clear to me if there's an existing TTI API for this or if we need to create a new one?

I am probably missing out something, but isn't this profitable only if we have more than one of these scalar loads extracting from the same vector load? Perhaps we could use these scalar loads as seeds and do short top-down SLP?
Also isn't it better to run this after vectorizeStoreChains() ?

I am probably missing out something, but isn't this profitable only if we have more than one of these scalar loads extracting from the same vector load?

It's almost certainly more profitable with >1 load, but that's not a requirement for profitability from what I can tell. Here's a more realistic example for the single load case:

define <2 x i64> @load_splat(<4 x float>* dereferenceable(16) %p, <2 x i64> %y) {
  %bc = bitcast <4 x float>* %p to i64*
  %ld = load i64, i64* %bc, align 16
  %ins = insertelement <2 x i64> undef, i64 %ld, i32 0
  %splat = shufflevector <2 x i64> %ins, <2 x i64> undef, <2 x i32> zeroinitializer
  %add  = add <2 x i64> %splat, %y
  ret <2 x i64> %add
}

Current codegen for an x86 SSE target:

movq	(%rdi), %xmm1           # xmm1 = mem[0],zero
pshufd	$68, %xmm1, %xmm1       # xmm1 = xmm1[0,1,0,1]
paddq	%xmm1, %xmm0

With this patch, the IR is reduced to:

$ ./opt load-bitcast-vec.ll -S -mtriple=x86_64-- -slp-vectorizer -instcombine
define <2 x i64> @larger_scalar(<4 x float>* dereferenceable(16) %p, <2 x i64> %y) {
  %1 = bitcast <4 x float>* %p to <2 x i64>*
  %2 = load <2 x i64>, <2 x i64>* %1, align 16
  %splat = shufflevector <2 x i64> %2, <2 x i64> undef, <2 x i32> zeroinitializer
  %add = add <2 x i64> %splat, %y
  ret <2 x i64> %add
}

And codegen improves by folding the load:

pshufd	$68, (%rdi), %xmm1      # xmm1 = mem[0,1,0,1]
paddq	%xmm1, %xmm0

Perhaps we could use these scalar loads as seeds and do short top-down SLP?
Also isn't it better to run this after vectorizeStoreChains() ?

I don't have enough familiarity with SLP to know how to best fit these pieces together, but those seem like reasonable ideas. I was only trying to assess viability with this initial proposal - and not break anything. :)

Thanks for sharing the example.

Isn't this something that should be pattern-matched in instcombine/codegen and not in SLP?
What I mean is that if we have multiple of these loads, then this transformation should obviously be performed by the vectorizer. But if we only have one scalar load, then it looks a bit odd to do it in SLP.

Isn't this something that should be pattern-matched in instcombine/codegen and not in SLP?

Great question. I've been wondering how to solve PR16739 for about 5 years now! Here are my current answers:

  1. It's not possible for instcombine to create vector loads from scalar loads because we have no legality/cost model there. In fact, we have a request to do an opposing transform in instcombine in PR42424:

https://bugs.llvm.org/show_bug.cgi?id=42424 (but as I said there, I don't think instcombine is allowed to do that transform either)

  1. It is possible to handle the most basic case in DAGCombiner, but it requires propagating the dereferenceable attribute/metadata through the transition from IR to the SDAG. There's partial precedence for that, but I'm not sure if it's enough to handle the general case. The disadvantage of waiting that long is that we may miss IR-based (instcombine) transforms and then have to recreate those in SDAG.

What I mean is that if we have multiple of these loads, then this transformation should obviously be performed by the vectorizer. But if we only have one scalar load, then it looks a bit odd to do it in SLP.

Yes, I'd like to extend this to handle the >1 load case, but I was starting with the minimal pattern and hoping to build on that. I'm imagining that if we have >1 load from a base pointer, then we'll group those together and create a sequence of extracts from a single vector load. If I need to show that as part of the initial patch, I'll extend this patch now.

Ping.

Unanswered questions:

  1. Is there a better cost query than checking if the target has a vector register ( TTI->getRegisterBitWidth(true) ) that exceeds the load size?
  2. Do we require that multiple scalar loads are subsumed by the vector load?

I personally think this seems to be going in the right direction,
though it isn't obvious without some more more complicated tests
that will show the further transforms this could allow.

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

I agree that there is no reason this should be a maskedload.
Do we have opposite folds for this in dagcombine?

spatel marked an inline comment as done.Jul 25 2019, 4:35 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

Yes - see narrowExtractedVectorLoad() in DAGCombiner.

lebedev.ri marked an inline comment as done.Jul 25 2019, 5:45 AM
lebedev.ri added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

Then as far i'm concerned this is zero-cost change.

ABataev added inline comments.Jul 25 2019, 6:35 AM
llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

getCastInstrCost + getMemoryOpCost for scalar instructions.
getMemoryOpCost + getExtractWithExtendCost for vector instructions. No?

lebedev.ri added inline comments.Jul 25 2019, 6:45 AM
llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9

Then as far i'm concerned this is zero-cost change.

... in the sense that if the further passes won't make more use of this load,
it is guaranteed to be demoted back into simple scalar load.

spatel updated this revision to Diff 211803.Jul 25 2019, 12:55 PM

Patch updated:
Use TTI cost model to compare costs of original and new load sequences.

spatel marked an inline comment as done.Aug 1 2019, 8:18 AM

Ping.

ABataev added inline comments.Aug 8 2019, 7:12 AM
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
86

Comment?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5075

Not sure this is the right place to call this function. I would suggest to do it at the end of vectorization. Plus, this should change the value of Changed variable.

5250

Load->isSimple() instead of Load->isVolatile() || Load->isAtomic()

5268–5272

Why? Long vectors could be split into several smaller vector loads successfully.

5280–5306

Could you reuse the original logic with building the vectorization tree, cost calculation etc.? This looks like bikeshedding and does not allow to extend it for other ops.

spatel updated this revision to Diff 214195.Aug 8 2019, 11:12 AM
spatel marked 4 inline comments as done.

Patch updated:

  1. Added documentation comment for vectorizeLoads().
  2. Use isSimple() to filter out 'volatile' and other loads that we don't want to alter.
  3. Moved vectorization of loads to end of processing per block (not sure if that answers the request for "end of vectorization" though).
  4. Removed check for load larger than vector register size (that was an attempt to not create something harmful, but now we are using the cost model).

It's not clear to me how to re-use the existing tree model/cost code, so I'm still looking into that. Suggestions appreciated.

@spatel Has this been superseded by D81766?

spatel abandoned this revision.Aug 8 2020, 9:26 AM

@spatel Has this been superseded by D81766?

Yes, we are keying off of a different pattern now, but the same motivation (though neither implementation would help PR16739 yet...follow-ups expected).