This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Simplify to scalar store if only one element updated
ClosedPublic

Authored by qiucf on Mar 9 2021, 2:00 AM.

Details

Summary

This is vector-combine version of revision D71828, which simplifies load-insertelt-store pattern into getelementptr-store.

Diff Detail

Event Timeline

qiucf created this revision.Mar 9 2021, 2:00 AM
qiucf requested review of this revision.Mar 9 2021, 2:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2021, 2:00 AM
HLJ2009 added a subscriber: HLJ2009.Mar 9 2021, 2:26 AM
qiucf updated this revision to Diff 329261.Mar 9 2021, 2:35 AM

Fix pipeline tests

spatel added a subscriber: asbirlea.Mar 9 2021, 8:38 AM

Since the original proposal, MemorySSA has evolved. I still don't know much about that though. cc @asbirlea @nikic to see if this implementation is ok or should we use MemorySSA here?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
777

Can shorten:

if (!SI || !SI->isSimple() ...)
829

This line of the comment is not accurate now. Remove or update.

fhahn added a comment.Mar 10 2021, 3:18 AM

Since the original proposal, MemorySSA has evolved. I still don't know much about that though. cc @asbirlea @nikic to see if this implementation is ok or should we use MemorySSA here?

MemorySSA may be handy in this case, but it looks like it's not available close-by in the current pipeline position?

nikic added inline comments.Mar 10 2021, 3:28 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810

This line shouldn't be dropped.

qiucf updated this revision to Diff 329968.Mar 11 2021, 7:55 AM
qiucf marked 3 inline comments as done.

Address comments

fhahn added a comment.Apr 11 2021, 2:13 PM

I put up a similar patch to handle extractelement (load %ptr), %index in a similar fashion: D100273. Together with this patch, they greatly improve codegen for certain code generated using the C/C++ matrix extension https://clang.godbolt.org/z/qsccPdPf4

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
761

I think we need a limit here, to avoid excessive compile time?

802

Do we need a pointer cast here? I *think* we can just create a GEP with the vector pointer and indices 0, Idx.

805

What about other metadata?

807

Is there a reason to remove the instruction here? I don't think the other functions do so, so it might be better to keep things consistent (or change it for other patterns as well)

qiucf updated this revision to Diff 336754.Apr 11 2021, 11:12 PM
qiucf marked 3 inline comments as done.

Address comments

qiucf added inline comments.Apr 11 2021, 11:13 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
807

I think that's because we're operating on a store, which will not be erased automatically for empty use.

fhahn added inline comments.Apr 12 2021, 6:43 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
795

Why only allow constants here? If the index is a non-constant, it should be even more profitable, because most targets probably do not have instructions to insert at a variable index.

qiucf updated this revision to Diff 337032.Apr 12 2021, 8:09 PM
qiucf marked an inline comment as done.

Allow non-constant index.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
795

Good catch. Thanks.

LGTM, but @fhahn likely has a wider/fresher view of the expected patterns, so see if there are any other comments.

fhahn added inline comments.Apr 14 2021, 6:49 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
54

This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact.

768

this potentially could be made a bit simpler by using any_of instead of the explicit loop.

800

I think we are also missing test coverage for the case where load and store are not in the same BB?

801

do we have test coverage for that?

802

I think we still need some more coverage for the various scenarios for isMemModifiedBetween, .e.g. the case where we have stores in between that are must-alias, no-alias and may-alias. Also we should have a test for the limit.

884

I think there are some oddities with respect to GlobalsAA and it should also be preserved, e.g. see D82342 (same for the new PM)

lebedev.ri added inline comments.Apr 14 2021, 6:53 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
53–54

Option name doesn't match the description/usage.
Iteration, to me, means how many times we will rerun this whole pass,
while you use it as a number of instructions to scan.

qiucf updated this revision to Diff 337649.Apr 15 2021, 12:50 AM
qiucf marked 6 inline comments as done.

Address comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
884

Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved.

RKSimon added inline comments.
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
3 ↗(On Diff #337649)

You've made this a X86 test but used a very generic data layout (and tested on big endian as well) - maybe move out of the x86 sub directory?

fhahn added inline comments.Apr 19 2021, 6:00 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784

I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly. Perhaps it might be good to limit this to types for which the following holds? SI->typeSizeEqualsStoreSize(LI->getType())?

884

Ah never mind, I missed that it was already preserved.

fhahn added inline comments.Apr 19 2021, 6:42 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
809

I think we need to set the alignment here. For example, the original store could have a alignment less than the default for the type.

llvm/test/Transforms/InstCombine/load-insert-store.ll
1

I'm not sure why this has been removed?

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15 ↗(On Diff #337649)

can you also add a test with element types > i8?

qiucf updated this revision to Diff 339536.Apr 22 2021, 2:34 AM
qiucf marked 3 inline comments as done.

Address some comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784

Hmm.. Not sure what you mean, maybe something like

%0 = getelementptr inbounds <16 x i24>, <16 x i24>* %q, i32 0, i32 3
store i24 %s, i24* %0, align 4

this should be expected result. Or do we need to ensure scalar type size of load equals to store's value type size?

llvm/test/Transforms/InstCombine/load-insert-store.ll
1

It's moved from InstCombine to VectorCombine, not deleted.

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15 ↗(On Diff #337649)

Do you mean insertelement <16 x i8> %0, i32 %s, i32 2? If not, there's a test for <8 x i16> below.

fhahn added a subscriber: bjope.Apr 23 2021, 6:59 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784

Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's worth to be extra careful. I am not too familiar with the potential problems, but I think @bjope has experience with such targets. @bjope WDYT, do you think this would work as expected?

(my preference would be to start with dis-allowing the transform if the sizes don't match initially and enable it once we got confirmation that it is safe)

809

Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (!align 1) for a vector of i16 or larger?

llvm/test/Transforms/InstCombine/load-insert-store.ll
1

Oh right, so when it was originally added the plan was to optimize the pattern in instcombine? Might be worth moving the file in a separate commit and then just have the diff here show the changes by this patch,.

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15 ↗(On Diff #337649)

yep I missed that one, sorry.

bjope added inline comments.Apr 26 2021, 12:43 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784

Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation).

Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores).

But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. I know that https://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements still says that "In the future, it will probably be outright disallowed.". Has there been discussions somewhere about opening up "pandoras box" (?) and start doing such things? If so, is that well tested somewhere?

For example something like this compiles without any complaints:

; RUN: llc -O3 -mtriple x86_64-- -o - %s
target datalayout = "E"
define void @foo(<8 x i4>* %q, i4 %s) {
  %p = getelementptr inbounds <8 x i4>, <8 x i4>* %q, i32 0, i32 3
  store i4 %s, i4* %p
  ret void
}

but it ends up writing to 3(%rdi) while the third element in that vector actually is at half of the byte at 1(%rdi).

fhahn added inline comments.Apr 26 2021, 7:23 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784

Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation).

Thanks for confirming!

Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores).

We are setting the alignment of the store to the minimum of the alignment for the scalar store and the original alignment of the store. Would the be missing something?

But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere.

It looks like at least instcombine likes to introduce such GEPs. Originally I was using something to the code in the link, which instcombine turned in a vector GEP, hence I updated D100273 to directly emit vector GEPS and suggested that here as well.

https://godbolt.org/z/q4o6fM3eP

qiucf updated this revision to Diff 341243.Apr 28 2021, 9:41 AM

Update tests and store size constraint. Thanks for explanation from @bjope and @fhahn

fhahn added a comment.May 1 2021, 9:08 AM

Update tests and store size constraint. Thanks for explanation from @bjope and @fhahn

Thanks for the update! I think the only thing missing is a few tests for un-common vector types, like <8 x i4> or <4 x i31>. Probably worth to have at least 2 different tests with different types.

qiucf updated this revision to Diff 343568.May 6 2021, 7:46 PM

Add more tests.

fhahn accepted this revision.May 7 2021, 12:53 PM

LGTM, thanks!

This revision is now accepted and ready to land.May 7 2021, 12:53 PM
hgreving added a subscriber: hgreving.EditedMay 25 2021, 8:08 PM

For targets not supporting scalar load from vector memory (like ours), this breaks it:

%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0

Now:

%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32

Are targets expected to provide patterns?

fhahn added a comment.May 26 2021, 1:14 AM

For targets not supporting scalar load from vector memory (like ours), this breaks it:

%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0

Now:

%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32

Are targets expected to provide patterns?

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

lebedev.ri added a comment.EditedMay 26 2021, 1:19 AM

Hmm wait, i completely ignored this patch :/
Does this really not do any cost modelling?
This should at least check that scalar load isn't more costly
than the original vector load + insertelement + vector store.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
768–770

Does this ignore debuginfo?

Sorry for completely ignoring this :(
I'm fine with fixing this up as a followup, if that happens today.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810–813

I'm certain this is a miscompile.
The alignment that was valid for a vector store
is not guaranteed to be valid for the store of a single vector element,
unless it's the 0'th element of course.

I think this needs to be something like

newalign = 1;
if(auto*C = dyn_cast<ConstantInt>(Idx)) {
  newalign = max(old store align, old load align);
  newalign = commonAlignment(newalign, Idx * DL.getTypeSize(NewElement.getType()))
}
lebedev.ri added inline comments.May 26 2021, 1:33 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
808

Technically, size of the insertelement index doesn't have to match the size of the GEP index,
the latter is controlled in datalayout. I'm not sure if/how we need to deal with that here, however.

hgreving added a comment.EditedMay 26 2021, 7:01 AM

For targets not supporting scalar load from vector memory (like ours), this breaks it:

%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0

Now:

%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32

Are targets expected to provide patterns?

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

fhahn added a comment.May 30 2021, 5:01 AM

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.

@qiucf would you be able to look into extending the code to check for the cost?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810–813

That's a good point. @qiucf can you look into this?

qiucf added a comment.May 30 2021, 7:10 PM

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.

@qiucf would you be able to look into extending the code to check for the cost?

I'll look into it, thanks. Sorry for the late notice.