This is vector-combine version of revision D71828, which simplifies load-insertelt-store pattern into getelementptr-store.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
MemorySSA may be handy in this case, but it looks like it's not available close-by in the current pipeline position?
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
810 | This line shouldn't be dropped. |
I put up a similar patch to handle extractelement (load %ptr), %index in a similar fashion: D100273. Together with this patch, they greatly improve codegen for certain code generated using the C/C++ matrix extension https://clang.godbolt.org/z/qsccPdPf4
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
761 | I think we need a limit here, to avoid excessive compile time? | |
802 | Do we need a pointer cast here? I *think* we can just create a GEP with the vector pointer and indices 0, Idx. | |
805 | What about other metadata? | |
807 | Is there a reason to remove the instruction here? I don't think the other functions do so, so it might be better to keep things consistent (or change it for other patterns as well) |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
807 | I think that's because we're operating on a store, which will not be erased automatically for empty use. |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
795 | Why only allow constants here? If the index is a non-constant, it should be even more profitable, because most targets probably do not have instructions to insert at a variable index. |
Allow non-constant index.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
795 | Good catch. Thanks. |
LGTM, but @fhahn likely has a wider/fresher view of the expected patterns, so see if there are any other comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
54 | This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact. | |
768 | this potentially could be made a bit simpler by using any_of instead of the explicit loop. | |
800 | I think we are also missing test coverage for the case where load and store are not in the same BB? | |
801 | do we have test coverage for that? | |
802 | I think we still need some more coverage for the various scenarios for isMemModifiedBetween, .e.g. the case where we have stores in between that are must-alias, no-alias and may-alias. Also we should have a test for the limit. | |
884 | I think there are some oddities with respect to GlobalsAA and it should also be preserved, e.g. see D82342 (same for the new PM) |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
53–54 | Option name doesn't match the description/usage. |
Address comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
884 | Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved. |
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll | ||
---|---|---|
3 | You've made this a X86 test but used a very generic data layout (and tested on big endian as well) - maybe move out of the x86 sub directory? |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
784 | I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly. Perhaps it might be good to limit this to types for which the following holds? SI->typeSizeEqualsStoreSize(LI->getType())? | |
884 | Ah never mind, I missed that it was already preserved. |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
809 | I think we need to set the alignment here. For example, the original store could have a alignment less than the default for the type. | |
llvm/test/Transforms/InstCombine/load-insert-store.ll | ||
1 | I'm not sure why this has been removed? | |
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll | ||
15 | can you also add a test with element types > i8? |
Address some comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
784 | Hmm.. Not sure what you mean, maybe something like %0 = getelementptr inbounds <16 x i24>, <16 x i24>* %q, i32 0, i32 3 store i24 %s, i24* %0, align 4 this should be expected result. Or do we need to ensure scalar type size of load equals to store's value type size? | |
llvm/test/Transforms/InstCombine/load-insert-store.ll | ||
1 | It's moved from InstCombine to VectorCombine, not deleted. | |
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll | ||
15 | Do you mean insertelement <16 x i8> %0, i32 %s, i32 2? If not, there's a test for <8 x i16> below. |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
784 | Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's worth to be extra careful. I am not too familiar with the potential problems, but I think @bjope has experience with such targets. @bjope WDYT, do you think this would work as expected? (my preference would be to start with dis-allowing the transform if the sizes don't match initially and enable it once we got confirmation that it is safe) | |
809 | Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (!align 1) for a vector of i16 or larger? | |
llvm/test/Transforms/InstCombine/load-insert-store.ll | ||
1 | Oh right, so when it was originally added the plan was to optimize the pattern in instcombine? Might be worth moving the file in a separate commit and then just have the diff here show the changes by this patch,. | |
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll | ||
15 | yep I missed that one, sorry. |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
784 | Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. I know that https://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements still says that "In the future, it will probably be outright disallowed.". Has there been discussions somewhere about opening up "pandoras box" (?) and start doing such things? If so, is that well tested somewhere? For example something like this compiles without any complaints: ; RUN: llc -O3 -mtriple x86_64-- -o - %s target datalayout = "E" define void @foo(<8 x i4>* %q, i4 %s) { %p = getelementptr inbounds <8 x i4>, <8 x i4>* %q, i32 0, i32 3 store i4 %s, i4* %p ret void } but it ends up writing to 3(%rdi) while the third element in that vector actually is at half of the byte at 1(%rdi). |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
784 |
Thanks for confirming!
We are setting the alignment of the store to the minimum of the alignment for the scalar store and the original alignment of the store. Would the be missing something?
It looks like at least instcombine likes to introduce such GEPs. Originally I was using something to the code in the link, which instcombine turned in a vector GEP, hence I updated D100273 to directly emit vector GEPS and suggested that here as well. |
Thanks for the update! I think the only thing missing is a few tests for un-common vector types, like <8 x i4> or <4 x i31>. Probably worth to have at least 2 different tests with different types.
For targets not supporting scalar load from vector memory (like ours), this breaks it:
%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28 %44 = extractelement <8 x i32> %43, i32 0
Now:
%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0 %44 = load i32, i32 addrspace(201)* %43, align 32
Are targets expected to provide patterns?
Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?
Hmm wait, i completely ignored this patch :/
Does this really not do any cost modelling?
This should at least check that scalar load isn't more costly
than the original vector load + insertelement + vector store.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
768–770 | Does this ignore debuginfo? |
Sorry for completely ignoring this :(
I'm fine with fixing this up as a followup, if that happens today.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
810–813 | I'm certain this is a miscompile. I think this needs to be something like newalign = 1; if(auto*C = dyn_cast<ConstantInt>(Idx)) { newalign = max(old store align, old load align); newalign = commonAlignment(newalign, Idx * DL.getTypeSize(NewElement.getType())) } |
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
808 | Technically, size of the insertelement index doesn't have to match the size of the GEP index, |
Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...
I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.
@qiucf would you be able to look into extending the code to check for the cost?
llvm/lib/Transforms/Vectorize/VectorCombine.cpp | ||
---|---|---|
810–813 | That's a good point. @qiucf can you look into this? |
This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact.