This patch changes the order of GEPs generated by Splitting GEPs pass, specially when one of the GEPs has constant and the base is loop invariant, then we will generate the GEP with constant first when beneficial, to expose more cases for LICM. If originally Splitting GEP generate the following: do.body.i: %idxprom.i = sext i32 %shr.i to i64 %2 = bitcast %typeD* %s to i8* %3 = shl i64 %idxprom.i, 2 %uglygep = getelementptr i8, i8* %2, i64 %3 %uglygep7 = getelementptr i8, i8* %uglygep, i64 1032 ... Now it genereates: do.body.i: %idxprom.i = sext i32 %shr.i to i64 %2 = bitcast %typeD* %s to i8* %3 = shl i64 %idxprom.i, 2 %uglygep = getelementptr i8, i8* %2, i64 1032 %uglygep7 = getelementptr i8, i8* %uglygep, i64 %3 ... For no-loop cases, the original way of generating GEPs seems to expose more CSE cases, so we don't change the logic for no-loop cases, and only limit our change to the specific case we are interested in.
Details
Diff Detail
Event Timeline
Hi Lawrence,
I am not sure the transformation is legal, I’ll come back on that, but assuming it is, I think we could consider that moving the constant elements of a gep at the beginning of a gep-chain is the canonical representation.
If people agree on this, then this transformation would be better suited as an inst combine and we can get rid of all the is loop invariant checks.
Now, going back to the legality of the transformation, I am definitely not a language lawyer but it seems to me we may slightly modify the behavior of the program.
E.g., consider
C = gep A, B
D = gep C, 1023
Where gep C, 1023 produces an overflow.
After this transformation, we would get:
New = gep A, 1023
D = gep B, New
Now, the overflow may be on New = gep A, 1023, which potentially affects the behavior of the program.
Thoughts?
Cheers,
-Quentin
Hi, Quentin:
Thanks for your prompt response.
Just want to clarify, after the transformation, the new code would be:
C = gep A, 1023
D = gep C, B // Note that B is still the second operand.
Based on the testcase originated this, those GEPs were generated by Split GEP pass, that means originally there is only one GEP related to A, so overflow on the first or the second should not make too much difference, because they were all based on A originally , that's why I checked the type of A and C must be the same.
However whether the check is strict enough to make sure the transformation is legal, I am not 100% sure, definitely I'd like to hear more comments.
About doing it in inst combining, I am not an expert of inst combining, my question is: is it possible for inst combining to hoist instruction outside loop? The important aspect of this transformation is to enable LICM to hoist a loop invariant computation outside loop, which is impossible before since the first GEP is loop variant, and we have to check if the first GEP is loop variant (or don't know), because this transformation is not needed otherwise -- If LICM can hoist the first GEP out, then it will hoist the second GEP out too.
More comments are welcome.
Regards
Lawrence Hu
I believe that if you restrict to 'inbounds' GEPs, you can avoid the overflow question.
About doing it in inst combining, I am not an expert of inst combining, my question is: is it possible for inst combining to hoist instruction outside loop?
No, but Quentin is right. When this kind of interchange is legal, we should establish a canonical form and use instcombine to transform to it. This will also allow CSE/GVN, etc. to pick up more of these cases.
Please upload your patches with full context.
I don't think GEP reassociation should be done as a canonical InstCombine for two reasons
(1) the inbounds property would need to be dropped
(2) Unlike Reassociate, InstCombine has no knowledge of the CFG
I think this should be part of SeparateConstOffsetFromGEP, which is also optimizing for register pressure and addressing modes. As part of this pass it is target configurable and controlled by EnableGEPOpt.
Can you even do this if you don't have inbounds GEPs? Nuno, do you have an opinion about this?
(2) Unlike Reassociate, InstCombine has no knowledge of the CFG
I think this should be part of SeparateConstOffsetFromGEP, which is also optimizing for register pressure and addressing modes. As part of this pass it is target configurable and controlled by EnableGEPOpt.
Thanks Hal & Andrew for your valuable comments.
I uploaded the full content diff.
Plus:
- This patches originally is developed as part of Splitting GEP, later I moved it into LICM because I can reuse loopinfo in LICM and it save an extra loop to collect information (won't increase compile time much, but should avoid if possible anyway), and avoid some mess check to avoid performance regression. It completely ok to move it back to Splitting GEP if needed, as long as very one agrees.
- For the pattern originated this, inbound is set to false by Splitting GEP.
- If we want to do it in InstCombine, then we must make sure there is one pass of Splitting GEP, InstCombine, LICM happen in sequence, I will check that.
I will not change the design of this patch for now, will do it after most of the people reach an agreement.
Regards
Lawrence Hu
I think Andy is right.
If you don't have the inbounds tag, then you can safely perform this transformation, since LLVM's semantics guarantee that it will compute the overflowing value correctly. The implementation of the lowering and constant folding of GEPs in LLVM matches the semantics.
If you have the inbounds tag, then the transformation can still be performed, although inbounds has to dropped in most cases, since otherwise we would need to prove that the intermediate results don't overflow.
This problem is similar to reassociation of integer expressions. It's always safe when there are no nsw/nuw, but if there are, they usually have to be dropped.
Nuno
Hi, Guys:
Can we reach an agreement about where to do it? I posted my older patch which is part of Split GEP at http://reviews.llvm.org/D11443.
Regards
You need to address this:
If you have the inbounds tag, then the transformation can still be performed, although inbounds has to dropped in most cases, since otherwise we would need to prove that the intermediate results don't overflow.
By either restricting to GEPs that don't have inbounds, or dropping inbounds after performing the transformation. Specifically, I think what Nuno is saying is that, for example, if we have:
ptr = (p + o) + c
and both are inbounds, then we know that (p + o) does not overflow, and specifically is no more than one byte past the end of the object to which p belongs. We also know that (p + o) + c offers the same guarantees. We don't, however, know that (p + c) is necessarily in bounds. Imagine that (p + o) is one past the end of the object, and c is -1. Then p + c might point before the beginning of the object.
We should not give up too easily, however, because is p is some known object we might be able to easily determine whether (p + c) is still inbounds. One way to do this would be to call:
V->stripAndAccumulateInBoundsConstantOffsets
on the base pointer and then see if we determine the size of the object. You can call getObjectSize(...) to attempt to figure this out (include/llvm/Analysis/MemoryBuiltins.h).
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
415 ↗ | (On Diff #29689) | This seems a bit odd. We should not be looking only at the very next instruction. At the very least, check for users in the same block. |
Thanks Hal.
For you comments about stripAndAccumulateInBoundsConstantOffsets and getObjectSize.., I didn't understand it 100%, please let me know if that is what you are thinking.
Regards
I mean that you can call stripAndAccumulateInBoundsConstantOffsets on the first GEP you form (the one with the constant index), which will give you a new base and an offset. You can then call getObjectSize on that new base, and that might give you the object size. If the Offset <= ObjectSize, then you can mark the first GEP as inbounds.
Regards
Some comments about the inbounds changes. They also need to be covered by the regression tests.
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
1145 ↗ | (On Diff #30640) | I think this can be uge, because inbounds pointers are allowed to point to one byte past the end of the object (and thus, equal the size). Actually, you want Offset.ule (the offset must be less than or equal to the size for it to be inbounds). |
1147 ↗ | (On Diff #30640) | But you also need to set inbounds to false is the offset is larger than the size (even if the original GEP was inbounds). You need to set the inbounds on the second GEP to false. |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
1147 ↗ | (On Diff #30640) | Thanks Hal. The second GEP should be safe to have whatever inbound attribute as before, since originally the first is p+o, the second is p+o+c, now the first is p+c, the second is p+c+o==p+o+c. |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
1147 ↗ | (On Diff #30834) | I don't follow this comment. My reading of langref is that both the GEP's base and result must be inbounds. So if the first GEP is not inbounds and is the base of the second GEP, then the second GEP cannot be inbounds. (This is why I generally don't like this transformation.) |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
1147 ↗ | (On Diff #30834) | Andy is correct. The LangRef is clear on this matter, saying: "If the inbounds keyword is present, the result value of the getelementptr is a poison value *if the base pointer* is not an in bounds address of an allocated object, or if any of the addresses..." So you can't have inbounds on the second GEP either. |
If you need to do this transformation is cases where the inbounds flag will be dropped it seems best to do it late in the pipeline (maybe even after LSR) as a target-controlled optimization. Arranging for the hoisting to happen seems like the easy part of this problem.
Out of curiosity, I wonder if the LSR fix here http://reviews.llvm.org/D11212 has any impact on this test case.
I agree. Also, as it turns out, both NVPTX and PowerPC run LICM late in the pipeline, and AArch64 can as well when -aarch64-gep-opt is enabled (PowerPC also runs it under the control of -ppc-gep-opt, but unlike AArch64, it is currently on by default).
Out of curiosity, I wonder if the LSR fix here http://reviews.llvm.org/D11212 has any impact on this test case.
Oh yeah. I forgot about my first comment on this patch. I'll repeat it:
I think this should be part of SeparateConstOffsetFromGEP, which is also optimizing for register pressure and addressing modes. As part of this pass it is target configurable and controlled by EnableGEPOpt.
Hi, Andrew:
About your comments, I have asked to reach an agreement, and posted my earlier version of this patch (in SeparateConstOffsetFromGEP) to http://reviews.llvm.org/D11443 to see if you guys like it or not. Since you and Hal both agree it should be done somewhere else, I assume SeparateConstOffsetFromGEP is a good place to go, I will move the code back to SeparateConstOffsetFromGEP then, hopefully I can do better then http://reviews.llvm.org/D11443.
Regards
Lawrence Hu
Forgot to mention.
This transformation is useful, it helps one of our benchmark by 7%, and that is the main reason why llvm is behind gcc for this benchmarks.
I don't have any serious objections.
- NumOfCommonBaseGEPs==1 seems very artificial. Can you generalize this to NumOfCommonBaseGEPs > 0?
- Quentin may want to do an arm64 benchmark run before we ok this. I'll leave that up to him.
I am making a quick assembly diff of the test suite to have a sense of the impact.
Q.
Hi Lawrence,
There are no diffs on the test-suite, so you’re good on that side :).
I don’t have any problems with the patch and let you address Andy’s comment.
Thanks,
-Quentin
lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp | ||
---|---|---|
1092 | hasOneUse |
Thanks Andrew and Quentin.
Quentin, the whole Splitting GEP pass is turned off, could you please try -mllvm -aarch64-gep-opt=true or revert 89cc8dd3b84a636b2798028a994f4b71c78e0163, internally, I have to do that to see difference, of course, it depends on testcase.
I will address Andrew's comment in next few days, stuck on bugs now.
Lawrence,
Just a friendly reminder that the community prefers a svn revision number over a git hash any day.. :)
Chad
I didn't realize the pass was off. We don't need to benchmark the change in that case, although whoever is using the pass might want to ?!
Thanks, Chad:
commit 89cc8dd3b84a636b2798028a994f4b71c78e0163
Author: James Molloy <james.molloy@arm.com>
Date: Wed Apr 22 09:11:38 2015 +0000
[AArch64] Disable complex GEP optimization by default. Enough concerns were raised that this optimization is pessimising some code patterns. The obvious fix, to add a Reassociate run afterwards, causes even more pessimisation in some cases due to fewer complex addressing modes being matched. As there isn't a trivial fix for this, backing this out by default until someone gets a chance to fix the addressing mode matcher. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@235491 91177308-0d34-0410-b5e6-96231b3b80d8
Lawrence, Chad, does this pass help you independent from the proposed patch? Is it something you will work toward enabling by default. If so, great. Otherwise, you probably want to decide which part will be controlled by flag and which part you want to enable by default. Obviously, changes to the default should be benchmarked.
Hi, Andy
I have seen up and down with Splitting GEP pass.
Yes, I am considering fixing a few more issues, then enable it by default, of course, we will do benchmark measurement at that time, can I ask your help at then to measure your benchmark?
Regards.
I don't regularly benchmark llvm test-suite any more, but I can help find someone who does.
I believe we saw a 3-4% regression on Spec200x/mcf with James disabled the pass. This was on our internal branch in the context of LTO, which departs greatly from the community LTO flow (we're hoping to fix that soon).
Is it something you will work toward enabling by default. If so, great.
@hulx2000: Let me know if you'd like to discuss this offline. Have you or Ana discussed your plans with James?
Thanks, Quentin.
Interesting though, James said he turned off that Pass because Apple seeing non-negligible slowdowns, he said ok to turn it back on with tweaks, but "you’ll have to provide evidence to convince Gerolf".
lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp | ||
---|---|---|
758 | You shouldn't need to pre-compute this, you can examine the use list if necessary, something like this: auto TooManyUsesInLoop = [](Value *V, Loop *L) { int UsesInLoop = 0; for (User *U : V->users()) { if (Instruction *User = dyn_cast<Instruction>(U)) if (L->contains(User)) if (++UsesInLoop > 1) return true; return false; } }; | |
803 | Then -> then | |
1060 | This seems unnecessary (see above). |
merged as git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@248420 91177308-0d34-0410-b5e6-96231b3b80d8
You shouldn't need to pre-compute this, you can examine the use list if necessary, something like this: