This is an archive of the discontinued LLVM Phabricator instance.

[SeparateConstOffsetFromGEP] Fix bugs and improve the current SeparateConstOffsetFromGEP
AbandonedPublic

Authored by HaoLiu on Oct 19 2014, 10:50 PM.

Details

Reviewers
t.p.northover
Summary

Hi Jingyue and other reviewers,

I investigated the SeparateConstOffsetFromGEP pass Jingyue added on May this year and find it is also beneficial for the address calculation in AArch64 backend. So we want to enable it in AArch64 backend. But before that, I want to fix some problems.

I find there are some run time failures in llvm test-suite benchmark sqlite3. They are caused by following two reasons:
(1) "int64 modulo uint64" generates an incorrect result. See test case "@sign_mod_unsign_bug" in my patch. This is fixed by a cast from uint64_t to int64_t
(2) The OR instruction may generate incorrect result. See the test case "@test_or_bug". This is fixed with the improvement described later.

Also, I find an opportunity to improve this pass. As your comments say, currently it can only find one constant in one index. For the index having several constants like " (a + 4) + (b + 1)", enabling instcombine may not always work as following reasons:
(1) We find some passes may generate such cases after instcombine. E.g. We find the CFGSimplifyPass can even generate index from two constants like "a = 1 + 2", which the current SeparateConstOffsetFromGEP pass can only find constant 1. Also, there are many cases like "(a + 4) + 1" or even more complex.
(2) The ADD-OR situation like the test case "@test_or_bug" will not be optimized by instcombine. We can only find constant 1 in "(a<<2 + 4) | 1" and ignore constant 4.

This patch improves this pass to handle such situations. I don't modify too much code. As the current logic can work well to find one constant in one index, I just add some code to enable it to find more constants in one index. Previously, it finds a constant in either operand of a binary instruction, now it can find the constants in both operands.

I tested this patch on llvm test-suite, now all the benchmarks can pass. Also spec cpu2000 and spec cpu2006 are tested. I don't know what benchmarks you tested on NVPTX backend, But I think this improvement can benefit NVPTX as well. At least it has no regression.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu retitled this revision from to [SeparateConstOffsetFromGEP] Fix bugs and improve the current SeparateConstOffsetFromGEP.
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added reviewers: jingyue, t.p.northover.
HaoLiu added a subscriber: Unknown Object (MLST).
HaoLiu updated this revision to Diff 15169.Oct 21 2014, 1:02 AM

Add another patch fixing some typo issues.

jingyue edited edge metadata.Oct 24 2014, 11:00 PM

Thanks for working on this!

First round.

I'll fix the two correctness bugs separately. After that, please rebase your diff against the new version.

I am very curious what CFGSimplificationPass does to generate a = 1 + 2. Does a = 1 + 2 appear at the end of -O3? I don't think leaving that clear misoptimization at the end of -O3 is reasonable.

Jingyue

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
120

s/outputs/returns/

129

Not the same as Extract any more.

475

Why do you prefer "unsigned &ChainIndex" to "unsigned ChainIndex"? The latter seems to use more space because unsigned& takes the size of a pointer while unsigned only takes 32 bits. Also, the original logic seems clearer.

712

// Splits this GEP index into a variadic part and a constant offset, and uses the variadic part as the new index.

739

Not quite true. The sum of offsets being 0 doesn't mean every offset is zero. As long as there's a non-zero offset, it's worth creating a new GEP.

774

Nice find! I like this bug :)

test/Transforms/SeparateConstOffsetFromGEP/AArch64/split-gep.ll
5

Remove the extra dash?

37

((a << 2) + 12)

Two correctness bugs are fixed in r220615 and r220618. Thanks for reporting them and providing tests and initial patches!

HaoLiu edited edge metadata.

Hi Jingyue,

First round.

I'll fix the two correctness bugs separately. After that, please rebase your diff against the new version.

Thanks very much for the code review and the commits for bugs.

I am very curious what CFGSimplificationPass does to generate a = 1 + 2. Does a = 1 + 2 appear at the end of -O3? I don't think leaving that clear misoptimization at the end of -O3 is reasonable.


I've attached a simple test case "simple.c" can reproduce it as following:

clang -O3 -S -emit-llvm simple.c
llc -march=aarch64 < simple.ll -print-after-all (or llc -march=nvptx < simple.ll -print-after-all)

For AArch64 backend We can find such IRs after CFGSimplification pass:

%inc.1.1 = add nsw i32 2, 1
...
%inc.2.1 = add nsw i32 %inc.1.1, 1
...
%inc.1.2 = add nsw i32 %inc.1.1, 2

Similar IRs can be found in NVPTX backend even though it doesn't call CFGSimplification pass.

Besides that, the test case for OR also requires us to find constants in both operands. InstCombine will never optimize test case like "((a<<2) + 12) | 1".

Why do you prefer "unsigned &ChainIndex" to "unsigned ChainIndex"? The latter seems to use more space because unsigned& takes the size of a pointer while unsigned only takes 32 bits. Also, the original logic seems clearer.

This is because I need to know the index of the next node. Previously, it find one constant throw a single path, so when it find one constant, the index is 0. But to find constants in both operands of a binary instruction, it needs to traverse throw a tree (As my example in the comments), which is a little different from previous single path. When we find a constant in an operand of a binary instruction, the current index may be not 0, and current single path is end. Then we can go back to the other operand and traverse another single path to find another constant. See the code

if (BO->getOperand(1) == UserChain[ChainIndex - 1])
if (ChainIndex != 0 && BO->getOperand(0) == UserChain[ChainIndex - 1])

Both operands are checked and visited. To make sure all nodes have been visited, an assertion for "ChainIndex == 0" is added.

A new patch has been added after modification according to the comments and rebase. Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
129

I just think we don't need to return an int64_t, which is only used to check whether the extraction is successful. As we always need to return a NewIdx, we can check whether the NewIdx is nullptr.

739

Sorry I still not quite understand. This is after setting new operands stage, which means the old GEP has already been replaced by new indices. The following code is only to create the second GEP with one index whose value is 0. So I still think a GEP with index 0 is unnecessary.

I think this 1+2 statement should be optimized away regardless of SeparateConstOffset, because leaving such misoptimization can hurt other passes down the road. This can be done by either running instsimplify after simplifycfg or fixing simplifycfg so that it simplifies the new instructions it creates. It's worth a separate bug report. Btw, I tried simple.c on the NVPTX backend, and found 1+2 is produced by CodeGenPrepare instead of simplifycfg. We probably need to fix the NVPTX backend similarly.

Handling "((a << 2) + 12) | 1" should probably be a feature request to instcombine, because doing so benefits other passes and the code quality in general.

These are all very nice finds, and I like them! I just think they should be fixed in a more general way.

Regarding the improvement on being able to extract multiple constants, I think your implementation is correct in general and I really appreciate that you figured it out. However, I am concerned about the complexity of the code. First, with the above issues in instcombine and simplifycfg fixed, are we still motivated in handling multiple constant offsets in this pass? Second, if we are still motivated, instead of finding and extracting all non-zero constant offsets in one DFS traversal, can we repeatedly call find+extract until we cannot improve any more? Your current implementation is surely faster, but I suspect (1) most GEPs have no constant offsets, and (2) for those who have, the latter approach would only run a very limited number (mostly two) of iterations. In general, I'd trade premature optimization for code simplicity.

Let me know what you think. Thanks!

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
129

Yes, I agree and I like that. I was just saying the comments need to be updated :)

739

Sounds good.

Nits: change the comment s/as/if, and add a blank line after GEP->setIsInBoiunds(false). Thanks!

Hi Jingyue,

I agree with you.
The test case about "1 + 2" are corner cases, it won't affect too much about the performance.
For the "((a << 2) + 12) | 1" issue, it is more common than above case. If instcombine can do optimization for such case, I think it can have some small improvement on performance.

I agree to not do such improvement. For other minor changes which you agree, I've added them to the patch in D5864, which is the key thing I want to do. I hope you can have a look and review it.

Thanks,
-Hao

Does that mean we should abandon this patch and move on to D5864?

In D5863#15, @jingyue wrote:

Does that mean we should abandon this patch and move on to D5864?

Yes. Thanks for your comments.

Could you click "abandon changes" so that it doesn't show up in reviewers' TODO list :)? Thanks!

If you are not going to fix the issues you discovered with instcombine and simplifycfg right away, please track them in the buganizer.

jingyue resigned from this revision.Nov 2 2014, 9:14 AM
jingyue removed a reviewer: jingyue.
HaoLiu abandoned this revision.Nov 3 2014, 5:49 PM

For the CFGSimplification case, a new ticket 21473 in bugzilla has been created.