This is an archive of the discontinued LLVM Phabricator instance.

Fix CombineToPostIndexedLoadStore in DAGCombiner.cpp
Needs RevisionPublic

Authored by fdeferriere on Apr 20 2015, 5:28 AM.

Details

Reviewers
qcolombet
Summary

There is a missed optimization in the DAGCombiner.cpp LLVM file for the selection of Post Indexed Load and Store operations.
This patch fixes the code in the function CombineToPostIndexedLoadStore, that checks the uses of an ADD/SUB operation, which does not correctly check the real uses.

Diff Detail

Event Timeline

fdeferriere retitled this revision from to Fix CombineToPostIndexedLoadStore in DAGCombiner.cpp.
fdeferriere updated this object.
fdeferriere edited the test plan for this revision. (Show Details)
fdeferriere edited subscribers, added: Unknown Object (MLST); removed: aemerson.
qcolombet edited edge metadata.Apr 20 2015, 2:07 PM

Hi François,

Since you are fixing two problems, I would prefer having two different patches/commits.

The fix for #1 looks good to me.
Please add a test case for it and commit it separately.

The fix for #2 is almost good. You just need the change "TryNext to RealUse" and the removal of the ADD/SUB check. The outer loop must not be removed.
Please add your test case to the patch (not just in the comment) so that it runs with make check and upload the new patch.

Thanks,
-Quentin

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9055

I don’t get why this change is useful.

9094

This change looks wrong to me.
You basically assume that BasePtr == Ptr, which AFAICT, is not necessarily true.

To me, the proper fix would be to simply replace TryNext by RealUse and use the opposite logic, like you did.

fdeferriere updated this object.
fdeferriere edited the test plan for this revision. (Show Details)
fdeferriere edited edge metadata.

Thanks Quentin for your review.

I have uploaded a new version of the patch, where I removed the first part of the original patch, that will be delivered as a separate patch. I also took into account your remarks, except for one point :

I added the following code before the inner loop :
  if (Use != Op)
     continue;

With this code, the semantics remains identical to my previous commit, and I checked again this morning on our target that we need this check to avoid catching unprofitable cases. However I don't know for other targets, and on our target in most cases this does not make a big difference, so I can remove this check if you prefer.

fdeferriere added inline comments.Apr 21 2015, 5:24 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9055

Since N is a Load/Store operation, the check is redundant with the ADD/SUB check. This change was just to simplify the code, I will remove it.

Hi François,

Almost good to me. See my inlined comments.

Could you run clang-format on your patch?
Some indents are suspicious.

Thanks,
-Quentin

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9086

This check still does not make sense to me.
This is indeed equivalent to your previous commit, but then, it has the same problem.

The bottom line is, yes, please remove it :).

9087

I would remove the RealUse variable…

9087–9088

And directly set TryNext here...

9088–9091

Then, this if-block becomes useless.

Removed the RealUse variable and the extra check.

qcolombet accepted this revision.Apr 22 2015, 9:37 AM
qcolombet edited edge metadata.

Hi François,

LGTM with a pair of nitpicks.

Please commit with those fixes or let me know if you want I commit for you.

Nice catch BTW.

Thanks,
-Quentin

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9079

Period at the end of the comment.

test/CodeGen/ARM/automod-test.ll
34

Add a CHECK-LABEL line.

This revision is now accepted and ready to land.Apr 22 2015, 9:37 AM
fdeferriere edited edge metadata.

Fixed comments by Q.Colombet.

Please Quentin proceed with the commit, or tell me how to do it myself.
Thanks.

  • François.

Hi François,

The test part needs to be updated for top of tree trunk.
Indeed, thanks to your change we grab much more post/pre addressing mode for ARM and we must update the regexp of the related test case for them to pass.

Let me know if you need help.

Thanks,
-Quentin

PS: Attached the your patch for TOT, your test case needed to be updated as well.

Hi Quentin,

 When running the ARM self tests I found 9 failing tests.

 I could fix 5 of them :
    2012-10-04-AAPCS-byval-align8.ll
    2012-10-18-PR14099-ByvalFrameAddress.ll
    2013-01-21-PR14992.ll
    byval_load_align.ll
    ldrd.ll

There are 3 tests for which the code is worse, due to save/restore of registers modified by the automod. On our target, we added a target specific pass to fix these cases and fully benefit from the automod addressing mode. These tests are :
    truncstore-dag-combine.ll
    unaligned_load_store.ll
    wrong-t2stmia-size-opt.ll

There is one test for which I suspect that the new code is wrong. A muls is generated in a sequence where it is checked that no muls is generated. The test is :
     avoid-cpsr-rmw.ll

Also, what do you mean by "Attached the your patch for TOT, your test case needed to be updated as well" ? 

Please tell me how you want me to proceed.

Thanks,

 - François.

Hi François,

There are 3 tests for which the code is worse, due to save/restore of registers modified by the auto mod.

For those we can file a PR when the change lands. Just make sure to update those tests with a comment to explain why it is worse and that they still pass make check.

There is one test for which I suspect that the new code is wrong. A muls is generated in a sequence where it is checked that no muls is generated. The test is :

Please investigate this one. Let me know if you need help.

Also, what do you mean by "Attached the your patch for TOT, your test case needed to be updated as well" ?

I meant that I have attached to my previous email your patch where I updated the test case so that we use the new "load <dstty>, <ptrty>" syntax instead of "load <ty>”. That was supposed to help you, not confuse you :). Feel free to ignore it.

Cheers,
-Quentin

fdeferriere updated this revision to Diff 29157.Jul 7 2015, 1:43 AM

Fixed the ARM regressions tests for :
test/CodeGen/ARM/2012-10-18-PR14099-ByvalFrameAddress.ll
test/CodeGen/ARM/2013-01-21-PR14992.ll
test/CodeGen/ARM/automod-test.ll
test/CodeGen/ARM/avoid-cpsr-rmw.ll
test/CodeGen/ARM/byval_load_align.ll
test/CodeGen/ARM/ldrd.ll
test/CodeGen/ARM/wrong-t2stmia-size-opt.ll

Opened a PR 24049 to report additional copies in some cases after this fix, and marked the following ARM regression tests as XFAIL :
test/CodeGen/ARM/truncstore-dag-combine.ll
test/CodeGen/ARM/unaligned_load_store.ll

Hi François,

Thanks for fixing the tests.
I’ll have a quick look at the tests that you’ve XFAILed before deciding whether or not this is OK.

Cheers,
-Quentin

test/CodeGen/ARM/2012-10-18-PR14099-ByvalFrameAddress.ll
29 ↗(On Diff #29157)

Looks like the load store optimizer could use some improvement to catch this case (assuming r0 is not used afterward).

test/CodeGen/ARM/avoid-cpsr-rmw.ll
31 ↗(On Diff #29157)

Could you comment why this check is failing now?
I am not saying this is wrong, I’d like to check we are not missing something.

Hi François,

A couple more comments.

It seems your patch exposed a few short coming in the load store optimizer for ARM. I am not sure we can proceed with this commit without fixing the load store optimizer first, otherwise we may regress a bunch of thing… Which is unfortunate since your patch seems a good general improvement to me.

Did you happen to run benchmark on ARM with/without this change?
That would help making our mind.

Anyway, let me look at the XFAILed test cases.

Cheers,
-Quentin

test/CodeGen/ARM/automod-test.ll
28

Looks like we could improve the load/store optimizer to catch those cases as well.

test/CodeGen/ARM/avoid-cpsr-rmw.ll
31 ↗(On Diff #29157)

Never mind me, I found the answer in an exchange we had offline.

For the record, the assembly code generated without and with your patch are the following :
Without the patch


ldr.w r9, [r0]
ldr r3, [r0, #4]
ldr r2, [r0, #8]
ldr.w r12, [r0, #12]
adds r0, #16 <— after this point, having muls would be wrong.
mul r3, r3, r9
mul r2, r3, r2
mul r2, r2, r12

With the patch


ldr.w r12, [r0, #4]
ldr r3, [r0, #8]
ldr.w r9, [r0, #12]
ldr r2, [r0], #16
mul r2, r12, r2
muls r2, r3, r2 <— Now the muls is necessary, since we got rid of the adds.
mul r2, r2, r9

qcolombet requested changes to this revision.Jul 7 2015, 4:21 PM
qcolombet edited edge metadata.

Hi François,

Looks like the XFAILed test cases were just us being lucky in the previous version.
This is unfortunate, but like I said, I do not think we can land the patch as is, since it may regress some stuff.

Looking at the test cases makes me feel like we may not want to form the pre/post addressing mode that early, but instead rely on later passes to catch those. The rational is that I would prefer we grab the multiple load/store first since they impact the performance, whereas pre/post increment are mainly size optimization and thus less important.

The bottom line is that IMHO, we should investigate that it gives to completely get rid of that DAG combine and do the pre/post optimization later.

Let me know what do you think?

Cheers,
-Quentin

test/CodeGen/ARM/2012-10-04-AAPCS-byval-align8.ll
33 ↗(On Diff #29157)

This one is funny because we change:
ld [@addr]
ld [@addr, 4]

into:
ld [@addr], #4
ld [@addr]
{reg used for @addr is redefined}

And that prevents the load store optimizer to catch it.

test/CodeGen/ARM/2013-01-21-PR14992.ll
18 ↗(On Diff #29157)

This one is worrisome as I expect it may expose runtime regressions and because it seems plausible to happen frequently.

Without:
ldm r0!, {r4, r5, r6}
bl bar

With:
ldr r4, [r0, #4]
ldr r5, [r0, #8]
ldr r6, [r0], #12
bl bar

The problem here is that after register allocation, without recoloring and with the support of PRE and POST automod, we won’t be able to form a ldm since the registers are not in the right order (r6, r4, r5, instead of r4, r5, r6). The pre- regalloc load/store optimizer could be taught to reorder them to make that possible but that does not seem trivial work.

test/CodeGen/ARM/wrong-t2stmia-size-opt.ll
21 ↗(On Diff #29157)

We miss that we can express this as:
[@addr], #8
[@addr, -#4]

That said, the stored registers are not in the right order after reg alloc to be able to recognize that this is stm @addr!, {r1, r2}.

This revision now requires changes to proceed.Jul 7 2015, 4:21 PM
fdeferriere added a comment.EditedJul 9 2015, 12:33 AM

Hi Quentin,

Thanks for your detailed review.

I agree with you that this patch introduces too much regressions to be delivered. I see now two possibilities :

1- Still rely on this pass to select pre/post addressing modes, but add a kind of "repair" pass after code selection to fix the regressions. This is what we did on our target. However, I am not sure if on ARM this could fix the code to catch again the multiple load/store instructions,

2- Disable this pass and add a new pass on machine instructions. The main advantage I see is that it can allow to look for pre/post patterns on regions larger than a DAG, which currently is a limitation. For our target, it would probably replace our "repair" pass.

Should I add a comment "Abandon Revision" to close this patch ?
Should I log a kind of "request for enhancement" (in Bugzilla ?) to keep a trace of this analysis ?

Regards,

-François.

Hi François,

Either way sounds good to me, depends what is most comfortable for you I guess.

See my comments inlined for few hints/remarks.