There is a missed optimization in the DAGCombiner.cpp LLVM file for the selection of Post Indexed Load and Store operations.
This patch fixes the code in the function CombineToPostIndexedLoadStore, that checks the uses of an ADD/SUB operation, which does not correctly check the real uses.
Details
- Reviewers
- qcolombet 
Diff Detail
Event Timeline
Hi François,
Since you are fixing two problems, I would prefer having two different patches/commits.
The fix for #1 looks good to me.
Please add a test case for it and commit it separately.
The fix for #2 is almost good. You just need the change "TryNext to RealUse" and the removal of the ADD/SUB check. The outer loop must not be removed.
Please add your test case to the patch (not just in the comment) so that it runs with make check and upload the new patch.
Thanks,
-Quentin
| lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 8853 | I don’t get why this change is useful. | |
| 8878 | This change looks wrong to me. To me, the proper fix would be to simply replace TryNext by RealUse and use the opposite logic, like you did. | |
Thanks Quentin for your review.
I have uploaded a new version of the patch, where I removed the first part of the original patch, that will be delivered as a separate patch. I also took into account your remarks, except for one point :
I added the following code before the inner loop :
  if (Use != Op)
     continue;With this code, the semantics remains identical to my previous commit, and I checked again this morning on our target that we need this check to avoid catching unprofitable cases. However I don't know for other targets, and on our target in most cases this does not make a big difference, so I can remove this check if you prefer.
| lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 8853 | Since N is a Load/Store operation, the check is redundant with the ADD/SUB check. This change was just to simplify the code, I will remove it. | |
Hi François,
Almost good to me. See my inlined comments.
Could you run clang-format on your patch?
Some indents are suspicious.
Thanks,
-Quentin
| lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 9088 | This check still does not make sense to me. The bottom line is, yes, please remove it :). | |
| 9092 | I would remove the RealUse variable… | |
| 9094 | And directly set TryNext here... | |
| 9101 | Then, this if-block becomes useless. | |
Please Quentin proceed with the commit, or tell me how to do it myself.
Thanks.
- François.
Hi François,
The test part needs to be updated for top of tree trunk.
Indeed, thanks to your change we grab much more post/pre addressing mode for ARM and we must update the regexp of the related test case for them to pass.
Let me know if you need help.
Thanks,
-Quentin
PS: Attached the your patch for TOT, your test case needed to be updated as well.
Hi Quentin,
 When running the ARM self tests I found 9 failing tests.
 I could fix 5 of them :
    2012-10-04-AAPCS-byval-align8.ll
    2012-10-18-PR14099-ByvalFrameAddress.ll
    2013-01-21-PR14992.ll
    byval_load_align.ll
    ldrd.ll
There are 3 tests for which the code is worse, due to save/restore of registers modified by the automod. On our target, we added a target specific pass to fix these cases and fully benefit from the automod addressing mode. These tests are :
    truncstore-dag-combine.ll
    unaligned_load_store.ll
    wrong-t2stmia-size-opt.ll
There is one test for which I suspect that the new code is wrong. A muls is generated in a sequence where it is checked that no muls is generated. The test is :
     avoid-cpsr-rmw.ll
Also, what do you mean by "Attached the your patch for TOT, your test case needed to be updated as well" ? 
Please tell me how you want me to proceed.
Thanks,
 - François.Hi François,
There are 3 tests for which the code is worse, due to save/restore of registers modified by the auto mod.
For those we can file a PR when the change lands. Just make sure to update those tests with a comment to explain why it is worse and that they still pass make check.
There is one test for which I suspect that the new code is wrong. A muls is generated in a sequence where it is checked that no muls is generated. The test is :
Please investigate this one. Let me know if you need help.
Also, what do you mean by "Attached the your patch for TOT, your test case needed to be updated as well" ?
I meant that I have attached to my previous email your patch where I updated the test case so that we use the new "load <dstty>, <ptrty>" syntax instead of "load <ty>”. That was supposed to help you, not confuse you :). Feel free to ignore it.
Cheers,
-Quentin
Fixed the ARM regressions tests for :
test/CodeGen/ARM/2012-10-18-PR14099-ByvalFrameAddress.ll
test/CodeGen/ARM/2013-01-21-PR14992.ll
test/CodeGen/ARM/automod-test.ll
test/CodeGen/ARM/avoid-cpsr-rmw.ll
test/CodeGen/ARM/byval_load_align.ll
test/CodeGen/ARM/ldrd.ll
test/CodeGen/ARM/wrong-t2stmia-size-opt.ll
Opened a PR 24049 to report additional copies in some cases after this fix, and marked the following ARM regression tests as XFAIL :
test/CodeGen/ARM/truncstore-dag-combine.ll
test/CodeGen/ARM/unaligned_load_store.ll
Hi François,
Thanks for fixing the tests.
I’ll have a quick look at the tests that you’ve XFAILed before deciding whether or not this is OK.
Cheers,
-Quentin
| test/CodeGen/ARM/2012-10-18-PR14099-ByvalFrameAddress.ll | ||
|---|---|---|
| 29 | Looks like the load store optimizer could use some improvement to catch this case (assuming r0 is not used afterward). | |
| test/CodeGen/ARM/avoid-cpsr-rmw.ll | ||
| 31 | Could you comment why this check is failing now? | |
Hi François,
A couple more comments.
It seems your patch exposed a few short coming in the load store optimizer for ARM. I am not sure we can proceed with this commit without fixing the load store optimizer first, otherwise we may regress a bunch of thing… Which is unfortunate since your patch seems a good general improvement to me.
Did you happen to run benchmark on ARM with/without this change?
That would help making our mind.
Anyway, let me look at the XFAILed test cases.
Cheers,
-Quentin
| test/CodeGen/ARM/automod-test.ll | ||
|---|---|---|
| 27 | Looks like we could improve the load/store optimizer to catch those cases as well. | |
| test/CodeGen/ARM/avoid-cpsr-rmw.ll | ||
| 31 | Never mind me, I found the answer in an exchange we had offline. For the record, the assembly code generated without and with your patch are the following : ldr.w	r9, [r0] With the patch ldr.w	r12, [r0, #4] | |
Hi François,
Looks like the XFAILed test cases were just us being lucky in the previous version.
This is unfortunate, but like I said, I do not think we can land the patch as is, since it may regress some stuff.
Looking at the test cases makes me feel like we may not want to form the pre/post addressing mode that early, but instead rely on later passes to catch those. The rational is that I would prefer we grab the multiple load/store first since they impact the performance, whereas pre/post increment are mainly size optimization and thus less important.
The bottom line is that IMHO, we should investigate that it gives to completely get rid of that DAG combine and do the pre/post optimization later.
Let me know what do you think?
Cheers,
-Quentin
| test/CodeGen/ARM/2012-10-04-AAPCS-byval-align8.ll | ||
|---|---|---|
| 33 | This one is funny because we change: into: And that prevents the load store optimizer to catch it. | |
| test/CodeGen/ARM/2013-01-21-PR14992.ll | ||
| 18 | This one is worrisome as I expect it may expose runtime regressions and because it seems plausible to happen frequently. Without: With: The problem here is that after register allocation, without recoloring and with the support of PRE and POST automod, we won’t be able to form a ldm since the registers are not in the right order (r6, r4, r5, instead of r4, r5, r6). The pre- regalloc load/store optimizer could be taught to reorder them to make that possible but that does not seem trivial work. | |
| test/CodeGen/ARM/wrong-t2stmia-size-opt.ll | ||
| 21 | We miss that we can express this as: That said, the stored registers are not in the right order after reg alloc to be able to recognize that this is stm @addr!, {r1, r2}. | |
Hi Quentin,
Thanks for your detailed review.
I agree with you that this patch introduces too much regressions to be delivered. I see now two possibilities :
1- Still rely on this pass to select pre/post addressing modes, but add a kind of "repair" pass after code selection to fix the regressions. This is what we did on our target. However, I am not sure if on ARM this could fix the code to catch again the multiple load/store instructions,
2- Disable this pass and add a new pass on machine instructions. The main advantage I see is that it can allow to look for pre/post patterns on regions larger than a DAG, which currently is a limitation. For our target, it would probably replace our "repair" pass.
Should I add a comment "Abandon Revision" to close this patch ?
Should I log a kind of "request for enhancement" (in Bugzilla ?) to keep a trace of this analysis ?
Regards,
-François.
Hi François,
Either way sounds good to me, depends what is most comfortable for you I guess.
See my comments inlined for few hints/remarks.
This one is funny because we change:
ld [@addr]
ld [@addr, 4]
into:
ld [@addr], #4
ld [@addr]
{reg used for @addr is redefined}
And that prevents the load store optimizer to catch it.