Download Raw Diff

Details

Reviewers

Summary

There is a missed optimization in the DAGCombiner.cpp LLVM file for the selection of Post Indexed Load and Store operations.
This patch fixes the code in the function CombineToPostIndexedLoadStore, that checks the uses of an ADD/SUB operation, which does not correctly check the real uses.

Diff Detail

Event Timeline

fdeferriere updated this revision to Diff 24014.Apr 20 2015, 5:28 AM

fdeferriere retitled this revision from to Fix CombineToPostIndexedLoadStore in DAGCombiner.cpp.

fdeferriere updated this object.

fdeferriere edited the test plan for this revision. (Show Details)

Herald added a subscriber: aemerson. · View Herald TranscriptApr 20 2015, 5:28 AM

fdeferriere added a reviewer: qcolombet.Apr 20 2015, 5:31 AM

fdeferriere edited subscribers, added: Unknown Object (MLST); removed: aemerson.

Hi François,

Since you are fixing two problems, I would prefer having two different patches/commits.

The fix for #1 looks good to me.
Please add a test case for it and commit it separately.

The fix for #2 is almost good. You just need the change "TryNext to RealUse" and the removal of the ADD/SUB check. The outer loop must not be removed.
Please add your test case to the patch (not just in the comment) so that it runs with make check and upload the new patch.

Thanks,
-Quentin

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9055	I don’t get why this change is useful.
9104	This change looks wrong to me. You basically assume that BasePtr == Ptr, which AFAICT, is not necessarily true. To me, the proper fix would be to simply replace TryNext by RealUse and use the opposite logic, like you did.

Thanks Quentin for your review.

I have uploaded a new version of the patch, where I removed the first part of the original patch, that will be delivered as a separate patch. I also took into account your remarks, except for one point :

I added the following code before the inner loop :
  if (Use != Op)
     continue;

With this code, the semantics remains identical to my previous commit, and I checked again this morning on our target that we need this check to avoid catching unprofitable cases. However I don't know for other targets, and on our target in most cases this does not make a big difference, so I can remove this check if you prefer.

fdeferriere added inline comments.Apr 21 2015, 5:24 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9055	Since N is a Load/Store operation, the check is redundant with the ADD/SUB check. This change was just to simplify the code, I will remove it.

Hi François,

Almost good to me. See my inlined comments.

Could you run clang-format on your patch?
Some indents are suspicious.

Thanks,
-Quentin

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9088	This check still does not make sense to me. This is indeed equivalent to your previous commit, but then, it has the same problem. The bottom line is, yes, please remove it :).
9092	I would remove the RealUse variable…
9094	And directly set TryNext here...
9101	Then, this if-block becomes useless.

Removed the RealUse variable and the extra check.

Hi François,

LGTM with a pair of nitpicks.

Please commit with those fixes or let me know if you want I commit for you.

Nice catch BTW.

Thanks,
-Quentin

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9079	Period at the end of the comment.
test/CodeGen/ARM/automod-test.ll
34	Add a CHECK-LABEL line.

This revision is now accepted and ready to land.Apr 22 2015, 9:37 AM

Fixed comments by Q.Colombet.

Please Quentin proceed with the commit, or tell me how to do it myself.
Thanks.

François.

Hi François,

The test part needs to be updated for top of tree trunk.
Indeed, thanks to your change we grab much more post/pre addressing mode for ARM and we must update the regexp of the related test case for them to pass.

Let me know if you need help.

Thanks,
-Quentin

PS: Attached the your patch for TOT, your test case needed to be updated as well.

francois1.patch3 KBDownload

Hi Quentin,

 When running the ARM self tests I found 9 failing tests.

 I could fix 5 of them :
    2012-10-04-AAPCS-byval-align8.ll
    2012-10-18-PR14099-ByvalFrameAddress.ll
    2013-01-21-PR14992.ll
    byval_load_align.ll
    ldrd.ll

There are 3 tests for which the code is worse, due to save/restore of registers modified by the automod. On our target, we added a target specific pass to fix these cases and fully benefit from the automod addressing mode. These tests are :
    truncstore-dag-combine.ll
    unaligned_load_store.ll
    wrong-t2stmia-size-opt.ll

There is one test for which I suspect that the new code is wrong. A muls is generated in a sequence where it is checked that no muls is generated. The test is :
     avoid-cpsr-rmw.ll

Also, what do you mean by "Attached the your patch for TOT, your test case needed to be updated as well" ? 

Please tell me how you want me to proceed.

Thanks,

 - François.

Hi François,

There are 3 tests for which the code is worse, due to save/restore of registers modified by the auto mod.

For those we can file a PR when the change lands. Just make sure to update those tests with a comment to explain why it is worse and that they still pass make check.

There is one test for which I suspect that the new code is wrong. A muls is generated in a sequence where it is checked that no muls is generated. The test is :

Please investigate this one. Let me know if you need help.

Also, what do you mean by "Attached the your patch for TOT, your test case needed to be updated as well" ?

I meant that I have attached to my previous email your patch where I updated the test case so that we use the new "load <dstty>, <ptrty>" syntax instead of "load <ty>”. That was supposed to help you, not confuse you :). Feel free to ignore it.

Cheers,
-Quentin

Fixed the ARM regressions tests for :
test/CodeGen/ARM/2012-10-18-PR14099-ByvalFrameAddress.ll
test/CodeGen/ARM/2013-01-21-PR14992.ll
test/CodeGen/ARM/automod-test.ll
test/CodeGen/ARM/avoid-cpsr-rmw.ll
test/CodeGen/ARM/byval_load_align.ll
test/CodeGen/ARM/ldrd.ll
test/CodeGen/ARM/wrong-t2stmia-size-opt.ll

Opened a PR 24049 to report additional copies in some cases after this fix, and marked the following ARM regression tests as XFAIL :
test/CodeGen/ARM/truncstore-dag-combine.ll
test/CodeGen/ARM/unaligned_load_store.ll

Hi François,

Thanks for fixing the tests.
I’ll have a quick look at the tests that you’ve XFAILed before deciding whether or not this is OK.

Cheers,
-Quentin

test/CodeGen/ARM/2012-10-18-PR14099-ByvalFrameAddress.ll
29 ↗	(On Diff #29157)	Looks like the load store optimizer could use some improvement to catch this case (assuming r0 is not used afterward).
test/CodeGen/ARM/avoid-cpsr-rmw.ll
31 ↗	(On Diff #29157)	Could you comment why this check is failing now? I am not saying this is wrong, I’d like to check we are not missing something.

Hi François,

A couple more comments.

It seems your patch exposed a few short coming in the load store optimizer for ARM. I am not sure we can proceed with this commit without fixing the load store optimizer first, otherwise we may regress a bunch of thing… Which is unfortunate since your patch seems a good general improvement to me.

Did you happen to run benchmark on ARM with/without this change?
That would help making our mind.

Anyway, let me look at the XFAILed test cases.

Cheers,
-Quentin

test/CodeGen/ARM/automod-test.ll
28	Looks like we could improve the load/store optimizer to catch those cases as well.
test/CodeGen/ARM/avoid-cpsr-rmw.ll
31 ↗	(On Diff #29157)	Never mind me, I found the answer in an exchange we had offline. For the record, the assembly code generated without and with your patch are the following : Without the patch ldr.w r9, [r0] ldr r3, [r0, #4] ldr r2, [r0, #8] ldr.w r12, [r0, #12] adds r0, #16 <— after this point, having muls would be wrong. mul r3, r3, r9 mul r2, r3, r2 mul r2, r2, r12 With the patch ldr.w r12, [r0, #4] ldr r3, [r0, #8] ldr.w r9, [r0, #12] ldr r2, [r0], #16 mul r2, r12, r2 muls r2, r3, r2 <— Now the muls is necessary, since we got rid of the adds. mul r2, r2, r9

Hi François,

Looks like the XFAILed test cases were just us being lucky in the previous version.
This is unfortunate, but like I said, I do not think we can land the patch as is, since it may regress some stuff.

Looking at the test cases makes me feel like we may not want to form the pre/post addressing mode that early, but instead rely on later passes to catch those. The rational is that I would prefer we grab the multiple load/store first since they impact the performance, whereas pre/post increment are mainly size optimization and thus less important.

The bottom line is that IMHO, we should investigate that it gives to completely get rid of that DAG combine and do the pre/post optimization later.

Let me know what do you think?

Cheers,
-Quentin

test/CodeGen/ARM/2012-10-04-AAPCS-byval-align8.ll
33 ↗	(On Diff #29157)	This one is funny because we change: ld [@addr] ld [@addr, 4] into: ld [@addr], #4 ld [@addr] {reg used for @addr is redefined} And that prevents the load store optimizer to catch it.
test/CodeGen/ARM/2013-01-21-PR14992.ll
18 ↗	(On Diff #29157)	This one is worrisome as I expect it may expose runtime regressions and because it seems plausible to happen frequently. Without: ldm r0!, {r4, r5, r6} bl bar With: ldr r4, [r0, #4] ldr r5, [r0, #8] ldr r6, [r0], #12 bl bar The problem here is that after register allocation, without recoloring and with the support of PRE and POST automod, we won’t be able to form a ldm since the registers are not in the right order (r6, r4, r5, instead of r4, r5, r6). The pre- regalloc load/store optimizer could be taught to reorder them to make that possible but that does not seem trivial work.
test/CodeGen/ARM/wrong-t2stmia-size-opt.ll
21 ↗	(On Diff #29157)	We miss that we can express this as: [@addr], #8 [@addr, -#4] That said, the stored registers are not in the right order after reg alloc to be able to recognize that this is stm @addr!, {r1, r2}.

This revision now requires changes to proceed.Jul 7 2015, 4:21 PM

Hi Quentin,

Thanks for your detailed review.

I agree with you that this patch introduces too much regressions to be delivered. I see now two possibilities :

1- Still rely on this pass to select pre/post addressing modes, but add a kind of "repair" pass after code selection to fix the regressions. This is what we did on our target. However, I am not sure if on ARM this could fix the code to catch again the multiple load/store instructions,

2- Disable this pass and add a new pass on machine instructions. The main advantage I see is that it can allow to look for pre/post patterns on regions larger than a DAG, which currently is a limitation. For our target, it would probably replace our "repair" pass.

Should I add a comment "Abandon Revision" to close this patch ?
Should I log a kind of "request for enhancement" (in Bugzilla ?) to keep a trace of this analysis ?

Regards,

-François.

Hi François,

Either way sounds good to me, depends what is most comfortable for you I guess.

See my comments inlined for few hints/remarks.

Diff 24118

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,046 Lines • ▼ Show 20 Lines	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(N)) {
return false;		return false;
}		}

if (Ptr.getNode()->hasOneUse())		if (Ptr.getNode()->hasOneUse())
return false;		return false;

for (SDNode *Op : Ptr.getNode()->uses()) {		for (SDNode *Op : Ptr.getNode()->uses()) {
if (Op == N \|\|		if (Op == N \|\|
(Op->getOpcode() != ISD::ADD && Op->getOpcode() != ISD::SUB))		(Op->getOpcode() != ISD::ADD && Op->getOpcode() != ISD::SUB))
		qcolombetUnsubmitted Not Done Reply Inline Actions I don’t get why this change is useful. qcolombet: I don’t get why this change is useful.
		fdeferriereAuthorUnsubmitted Not Done Reply Inline Actions Since N is a Load/Store operation, the check is redundant with the ADD/SUB check. This change was just to simplify the code, I will remove it. fdeferriere: Since N is a Load/Store operation, the check is redundant with the ADD/SUB check. This change…
continue;		continue;

SDValue BasePtr;		SDValue BasePtr;
SDValue Offset;		SDValue Offset;
ISD::MemIndexedMode AM = ISD::UNINDEXED;		ISD::MemIndexedMode AM = ISD::UNINDEXED;
if (TLI.getPostIndexedAddressParts(N, Op, BasePtr, Offset, AM, DAG)) {		if (TLI.getPostIndexedAddressParts(N, Op, BasePtr, Offset, AM, DAG)) {
// Don't create a indexed load / store with zero offset.		// Don't create a indexed load / store with zero offset.
if (isa<ConstantSDNode>(Offset) &&		if (isa<ConstantSDNode>(Offset) &&
cast<ConstantSDNode>(Offset)->isNullValue())		cast<ConstantSDNode>(Offset)->isNullValue())
continue;		continue;

// Try turning it into a post-indexed load / store except when		// Try turning it into a post-indexed load / store except when
// 1) All uses are load / store ops that use it as base ptr (and		// 1) All uses are load / store ops that use it as base ptr (and
// it may be folded as addressing mmode).		// it may be folded as addressing mmode).
// 2) Op must be independent of N, i.e. Op is neither a predecessor		// 2) Op must be independent of N, i.e. Op is neither a predecessor
// nor a successor of N. Otherwise, if Op is folded that would		// nor a successor of N. Otherwise, if Op is folded that would
// create a cycle.		// create a cycle.

if (isa<FrameIndexSDNode>(BasePtr) \|\| isa<RegisterSDNode>(BasePtr))		if (isa<FrameIndexSDNode>(BasePtr) \|\| isa<RegisterSDNode>(BasePtr))
continue;		continue;

// Check for #1.		// Check for #1.
bool TryNext = false;		// Look for a RealUse, i.e. one use that is not a load / store op, or one
		// that cannot be folded as addressing mode
		qcolombetUnsubmitted Not Done Reply Inline Actions Period at the end of the comment. qcolombet: Period at the end of the comment.
		// Check for #1.
		bool TryNext = true;
for (SDNode *Use : BasePtr.getNode()->uses()) {		for (SDNode *Use : BasePtr.getNode()->uses()) {
if (Use == Ptr.getNode())		if (Use == Ptr.getNode())
continue;		continue;

// If all the uses are load / store addresses, then don't do the		if (Use != Op)
// transformation.		continue;

		qcolombetUnsubmitted Not Done Reply Inline Actions This check still does not make sense to me. This is indeed equivalent to your previous commit, but then, it has the same problem. The bottom line is, yes, please remove it :). qcolombet: This check still does not make sense to me. This is indeed equivalent to your previous commit…
		// If one use is not a load / store address, then do the transformation.
if (Use->getOpcode() == ISD::ADD \|\| Use->getOpcode() == ISD::SUB){		if (Use->getOpcode() == ISD::ADD \|\| Use->getOpcode() == ISD::SUB){
bool RealUse = false;		bool RealUse = false;
for (SDNode *UseUse : Use->uses()) {		for (SDNode *UseUse : Use->uses()) {
		qcolombetUnsubmitted Not Done Reply Inline Actions I would remove the RealUse variable… qcolombet: I would remove the RealUse variable…
if (!canFoldInAddressingMode(Use, UseUse, DAG, TLI))		if (!canFoldInAddressingMode(Use, UseUse, DAG, TLI)) {
RealUse = true;		RealUse = true;
		qcolombetUnsubmitted Not Done Reply Inline Actions And directly set TryNext here... qcolombet: And directly set TryNext here...
		break;
		}
}		}

if (!RealUse) {		if (RealUse) {
TryNext = true;		TryNext = false;
break;		break;
		qcolombetUnsubmitted Not Done Reply Inline Actions Then, this if-block becomes useless. qcolombet: Then, this if-block becomes useless.
}		}
}		}
}		}
		qcolombetUnsubmitted Not Done Reply Inline Actions This change looks wrong to me. You basically assume that BasePtr == Ptr, which AFAICT, is not necessarily true. To me, the proper fix would be to simply replace TryNext by RealUse and use the opposite logic, like you did. qcolombet: This change looks wrong to me. You basically assume that BasePtr == Ptr, which AFAICT, is not…

if (TryNext)		if (TryNext)
continue;		continue;

// Check for #2		// Check for #2
if (!Op->isPredecessorOf(N) && !N->isPredecessorOf(Op)) {		if (!Op->isPredecessorOf(N) && !N->isPredecessorOf(Op)) {
SDValue Result = isLoad		SDValue Result = isLoad
? DAG.getIndexedLoad(SDValue(N,0), SDLoc(N),		? DAG.getIndexedLoad(SDValue(N,0), SDLoc(N),
▲ Show 20 Lines • Show All 4,578 Lines • Show Last 20 Lines

test/CodeGen/ARM/automod-test.ll

				; Test that checks that automod addressing mode is selected
				;
				; RUN: llc -O2 < %s -march=arm \| FileCheck %s
				;
				; ======================================================
				; Without the fix, the generated code is the following :
				;
				; ldrh r3, [r2, #2]
				; strh r3, [r1, #-2]
				; ldrh r3, [r2]
				; sub r2, r2, #6
				; strh r3, [r1]
				; ldr r3, [r0], #48
				; add r1, r1, #6
				; cmp r3, #0
				; bne .LBB0_1
				;
				; With the patch, post modifying addressing modes are selected :
				;
				; ldrh r3, [r2, #2]
				; strh r3, [r1, #-2]
				; ldrh r3, [r2], #-6
				; strh r3, [r1], #6
				; ldr r3, [r0], #48
				; cmp r3, #0
				; bne .LBB0_1
				; ======================================================

				qcolombetUnsubmitted Not Done Reply Inline Actions Looks like we could improve the load/store optimizer to catch those cases as well. qcolombet: Looks like we could improve the load/store optimizer to catch those cases as well.
				@input_tab64 = common global [32 x i16] zeroinitializer, align 2
				@output_tab64 = common global [32 x i16] zeroinitializer, align 2

				; Function Attrs: nounwind
				define void @compute(i32* nocapture readonly %IDX) #0 {
				entry:
				qcolombetUnsubmitted Not Done Reply Inline Actions Add a CHECK-LABEL line. qcolombet: Add a CHECK-LABEL line.

				%0 = load i32* %IDX, align 4
				%tobool14 = icmp eq i32 %0, 0
				br i1 %tobool14, label %for.end, label %for.body

				for.body: ; preds = %entry, %for.body
				%i.015 = phi i32 [ %add8, %for.body ], [ 0, %entry ]
				%sub = sub nsw i32 32, %i.015
				%sub1 = add nsw i32 %sub, -1
				%arrayidx2 = getelementptr inbounds [32 x i16]* @input_tab64, i32 0, i32 %sub1
				%1 = load i16* %arrayidx2, align 2
				%arrayidx3 = getelementptr inbounds [32 x i16]* @output_tab64, i32 0, i32 %i.015
				store i16 %1, i16* %arrayidx3, align 2
				%sub5 = add nsw i32 %sub, -2
				%arrayidx6 = getelementptr inbounds [32 x i16]* @input_tab64, i32 0, i32 %sub5
				%2 = load i16* %arrayidx6, align 2
				%add = add nsw i32 %i.015, 1
				%arrayidx7 = getelementptr inbounds [32 x i16]* @output_tab64, i32 0, i32 %add
				store i16 %2, i16* %arrayidx7, align 2
				%add8 = add nsw i32 %i.015, 3
				%shl = shl i32 %add8, 2
				%arrayidx = getelementptr inbounds i32* %IDX, i32 %shl
				%3 = load i32* %arrayidx, align 4
				%tobool = icmp eq i32 %3, 0
				br i1 %tobool, label %for.end, label %for.body

				; CHECK: ldrh r{{[0-9]+}}, [r{{[0-9]+}}], #-6
				; CHECK: strh r{{[0-9]+}}, [r{{[0-9]+}}], #6

				for.end: ; preds = %for.body, %entry
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Fix CombineToPostIndexedLoadStore in DAGCombiner.cpp
Needs RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 24118

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/ARM/automod-test.ll

This is an archive of the discontinued LLVM Phabricator instance.

Fix CombineToPostIndexedLoadStore in DAGCombiner.cppNeeds RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 24118

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/ARM/automod-test.ll

Fix CombineToPostIndexedLoadStore in DAGCombiner.cpp
Needs RevisionPublic