This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
2
DAGCombiner.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
MergeConsecutiveStores.ll

Differential D26367

Fix DAGCombiner match
ClosedPublic

Authored by evstupac on Nov 7 2016, 12:50 PM.

Download Raw Diff

Details

Reviewers

RKSimon
rengolin
aschwaighofer
jmolloy
hfinkel
mcrosier

Commits

rGc88697dc16fb: The patch fixes (base, index, offset) match.

Summary

The patches fix (base, index, offset) match in DAGCombine to combine the following stores/loads (a, b, c char arrays):
for (i = 0; i < n; i += 2) {

char c1 = c{*b];
char c2 = c[*b + 1];
a[i] = c1;
a[i + 1] = c2;
b++;

}
into (a, c, the same short arrays):
for (i = 0; i < n; i++) {

short c1 = c{*b];
a[i] = c1;
b++;

}

Without the patch match returned the following (base, index, offset):
*(a + i) -> (a, i, 0)
*(a + i + 1) -> (a + i, undef, 1)
and loads/stores were not combined because of different base.

Diff Detail

Repository: rL LLVM

Event Timeline

evstupac updated this revision to Diff 77081.Nov 7 2016, 12:50 PM

evstupac retitled this revision from to Fix DAGCombiner match.

evstupac updated this object.

evstupac added reviewers: aschwaighofer, RKSimon, hfinkel.

evstupac set the repository for this revision to rL LLVM.

evstupac added a subscriber: llvm-commits.

evstupac added a subscriber: Farhana.

PING.

This LGTM but I think we should probably add tests for other targets as well.

This LGTM but I think we should probably add tests for other targets as well.

There are similar tests on ARM only. When I tried to extend them I've got the following:

ldrb r12, [lr, r12]!
ldrb lr, [lr, #1]

Instead of 1 ldrh.

I'm not sure what is more profitable for ARM and which -mtriple/-mcpu is better. So I'd better leave the test inserting for ARM guys.

In D26367#599175, @evstupac wrote:
This LGTM but I think we should probably add tests for other targets as well.

There are similar tests on ARM only. When I tried to extend them I've got the following:
ldrb r12, [lr, r12]!
ldrb lr, [lr, #1]
Instead of 1 ldrh.

I'm not sure what is more profitable for ARM and which -mtriple/-mcpu is better. So I'd better leave the test inserting for ARM guys.

Renato/James - is there any insight you can offer here please?

The patch looks not that complicated. One of my next patches depends on it. Let's commit it and Arm guys will add test if necessary.

In D26367#609267, @evstupac wrote:

The patch looks not that complicated. One of my next patches depends on it. Let's commit it and Arm guys will add test if necessary.

I'd agree with you but the lack of existing test coverage makes me very nervous.

Hi,

In D26367#599175, @evstupac wrote:
This LGTM but I think we should probably add tests for other targets as well.

There are similar tests on ARM only. When I tried to extend them I've got the following:
ldrb r12, [lr, r12]!
ldrb lr, [lr, #1]
Instead of 1 ldrh.

I'm not sure what is more profitable for ARM and which -mtriple/-mcpu is better. So I'd better leave the test inserting for ARM guys.

This is not a good result. It seems fairly obvious to me that two instructions are worse than one - an LDRH would be what I would expect generated.

James

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11261	Why has IsIndexSignExt been removed here?

Given that this affects all targets, have you done performance testing using the test-suite? What were the results?

Cheers,

James

In D26367#611654, @jmolloy wrote:

Given that this affects all targets, have you done performance testing using the test-suite? What were the results?

I'm not sure what is "the test-suite".
However I've tested this on specs and some other benchmarks. The are no significant performance changes (only noise). Some structure loads were optimized but not in hot loops.

In D26367#612055, @evstupac wrote:

I'm not sure what is "the test-suite".

http://llvm.org/docs/TestingGuide.html
http://llvm.org/svn/llvm-project/test-suite/trunk

This is not a good result. It seems fairly obvious to me that two instructions are worse than one - an LDRH would be what I would expect generated.

I'm not Arm specialist.
The only similar test I've found is:
test/CodeGen/ARM/MergeConsecutiveStores.ll
But it it has an option -mtriple=armv7-apple-darwin. For the option I get the result I've posted. I don't want to modify it.
For -mtriple=aarch64 I get expected result:

ldrh w9, [x2, x9]

But for Aarch64 there no other tests on store/load merge.
I can restructure all this how I feel is better.
But again I'm not Arm specialist. And (since there are no regressions) it is better leave all this for someone who has better knowledge in Arm.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11261	It was not removed. We'll add it when form return value BaseIndexOffset(..,..,..). Here we just do not stop when found VAR + OFFSET, but recursively call match for the VAR. It could happen that VAR is not just BASE, but BASE + INDEX.

http://llvm.org/docs/TestingGuide.html
http://llvm.org/svn/llvm-project/test-suite/trunk

There are no new fails on LLVM LIT tests.

http://llvm.org/svn/llvm-project/test-suite/trunk

For this I can get only X86 results. Is this ok?

Hi Evgeny,

There are no new fails on LLVM LIT tests.

The LLVM LIT tests are essentially unit tests so provide a quick indication that something is broken. The test-suite is a series of actual programs and benchmarks that provide a more rigorous correctness and performance test.

The test-suite can be run via LNT - see here for example: http://llvm.org/docs/lnt/quickstart.html#running-tests . For this change, I'm worried about the potential impact of the change beyond the regression test you've written, so I'd prefer if you could run performance testing and check there are no major performance regressions. (Make sure you use -j1 when invoking LNT for this, otherwise you'll end up with noisy results! also use --multisample to get an idea of the variance as some tests can vary wildly).

For this I can get only X86 results. Is this ok?

That's fine - X86 is the only hardware that we can reasonably expect everyone to have access to.

+Chad, to run an eye over the load/store merging changes.

That's fine - X86 is the only hardware that we can reasonably expect everyone to have access to.

For X86 I've got build same for all but MiBench/consumer-lame, which has no performance difference.

In D26367#599175, @evstupac wrote:
ldrb r12, [lr, r12]!
ldrb lr, [lr, #1]
Instead of 1 ldrh.

The question here is: Is LLVM emitting one LDRH today, or is it already emitting two LDRBs?

If the former, than this is a regression. If the latter, than you can safely ignore ARM.

The AArch64 side looks good, but I'd also like to know if we were emitting two byte-loads before.

However, even on the ARM side, this looks more like and issue on the ARM code-gen than on the DAG combiner, so it should be fine to push this through and fix the ARM code-gen later.

But we need to know what's the behaviour today, so that we can correctly fill the bug we open on ARM, to mention if it is a regression or not.

cheers,
--renato

Today the behavior is the same 2 byte loads. So there is no regressions. I'm just unable to add new test for ARM.

evstupac mentioned this in D27695: Add Instruction number to LSR cost model (PR23384).Dec 12 2016, 6:07 PM

In D26367#620246, @evstupac wrote:

Today the behavior is the same 2 byte loads. So there is no regressions. I'm just unable to add new test for ARM.

Right, I have just confirmed this behaviour. Also, your patch improves AArch64 (as you said), so I think the best course of action now is to:

add the same snippet above to the ARM test, checking for two ldrbs and with a big FIXME saying that we need to fix PartialOffset in the DAGCombiner to make it work.
Copy that test to the AArch64 directory, change the triple to "aarch64-linux-gnu" and change the register patterns from r[0-9] to w[0-9].
Change the last (added) test in AArch64 to expect ldrh and see it pass.

I can help you with the tests, but the error messages should be reasonably simple to match.

James, given that this is not changing the existing behaviour on ARM and is improving x86 and AArch64 code (and didn't show variation in spec), it should be ok to merge.

cheers,
--renato

add the same snippet above to the ARM test, checking for two ldrbs and with a big FIXME saying that we need to fix PartialOffset in the DAGCombiner to make it work.

Hmm... It looks like bug report is better.

Copy that test to the AArch64 directory, change the triple to "aarch64-linux-gnu" and change the register patterns from r[0-9] to w[0-9].

I'd leave this move for separate commit from James, you or someone else from ARM. It should be easier for you and this move is not related to my fix.

Change the last (added) test in AArch64 to expect ldrh and see it pass.

If you move the test before I commit the patch I'll add new check there. If after you'll need to add it.

Thanks,
Evgeny

In D26367#622049, @rengolin wrote:

In D26367#620246, @evstupac wrote:

Today the behavior is the same 2 byte loads. So there is no regressions. I'm just unable to add new test for ARM.

Right, I have just confirmed this behaviour. Also, your patch improves AArch64 (as you said), so I think the best course of action now is to:

add the same snippet above to the ARM test, checking for two ldrbs and with a big FIXME saying that we need to fix PartialOffset in the DAGCombiner to make it work.

Copy that test to the AArch64 directory, change the triple to "aarch64-linux-gnu" and change the register patterns from r[0-9] to w[0-9].

Change the last (added) test in AArch64 to expect ldrh and see it pass.

I can help you with the tests, but the error messages should be reasonably simple to match.

James, given that this is not changing the existing behaviour on ARM and is improving x86 and AArch64 code (and didn't show variation in spec), it should be ok to merge.

cheers,
--renato

Ok, LGTM from my side.

LGTM on X86

rengolin accepted this revision.Dec 15 2016, 1:34 AM

rengolin edited edge metadata.

This revision is now accepted and ready to land.Dec 15 2016, 1:34 AM

Committed.
rL291012

Fixed in rL291012

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

18 lines

test/

CodeGen/

X86/

MergeConsecutiveStores.ll

34 lines

Diff 77081

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,224 Lines • ▼ Show 20 Lines	BaseIndexOffset(SDValue Base, SDValue Index, int64_t Offset,
Base(Base), Index(Index), Offset(Offset), IsIndexSignExt(IsIndexSignExt) {}		Base(Base), Index(Index), Offset(Offset), IsIndexSignExt(IsIndexSignExt) {}

bool equalBaseIndex(const BaseIndexOffset &Other) {		bool equalBaseIndex(const BaseIndexOffset &Other) {
return Other.Base == Base && Other.Index == Index &&		return Other.Base == Base && Other.Index == Index &&
Other.IsIndexSignExt == IsIndexSignExt;		Other.IsIndexSignExt == IsIndexSignExt;
}		}

/// Parses tree in Ptr for base, index, offset addresses.		/// Parses tree in Ptr for base, index, offset addresses.
static BaseIndexOffset match(SDValue Ptr, SelectionDAG &DAG) {		static BaseIndexOffset match(SDValue Ptr, SelectionDAG &DAG,
		int64_t PartialOffset = 0) {
bool IsIndexSignExt = false;		bool IsIndexSignExt = false;

// Split up a folded GlobalAddress+Offset into its component parts.		// Split up a folded GlobalAddress+Offset into its component parts.
if (GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(Ptr))		if (GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(Ptr))
if (GA->getOpcode() == ISD::GlobalAddress && GA->getOffset() != 0) {		if (GA->getOpcode() == ISD::GlobalAddress && GA->getOffset() != 0) {
return BaseIndexOffset(DAG.getGlobalAddress(GA->getGlobal(),		return BaseIndexOffset(DAG.getGlobalAddress(GA->getGlobal(),
SDLoc(GA),		SDLoc(GA),
GA->getValueType(0),		GA->getValueType(0),
/Offset=/0,		/Offset=/PartialOffset,
/isTargetGA=/false,		/isTargetGA=/false,
GA->getTargetFlags()),		GA->getTargetFlags()),
SDValue(),		SDValue(),
GA->getOffset(),		GA->getOffset(),
IsIndexSignExt);		IsIndexSignExt);
}		}

// We only can pattern match BASE + INDEX + OFFSET. If Ptr is not an ADD		// We only can pattern match BASE + INDEX + OFFSET. If Ptr is not an ADD
// instruction, then it could be just the BASE or everything else we don't		// instruction, then it could be just the BASE or everything else we don't
// know how to handle. Just use Ptr as BASE and give up.		// know how to handle. Just use Ptr as BASE and give up.
if (Ptr->getOpcode() != ISD::ADD)		if (Ptr->getOpcode() != ISD::ADD)
return BaseIndexOffset(Ptr, SDValue(), 0, IsIndexSignExt);		return BaseIndexOffset(Ptr, SDValue(), PartialOffset, IsIndexSignExt);

// We know that we have at least an ADD instruction. Try to pattern match		// We know that we have at least an ADD instruction. Try to pattern match
// the simple case of BASE + OFFSET.		// the simple case of BASE + OFFSET.
if (isa<ConstantSDNode>(Ptr->getOperand(1))) {		if (isa<ConstantSDNode>(Ptr->getOperand(1))) {
int64_t Offset = cast<ConstantSDNode>(Ptr->getOperand(1))->getSExtValue();		int64_t Offset = cast<ConstantSDNode>(Ptr->getOperand(1))->getSExtValue();
return BaseIndexOffset(Ptr->getOperand(0), SDValue(), Offset,		return match(Ptr->getOperand(0), DAG, Offset + PartialOffset);
		jmolloyUnsubmitted Not Done Reply Inline Actions Why has IsIndexSignExt been removed here? jmolloy: Why has IsIndexSignExt been removed here?
		evstupacAuthorUnsubmitted Not Done Reply Inline Actions It was not removed. We'll add it when form return value BaseIndexOffset(..,..,..). Here we just do not stop when found VAR + OFFSET, but recursively call match for the VAR. It could happen that VAR is not just BASE, but BASE + INDEX. evstupac: It was not removed. We'll add it when form return value BaseIndexOffset(..,..,..). Here we just…
IsIndexSignExt);
}		}

// Inside a loop the current BASE pointer is calculated using an ADD and a		// Inside a loop the current BASE pointer is calculated using an ADD and a
// MUL instruction. In this case Ptr is the actual BASE pointer.		// MUL instruction. In this case Ptr is the actual BASE pointer.
// (i64 add (i64 %array_ptr)		// (i64 add (i64 %array_ptr)
// (i64 mul (i64 %induction_var)		// (i64 mul (i64 %induction_var)
// (i64 %element_size)))		// (i64 %element_size)))
if (Ptr->getOperand(1)->getOpcode() == ISD::MUL)		if (Ptr->getOperand(1)->getOpcode() == ISD::MUL)
return BaseIndexOffset(Ptr, SDValue(), 0, IsIndexSignExt);		return BaseIndexOffset(Ptr, SDValue(), PartialOffset, IsIndexSignExt);

// Look at Base + Index + Offset cases.		// Look at Base + Index + Offset cases.
SDValue Base = Ptr->getOperand(0);		SDValue Base = Ptr->getOperand(0);
SDValue IndexOffset = Ptr->getOperand(1);		SDValue IndexOffset = Ptr->getOperand(1);

// Skip signextends.		// Skip signextends.
if (IndexOffset->getOpcode() == ISD::SIGN_EXTEND) {		if (IndexOffset->getOpcode() == ISD::SIGN_EXTEND) {
IndexOffset = IndexOffset->getOperand(0);		IndexOffset = IndexOffset->getOperand(0);
IsIndexSignExt = true;		IsIndexSignExt = true;
}		}

// Either the case of Base + Index (no offset) or something else.		// Either the case of Base + Index (no offset) or something else.
if (IndexOffset->getOpcode() != ISD::ADD)		if (IndexOffset->getOpcode() != ISD::ADD)
return BaseIndexOffset(Base, IndexOffset, 0, IsIndexSignExt);		return BaseIndexOffset(Base, IndexOffset, PartialOffset, IsIndexSignExt);

// Now we have the case of Base + Index + offset.		// Now we have the case of Base + Index + offset.
SDValue Index = IndexOffset->getOperand(0);		SDValue Index = IndexOffset->getOperand(0);
SDValue Offset = IndexOffset->getOperand(1);		SDValue Offset = IndexOffset->getOperand(1);

if (!isa<ConstantSDNode>(Offset))		if (!isa<ConstantSDNode>(Offset))
return BaseIndexOffset(Ptr, SDValue(), 0, IsIndexSignExt);		return BaseIndexOffset(Ptr, SDValue(), PartialOffset, IsIndexSignExt);

// Ignore signextends.		// Ignore signextends.
if (Index->getOpcode() == ISD::SIGN_EXTEND) {		if (Index->getOpcode() == ISD::SIGN_EXTEND) {
Index = Index->getOperand(0);		Index = Index->getOperand(0);
IsIndexSignExt = true;		IsIndexSignExt = true;
} else IsIndexSignExt = false;		} else IsIndexSignExt = false;

int64_t Off = cast<ConstantSDNode>(Offset)->getSExtValue();		int64_t Off = cast<ConstantSDNode>(Offset)->getSExtValue();
return BaseIndexOffset(Base, Index, Off, IsIndexSignExt);		return BaseIndexOffset(Base, Index, Off + PartialOffset, IsIndexSignExt);
}		}
};		};
} // namespace		} // namespace

// This is a helper function for visitMUL to check the profitability		// This is a helper function for visitMUL to check the profitability
// of folding (mul (add x, c1), c2) -> (add (mul x, c2), c1*c2).		// of folding (mul (add x, c1), c2) -> (add (mul x, c2), c1*c2).
// MulNode is the original multiply, AddNode is (add x, c1),		// MulNode is the original multiply, AddNode is (add x, c1),
// and ConstNode is c2.		// and ConstNode is c2.
▲ Show 20 Lines • Show All 4,156 Lines • Show Last 20 Lines

test/CodeGen/X86/MergeConsecutiveStores.ll

Show First 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	; <label>:1
%12 = icmp eq i32 %11, 0		%12 = icmp eq i32 %11, 0
br i1 %12, label %13, label %1		br i1 %12, label %13, label %1

; <label>:13		; <label>:13
ret void		ret void
}		}

; Make sure that we merge the consecutive load/store sequence below and use a		; Make sure that we merge the consecutive load/store sequence below and use a
		; word (16 bit) instead of a byte copy for complicated address calculation.
		; .
		; CHECK-LABEL: MergeLoadStoreBaseIndexOffsetComplicated:
		; BWON: movzwl (%{{.}},%{{.}}), %e[[REG:[a-z]+]]
		; BWOFF: movw (%{{.}},%{{.}}), %[[REG:[a-z]+]]
		; CHECK: movw %[[REG]], (%{{.*}})
		define void @MergeLoadStoreBaseIndexOffsetComplicated(i8* %a, i8* %b, i8* %c, i64 %n) {
		br label %1

		; <label>:1
		%.09 = phi i64 [ 0, %0 ], [ %13, %1 ]
		%.08 = phi i8* [ %b, %0 ], [ %12, %1 ]
		%2 = load i8, i8* %.08, align 1
		%3 = sext i8 %2 to i64
		%4 = getelementptr inbounds i8, i8* %c, i64 %3
		%5 = load i8, i8* %4, align 1
		%6 = add nsw i64 %3, 1
		%7 = getelementptr inbounds i8, i8* %c, i64 %6
		%8 = load i8, i8* %7, align 1
		%9 = getelementptr inbounds i8, i8* %a, i64 %.09
		store i8 %5, i8* %9, align 1
		%10 = or i64 %.09, 1
		%11 = getelementptr inbounds i8, i8* %a, i64 %10
		store i8 %8, i8* %11, align 1
		%12 = getelementptr inbounds i8, i8* %.08, i64 1
		%13 = add nuw nsw i64 %.09, 2
		%14 = icmp slt i64 %13, %n
		br i1 %14, label %1, label %15

		; <label>:15
		ret void
		}

		; Make sure that we merge the consecutive load/store sequence below and use a
; word (16 bit) instead of a byte copy even if there are intermediate sign		; word (16 bit) instead of a byte copy even if there are intermediate sign
; extensions.		; extensions.
; CHECK-LABEL: MergeLoadStoreBaseIndexOffsetSext:		; CHECK-LABEL: MergeLoadStoreBaseIndexOffsetSext:
; BWON: movzwl (%{{.}},%{{.}}), %e[[REG:[a-z]+]]		; BWON: movzwl (%{{.}},%{{.}}), %e[[REG:[a-z]+]]
; BWOFF: movw (%{{.}},%{{.}}), %[[REG:[a-z]+]]		; BWOFF: movw (%{{.}},%{{.}}), %[[REG:[a-z]+]]
; CHECK: movw %[[REG]], (%{{.*}})		; CHECK: movw %[[REG]], (%{{.*}})
define void @MergeLoadStoreBaseIndexOffsetSext(i8* %a, i8* %b, i8* %c, i32 %n) {		define void @MergeLoadStoreBaseIndexOffsetSext(i8* %a, i8* %b, i8* %c, i32 %n) {
br label %1		br label %1
▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines