Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
reames
simon.f.whittaker
hfinkel

Summary

Poor code gen identified in Andreas Fredriksson's GDC 2016 Talk 'Taming the Jaguar: x86 Optimization at Insomniac Games':
http://schedule.gdconf.com/session/taming-the-jaguar-x86-optimization-at-insomniac-games

(see details in PR27136)

Diff Detail

Event Timeline

avt77 updated this revision to Diff 71919.Sep 20 2016, 6:17 AM

avt77 retitled this revision from to Failure to hoist constant out of loop.

avt77 updated this object.

avt77 added a reviewer: ABataev.

I updated the patch after the first review from Alexey Bataev

Another Alexey's request resolved.

avt77 added reviewers: simon.f.whittaker, RKSimon, spatel.Sep 21 2016, 5:33 AM

Fixed the test accordingly to rksimon request

avt77 updated this object.Sep 22 2016, 7:16 AM

Metadata removed from the test

The previously committed test was updated to mirror changes in CodeGen. Please, review this version.

This certainly seems like the right thing to do. Have you run the test suite, etc. to check for performance regressions?

In D24760#558980, @hfinkel wrote:

This certainly seems like the right thing to do. Have you run the test suite, etc. to check for performance regressions?

No, I did not run any special test suite. Could you give me a hint how to do it?

djasper removed a reviewer: djasper.Oct 4 2016, 11:29 PM

In D24760#558980, @hfinkel wrote:

This certainly seems like the right thing to do. Have you run the test suite, etc. to check for performance regressions?

All tests from llvm-test-suite passed and there are no performance degradations. I used the following command:

lnt runtest nt -v -sandbox SANDBOX --cc WORKSPACE/build/bin/clang --test-suite ~/llvm-test-suite

Is it enough or I should do more testing?

A couple of small comments, but fair warning: I am not a qualified reviewer for this code and won't be able to sign off.

lib/CodeGen/MachineLICM.cpp
1083	If I understand the issue correctly, this might be better phrased in terms of this predicate. Essentially, we know that we can rematerialize the constant without creating a copy even if there is a PHI use. I can see why this follows for loop exit phis. I'm not quite as clear I follow for phi's inside the loop.
test/CodeGen/X86/loop-search.ll
58	Just to make sure I understand the case you're trying to fix, the previous code fell over on this instruction as it is a use of the instruction to materialize the constant right?

Move the comment about hoisted constant into the proper place

avt77 added inline comments.Oct 11 2016, 4:04 AM

lib/CodeGen/MachineLICM.cpp
1083	I'm not sure I understand you. The idea is very simple and you noticed about: yes, we can rematerialize the constant without creating a copy that's why it's profitable to hoist it. Is it OK?
test/CodeGen/X86/loop-search.ll
58	It seems I put my comment in the wrong place: it should be moved from line 28 to line 33. I'll fix it asap. The previous code prepared TRUE result (constant) inside loop body (see lines 18, 19 on the left). Now we materialize this constant outside the loop (see lines 32, 33 on the right).

I created a simple test to check the performance:
#include <stdint.h>
#include <stdbool.h>

#define N 1000000000
#define M 100000000000000000

bool search(long needle, const uint32_t *haystack, int count) {

for (int i = 0; i < count; ++i)
  if (needle == haystack[i])
    return true;
return false;

}

long a [N];

int main () {

int i;
for (i = 0; i < N; i++) {
  a[i] = i;
}
for (long j = 0; j < M; j++) {
  search (j % N, (const uint32_t *)&a, N);
}

}
I checked this test with and without my fix like here:

time ./search.o

The improvement is about 8%.

Hal, could you give me your LGTM on this patch?

In D24760#570267, @avt77 wrote:
I created a simple test to check the performance:
#include <stdint.h>
#include <stdbool.h>

#define N 1000000000
#define M 100000000000000000

bool search(long needle, const uint32_t *haystack, int count) {
for (int i = 0; i < count; ++i)
  if (needle == haystack[i])
    return true;
return false;
}

long a [N];

int main () {
int i;
for (i = 0; i < N; i++) {
  a[i] = i;
}
for (long j = 0; j < M; j++) {
  search (j % N, (const uint32_t *)&a, N);
}
}
I checked this test with and without my fix like here:

time ./search.o

The improvement is about 8%.

Hal, could you give me your LGTM on this patch?

Okay; LGTM. Can you also put together a patch to add your test to the test-suite? That way we can track the performance.

This revision is now accepted and ready to land.Oct 14 2016, 7:14 AM

Sorry, don't understand: do you mean I should add my perf test into the patch? But it is not a LIT test, it's an application. How can I add it to the test suit? I'll do it with my follow-up patch (on the next week) but please explain me how I should do it.

In D24760#570503, @avt77 wrote:

Sorry, don't understand: do you mean I should add my perf test into the patch? But it is not a LIT test, it's an application. How can I add it to the test suit? I'll do it with my follow-up patch (on the next week) but please explain me how I should do it.

Applications get added to the test suite (not the lit-run regressions tests). You know how to run the test suite (you did it with lnt as you indicated above). You could submit a patch to add your test to https://llvm.org/svn/llvm-project/test-suite/trunk, probably by placing it in the SingleSource/UnitTests subdirectory. The patch gets directed to llvm-commits as with this one.

avt77 added a reviewer: kparzysz.Oct 18 2016, 11:59 PM

kparzysz, not long ago you added test tail-dup-merge-loop-headers.ll (maybe you added some other tests as well at that time). And this test (plus some others) failed now if I add this patch to the trunk code. Could you review my really tiny patch and could you give me a hint what's wrong with your code if I apply this patch?

In D24760#573827, @avt77 wrote:

kparzysz, not long ago you added test tail-dup-merge-loop-headers.ll (maybe you added some other tests as well at that time). And this test (plus some others) failed now if I add this patch to the trunk code. Could you review my really tiny patch and could you give me a hint what's wrong with your code if I apply this patch?

That was actually Kyle Butt (@iteratee) who added that test.

For tail-dup-merge-loop-headers.ll

That is a layout test, and your change affected the final layout. Your change is innocuous because it's really checking for the presence/absence of blocks.

In general:
By my count, you have 5 other tests to look at. Please look carefully at them to see if you can understand the intent of the test, and if the resulting code is in keeping with the intent of the test. If so, feel free to change the test to match. If you can't, try to get someone to look at the specific test.

avt77 edited edge metadata.Oct 20 2016, 7:13 AM

avt77 added subscribers: ab, jroelofs, logan.

logan, jroelofs, ab, iteratee, tnorthover: this patch fails your tests:
atomic-cmpxchg
cmpxchg-idioms
ifcvt-rescan-diamonds
But it seems the new version is better than the current one. Could you review the newly generated code and allow me to fix the tests?

atomic-cmpxchg.current.s749 BDownload

atomic-cmpxchg.new.s736 BDownload

tail-dup-merge-loop-headers.new.s4 KBDownload

cmpxchg-idioms.current.s1 KBDownload

ifcvt-rescan-diamonds.current.s2 KBDownload

ifcvt-rescan-diamonds.new.s2 KBDownload

cmpxchg-idioms.new.s1 KBDownload

tail-dup-merge-loop-headers.current.s4 KBDownload

In D24760#575410, @avt77 wrote:

logan, jroelofs, ab, iteratee, tnorthover: this patch fails your tests:
atomic-cmpxchg
cmpxchg-idioms
ifcvt-rescan-diamonds
But it seems the new version is better than the current one. Could you review the newly generated code and allow me to fix the tests?
atomic-cmpxchg.current.s749 BDownload

atomic-cmpxchg.new.s736 BDownload

tail-dup-merge-loop-headers.new.s4 KBDownload

cmpxchg-idioms.current.s1 KBDownload

ifcvt-rescan-diamonds.current.s2 KBDownload

ifcvt-rescan-diamonds.new.s2 KBDownload

cmpxchg-idioms.new.s1 KBDownload

tail-dup-merge-loop-headers.current.s4 KBDownload

Mind uploading these as full-context diffs? Phab's way of showing them as raw files isn't particularly helpful reviewing / making comments.

for ifcvt-rescan-diamonds.ll Nothing is broken, the test just gets optimized away.

Change the 0 in the phi of %cond.end84 to a load and you'll be fine. Change the CHECK lines to match.

I've fixed all failed tests. Test owners please review my changes: they are rather small that's why it should not take a lot of time from you.

Herald added a subscriber: anna. · View Herald TranscriptOct 26 2016, 7:35 AM

I don't see any changes to ifcvt-rescan-diamonds.ll

In D24760#580189, @iteratee wrote:

I don't see any changes to ifcvt-rescan-diamonds.ll

Yes, the problem has disappeared - don't know why. Sorry for boring.

Hi @avt77:

For atomic-cmpxchg.ll, I found that your output is the same as the version I originally committed. The test was updated by @danielcdh in D24818 / rL284757. You may wish to add him as reviewer.

Looks like the hoisting issue described in PR27136 is already fixed at head? My guess is that the fix is from https://reviews.llvm.org/rL284757

In D24760#584793, @danielcdh wrote:

Looks like the hoisting issue described in PR27136 is already fixed at head? My guess is that the fix is from https://reviews.llvm.org/rL284757

Yes, you're right: the current trunk does not have the issue with test/CodeGen/X86/loop-search.ll. It was fixed. But the hosting issue is still here: this patch really improves code for several tests. Should we continue with this patch or we should simply close it because loop_search.ll was fixed?

kparzysz resigned from this revision.Nov 2 2016, 6:39 AM

kparzysz removed a reviewer: kparzysz.

danielcdh added inline comments.Nov 2 2016, 10:18 AM

test/CodeGen/X86/loop-search.ll
15	Looks like hoisting this instruction is not the best choice because it will be executed speculatively. More specifically, if the search fails, i.e. the loop terminated when i==count (no early exit), the performance will be worse because the hoisted movb will be redundant. OTOH, if the loop terminated with early exit (needle == haystack[i]), there will be no redundancy, but the life range of ax will be much longer and overlaps with many other life-ranges. This will add extra burden to RA. So looks to me hoisting is an overall loss here?

ABataev resigned from this revision.Feb 13 2017, 11:44 AM

Abandon this? Its seems to be fixed in trunk - there are still some issues in PR27136 but no constant hoisting.

This revision is now accepted and ready to land.Apr 7 2018, 9:17 AM

Herald added a subscriber: javed.absar. · View Herald TranscriptApr 7 2018, 9:17 AM

Stripping approval

This revision now requires changes to proceed.Apr 7 2018, 9:18 AM

The trunk fixed the issue.

Diff 72167

lib/CodeGen/MachineLICM.cpp

	Show First 20 Lines • Show All 1,056 Lines • ▼ Show 20 Lines
	}			}

	/// Return true if it is potentially profitable to hoist the given loop			/// Return true if it is potentially profitable to hoist the given loop
	/// invariant.			/// invariant.
	bool MachineLICM::IsProfitableToHoist(MachineInstr &MI) {			bool MachineLICM::IsProfitableToHoist(MachineInstr &MI) {
	if (MI.isImplicitDef())			if (MI.isImplicitDef())
	return true;			return true;

				// Rematerializable instructions should always be hoisted since the register
				// allocator can just pull them down again when needed.
				if (TII->isTriviallyReMaterializable(MI, AA))
				return true;

	// Besides removing computation from the loop, hoisting an instruction has			// Besides removing computation from the loop, hoisting an instruction has
	// these effects:			// these effects:
	//			//
	// - The value defined by the instruction becomes live across the entire			// - The value defined by the instruction becomes live across the entire
	// loop. This increases register pressure in the loop.			// loop. This increases register pressure in the loop.
	//			//
	// - If the value is used by a PHI in the loop, a copy will be required for			// - If the value is used by a PHI in the loop, a copy will be required for
	// lowering the PHI after extending the live range.			// lowering the PHI after extending the live range.
	//			//
	// - When hoisting the last use of a value in the loop, that value no longer			// - When hoisting the last use of a value in the loop, that value no longer
	// needs to be live in the loop. This lowers register pressure in the loop.			// needs to be live in the loop. This lowers register pressure in the loop.

	bool CheapInstr = IsCheapInstruction(MI);			bool CheapInstr = IsCheapInstruction(MI);
	bool CreatesCopy = HasLoopPHIUse(&MI);			bool CreatesCopy = HasLoopPHIUse(&MI);
				reamesUnsubmitted Not Done Reply Inline Actions If I understand the issue correctly, this might be better phrased in terms of this predicate. Essentially, we know that we can rematerialize the constant without creating a copy even if there is a PHI use. I can see why this follows for loop exit phis. I'm not quite as clear I follow for phi's inside the loop. reames: If I understand the issue correctly, this might be better phrased in terms of this predicate.
				avt77AuthorUnsubmitted Not Done Reply Inline Actions I'm not sure I understand you. The idea is very simple and you noticed about: yes, we can rematerialize the constant without creating a copy that's why it's profitable to hoist it. Is it OK? avt77: I'm not sure I understand you. The idea is very simple and you noticed about: yes, we can…

	// Don't hoist a cheap instruction if it would create a copy in the loop.			// Don't hoist a cheap instruction if it would create a copy in the loop.
	if (CheapInstr && CreatesCopy) {			if (CheapInstr && CreatesCopy) {
	DEBUG(dbgs() << "Won't hoist cheap instr with loop PHI use: " << MI);			DEBUG(dbgs() << "Won't hoist cheap instr with loop PHI use: " << MI);
	return false;			return false;
	}			}

	// Rematerializable instructions should always be hoisted since the register
	// allocator can just pull them down again when needed.
	if (TII->isTriviallyReMaterializable(MI, AA))
	return true;

	// FIXME: If there are long latency loop-invariant instructions inside the			// FIXME: If there are long latency loop-invariant instructions inside the
	// loop at this point, why didn't the optimizer's LICM hoist them?			// loop at this point, why didn't the optimizer's LICM hoist them?
	for (unsigned i = 0, e = MI.getDesc().getNumOperands(); i != e; ++i) {			for (unsigned i = 0, e = MI.getDesc().getNumOperands(); i != e; ++i) {
	const MachineOperand &MO = MI.getOperand(i);			const MachineOperand &MO = MI.getOperand(i);
	if (!MO.isReg() \|\| MO.isImplicit())			if (!MO.isReg() \|\| MO.isImplicit())
	continue;			continue;
	unsigned Reg = MO.getReg();			unsigned Reg = MO.getReg();
	if (!TargetRegisterInfo::isVirtualRegister(Reg))			if (!TargetRegisterInfo::isVirtualRegister(Reg))
	▲ Show 20 Lines • Show All 290 Lines • Show Last 20 Lines

test/CodeGen/X86/licm-nested.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc -mtriple=x86_64-apple-darwin -march=x86-64 < %s -o /dev/null -stats -info-output-file - \| grep "hoisted out of loops" \| grep 4			; RUN: llc -mtriple=x86_64-apple-darwin -march=x86-64 < %s -o /dev/null -stats -info-output-file - \| FileCheck %s

	; MachineLICM should be able to hoist the symbolic addresses out of			; MachineLICM should be able to hoist the symbolic addresses out of
	; the inner loops.			; the inner loops.

				; CHECK: 6{{.*}}hoisted out of loops

	@main.flags = internal global [8193 x i8] zeroinitializer, align 16 ; <[8193 x i8]*> [#uses=3]			@main.flags = internal global [8193 x i8] zeroinitializer, align 16 ; <[8193 x i8]*> [#uses=3]
	@.str = private constant [11 x i8] c"Count: %d\0A\00" ; <[11 x i8]*> [#uses=1]			@.str = private constant [11 x i8] c"Count: %d\0A\00" ; <[11 x i8]*> [#uses=1]

	define i32 @main(i32 %argc, i8** nocapture %argv) nounwind ssp {			define i32 @main(i32 %argc, i8** nocapture %argv) nounwind ssp {
	entry:			entry:
	%cmp = icmp eq i32 %argc, 2 ; <i1> [#uses=1]			%cmp = icmp eq i32 %argc, 2 ; <i1> [#uses=1]
	br i1 %cmp, label %while.cond.preheader, label %bb.nph53			br i1 %cmp, label %while.cond.preheader, label %bb.nph53

	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

test/CodeGen/X86/loop-search.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-apple-darwin \| FileCheck %s

				; This test comes from PR27136
				; We should hoist loop constant invariant

				define zeroext i1 @search(i32 %needle, i32* nocapture readonly %haystack, i32 %count) {
				; CHECK-LABEL: search:
				; CHECK: ## BB#0: ## %entry
				; CHECK-NEXT: testl %edx, %edx
				; CHECK-NEXT: jle LBB0_1
				; CHECK-NEXT: ## BB#2: ## %for.body.preheader
				; CHECK-NEXT: movslq %edx, %rcx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: xorl %edx, %edx
				danielcdhUnsubmitted Not Done Reply Inline Actions Looks like hoisting this instruction is not the best choice because it will be executed speculatively. More specifically, if the search fails, i.e. the loop terminated when i==count (no early exit), the performance will be worse because the hoisted movb will be redundant. OTOH, if the loop terminated with early exit (needle == haystack[i]), there will be no redundancy, but the life range of ax will be much longer and overlaps with many other life-ranges. This will add extra burden to RA. So looks to me hoisting is an overall loss here? danielcdh: Looks like hoisting this instruction is not the best choice because it will be executed…
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: LBB0_4: ## %for.body
				; CHECK-NEXT: ## =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: cmpl %edi, (%rsi,%rdx,4)
				; CHECK-NEXT: je LBB0_5
				; CHECK-NEXT: ## BB#3: ## %for.cond
				; CHECK-NEXT: ## in Loop: Header=BB0_4 Depth=1
				; CHECK-NEXT: incq %rdx
				; CHECK-NEXT: cmpq %rcx, %rdx
				; CHECK-NEXT: jl LBB0_4
				; CHECK-NEXT: jmp LBB0_6
				; CHECK-NEXT: LBB0_1:
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: ## kill: %AL<def> %AL<kill> %RAX<kill>
				; CHECK-NEXT: retq
				; CHECK-NEXT: LBB0_5:
				; ## the TRUE result value moved here from %for.body
				; CHECK-NEXT: movb $1, %al
				; CHECK-NEXT: LBB0_6: ## %cleanup
				; CHECK-NEXT: ## kill: %AL<def> %AL<kill> %RAX<kill>
				; CHECK-NEXT: retq
				;
				entry:
				%cmp5 = icmp sgt i32 %count, 0
				br i1 %cmp5, label %for.body.preheader, label %cleanup

				for.body.preheader: ; preds = %entry
				%0 = sext i32 %count to i64
				br label %for.body

				for.cond: ; preds = %for.body
				%cmp = icmp slt i64 %indvars.iv.next, %0
				br i1 %cmp, label %for.body, label %cleanup.loopexit

				for.body: ; preds = %for.body.preheader, %for.cond
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.cond ]
				%arrayidx = getelementptr inbounds i32, i32* %haystack, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx, align 4
				%cmp1 = icmp eq i32 %1, %needle
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br i1 %cmp1, label %cleanup.loopexit, label %for.cond

				cleanup.loopexit: ; preds = %for.cond, %for.body
				reamesUnsubmitted Not Done Reply Inline Actions Just to make sure I understand the case you're trying to fix, the previous code fell over on this instruction as it is a use of the instruction to materialize the constant right? reames: Just to make sure I understand the case you're trying to fix, the previous code fell over on…
				avt77AuthorUnsubmitted Not Done Reply Inline Actions It seems I put my comment in the wrong place: it should be moved from line 28 to line 33. I'll fix it asap. The previous code prepared TRUE result (constant) inside loop body (see lines 18, 19 on the left). Now we materialize this constant outside the loop (see lines 32, 33 on the right). avt77: It seems I put my comment in the wrong place: it should be moved from line 28 to line 33. I'll…
				%.ph = phi i1 [ false, %for.cond ], [ true, %for.body ]
				br label %cleanup

				cleanup: ; preds = %cleanup.loopexit, %entry
				%2 = phi i1 [ false, %entry ], [ %.ph, %cleanup.loopexit ]
				ret i1 %2
				}

This is an archive of the discontinued LLVM Phabricator instance.

Failure to hoist constant out of loop
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 72167

lib/CodeGen/MachineLICM.cpp

test/CodeGen/X86/licm-nested.ll

test/CodeGen/X86/loop-search.ll

This is an archive of the discontinued LLVM Phabricator instance.

Failure to hoist constant out of loopAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 72167

lib/CodeGen/MachineLICM.cpp

test/CodeGen/X86/licm-nested.ll

test/CodeGen/X86/loop-search.ll

Failure to hoist constant out of loop
AbandonedPublic