This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
4
SimplifyCFG.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
analyzecmp.ll
-
arm64-promote-const.ll
-
Thumb2/
-
ifcvt-neon.ll

Differential D7507

[SimplifyCFG] Be more aggressive
ClosedPublic

Authored by jmolloy on Feb 9 2015, 9:09 AM.

Download Raw Diff

Details

Reviewers

andreadb
hfinkel

Summary

Up the phi node folding threshold from a cheap "1" to a meagre "2".

Update tests for extra added selects and slight code churn.

Diff Detail

Repository: rL LLVM

Event Timeline

jmolloy updated this revision to Diff 19589.Feb 9 2015, 9:09 AM

jmolloy retitled this revision from to [SimplifyCFG] Be more aggressive.

jmolloy updated this object.

jmolloy edited the test plan for this revision. (Show Details)

jmolloy added a reviewer: hfinkel.

jmolloy set the repository for this revision to rL LLVM.

jmolloy added a subscriber: Unknown Object (MLST).

kristof.beyls added a subscriber: kristof.beyls.Feb 9 2015, 9:14 AM

Update to add a new test, checking that the idiomatic "clamp()" function is correctly optimized to two selects.

jmolloy mentioned this in D7506: [SimplifyCFG] Swap to using TargetTransformInfo for cost analysis..Feb 10 2015, 1:04 AM

andreadb added a reviewer: andreadb.Feb 10 2015, 10:36 AM

hfinkel added inline comments.Feb 10 2015, 10:51 AM

test/Transforms/SimplifyCFG/clamp.ll
5 ↗	(On Diff #19647)	Please actually check for the clamp code you expect.

hfinkel added inline comments.Feb 10 2015, 10:52 AM

lib/Transforms/Utils/SimplifyCFG.cpp
58	And please add a comment here explaining why the number is 2 (so that we'll get the clamp pattern, etc.).

Hi Hal,

Thanks for the review. Changes made.

Cheers,

James

Can you please provide a summary of how this affects our normal performance benchmarks?

lib/Transforms/Utils/SimplifyCFG.cpp
57	Do you mean two selects?

Hi Hal,

Sure. There is no difference in any of the industry-standard test suites I have. In LNT, I see two small regressions and 9 small improvements (on AArch64):

Regressions:

telecomm-gsm: 4.0%
Nodesplitting-dbl: 1.04%

Improvements:

aha: -7.8%
lambda: -6.16%
covariance (from polybench): -3.65%
gemver: -2.65%
+5 more sub-2% improvements.

Cheers,

James

lib/Transforms/Utils/SimplifyCFG.cpp
57	No; the heuristic has to be enough that it will hoist one select, to then remove the branch and in the end cause two selects.

When I run this change on my POWER7 box, I see no improvements and one major regression:

MultiSource/Benchmarks/Olden/power/power
23.3258% +/- 9.53904%

I'll attempt to figure out what is going on here.

Generally speaking, I'd like to discuss a bit more, from a modeling perspective, what makes this a good idea? And, should we be using the same threshold for both FoldTwoEntryPHINode and SpeculativelyExecuteBB? We have costs for the instructions, do we need a cost for the branch? Do we need to consider whether or not we're speculating multiple instructions that are dependent on each other vs. independent?

lib/Transforms/Utils/SimplifyCFG.cpp
57	Okay, please explain that in the comment (there is no need to be terse).

Hi Hal,

I can make the comment change, no problem.

With regards modelling, I think this is going to be a very difficult problem if we want to model more accurately. The penalty for speculating instructions is a function of the number and type of instructions speculated, the in-orderness of the CPU, the predictability of the branch condition, and probably a bunch of other factors too.

I feel that, given how much better the optimizer can reason about things when they're in a single basic block, that for small sequences like this the speculated version should be the canonical form. Then CodeGenPrepare or something similarly nearer the backend can make a target-specific decision whether to expand it or not. So really, the heuristic here is mainly a cutoff to stop horribly expensive stuff from being speculated, but the backend should have the final say.

That is, the indirect benefits outweigh the direct benefits, so more accurately modelling the direct cost in SimplifyCFG we may end up making the wrong decision.

James

To provide one more data point: On an Intel Sandy Bridge box, I see no regressions and one speedup:

SingleSource/Benchmarks/CoyoteBench/huffbench
-11.4659% +/- 3.75373%

In D7507#122614, @jmolloy wrote:

Hi Hal,

I can make the comment change, no problem.

With regards modelling, I think this is going to be a very difficult problem if we want to model more accurately. The penalty for speculating instructions is a function of the number and type of instructions speculated, the in-orderness of the CPU, the predictability of the branch condition, and probably a bunch of other factors too.

I feel that, given how much better the optimizer can reason about things when they're in a single basic block, that for small sequences like this the speculated version should be the canonical form. Then CodeGenPrepare or something similarly nearer the backend can make a target-specific decision whether to expand it or not. So really, the heuristic here is mainly a cutoff to stop horribly expensive stuff from being speculated, but the backend should have the final say.

That is, the indirect benefits outweigh the direct benefits, so more accurately modelling the direct cost in SimplifyCFG we may end up making the wrong decision.

Fair enough, but sometimes we need to take the "do no harm" approach. Nevertheless, it turns out my P7 performance regression was a combination of a faulty script and a compiler crash, so we can ignore that (it is a good thing that I investigated it, however, because it turns out it was a serious regression). I think that we may need to return to the modeling question here, but I think we can move forward with this for now (all of my testing is neutral, both on PPC and on X86, with a speedup on x86). LGTM.

James

This revision is now accepted and ready to land.Feb 12 2015, 4:29 PM

Thanks Hal,

Landed in r229099.

James

jmolloy closed this revision.Feb 13 2015, 2:51 AM

Revision Contents

Path

Size

lib/

Transforms/

Utils/

SimplifyCFG.cpp

4 lines

test/

CodeGen/

AArch64/

analyzecmp.ll

8 lines

arm64-promote-const.ll

36 lines

Thumb2/

ifcvt-neon.ll

6 lines

Diff 19589

lib/Transforms/Utils/SimplifyCFG.cpp

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	#include <map>			#include <map>
	#include <set>			#include <set>
	using namespace llvm;			using namespace llvm;
	using namespace PatternMatch;			using namespace PatternMatch;

	#define DEBUG_TYPE "simplifycfg"			#define DEBUG_TYPE "simplifycfg"

	static cl::opt<unsigned>			static cl::opt<unsigned>
	PHINodeFoldingThreshold("phi-node-folding-threshold", cl::Hidden, cl::init(1),			PHINodeFoldingThreshold("phi-node-folding-threshold", cl::Hidden, cl::init(2),
				hfinkelUnsubmitted Not Done Reply Inline Actions Do you mean two selects? hfinkel: Do you mean two selects?
				jmolloyAuthorUnsubmitted Not Done Reply Inline Actions No; the heuristic has to be enough that it will hoist one select, to then remove the branch and in the end cause two selects. jmolloy: No; the heuristic has to be enough that it will hoist one select, to then remove the branch…
				hfinkelUnsubmitted Not Done Reply Inline Actions Okay, please explain that in the comment (there is no need to be terse). hfinkel: Okay, please explain that in the comment (there is no need to be terse).
	cl::desc("Control the amount of phi node folding to perform (default = 1)"));			cl::desc("Control the amount of phi node folding to perform (default = 2)"));
				hfinkelUnsubmitted Not Done Reply Inline Actions And please add a comment here explaining why the number is 2 (so that we'll get the clamp pattern, etc.). hfinkel: And please add a comment here explaining why the number is 2 (so that we'll get the clamp…

	static cl::opt<bool>			static cl::opt<bool>
	DupRet("simplifycfg-dup-ret", cl::Hidden, cl::init(false),			DupRet("simplifycfg-dup-ret", cl::Hidden, cl::init(false),
	cl::desc("Duplicate return instructions into unconditional branches"));			cl::desc("Duplicate return instructions into unconditional branches"));

	static cl::opt<bool>			static cl::opt<bool>
	SinkCommon("simplifycfg-sink-common", cl::Hidden, cl::init(true),			SinkCommon("simplifycfg-sink-common", cl::Hidden, cl::init(true),
	cl::desc("Sink common instructions down to the end block"));			cl::desc("Sink common instructions down to the end block"));
	▲ Show 20 Lines • Show All 4,558 Lines • Show Last 20 Lines

test/CodeGen/AArch64/analyzecmp.ll

	; RUN: llc -O3 -mcpu=cortex-a57 < %s \| FileCheck %s			; RUN: llc -O3 -mcpu=cortex-a57 < %s \| FileCheck %s

	; CHECK-LABLE: @test			; CHECK-LABEL: @test
	; CHECK: tst [[CMP:x[0-9]+]], #0x8000000000000000			; CHECK: and
	; CHECK: csel [[R0:x[0-9]+]], [[S0:x[0-9]+]], [[S1:x[0-9]+]], eq			; CHECK: csel
	; CHECK: csel [[R1:x[0-9]+]], [[S2:x[0-9]+]], [[S3:x[0-9]+]], eq			; CHECK: csel
	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "arm64--linux-gnueabi"			target triple = "arm64--linux-gnueabi"

	define void @test(i64 %a, i64* %ptr1, i64* %ptr2) #0 align 2 {			define void @test(i64 %a, i64* %ptr1, i64* %ptr2) #0 align 2 {
	entry:			entry:
	%conv = and i64 %a, 4294967295			%conv = and i64 %a, 4294967295
	%add = add nsw i64 %conv, -1			%add = add nsw i64 %conv, -1
	%div = sdiv i64 %add, 64			%div = sdiv i64 %add, 64
	Show All 18 Lines

test/CodeGen/AArch64/arm64-promote-const.ll

	Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines

	; Two different uses of the sane constant in two different basic blocks,			; Two different uses of the sane constant in two different basic blocks,
	; one is in a phi.			; one is in a phi.
	define <16 x i8> @test5(<16 x i8> %arg, i32 %path) {			define <16 x i8> @test5(<16 x i8> %arg, i32 %path) {
	; PROMOTED-LABEL: test5:			; PROMOTED-LABEL: test5:
	; In stress mode, constant vector are promoted			; In stress mode, constant vector are promoted
	; Since, the constant is the same as the previous function,			; Since, the constant is the same as the previous function,
	; the same address must be used			; the same address must be used
	; PROMOTED: adrp [[PAGEADDR:x[0-9]+]], [[CSTV1]]@PAGE			; PROMOTED: ldr
	; PROMOTED-NEXT: ldr q[[REGNUM:[0-9]+]], {{\[}}[[PAGEADDR]], [[CSTV1]]@PAGEOFF]			; PROMOTED-NOT: ldr
	; PROMOTED-NEXT: cbz w0, [[LABEL:LBB.*]]			; PROMOTED: ret
	; Next BB
	; PROMOTED: add.16b [[DESTV:v[0-9]+]], v0, v[[REGNUM]]
	; PROMOTED-NEXT: mul.16b v[[REGNUM]], [[DESTV]], v[[REGNUM]]
	; Next BB
	; PROMOTED-NEXT: [[LABEL]]:
	; PROMOTED-NEXT: mul.16b [[TMP1:v[0-9]+]], v[[REGNUM]], v[[REGNUM]]
	; PROMOTED-NEXT: mul.16b [[TMP2:v[0-9]+]], [[TMP1]], [[TMP1]]
	; PROMOTED-NEXT: mul.16b [[TMP3:v[0-9]+]], [[TMP2]], [[TMP2]]
	; PROMOTED-NEXT: mul.16b v0, [[TMP3]], [[TMP3]]
	; PROMOTED-NEXT: ret

	; REGULAR-LABEL: test5:			; REGULAR-LABEL: test5:
	; REGULAR: cbz w0, [[LABELelse:LBB.*]]			; REGULAR: ldr
	; Next BB			; REGULAR: ret
	; REGULAR: adrp [[PAGEADDR:x[0-9]+]], [[CSTLABEL:lCP.*]]@PAGE
	; REGULAR-NEXT: ldr q[[REGNUM:[0-9]+]], {{\[}}[[PAGEADDR]], [[CSTLABEL]]@PAGEOFF]
	; REGULAR-NEXT: add.16b [[DESTV:v[0-9]+]], v0, v[[REGNUM]]
	; REGULAR-NEXT: mul.16b v[[DESTREGNUM:[0-9]+]], [[DESTV]], v[[REGNUM]]
	; REGULAR-NEXT: b [[LABELend:LBB.*]]
	; Next BB
	; REGULAR-NEXT: [[LABELelse]]
	; REGULAR-NEXT: adrp [[PAGEADDR:x[0-9]+]], [[CSTLABEL:lCP.*]]@PAGE
	; REGULAR-NEXT: ldr q[[DESTREGNUM]], {{\[}}[[PAGEADDR]], [[CSTLABEL]]@PAGEOFF]
	; Next BB
	; REGULAR-NEXT: [[LABELend]]:
	; REGULAR-NEXT: mul.16b [[TMP1:v[0-9]+]], v[[DESTREGNUM]], v[[DESTREGNUM]]
	; REGULAR-NEXT: mul.16b [[TMP2:v[0-9]+]], [[TMP1]], [[TMP1]]
	; REGULAR-NEXT: mul.16b [[TMP3:v[0-9]+]], [[TMP2]], [[TMP2]]
	; REGULAR-NEXT: mul.16b v0, [[TMP3]], [[TMP3]]
	; REGULAR-NEXT: ret
	entry:			entry:
	%tobool = icmp eq i32 %path, 0			%tobool = icmp eq i32 %path, 0
	br i1 %tobool, label %if.end, label %if.then			br i1 %tobool, label %if.end, label %if.then

	if.then: ; preds = %entry			if.then: ; preds = %entry
	%add.i = add <16 x i8> %arg, <i8 -40, i8 -93, i8 -118, i8 -99, i8 -75, i8 -105, i8 74, i8 -110, i8 62, i8 -115, i8 -119, i8 -120, i8 34, i8 -124, i8 0, i8 -128>			%add.i = add <16 x i8> %arg, <i8 -40, i8 -93, i8 -118, i8 -99, i8 -75, i8 -105, i8 74, i8 -110, i8 62, i8 -115, i8 -119, i8 -120, i8 34, i8 -124, i8 0, i8 -128>
	%mul.i26 = mul <16 x i8> %add.i, <i8 -40, i8 -93, i8 -118, i8 -99, i8 -75, i8 -105, i8 74, i8 -110, i8 62, i8 -115, i8 -119, i8 -120, i8 34, i8 -124, i8 0, i8 -128>			%mul.i26 = mul <16 x i8> %add.i, <i8 -40, i8 -93, i8 -118, i8 -99, i8 -75, i8 -105, i8 74, i8 -110, i8 62, i8 -115, i8 -119, i8 -120, i8 34, i8 -124, i8 0, i8 -128>
	br label %if.end			br label %if.end
	Show All 27 Lines

test/CodeGen/Thumb2/ifcvt-neon.ll

	; RUN: llc -mtriple=thumb-eabi -mcpu=cortex-a8 %s -o - \| FileCheck %s			; RUN: llc -mtriple=thumb-eabi -mcpu=cortex-a8 %s -o - \| FileCheck %s
	; rdar://7368193			; rdar://7368193

	@a = common global float 0.000000e+00 ; <float*> [#uses=2]			@a = common global float 0.000000e+00 ; <float*> [#uses=2]
	@b = common global float 0.000000e+00 ; <float*> [#uses=1]			@b = common global float 0.000000e+00 ; <float*> [#uses=1]

	define float @t(i32 %c) nounwind {			define float @t(i32 %c) nounwind {
	entry:			entry:
	%0 = icmp sgt i32 %c, 1 ; <i1> [#uses=1]			%0 = icmp sgt i32 %c, 1 ; <i1> [#uses=1]
	%1 = load float* @a, align 4 ; <float> [#uses=2]			%1 = load float* @a, align 4 ; <float> [#uses=2]
	%2 = load float* @b, align 4 ; <float> [#uses=2]			%2 = load float* @b, align 4 ; <float> [#uses=2]
	br i1 %0, label %bb, label %bb1			br i1 %0, label %bb, label %bb1

	bb: ; preds = %entry			bb: ; preds = %entry
	; CHECK: ite lt			; CHECK: vsub.f32
	; CHECK: vsublt.f32			; CHECK-NEXT: vadd.f32
	; CHECK-NEXT: vaddge.f32			; CHECK: it gt
	%3 = fadd float %1, %2 ; <float> [#uses=1]			%3 = fadd float %1, %2 ; <float> [#uses=1]
	br label %bb2			br label %bb2

	bb1: ; preds = %entry			bb1: ; preds = %entry
	%4 = fsub float %1, %2 ; <float> [#uses=1]			%4 = fsub float %1, %2 ; <float> [#uses=1]
	br label %bb2			br label %bb2

	bb2: ; preds = %bb1, %bb			bb2: ; preds = %bb1, %bb
	%storemerge = phi float [ %4, %bb1 ], [ %3, %bb ] ; <float> [#uses=2]			%storemerge = phi float [ %4, %bb1 ], [ %3, %bb ] ; <float> [#uses=2]
	store float %storemerge, float* @a			store float %storemerge, float* @a
	ret float %storemerge			ret float %storemerge
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Be more aggressiveClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 19589

lib/Transforms/Utils/SimplifyCFG.cpp

test/CodeGen/AArch64/analyzecmp.ll

test/CodeGen/AArch64/arm64-promote-const.ll

test/CodeGen/Thumb2/ifcvt-neon.ll

[SimplifyCFG] Be more aggressive
ClosedPublic