This is an archive of the discontinued LLVM Phabricator instance.

Differential D19558

Codegen: [X86] Set preferred loop alignment to 32 bytes.
AbandonedPublic

Authored by iteratee on Apr 26 2016, 2:26 PM.

Download Raw Diff

Details

Reviewers

qcolombet
echristo
craig.topper

Summary

I recently discovered a performance regression in
test-suite/Multisource/Benchmarks/ptrdist/ks/ks.test solely due to
changing alignment. If you add 16 nops to the top of the
FindMaxGpAndSwap, performance goes down by about 10%. If you add an
aligment directive of 32 bytes at the top of the first loop of the
function, the performance comes back.

Verified on both Haswell and Sandybridge.

Diff Detail

Repository: rL LLVM

Event Timeline

iteratee updated this revision to Diff 55091.Apr 26 2016, 2:26 PM

iteratee retitled this revision from to Codegen: [X86] Set preferred loop alignment to 32 bytes..

iteratee updated this object.

iteratee added reviewers: craig.topper, echristo.

iteratee set the repository for this revision to rL LLVM.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptApr 26 2016, 2:26 PM

As a default I think this is reasonable and I don't know if we need to change it to something subtarget dependent. Sandybridge is going back a ways at this point, haswell is pretty recent, I don't know how performance stands on Broadwell though.

Adding Quentin here for that perspective.

-eric

Hi Kyle,

What are the performance numbers you get not just on that benchmark?

I am trying to ascertain how broadly this will impact the performance and thus, if it is worth doing that.

My problem is that changing this attribute will have a non-negligible impact on the code size (do you have numbers BTW?) and I am not ready to take that hit without broader numbers. In particular, I think ks is known to be noisy so I would just take that regression and be done with it.

Cheers,
-Quentin

PS: Please add llvm-commmit as a Subscriber.

echristo added a subscriber: llvm-commits.Apr 27 2016, 11:02 AM

Hi Quentin,

To be clear we're seeing multiple benchmarks speeding up 10% because of this change. That said, size numbers would be good just so we know what we're getting into - though I don't think they should be a blocking factor here. If that's an issue then we can make loop alignment in general dependent upon optimization level, but for cpu optimization we don't want that to be the guiding factor (i.e. the patch is still fine, but we might want to change where we align loops).

Thanks!

-eric

mehdi_amini added inline comments.Apr 27 2016, 11:22 AM

lib/Target/X86/X86ISelLowering.cpp
1639	The settings just above differs between OptSize or not, if there is an impact on code size it makes sense to change the way the PreLoopAlignment is handle as well.

I'd like to have more data that would show the real benefit of this, and isolate it from other factor (read to the end to see what I mean), because I believe that this 10% is a side effect and not a real consequence of the realignment.

First, KS is absolutely not a reliable test, I spent a long week on this test alone for the same issue, trying to align loops in this benchmark. In my case I was doing some performance tuning and noticed a 10% regression after one of my changes. I turned out that my A/B test was expanding the FILE macro with a different size, and it lead to performance swing on this test.

Here are my notes from last October:

I measured the time when replacing the __FILE__ macro by a  byte per byte growing string:

0->4 : ~750ms
5->19: ~850ms
20->35: 730ms
36->51: 850ms
52->67: 730ms
68->83: 850ms
84->99: 750ms
100->115: 870ms
116->131:750ms
…

The pattern continues (i checked till 1024): 16 bytes fast, then 16 bytes slow.

My first thought was increasing the loop alignment, but it didn't provide great results all the time. I don't remember the details, but simply aligning the header of the hot loop independently of the rest of the didn't help as much (it can help by side effect on the rest of the code alignment).

At some point, somehow Bob heard that I was working on this test (KS) and pointed me to: https://llvm.org/bugs/show_bug.cgi?id=5615 ; I invite you to read Zia answers there.
You should also check the slides he attached to the bug that detail the issue in a very nice way.
Here is the relevant LLVM-dev post about this: http://lists.llvm.org/pipermail/llvm-dev/2015-June/086640.html

In general, I'm not a big fan of blindly aligning loops (on any boundary), as this can cause random effects +ve and -ve.
It's very simple to come up with examples where aligning a loop will cause regressions on certain architectures, specifically in loops that have control flow.

Having said that, I'd anticipate a change like this to cause pretty broad impact on code in a random way, and also bloat.

If the benchmark affected is important, it would be best to try and figure out the exact cause and see if there's a reasonable compiler heuristic that can be applied to target the specific issue (in many alignment-type regressions, there aren't). If you need help with this part of the analysis because it's tricky to determine the cause, I'd be happy to try and help.

Zia.

After speaking with Kyle I'm going to walk back my earlier comments. I was under the impression this helped more than the one benchmark - it's something else that helps those.

So, just ignore me here :)

-eric

Hi,

Is there anything else to discuss here or should we close that revision?

Cheers,
-Quentin

I meant to abandon this patch.

So just close it in phabricator (the drop-down menu above the box where you can write the comment)

iteratee abandoned this revision.May 18 2016, 11:20 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

2 lines

test/

CodeGen/

X86/

4 lines

2 lines

4 lines

2 lines

Diff 55091

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 1,630 Lines • ▼ Show 20 Lines
	computeRegisterProperties(Subtarget.getRegisterInfo());			computeRegisterProperties(Subtarget.getRegisterInfo());

	MaxStoresPerMemset = 16; // For @llvm.memset -> sequence of stores			MaxStoresPerMemset = 16; // For @llvm.memset -> sequence of stores
	MaxStoresPerMemsetOptSize = 8;			MaxStoresPerMemsetOptSize = 8;
	MaxStoresPerMemcpy = 8; // For @llvm.memcpy -> sequence of stores			MaxStoresPerMemcpy = 8; // For @llvm.memcpy -> sequence of stores
	MaxStoresPerMemcpyOptSize = 4;			MaxStoresPerMemcpyOptSize = 4;
	MaxStoresPerMemmove = 8; // For @llvm.memmove -> sequence of stores			MaxStoresPerMemmove = 8; // For @llvm.memmove -> sequence of stores
	MaxStoresPerMemmoveOptSize = 4;			MaxStoresPerMemmoveOptSize = 4;
	setPrefLoopAlignment(4); // 2^4 bytes.			setPrefLoopAlignment(5); // 2^5 bytes.
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions The settings just above differs between OptSize or not, if there is an impact on code size it makes sense to change the way the PreLoopAlignment is handle as well. mehdi_amini: The settings just above differs between OptSize or not, if there is an impact on code size it…

	// An out-of-order CPU can speculatively execute past a predictable branch,			// An out-of-order CPU can speculatively execute past a predictable branch,
	// but a conditional move could be stalled by an expensive earlier operation.			// but a conditional move could be stalled by an expensive earlier operation.
	PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();			PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();
	EnableExtLdPromotion = true;			EnableExtLdPromotion = true;
	setPrefFunctionAlignment(4); // 2^4 bytes.			setPrefFunctionAlignment(4); // 2^4 bytes.

	verifyIntrinsicTables();			verifyIntrinsicTables();
	▲ Show 20 Lines • Show All 28,809 Lines • Show Last 20 Lines

test/CodeGen/X86/avx2-vbroadcast.ll

	Show First 20 Lines • Show All 635 Lines • ▼ Show 20 Lines
	define void @crash() nounwind alwaysinline {			define void @crash() nounwind alwaysinline {
	; X32-LABEL: crash:			; X32-LABEL: crash:
	; X32: ## BB#0: ## %WGLoopsEntry			; X32: ## BB#0: ## %WGLoopsEntry
	; X32-NEXT: xorl %eax, %eax			; X32-NEXT: xorl %eax, %eax
	; X32-NEXT: testb %al, %al			; X32-NEXT: testb %al, %al
	; X32-NEXT: je LBB31_1			; X32-NEXT: je LBB31_1
	; X32-NEXT: ## BB#2: ## %ret			; X32-NEXT: ## BB#2: ## %ret
	; X32-NEXT: retl			; X32-NEXT: retl
	; X32-NEXT: .p2align 4, 0x90			; X32-NEXT: .p2align 5, 0x90
	; X32-NEXT: LBB31_1: ## %footer349VF			; X32-NEXT: LBB31_1: ## %footer349VF
	; X32-NEXT: ## =>This Inner Loop Header: Depth=1			; X32-NEXT: ## =>This Inner Loop Header: Depth=1
	; X32-NEXT: jmp LBB31_1			; X32-NEXT: jmp LBB31_1
	;			;
	; X64-LABEL: crash:			; X64-LABEL: crash:
	; X64: ## BB#0: ## %WGLoopsEntry			; X64: ## BB#0: ## %WGLoopsEntry
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: testb %al, %al			; X64-NEXT: testb %al, %al
	; X64-NEXT: je LBB31_1			; X64-NEXT: je LBB31_1
	; X64-NEXT: ## BB#2: ## %ret			; X64-NEXT: ## BB#2: ## %ret
	; X64-NEXT: retq			; X64-NEXT: retq
	; X64-NEXT: .p2align 4, 0x90			; X64-NEXT: .p2align 5, 0x90
	; X64-NEXT: LBB31_1: ## %footer349VF			; X64-NEXT: LBB31_1: ## %footer349VF
	; X64-NEXT: ## =>This Inner Loop Header: Depth=1			; X64-NEXT: ## =>This Inner Loop Header: Depth=1
	; X64-NEXT: jmp LBB31_1			; X64-NEXT: jmp LBB31_1
	WGLoopsEntry:			WGLoopsEntry:
	br i1 undef, label %ret, label %footer329VF			br i1 undef, label %ret, label %footer329VF

	footer329VF:			footer329VF:
	%A.0.inVF = fmul float undef, 6.553600e+04			%A.0.inVF = fmul float undef, 6.553600e+04
	▲ Show 20 Lines • Show All 485 Lines • Show Last 20 Lines

test/CodeGen/X86/licm-symbol.ll

	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s

	; MachineLICM should be able to hoist the sF reference out of the loop.			; MachineLICM should be able to hoist the sF reference out of the loop.

	; CHECK: pushl %esi			; CHECK: pushl %esi
	; CHECK: pushl			; CHECK: pushl
	; CHECK: movl $176, %esi			; CHECK: movl $176, %esi
	; CHECK: addl L___sF$non_lazy_ptr, %esi			; CHECK: addl L___sF$non_lazy_ptr, %esi
	; CHECK: .p2align 4, 0x90			; CHECK: .p2align 5, 0x90

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32"
	target triple = "i386-apple-darwin8"			target triple = "i386-apple-darwin8"

	%struct.FILE = type { i8, i32, i32, i16, i16, %struct.__sbuf, i32, i8, i32 (i8), i32 (i8, i8, i32), i64 (i8, i64, i32), i32 (i8, i8, i32), %struct.__sbuf, %struct.__sFILEX*, i32, [3 x i8], [1 x i8], %struct.__sbuf, i32, i64 }			%struct.FILE = type { i8, i32, i32, i16, i16, %struct.__sbuf, i32, i8, i32 (i8), i32 (i8, i8, i32), i64 (i8, i64, i32), i32 (i8, i8, i32), %struct.__sbuf, %struct.__sFILEX*, i32, [3 x i8], [1 x i8], %struct.__sbuf, i32, i64 }
	%struct.__sFILEX = type opaque			%struct.__sFILEX = type opaque
	%struct.__sbuf = type { i8*, i32 }			%struct.__sbuf = type { i8*, i32 }
	%struct.gcov_ctr_summary = type { i32, i32, i64, i64, i64 }			%struct.gcov_ctr_summary = type { i32, i32, i64, i64, i64 }
	Show All 22 Lines

test/CodeGen/X86/postra-licm.ll

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	imix_test.exit: ; preds = %bb.i1
br i1 undef, label %bb23, label %bb26.preheader		br i1 undef, label %bb23, label %bb26.preheader

bb26.preheader: ; preds = %imix_test.exit		bb26.preheader: ; preds = %imix_test.exit
br i1 undef, label %bb28, label %bb30		br i1 undef, label %bb28, label %bb30

bb23: ; preds = %imix_test.exit		bb23: ; preds = %imix_test.exit
unreachable		unreachable
; Verify that there are no loads inside the loop.		; Verify that there are no loads inside the loop.
; X86-32: .p2align 4		; X86-32: .p2align 5
; X86-32: %bb28		; X86-32: %bb28
; X86-32-NOT: (%esp),		; X86-32-NOT: (%esp),
; X86-32-NOT: (%ebp),		; X86-32-NOT: (%ebp),
; X86-32: jmp		; X86-32: jmp

bb28: ; preds = %bb28, %bb26.preheader		bb28: ; preds = %bb28, %bb26.preheader
%counter.035 = phi i32 [ %3, %bb28 ], [ 0, %bb26.preheader ] ; <i32> [#uses=2]		%counter.035 = phi i32 [ %3, %bb28 ], [ 0, %bb26.preheader ] ; <i32> [#uses=2]
%tmp56 = shl i32 %counter.035, 2 ; <i32> [#uses=0]		%tmp56 = shl i32 %counter.035, 2 ; <i32> [#uses=0]
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines

define void @t2(i8* nocapture %bufp, i8* nocapture %data, i32 %dsize) nounwind ssp {		define void @t2(i8* nocapture %bufp, i8* nocapture %data, i32 %dsize) nounwind ssp {
; X86-64-LABEL: t2:		; X86-64-LABEL: t2:
entry:		entry:
br i1 undef, label %return, label %bb.nph		br i1 undef, label %return, label %bb.nph

bb.nph: ; preds = %entry		bb.nph: ; preds = %entry
; X86-64: movq _map_4_to_16@GOTPCREL(%rip)		; X86-64: movq _map_4_to_16@GOTPCREL(%rip)
; X86-64: .p2align 4		; X86-64: .p2align 5
%tmp5 = zext i32 undef to i64 ; <i64> [#uses=1]		%tmp5 = zext i32 undef to i64 ; <i64> [#uses=1]
%tmp6 = add i64 %tmp5, 1 ; <i64> [#uses=1]		%tmp6 = add i64 %tmp5, 1 ; <i64> [#uses=1]
%tmp11 = shl i64 undef, 1 ; <i64> [#uses=1]		%tmp11 = shl i64 undef, 1 ; <i64> [#uses=1]
%tmp14 = mul i64 undef, 3 ; <i64> [#uses=1]		%tmp14 = mul i64 undef, 3 ; <i64> [#uses=1]
br label %bb		br label %bb

bb: ; preds = %bb, %bb.nph		bb: ; preds = %bb, %bb.nph
%tmp9 = mul i64 undef, undef ; <i64> [#uses=2]		%tmp9 = mul i64 undef, undef ; <i64> [#uses=2]
Show All 25 Lines

test/CodeGen/X86/setcc-lowering.ll

	Show All 36 Lines
	; KNL-32-NEXT: pushl %esi			; KNL-32-NEXT: pushl %esi
	; KNL-32-NEXT: .Ltmp0:			; KNL-32-NEXT: .Ltmp0:
	; KNL-32-NEXT: .cfi_def_cfa_offset 8			; KNL-32-NEXT: .cfi_def_cfa_offset 8
	; KNL-32-NEXT: .Ltmp1:			; KNL-32-NEXT: .Ltmp1:
	; KNL-32-NEXT: .cfi_offset %esi, -8			; KNL-32-NEXT: .cfi_offset %esi, -8
	; KNL-32-NEXT: movl {{[0-9]+}}(%esp), %eax			; KNL-32-NEXT: movl {{[0-9]+}}(%esp), %eax
	; KNL-32-NEXT: movl {{[0-9]+}}(%esp), %ecx			; KNL-32-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; KNL-32-NEXT: movw $-1, %dx			; KNL-32-NEXT: movw $-1, %dx
	; KNL-32-NEXT: .p2align 4, 0x90			; KNL-32-NEXT: .p2align 5, 0x90
	; KNL-32-NEXT: .LBB1_1: # %for_loop599			; KNL-32-NEXT: .LBB1_1: # %for_loop599
	; KNL-32-NEXT: # =>This Inner Loop Header: Depth=1			; KNL-32-NEXT: # =>This Inner Loop Header: Depth=1
	; KNL-32-NEXT: cmpl $65536, %ecx # imm = 0x10000			; KNL-32-NEXT: cmpl $65536, %ecx # imm = 0x10000
	; KNL-32-NEXT: movl %eax, %esi			; KNL-32-NEXT: movl %eax, %esi
	; KNL-32-NEXT: sbbl $0, %esi			; KNL-32-NEXT: sbbl $0, %esi
	; KNL-32-NEXT: movl $0, %esi			; KNL-32-NEXT: movl $0, %esi
	; KNL-32-NEXT: cmovlw %dx, %si			; KNL-32-NEXT: cmovlw %dx, %si
	; KNL-32-NEXT: testw %si, %si			; KNL-32-NEXT: testw %si, %si
	Show All 22 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Codegen: [X86] Set preferred loop alignment to 32 bytes.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 55091

lib/Target/X86/X86ISelLowering.cpp

test/CodeGen/X86/avx2-vbroadcast.ll

test/CodeGen/X86/licm-symbol.ll

test/CodeGen/X86/postra-licm.ll

test/CodeGen/X86/setcc-lowering.ll

Codegen: [X86] Set preferred loop alignment to 32 bytes.
AbandonedPublic