This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
2/2
MachineBlockPlacement.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
loop-align-limit.ll
-
merge-store-dependency.ll

Differential D156235

[MachineBlockPlacement] Remove the pad limit for no-fallthrough loops
Needs ReviewPublic

Authored by chill on Jul 25 2023, 6:32 AM.

Download Raw Diff

Details

Reviewers

efriedma
dmgreen
xen0n
SixWeining

Summary

This patch removes the limit on how many padding bytes are allowed to
be inserted in order to align loop blocks that have no fallthrough
edges into them and are either a loop header or are preceded in the
layout by a block in a different loop.

This change gives some small performance improvements on AArch64 and
also makes benchmark results less susceptible for variations due to
block placement.

Diff Detail

Unit TestsFailed

	Time	Test
	340 ms	x64 debian > LLVM.CodeGen/RISCV::attributes.ll

Event Timeline

chill created this revision.Jul 25 2023, 6:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2023, 6:32 AM

Herald added subscribers: StephenFan, hiraditya, kristof.beyls. · View Herald Transcript

chill requested review of this revision.Jul 25 2023, 6:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2023, 6:32 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

(Pretty far outside my area of expertise)

I'm having a hard time following what changes are actually semantically significant here. Can you split the "remove MBB argument from getMaxPermittedBytesForAlignment" part into a separate patch?

I think if someone specifies -max-bytes-for-alignment, the command line argument should not be ignored for non-fallthrough loop headers. It should ideally take precedence over this heuristic.

I would also make this AArch64 specific, as it has not been verified on any other architectures. That is debatable though, it just might be using more space than is desirable. If that can't be done via a basic block arg, maybe getMaxPermittedBytesForAlignment could take a bool indicating whether it is non-fallthrough loop header.

Harbormaster completed remote builds in B247975: Diff 543946.Jul 25 2023, 1:23 PM

chill updated this revision to Diff 544327.Jul 26 2023, 5:44 AM

chill removed a reviewer: nikic.

chill added a parent revision: D156324: [MachineBlockPlacement] Respect target limits on padding amount when aligning all blocks.

In D156235#4532640, @efriedma wrote:

I'm having a hard time following what changes are actually semantically significant here. Can you split the "remove MBB argument from getMaxPermittedBytesForAlignment" part into a separate patch?

Done.

In D156235#4532920, @dmgreen wrote:

I think if someone specifies -max-bytes-for-alignment, the command line argument should not be ignored for non-fallthrough loop headers. It should ideally take precedence over this heuristic.

Done.

I would also make this AArch64 specific, as it has not been verified on any other architectures. That is debatable though, it just might be using more space than is desirable. If that can't be done via a basic block arg, maybe getMaxPermittedBytesForAlignment could take a bool indicating whether it is non-fallthrough loop header.

The behaviour already exists on all the other architectures (except LoongArch) since they keep TargetLoweringBase::MaxBytesForAlignment initialised to 0, i.e. no limit. Thus I would not expect regressions.
I've added people who touched loop alignment on LoongArch as reviewers.

Harbormaster completed remote builds in B248234: Diff 544327.Jul 26 2023, 10:44 AM

efriedma added inline comments.Jul 26 2023, 11:48 AM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2954	The `!LayoutPred->isSuccessor(ChainBB)` check already ensures the padding will never be executed. Given that, I guess the remaining checks here are to try to maximize icache hits. In that context, why are loop headers special? Do we care if LayoutPred is part of a subloop of L?

chill added inline comments.Jul 27 2023, 2:36 AM

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2954	Given that, I guess the remaining checks here are to try to maximize icache hits. Yes, I'd like to avoid excessive padding inside a loop. In that context, why are loop headers special? They aren't really special, that check for loop header is a shortcut for not having to dive into `getLoopFor()`, IIUC `findBestLoopTop` will place the loop header first (so the layout predecessor will be in a different loop), or place another loop block in front of the header with a fallthrough to the header. Do we care if LayoutPred is part of a subloop of L? I haven't thought about this case. I'm going to experiment with doing this only for innermost loops.

chill updated this revision to Diff 545084.Jul 28 2023, 3:10 AM

chill marked an inline comment as done.

Harbormaster completed remote builds in B248797: Diff 545084.Jul 28 2023, 3:38 AM

Ping?

dmgreen mentioned this in D124092: CodeGen: Remove MaxBytesForAlignment from MachineBasicBlock.Sep 4 2023, 1:02 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineBlockPlacement.cpp

7 lines

test/

CodeGen/

AArch64/

loop-align-limit.ll

43 lines

merge-store-dependency.ll

6 lines

Diff 545084

llvm/lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 2,944 Lines • ▼ Show 20 Lines	for (MachineBasicBlock *ChainBB : FunctionChain) {
// Check for the existence of a non-layout predecessor which would benefit		// Check for the existence of a non-layout predecessor which would benefit
// from aligning this block.		// from aligning this block.
MachineBasicBlock *LayoutPred =		MachineBasicBlock *LayoutPred =
&*std::prev(MachineFunction::iterator(ChainBB));		&*std::prev(MachineFunction::iterator(ChainBB));

// Force alignment if all the predecessors are jumps. We already checked		// Force alignment if all the predecessors are jumps. We already checked
// that the block isn't cold above.		// that the block isn't cold above.
if (!LayoutPred->isSuccessor(ChainBB)) {		if (!LayoutPred->isSuccessor(ChainBB)) {
		if (MaxBytesForAlignmentOverride.getNumOccurrences() == 0 &&
		L->isInnermost() &&
		efriedmaUnsubmitted Done Reply Inline Actions The `!LayoutPred->isSuccessor(ChainBB)` check already ensures the padding will never be executed. Given that, I guess the remaining checks here are to try to maximize icache hits. In that context, why are loop headers special? Do we care if LayoutPred is part of a subloop of L? efriedma: The `!LayoutPred->isSuccessor(ChainBB)` check already ensures the padding will never be…
		chillAuthorUnsubmitted Done Reply Inline Actions Given that, I guess the remaining checks here are to try to maximize icache hits. Yes, I'd like to avoid excessive padding inside a loop. In that context, why are loop headers special? They aren't really special, that check for loop header is a shortcut for not having to dive into `getLoopFor()`, IIUC `findBestLoopTop` will place the loop header first (so the layout predecessor will be in a different loop), or place another loop block in front of the header with a fallthrough to the header. Do we care if LayoutPred is part of a subloop of L? I haven't thought about this case. I'm going to experiment with doing this only for innermost loops. chill: > Given that, I guess the remaining checks here are to try to maximize icache hits. Yes, I'd…
		(ChainBB == LoopHeader \|\| MLI->getLoopFor(LayoutPred) != L))
		ChainBB->setAlignment(Align, 0);
		else
ChainBB->setAlignment(Align, MaxBytesForAlignment);		ChainBB->setAlignment(Align, MaxBytesForAlignment);
continue;		continue;
}		}

// Align this block if the layout predecessor's edge into this block is		// Align this block if the layout predecessor's edge into this block is
// cold relative to the block. When this is true, other predecessors make up		// cold relative to the block. When this is true, other predecessors make up
// all of the hot entries into the block and thus alignment is likely to be		// all of the hot entries into the block and thus alignment is likely to be
// important.		// important.
BranchProbability LayoutProb =		BranchProbability LayoutProb =
▲ Show 20 Lines • Show All 726 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/loop-align-limit.ll

This file was added.

				; RUN: llc < %s \| FileCheck %s -check-prefixes CHECK,CHECK-NOLIMIT
				; RUN: llc -max-bytes-for-alignment=2 < %s \| FileCheck %s -check-prefixes CHECK,CHECK-LIMIT
				target triple = "aarch64-linux"

				declare i1 @cond(i64, i64)
				declare i32 @h(i32)

				define i32 @g(ptr %a, i64 %n, i32 %d) "tune-cpu"="neoverse-v1" {
				; CHECK-LABEL: g:
				; CHECK: b .LBB0_2
				; CHECK-NOLIMIT-NEXT: .p2align 5{{$}}
				; CHECK-LIMIT-NEXT: .p2align 5, , 2
				; CHECK-NEXT: // %if.end
				entry:
				br label %loop

				loop:
				%i = phi i64 [0, %entry], [%i.next, %if.end]
				%s = phi i32 [0, %entry], [%s.next, %if.end]
				%c = icmp slt i64 %i, %n
				br i1 %c, label %loop.body, label %exit

				loop.body:
				%p = getelementptr i32, ptr %a, i64 %i
				%v = load i32, ptr %p
				%c1 = icmp slt i32 %d, 1
				br i1 %c1, label %if.then, label %if.end

				if.then:
				%v0 = call i32 @h(i32 %v)
				br label %if.end

				if.end:
				%w = phi i32 [%v0, %if.then], [%v, %loop.body]
				%s.next = add i32 %s, %w
				%i.next = add i64 %i, 1
				br label %loop


				exit:
				ret i32 %s
				}

llvm/test/CodeGen/AArch64/merge-store-dependency.ll

	Show All 13 Lines
	; A53-NEXT: .cfi_def_cfa_offset 16			; A53-NEXT: .cfi_def_cfa_offset 16
	; A53-NEXT: .cfi_offset w19, -8			; A53-NEXT: .cfi_offset w19, -8
	; A53-NEXT: .cfi_offset w30, -16			; A53-NEXT: .cfi_offset w30, -16
	; A53-NEXT: .cfi_remember_state			; A53-NEXT: .cfi_remember_state
	; A53-NEXT: movi v0.2d, #0000000000000000			; A53-NEXT: movi v0.2d, #0000000000000000
	; A53-NEXT: mov x8, x0			; A53-NEXT: mov x8, x0
	; A53-NEXT: mov x19, x8			; A53-NEXT: mov x19, x8
	; A53-NEXT: mov w0, w1			; A53-NEXT: mov w0, w1
	; A53-NEXT: mov w9, #256			; A53-NEXT: mov w9, #256 // =0x100
	; A53-NEXT: stp x2, x3, [x8, #32]			; A53-NEXT: stp x2, x3, [x8, #32]
	; A53-NEXT: mov x2, x8			; A53-NEXT: mov x2, x8
	; A53-NEXT: str q0, [x19, #16]!			; A53-NEXT: str q0, [x19, #16]!
	; A53-NEXT: str w1, [x19]			; A53-NEXT: str w1, [x19]
	; A53-NEXT: mov w1, #4			; A53-NEXT: mov w1, #4 // =0x4
	; A53-NEXT: str q0, [x8]			; A53-NEXT: str q0, [x8]
	; A53-NEXT: strh w9, [x8, #24]			; A53-NEXT: strh w9, [x8, #24]
	; A53-NEXT: str wzr, [x8, #20]			; A53-NEXT: str wzr, [x8, #20]
	; A53-NEXT: bl fcntl			; A53-NEXT: bl fcntl
	; A53-NEXT: adrp x9, gv0			; A53-NEXT: adrp x9, gv0
	; A53-NEXT: add x9, x9, :lo12:gv0			; A53-NEXT: add x9, x9, :lo12:gv0
	; A53-NEXT: cmp x19, x9			; A53-NEXT: cmp x19, x9
	; A53-NEXT: b.eq .LBB0_4			; A53-NEXT: b.eq .LBB0_4
	Show All 10 Lines
	; A53-NEXT: bl foo			; A53-NEXT: bl foo
	; A53-NEXT: adrp x8, gv1			; A53-NEXT: adrp x8, gv1
	; A53-NEXT: str x0, [x8, :lo12:gv1]			; A53-NEXT: str x0, [x8, :lo12:gv1]
	; A53-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload			; A53-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload
	; A53-NEXT: .cfi_def_cfa_offset 0			; A53-NEXT: .cfi_def_cfa_offset 0
	; A53-NEXT: .cfi_restore w19			; A53-NEXT: .cfi_restore w19
	; A53-NEXT: .cfi_restore w30			; A53-NEXT: .cfi_restore w30
	; A53-NEXT: ret			; A53-NEXT: ret
	; A53-NEXT: .p2align 4, , 8			; A53-NEXT: .p2align 4
	; A53-NEXT: .LBB0_4: // %while.body.i.split			; A53-NEXT: .LBB0_4: // %while.body.i.split
	; A53-NEXT: // =>This Inner Loop Header: Depth=1			; A53-NEXT: // =>This Inner Loop Header: Depth=1
	; A53-NEXT: .cfi_restore_state			; A53-NEXT: .cfi_restore_state
	; A53-NEXT: b .LBB0_4			; A53-NEXT: b .LBB0_4
	entry:			entry:
	tail call void @llvm.memset.p0.i64(ptr align 8 %fde, i8 0, i64 40, i1 false)			tail call void @llvm.memset.p0.i64(ptr align 8 %fde, i8 0, i64 40, i1 false)
	%state = getelementptr inbounds %struct1, ptr %fde, i64 0, i32 4			%state = getelementptr inbounds %struct1, ptr %fde, i64 0, i32 4
	store i16 256, ptr %state, align 8			store i16 256, ptr %state, align 8
	▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines