This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
1
RISCVTargetMachine.cpp
-
test/CodeGen/RISCV/
-
CodeGen/
-
RISCV/
-
O3-pipeline.ll
-
global-merge-minsize.ll

Differential D129178

[RISCV] Enable the GlobalMerge pass by default
Needs ReviewPublic

Authored by asb on Jul 6 2022, 2:19 AM.

Download Raw Diff

Details

Reviewers

craig.topper
reames
kito-cheng
luismarques

Summary

As a follow-up to D130481, this patch enables GlobalMerge by default. Posting for comment / testing rather than seriously suggesting we go ahead and enable this yet (the earliest we would possibly to this is after the LLVM 15 branch).

Note: an earlier version of this review both added support for GlobalMerge and enabled it by default. This has now been split out.

Diff Detail

Event Timeline

asb created this revision.Jul 6 2022, 2:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 2:19 AM

Herald added subscribers: wingo, sunshaoce, pmatos and 30 others. · View Herald Transcript

asb requested review of this revision.Jul 6 2022, 2:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2022, 2:19 AM

Herald added subscribers: • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B173830: Diff 442476.Jul 6 2022, 3:39 AM

Allen added a subscriber: Allen.Jul 6 2022, 6:24 PM

asb mentioned this in D129686: [RISCV] Reuse a materialised global address in preference to merging into a load/store.Jul 13 2022, 12:30 PM

asb added a child revision: D129686: [RISCV] Reuse a materialised global address in preference to merging into a load/store.Jul 13 2022, 12:47 PM

Ping and rebase.

Any thoughts? One suggestion would be to tweak this patch so it adds the globals merging tests and command-line option, but leaves it off by default, and we can revisit enabling after 15.x branch takes place.

Harbormaster completed remote builds in B176203: Diff 445746.Jul 19 2022, 3:33 AM

This patch seems to prevent GP linker relaxation on dhrystone. Presumably because the merged symbols are no longer in the .sbss section?

asb mentioned this in D130481: [RISCV] Add the GlobalMerge pass (disabled by default).Jul 25 2022, 6:25 AM

Change this patch so it follows-on from the newly split D130481 (which adds support for GlobalMerge). This allows us to separately review and merge basic support vs changes to the default pass pipeline.

asb added a parent revision: D130481: [RISCV] Add the GlobalMerge pass (disabled by default).Jul 25 2022, 6:28 AM

asb retitled this revision from [RISCV] Enable the GlobalMerge pass for RISC-V to [RISCV] Enable the GlobalMerge pass by default.

Harbormaster completed remote builds in B177359: Diff 447306.Jul 25 2022, 7:18 AM

In D129178#3669396, @craig.topper wrote:

This patch seems to prevent GP linker relaxation on dhrystone. Presumably because the merged symbols are no longer in the .sbss section?

Yeah, probably because of that (or .sdata?). It seems the heuristic of being small is failing as a predictor of what you want to put in the relaxation range. I guess the actual metric you want to optimize for is the total number of symbol references that you can move from the general address range to the optimizable address range? Not the kind of thing you can do from a linker script... How big is the regression? For comparison, Embench O3 and Oz have code size improvements of 3.35% and 2.92%, respectively (D129686 doesn't seem to change that at all).

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
213–217	Please add a comment describing in prose when the option is enabled.

craig.topper mentioned this in rG51ae462447d9: [RISCV] Add the GlobalMerge pass (disabled by default).Sep 8 2022, 6:51 PM

The following is basically a brain dump on a few things vaguely related to GlobalMerge for RISCV. This isn't a review comment on this review per se. Some of this came from discussion w/Palmer because I nerd sniped myself into thinking this a bit too hard, and he was willing to brainstorm with me. I then did the same to @craig.topper a bit later, and edited in some further changes.

Profitability wise, we have three known cases.

Case 1 is where the alignment guarantees the second address could fold into the consuming load/store instruction. The simplest case would be to restrict to when at least one of the globals being merged had a sufficiently large alignment. https://reviews.llvm.org/D129686#inline-1380320 has some brainstorming on a more advanced boundary align mechanism, but building that out is likely non trivial. There have been some other use cases for analogous features in the past, but I don't have details.

Case 2 is when we have three or more accesses using the same global (regardless of alignment). In this case, we only need one lui/addi pair + one access with small folded offset for each of the original access. This is a 1 instruction savings for each additional access.

Case 3 is a size optimization only. This is Alex's https://reviews.llvm.org/D129686 and is geared at using compressed instructions to share common addresses.

For the GP interaction, we may want to take a close look at how gcc models global merging vs how we do. Per Palmer, it keeps around the symbols for each global, and that may impact the heuristic that LD uses for selecting globals to place near GP. We may be able to massage our output a bit to line up with the existing heuristics.

There's a question of how worthwhile this is. For anything beyond static builds with medlow, we need to worry about pc relative addresses. Out of the three known profitable cases above, case 2 and 3 apply to pc relative sequences without knowing the alignment of the auipc, but case 1 does not. For case 1, we'd need to additionally account for the alignment of the auipc. We could potentially insert an align directive, but that wastes space. Per Palmer, there was some previous discussion around a relocation type for an optimized "aligned auipc" construct which used (at most) a single extra instruction. However, no one has pushed this forward.

My current thinking is that we should probably enable this for code size minimization only, and return to it at a later point.

Herald added a subscriber: luke. · View Herald TranscriptJan 30 2023, 2:08 PM

In D129178#4091957, @reames wrote:

The following is basically a brain dump on a few things vaguely related to GlobalMerge for RISCV. This isn't a review comment on this review per se. Some of this came from discussion w/Palmer because I nerd sniped myself into thinking this a bit too hard, and he was willing to brainstorm with me. .

Profitability wise, we could potentially restrict GM to cases where the alignment guarantees the second address could fold into the consuming load/store instruction. The simplest case would be to restrict to when at least one of the globals being merged had a sufficiently large alignment. https://reviews.llvm.org/D129686#inline-1380320 has some brainstorming on a more advanced boundary align mechanism, but building that out is likely non trivial. There have been some other use cases for analogous features in the past, but I don't have details.

For the GP interaction, we may want to take a close look at how gcc models global merging vs how we do. Per Palmer, it keeps around the symbols for each global, and that may impact the heuristic that LD uses for selecting globals to place near GP. We may be able to massage our output a bit to line up with the existing heuristics.

There's a question of how worthwhile this is. For anything beyond static builds with medlow, we need to worry about pc relative addresses. Given that, we'd need to additionally account for the alignment of the auipc. We could potentially insert an align directive, but that wastes space. Per Palmer, there was some previous discussion around a relocation type for an optimized "aligned auipc" construct which used (at most) a single extra instruction. However, no one has pushed this forward.

My current thinking is that we should probably enable this for code size minimization only, and return to it at a later point.

We saw regressions in 400.perlbench and 458.sjeng from SPEC2006INT on one of our cores when enabling GlobalMerge. I suspect this was due to a loss of GP relaxation.

liaolucy added a subscriber: liaolucy.Feb 1 2023, 11:57 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVTargetMachine.cpp

10 lines

test/

CodeGen/

RISCV/

O3-pipeline.ll

1 line

global-merge-minsize.ll

43 lines

Diff 447306

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp

	Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines
	bool RISCVPassConfig::addPreISel() {			bool RISCVPassConfig::addPreISel() {
	if (TM->getOptLevel() != CodeGenOpt::None) {			if (TM->getOptLevel() != CodeGenOpt::None) {
	// Add a barrier before instruction selection so that we will not get			// Add a barrier before instruction selection so that we will not get
	// deleted block address after enabling default outlining. See D99707 for			// deleted block address after enabling default outlining. See D99707 for
	// more details.			// more details.
	addPass(createBarrierNoopPass());			addPass(createBarrierNoopPass());
	}			}

	if (EnableGlobalMerge == cl::BOU_TRUE) {			if ((TM->getOptLevel() != CodeGenOpt::None &&
	addPass(createGlobalMergePass(TM, /* MaxOffset */ 2047,			EnableGlobalMerge == cl::BOU_UNSET) \|\|
	/* OnlyOptimizeForSize */ false,			EnableGlobalMerge == cl::BOU_TRUE) {
				bool OnlyOptimizeForSize = (TM->getOptLevel() < CodeGenOpt::Aggressive) &&
				(EnableGlobalMerge == cl::BOU_UNSET);
				luismarquesUnsubmitted Not Done Reply Inline Actions Please add a comment describing in prose when the option is enabled. luismarques: Please add a comment describing in prose when the option is enabled.

				addPass(createGlobalMergePass(TM, /* MaxOffset */ 2047, OnlyOptimizeForSize,
	/* MergeExternalByDefault */ true));			/* MergeExternalByDefault */ true));
	}			}

	return false;			return false;
	}			}

	bool RISCVPassConfig::addInstSelector() {			bool RISCVPassConfig::addInstSelector() {
	addPass(createRISCVISelDag(getRISCVTargetMachine(), getOptLevel()));			addPass(createRISCVISelDag(getRISCVTargetMachine(), getOptLevel()));
	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/O3-pipeline.ll

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Expand reduction intrinsics			; CHECK-NEXT: Expand reduction intrinsics
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: TLS Variable Hoist			; CHECK-NEXT: TLS Variable Hoist
	; CHECK-NEXT: CodeGen Prepare			; CHECK-NEXT: CodeGen Prepare
	; CHECK-NEXT: Dominator Tree Construction			; CHECK-NEXT: Dominator Tree Construction
	; CHECK-NEXT: Exception handling preparation			; CHECK-NEXT: Exception handling preparation
	; CHECK-NEXT: A No-Op Barrier Pass			; CHECK-NEXT: A No-Op Barrier Pass
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
				; CHECK-NEXT: Merge internal globals
	; CHECK-NEXT: Safe Stack instrumentation pass			; CHECK-NEXT: Safe Stack instrumentation pass
	; CHECK-NEXT: Insert stack protectors			; CHECK-NEXT: Insert stack protectors
	; CHECK-NEXT: Module Verifier			; CHECK-NEXT: Module Verifier
	; CHECK-NEXT: Dominator Tree Construction			; CHECK-NEXT: Dominator Tree Construction
	; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)			; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Post-Dominator Tree Construction			; CHECK-NEXT: Post-Dominator Tree Construction
	▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/global-merge-minsize.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s \
				; RUN: \| FileCheck %s
				; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
				; RUN: \| FileCheck %s

				@eg1 = dso_local global i32 0, align 4
				@eg2 = dso_local global i32 0, align 4
				@eg3 = dso_local global i32 0, align 4
				@eg4 = dso_local global i32 0, align 4

				; Demonstrate that at the default optimisation level, global merging takes
				; place for globals referenced in minsize functions but not others.

				define void @f1(i32 %a) nounwind {
				; CHECK-LABEL: f1:
				; CHECK: # %bb.0:
				; CHECK-NEXT: lui a1, %hi(eg1)
				; CHECK-NEXT: sw a0, %lo(eg1)(a1)
				; CHECK-NEXT: lui a1, %hi(eg2)
				; CHECK-NEXT: sw a0, %lo(eg2)(a1)
				; CHECK-NEXT: ret
				store i32 %a, ptr @eg1, align 4
				store i32 %a, ptr @eg2, align 4
				ret void
				}

				; TODO: It would be better for code size to alter the first store below by
				; first fully materialising .L_MergedGlobals in a1 and then storing to it with
				; a 0 offset.

				define void @f2(i32 %a) nounwind minsize optsize {
				; CHECK-LABEL: f2:
				; CHECK: # %bb.0:
				; CHECK-NEXT: lui a1, %hi(.L_MergedGlobals)
				; CHECK-NEXT: sw a0, %lo(.L_MergedGlobals)(a1)
				; CHECK-NEXT: addi a1, a1, %lo(.L_MergedGlobals)
				; CHECK-NEXT: sw a0, 4(a1)
				; CHECK-NEXT: ret
				store i32 %a, ptr @eg3, align 4
				store i32 %a, ptr @eg4, align 4
				ret void
				}