This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64TargetMachine.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
global-merge-minsize.ll

Differential D61947

[AArch64] Merge globals when optimising for size
ClosedPublic

Authored by SjoerdMeijer on May 15 2019, 7:36 AM.

Download Raw Diff

Details

Reviewers

efriedma
Jiangning
john.brawn
ab
ramred01

Commits

rGde73404b8c43: [AArch64] Merge globals when optimising for size
rL363130: [AArch64] Merge globals when optimising for size

Summary

Merge of global constants does not happen on Aarch64 even when the constants are used in successive instructions. It generates two separate labels for the two constants and hence materializes the two label addresses separately before loading the two constants. Instead if it were to merg the two constants and create a common label, it would have meant loading the label address once in a base register and then loading the constants using an offset from that label.

The reason phenomenon is seen on Aarch64 but not on arm-none-eabi is because merging of external globals is not enabled by default for the GlobalMerge Pass on Aarch64, when it should ideally be. It is disabled only for Mach-O systems where we emit the .subsections_by_symbols directive and it is not safe to merge external globals. This patch enables the MergeExternalBydefault for the GlobalMerge pass.

Diff Detail

Repository: rL LLVM

Event Timeline

ramred01 created this revision.May 15 2019, 7:36 AM

Herald added subscribers: kristof.beyls, javed.absar. · View Herald TranscriptMay 15 2019, 7:36 AM

This probably makes sense (the change basically just makes this code match 32-bit ARM), but I'd like to see codesize and performance numbers, since this will substantially change code generation in a lot of cases.

In D61947#1503450, @efriedma wrote:

This probably makes sense (the change basically just makes this code match 32-bit ARM), but I'd like to see codesize and performance numbers, since this will substantially change code generation in a lot of cases.

Here are the performance numbers for the LNT and SPEC Benchmarks:

`Performance Regressions - execution_time
===========================

External/SPEC/CINT2000/164.gzip/164.gzip                   2.13%



Performance Improvements - execution_time 
=============================	

MultiSource/Benchmarks/VersaBench/8b10b/8b10b         	  -8.50%
SingleSource/Benchmarks/McGill/queens                     -5.98%
SingleSource/Benchmarks/Misc/himenobmtxpa                 -3.99%
MultiSource/Applications/aha/aha                          -3.17%
SingleSource/Benchmarks/Adobe-C++/loop_unroll             -2.23% 
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl   -2.10%
External/SPEC/CINT2000/186.crafty/186.crafty              -1.46% 
MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl   -1.21% 
SingleSource/Benchmarks/Misc/richards_benchmark           -1.10%



Performance Improvements - mem_bytes 	
===========================

SingleSource/Benchmarks/Misc/flops-5                     -12.00% 
SingleSource/Benchmarks/Misc/flops-8                     -11.93%
SingleSource/Benchmarks/Misc/flops                       -11.64%`

As can be seen, except for a performance degradation of 2.13% in gzip, there is 1-8.5% performance improvements in execution time and about 12% in code size for some of the benchmarks, the last one being more relevant since we are optimizing for code size.

These numbers were obtained by running the benchmarks with -Oz (and not -O3), both for the baseline run as well as with the patch.

A 2% regression in gzip is concerning; SPEC is relatively important, and the numbers are usually pretty stable. Is the regression reproducible? Do you have any idea what's causing it?

t.p.northover added a subscriber: t.p.northover.May 25 2019, 1:33 AM

t.p.northover added inline comments.

test/CodeGen/AArch64/global_merge_aarc64_ac6.ll
25–35 ↗	(On Diff #199605)	There's no need to include most of this extra stuff in the test. You might need to replace the `#0` on the definition of `@func` with `minsize` though.

SjoerdMeijer commandeered this revision.Jun 11 2019, 9:04 AM

SjoerdMeijer added a reviewer: ramred01.

I have enabled this only when we optimise for code-size. The performance results show that there's potential, but as pointed out there is this SPEC regression. But at the moment, we are interested in this patch for code size reasons, and it shows good improvements. I've left a FIXME that it would be worth investigating the regression so that it could be enabled for performance too in a follow up patch.

Herald added a project: Restricted Project. · View Herald TranscriptJun 11 2019, 9:10 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

SjoerdMeijer updated this revision to Diff 204085.Jun 11 2019, 9:24 AM

LGTM

This revision is now accepted and ready to land.Jun 11 2019, 2:48 PM

Closed by commit rL363130: [AArch64] Merge globals when optimising for size (authored by SjoerdMeijer). · Explain WhyJun 12 2019, 1:25 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64TargetMachine.cpp

15 lines

test/

CodeGen/

AArch64/

global-merge-minsize.ll

21 lines

Diff 204230

llvm/trunk/lib/Target/AArch64/AArch64TargetMachine.cpp

Show First 20 Lines • Show All 456 Lines • ▼ Show 20 Lines	bool AArch64PassConfig::addPreISel() {
// FIXME: On AArch64, this depends on the type.		// FIXME: On AArch64, this depends on the type.
// Basically, the addressable offsets are up to 4095 * Ty.getSizeInBytes().		// Basically, the addressable offsets are up to 4095 * Ty.getSizeInBytes().
// and the offset has to be a multiple of the related size in bytes.		// and the offset has to be a multiple of the related size in bytes.
if ((TM->getOptLevel() != CodeGenOpt::None &&		if ((TM->getOptLevel() != CodeGenOpt::None &&
EnableGlobalMerge == cl::BOU_UNSET) \|\|		EnableGlobalMerge == cl::BOU_UNSET) \|\|
EnableGlobalMerge == cl::BOU_TRUE) {		EnableGlobalMerge == cl::BOU_TRUE) {
bool OnlyOptimizeForSize = (TM->getOptLevel() < CodeGenOpt::Aggressive) &&		bool OnlyOptimizeForSize = (TM->getOptLevel() < CodeGenOpt::Aggressive) &&
(EnableGlobalMerge == cl::BOU_UNSET);		(EnableGlobalMerge == cl::BOU_UNSET);
addPass(createGlobalMergePass(TM, 4095, OnlyOptimizeForSize));
		// Merging of extern globals is enabled by default on non-Mach-O as we
		// expect it to be generally either beneficial or harmless. On Mach-O it
		// is disabled as we emit the .subsections_via_symbols directive which
		// means that merging extern globals is not safe.
		bool MergeExternalByDefault = !TM->getTargetTriple().isOSBinFormatMachO();

		// FIXME: extern global merging is only enabled when we optimise for size
		// because there are some regressions with it also enabled for performance.
		if (!OnlyOptimizeForSize)
		MergeExternalByDefault = false;

		addPass(createGlobalMergePass(TM, 4095, OnlyOptimizeForSize,
		MergeExternalByDefault));
}		}

return false;		return false;
}		}

bool AArch64PassConfig::addInstSelector() {		bool AArch64PassConfig::addInstSelector() {
addPass(createAArch64ISelDag(getAArch64TargetMachine(), getOptLevel()));		addPass(createAArch64ISelDag(getAArch64TargetMachine(), getOptLevel()));

▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/global-merge-minsize.ll

				; RUN: llc %s -o - -verify-machineinstrs \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-arm-none-eabi"

				@global0 = dso_local local_unnamed_addr global i32 0, align 4
				@global1 = dso_local local_unnamed_addr global i32 0, align 4

				define dso_local i32 @func() minsize optsize {
				; CHECK-LABEL: @func
				; CHECK: adrp x8, .L_MergedGlobals
				; CHECK-NEXT: add x8, x8, :lo12:.L_MergedGlobals
				; CHECK-NEXT: ldp w9, w8, [x8]
				; CHECK-NEXT: add w0, w8, w9
				; CHECK-NEXT: ret
				entry:
				%0 = load i32, i32* @global0, align 4
				%1 = load i32, i32* @global1, align 4
				%add = add nsw i32 %1, %0
				ret i32 %add
				}