This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Merge globals when optimising for size
ClosedPublic

Authored by SjoerdMeijer on May 15 2019, 7:36 AM.

Details

Summary

Merge of global constants does not happen on Aarch64 even when the constants are used in successive instructions. It generates two separate labels for the two constants and hence materializes the two label addresses separately before loading the two constants. Instead if it were to merg the two constants and create a common label, it would have meant loading the label address once in a base register and then loading the constants using an offset from that label.

The reason phenomenon is seen on Aarch64 but not on arm-none-eabi is because merging of external globals is not enabled by default for the GlobalMerge Pass on Aarch64, when it should ideally be. It is disabled only for Mach-O systems where we emit the .subsections_by_symbols directive and it is not safe to merge external globals. This patch enables the MergeExternalBydefault for the GlobalMerge pass.

Diff Detail

Repository
rL LLVM

Event Timeline

ramred01 created this revision.May 15 2019, 7:36 AM

This probably makes sense (the change basically just makes this code match 32-bit ARM), but I'd like to see codesize and performance numbers, since this will substantially change code generation in a lot of cases.

ramred01 added a comment.EditedMay 24 2019, 8:42 AM

This probably makes sense (the change basically just makes this code match 32-bit ARM), but I'd like to see codesize and performance numbers, since this will substantially change code generation in a lot of cases.

Here are the performance numbers for the LNT and SPEC Benchmarks:

`Performance Regressions - execution_time
===========================

External/SPEC/CINT2000/164.gzip/164.gzip                   2.13%



Performance Improvements - execution_time 
=============================	

MultiSource/Benchmarks/VersaBench/8b10b/8b10b         	  -8.50%
SingleSource/Benchmarks/McGill/queens                     -5.98%
SingleSource/Benchmarks/Misc/himenobmtxpa                 -3.99%
MultiSource/Applications/aha/aha                          -3.17%
SingleSource/Benchmarks/Adobe-C++/loop_unroll             -2.23% 
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl   -2.10%
External/SPEC/CINT2000/186.crafty/186.crafty              -1.46% 
MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl   -1.21% 
SingleSource/Benchmarks/Misc/richards_benchmark           -1.10%



Performance Improvements - mem_bytes 	
===========================

SingleSource/Benchmarks/Misc/flops-5                     -12.00% 
SingleSource/Benchmarks/Misc/flops-8                     -11.93%
SingleSource/Benchmarks/Misc/flops                       -11.64%`

As can be seen, except for a performance degradation of 2.13% in gzip, there is 1-8.5% performance improvements in execution time and about 12% in code size for some of the benchmarks, the last one being more relevant since we are optimizing for code size.

These numbers were obtained by running the benchmarks with -Oz (and not -O3), both for the baseline run as well as with the patch.

A 2% regression in gzip is concerning; SPEC is relatively important, and the numbers are usually pretty stable. Is the regression reproducible? Do you have any idea what's causing it?

t.p.northover added inline comments.
test/CodeGen/AArch64/global_merge_aarc64_ac6.ll
25–35 ↗(On Diff #199605)

There's no need to include most of this extra stuff in the test. You might need to replace the #0 on the definition of @func with minsize though.

SjoerdMeijer commandeered this revision.Jun 11 2019, 9:04 AM
SjoerdMeijer added a reviewer: ramred01.
SjoerdMeijer retitled this revision from Merge of Global Constants not happening on Aarch64 to [AArch64] Merge globals when optimising for size.

I have enabled this only when we optimise for code-size. The performance results show that there's potential, but as pointed out there is this SPEC regression. But at the moment, we are interested in this patch for code size reasons, and it shows good improvements. I've left a FIXME that it would be worth investigating the regression so that it could be enabled for performance too in a follow up patch.

Herald added a project: Restricted Project. · View Herald TranscriptJun 11 2019, 9:10 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
This revision is now accepted and ready to land.Jun 11 2019, 2:48 PM
This revision was automatically updated to reflect the committed changes.