Page MenuHomePhabricator

Merge of Global Constants not happening on Aarch64
Needs ReviewPublic

Authored by ramred01 on Wed, May 15, 7:36 AM.

Details

Summary

Merge of global constants does not happen on Aarch64 even when the constants are used in successive instructions. It generates two separate labels for the two constants and hence materializes the two label addresses separately before loading the two constants. Instead if it were to merg the two constants and create a common label, it would have meant loading the label address once in a base register and then loading the constants using an offset from that label.

The reason phenomenon is seen on Aarch64 but not on arm-none-eabi is because merging of external globals is not enabled by default for the GlobalMerge Pass on Aarch64, when it should ideally be. It is disabled only for Mach-O systems where we emit the .subsections_by_symbols directive and it is not safe to merge external globals. This patch enables the MergeExternalBydefault for the GlobalMerge pass.

Diff Detail

Event Timeline

ramred01 created this revision.Wed, May 15, 7:36 AM

This probably makes sense (the change basically just makes this code match 32-bit ARM), but I'd like to see codesize and performance numbers, since this will substantially change code generation in a lot of cases.

ramred01 added a comment.EditedFri, May 24, 8:42 AM

This probably makes sense (the change basically just makes this code match 32-bit ARM), but I'd like to see codesize and performance numbers, since this will substantially change code generation in a lot of cases.

Here are the performance numbers for the LNT and SPEC Benchmarks:

`Performance Regressions - execution_time
===========================

External/SPEC/CINT2000/164.gzip/164.gzip                   2.13%



Performance Improvements - execution_time 
=============================	

MultiSource/Benchmarks/VersaBench/8b10b/8b10b         	  -8.50%
SingleSource/Benchmarks/McGill/queens                     -5.98%
SingleSource/Benchmarks/Misc/himenobmtxpa                 -3.99%
MultiSource/Applications/aha/aha                          -3.17%
SingleSource/Benchmarks/Adobe-C++/loop_unroll             -2.23% 
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl   -2.10%
External/SPEC/CINT2000/186.crafty/186.crafty              -1.46% 
MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl   -1.21% 
SingleSource/Benchmarks/Misc/richards_benchmark           -1.10%



Performance Improvements - mem_bytes 	
===========================

SingleSource/Benchmarks/Misc/flops-5                     -12.00% 
SingleSource/Benchmarks/Misc/flops-8                     -11.93%
SingleSource/Benchmarks/Misc/flops                       -11.64%`

As can be seen, except for a performance degradation of 2.13% in gzip, there is 1-8.5% performance improvements in execution time and about 12% in code size for some of the benchmarks, the last one being more relevant since we are optimizing for code size.

These numbers were obtained by running the benchmarks with -Oz (and not -O3), both for the baseline run as well as with the patch.

A 2% regression in gzip is concerning; SPEC is relatively important, and the numbers are usually pretty stable. Is the regression reproducible? Do you have any idea what's causing it?

t.p.northover added inline comments.
test/CodeGen/AArch64/global_merge_aarc64_ac6.ll
25–35

There's no need to include most of this extra stuff in the test. You might need to replace the #0 on the definition of @func with minsize though.