This is an archive of the discontinued LLVM Phabricator instance.

'CSE' of ADRP instructions used for load/store addressing
Needs ReviewPublic

Authored by Jiangning on Mar 30 2014, 8:18 PM.

Details

Reviewers
t.p.northover
Summary

Hi,

Attached patch is to do 'CSE' of ADRP instructions used for load/store addressing.

GCC always use common attribute to expose global symbols to other modules, but this seems not mandatory for most of the compilation scenarios (Let me know if I'm wrong). With -fno-common enabled, we could get a chance of merging global definitions within the current compilation module to be a single monolithic big structure. In this way, LLVM would be able to treat all global access as a field within the newly created big structure.

For global symbol access, we usually have the following instruction sequence,

adrp ; load page address (4K boundary)
add ; add the address within the page
load/store ; use the address from the last add

If the big structure can be always fit into a page, we would be able to see fewer adrp instructions.

LLVM already has a GlobalMerge Pass, and originally it only tries to merge static variables, so now this patch is to get it extended to support external variables for AArch64.

There are some other considerations,

  1. In theory, ADRP reduction may not always be good, because
  2. The live range holding page address would increase a lot, and register pressure can be increased accordingly.
  3. The data section may have some holes, and data cache behavior may be changed accordingly.
  4. The maximum offset being supported by AArch64 ldr/str instruction actually is 4096*Sizeof(elem_type), which is larger than a 4K page. In theory, we can create an even larger struct merged for global variables. But the static experiment shows fewer extra adrp reduction when increasing merged size from 4K to 8K.
  5. The scenario of applying this optimization is different from ARM target because of different addressing mode.
  6. Actually we can reduce the "add" instruction as well by propagating the in-page address, which is usually a relocation, to load/store instruction, and the original offset in load/store instruction can be dumped into fixup(addend) of relocation section. This is be a separate optimization for which we could give follow-up.
  7. I know ARM64 back-end is in trunk now, and I can also port it to ARM64 if needed.

Finally, I don't have real AArch64 machine to test the patch, so here comes the static statistic data only for SPEC2000. I would be extremely appreciative if anybody can help me to validate the performance.

(The before/after merge column contains is the number of ADRP instructions)

CINT2000before mergeafter mergedecreased
256.bzip265536743.97%
186.crafty389837862.87%
175.vpr163616200.98%
255.vortex733369505.22%
252.eon166016580.12%
181.mcf58580.00%
164.gzip71862213.37%
253.perlbmk946094430.18%
197.parser143713098.91%
300.twolf299327139.36%
254.gap614555968.93%
176.gcc146501290811.89%
183.equake26418131.44%
188.ammp9378628.00%
179.art34416252.91%
177.mesa423042300.00%

Thanks,
-Jiangning

Diff Detail

Event Timeline

Can't this be tested with "opt -global-merge"?

Hi Rafael,

No. I didn't enable global-merge-on-external by default, so you have to use -global-merge-on-external instead.

Thanks,
-Jiangning

zzheng added a comment.Apr 2 2014, 3:46 PM

Hi,

We applied this patch and are evaluating its performance. However, it only
works for AArch64. The test case added by this patch failed on ARM64.
Jiangning, can you take a look?

Thanks,
Zhaoshi

Jiangning updated this revision to Unknown Object (????).Apr 17 2014, 11:49 PM

Rebase the patch on TOT, because the last patch will raise build time failure on TOT.

Hi Jianging,

It looks to me that you have two change sets in the same patch:

  • One that adds the support of external linkage to the global merge pass.
  • One that enables the global merge pass for AArch64.

Could you split the patch to match that?

Thanks,
-Quentin

Hi Quentin,

Yes, you are correct. Now I split it into two separate patches.

One is at http://reviews.llvm.org/D3431 for enabling global merge pass, the
other is at http://reviews.llvm.org/D3432 for implementing ADRP CSE for
global symbols.

I don't use the original code review at http://reviews.llvm.org/D3223 to
avoid confusion.

Thanks,
-Jiangning

2014-04-19 0:48 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:

Hi Jianging,

It looks to me that you have two change sets in the same patch:
- One that adds the support of external linkage to the global merge pass.
- One that enables the global merge pass for AArch64.

Could you split the patch to match that?

Thanks,
-Quentin

http://reviews.llvm.org/D3223