This is an archive of the discontinued LLVM Phabricator instance.

Implement ADRP CSE for global symbols
ClosedPublic

Authored by Jiangning on Apr 18 2014, 8:43 PM.

Details

Summary

This patch depends on another patch "Enable global merge pass." at http://reviews.llvm.org/D3431.

This patch is to implement ADRP CSE for global symbols defined within the module being compiled.

Diff Detail

Event Timeline

Hi Jiangning,

This sounds like a good idea. I've got a couple of smaller comments:

lib/IR/Globals.cpp
277–279

Are these clauses tested? I don't see any aliases or GEPs in your examples. I also vaguely remember Rafael saying that a GEP-based alias wasn't intended to be supported, though I wouldn't swear to it (and I'm fuzzy on exactly why it's a bad idea).

lib/Transforms/Scalar/GlobalMerge.cpp
287–290

This might be clearer split up. I know my brain starts to melt when I see massive nested conditionals like this. Perhaps linkage conditions vs intrinsic blockers?

test/CodeGen/AArch64/global_merge_2.ll
1 ↗(On Diff #8660)

And speaking of tests, perhaps this would be a good time to promote GlobalMerge to something that can be run through opt and use that to do the testing.

Hi Jiangning,

I copy/paste my comments from the other thread here to ease the tracking.

I wanted to sort out the redundant ADRP stuff on ARM64 while using the pseudo instructions (for address computations) before doing any measurements.
It appears that your patch that adds the external support is not playing nicely (I haven’t looked why) with the existing framework.
I.e.,
Here is my test case (notice the use of internal for the globals):
@x = internal global i32 0, align 4
@y = internal global i32 0, align 4

define void @f1(i32 %a1, i32 %a2) {

store i32 %a1, i32* @x, align 4
store i32 %a2, i32* @y, align 4
ret void

}

Test1, your patch is disabled:
llc -o - -mtriple=arm64-apple-ios test.ll -global-merge-on-external=false -arm64-collect-loh=false

_f1: ; @f1
.cfi_startproc
; BB#0:
adrp x8, MergedGlobals@PAGE
add x8, x8,
MergedGlobals@PAGEOFF
stp w0, w1, [x8]
ret
.cfi_endproc

Test2, your patch is enabled:
llc -o - -mtriple=arm64-apple-ios test.ll -global-merge-on-external=true -arm64-collect-loh=false

_f1: ; @f1
.cfi_startproc
; BB#0:
adrp x8, MergedGlobals@PAGE
adrp x9,
MergedGlobals@PAGE
add x9, x9, MergedGlobals@PAGEOFF
str w0, [x8,
MergedGlobals@PAGEOFF]
str w1, [x9, #4]
ret
.cfi_endproc

Could you take a look please?

Thanks,
-Quentin

Jiangning updated this revision to Unknown Object (????).Apr 21 2014, 10:50 PM

This patch is to fix the issue raised by Quentin and Tim. The code was cleaned, and the followings are changed

  1. Change IsInternalOnly to IsExternal, and add assertion.
  2. The behavior of -global-merge-on-external will not affect static variables any longer.
  3. Split a long if condition expression.

Thanks,
-Jiangning

Jiangning added inline comments.Apr 21 2014, 11:32 PM
lib/IR/Globals.cpp
277–279

Tim, Yes, these clauses are tested, and the follows are just aliases to be checked.

; CHECK: y = _MergedGlobals_x+4
; CHECK: z = _MergedGlobals_x+8

Without this change, compiler would fail to run the test case global_merge_2.ll.

GEPs are from ".globl y" and ".globl z"

i32* getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 1)
i32* getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 2)

I don't know what Rafael had said previously, can you point me that email thread if you know, then I can consider this issue.

lib/Transforms/Scalar/GlobalMerge.cpp
287–290

Tim, That's OK, and I've split it up in new version.

Hi Quenin,

Thanks for your testing!

It appears that your patch that adds the external support is not playing

nicely (I haven’t looked why) with the existing framework.

I uploaded a new version to fix the issue you mentioned and now
-global-merge-on-external will not affect stack variable merge any longer.

Thanks,
-Jiangning

Jiangning added inline comments.Apr 21 2014, 11:47 PM
test/CodeGen/AArch64/global_merge_2.ll
1 ↗(On Diff #8660)

Tim, Do you mean we should move it out of PreISel stage and hoist it to middle end which can really be covered by opt? If yes, do you have any suggestions, where we should drop this pass? I mean, which passes should be before/after this global merge pass? Anyway, I can investigate it myself, and I thought it was designed in PreISel on purpose, although I don't know the reason yet.

Hi Jiangning,

Tim, Do you mean we should move it out of PreISel stage and hoist it to middle end which can really be covered by opt? If yes, do you have any suggestions, where we should drop this pass?

I didn't mean any change in the order (I've no idea where it would be
best placed, though "as late as possible" seems reasonable so PreISel
may be right), just that it would be good to be able to run it via opt
instead of llc. I did the same recently to the Atomics pass I've been
working on.

Cheers.

Tim

Hi Jiangning,

I confirm that the changes you did fix the “regression” on ARM64, however I do not think this is the way to go.
Indeed, what fixes the problem, as you know, is the fact that we do not set any alignment information for internal globals. Because of this lack of information, the ARM64 backend does not try to fold the ADDlow into the memory operation since it may not be correct.

That said, if we can set some reasonable alignment, we should do it, and not backed this off because of ARM64 current limitation. Nevertheless, we need to be careful not to introduce any regression here because so far the global merge pass was producing unaligned globals.

Now, regarding a reasonable alignment, I do not think the current setting make sense.
Setting the alignment to the size of the struct looks wrong to me (think a struct with 20 times the same type). I guess this information should be set using the data layout on MergedTy. After all, all the types are currently compliant with the ABI (otherwise, we excluded the related globals).

I’ve also a couple of minor comments, see the inline comments.

I am looking into making the folding smarter to avoid the redundant ADRP.

Thanks,
-Quentin

lib/Transforms/Scalar/GlobalMerge.cpp
167

Should be hoisted outside of the loop, since we assert this does not change within the loop.

188

Ditto.

203

This is the alignment thing I was talking about.

Jiangning updated this revision to Diff 8787.Apr 24 2014, 1:16 AM
Jiangning edited edge metadata.

This new version fixed the followings,

  1. Following Tim's feedback, make global merge pass work for opt, so the test cases under test/Transforms/GlobalMerge are updated to use opt only.
  2. Move the original test under test/Transforms/GlobalMerge to test/CodeGen and keep the llc test. The test using llc should be still meaningful, because it make guarantee global merge pass is really enabled in PreISel for optimization level -O1 and above. The opt test can't cover this.
  3. Following Quentin's feedback, hoisted some codes out of the loop.

Thanks,
-Jiangning

Jiangning updated this revision to Diff 8790.Apr 24 2014, 1:24 AM
Jiangning edited edge metadata.

This patch make some more little changes,

  1. Remove file mode change
  2. Remove useless comments

Thanks,
-Jiangning

Jiangning added inline comments.Apr 24 2014, 2:58 AM
lib/Transforms/Scalar/GlobalMerge.cpp
167

Accept and fixed by the new version uploaded.

188

Accept and fixed by the new version uploaded.

203

Hi Quentin,

Appreciate for your further feedback!

I understand your point. Actually I think we have three solutions to solve this ADRP issue, let me summarize them, and make sure our discussion is on the same page.

  1. Materialize the base address of merged global data structure with a single register. MOVAddr in ARM64 just does this work. This is essentially to use "adrp + add" to compute the base of merged global data structure. This way, the merge global data structure needn't the structure size alignment, but the natural alignment. The disadvantage of this solution is, at compile time we don't really know if the offset in add instruction can really be propagated and combined into load/store instructions, because it might be larger than a page size. For this solution the instruction sequence could be like below, (Suppose we have three loads)

adrp r0, MergedGlobal@PAGE
add r1, r0, MergedGlobal@PAGE_OFF
load [r1, var1_offset_within_MergedGlobal]
load [r1, var2_offset_within_MergedGlobal]
load [r1, var3_offset_within_MergedGlobal]

Refer to the chart below as well,

page 1                     page 2

r0
------------------------------r1-----------| // merged global data structure
----------------------------s-------e
var1_offset_within_MergedGlobal = e - s

(MergedGlobal@PAGE_OFF + var1_offset_within_MergedGlobal) > PageSize
  1. We don't use a single register to describe the base address of merged global data structure at all. If we can guarantee the merged global data structure is always within a page, and doesn't cross page boundary, we would be able make sure the offset from the page boundary is always smaller than a page, so at compile time, the offset to page boundary would be able to be fused into load/store instruction directly. Setting the alignment to be the maximum merged data structure size could guarantee the merged data structure doesn't cross page boundary. The disadvantage of this solution is, we would probably increase run-time memory consumption, because there might be some "holes" in data section. With this solution, we would have the following instruction sequence finally,

adrp r0, MergedGlobal@PAGE
load [r0, var1_offset_within_PAGE]
load [r0, var2_offset_within_PAGE]
load [r0, var3_offset_within_PAGE]

Refer to the chart below as well,

page 1                     page 2

r0
--------------------|----------| // merged global data structure
s----------------------e
var1_offset_within_PAGE = e - s

  1. We use link-time solution to help the removal of ADRP and ADD. Link-time optimization might not be cost-less because the dead instruction can only be replaced with NOP instruction.

Both solution 1) and 2) can remove some ADRPs statically. And the ADD introduced in solution 1) could probably be reduced by solution 3) as well.

Do you think the disadvantage with solution 2) is unacceptable?

Thanks,
-Jiangning

Hi Quentin,

I just realized probably I misunderstood your point of using data layout. I
think you are correct the merged size can't be simply accumulated by
indivisual global variable size, because of natural alignment requirement
inside the merged data structure.

Sorry about that and I will change my code soon.

Thanks,
-Jiangning

Hi Jiangning,

I think you are correct the merged size can't be simply accumulated by indivisual global variable size, because of natural alignment requirement inside the merged data structure.

The size is fine, we accumulate for AllocSize, i.e., with padding and such.
What I meant is the alignment of the whole structure does not look reasonable :).

To be more specific, I think we should set the alignment of the merge structure to DL->getABIAlignment(MergedTy), or something like that.

The example I had in mind was:
@a1 = global i32 align 4
@a2 = global i32 align 4
@a3 = global i32 align 4
@a4 = global i32 align 4

>

struct MergeGlobal {
i32, i32, i32, i32
}
With the current approach we would get an alignment of 4*4 = 16, whereas 4 may have been enough. Anyhow, this depends on the data layout.

Thanks for looking into it.

-Quentin

Hi Quentin,

I thought I misunderstood you, but looking at your explanation below, it
seems I wasn't. See my answer below,

2014-04-25 1:06 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:

Hi Jiangning,

I think you are correct the merged size can't be simply accumulated by

indivisual global variable size, because of natural alignment requirement
inside the merged data structure.

The size is fine, we accumulate for AllocSize, i.e., with padding and such.
What I meant is the alignment of the whole structure does not look
reasonable :).

To be more specific, I think we should set the alignment of the merge
structure to DL->getABIAlignment(MergedTy), or something like that.

The example I had in mind was:
@a1 = global i32 align 4
@a2 = global i32 align 4
@a3 = global i32 align 4
@a4 = global i32 align 4

>

struct MergeGlobal {
i32, i32, i32, i32
}
With the current approach we would get an alignment of 4*4 = 16, whereas 4
may have been enough. Anyhow, this depends on the data layout.

No. I think it should be 16, otherwise this MergeGlobal maybe cross page
boundary, and the offset would not be able to fused into load/store
instruction at compile-time.

If we set alignment to 16, we would be able to guarantee at compile-time
that MergeGlobal is always within a page. The load/store instruction
requires the offset is not larger than the value holding 4096 element.
Conservatively, we can treat it as a page limitation, although in reality
it only holds for bye element.

Could you please look into again my last reply at
http://reviews.llvm.org/D3432#14 containing the summary of three different
solutions? My choice is solution 2). I drew some text charts to elaborate
the idea. For your example, I'm hoping we can at compile-time generate a
single ADRP and no ADD instruction at all.

Thanks,
-Jiangning

Thanks for looking into it.

-Quentin

http://reviews.llvm.org/D3432

Hi Jiangning,

Ok, I see your point now.
This leads me to the following question: why don’t we do the same for internal globals?

Anyway, you should comment why we are setting this value for the alignment :).

When that is fixed, I can give it a shot.

Thanks,
-Quentin

Hi Jiangning,

On a second though, I do not think approach #2 is reasonable.
In your example:
adrp r0, MergedGlobal@PAGE
load [r0, var1_offset_within_PAGE]
load [r0, var2_offset_within_PAGE]
load [r0, var3_offset_within_PAGE]

Is in fact, syntax A
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal@PAGEOFF+var1_offset]
load [r0, MergedGlobal@PAGEOFF+var2_offset]
load [r0, MergedGlobal@PAGEOFF+var3_offset]

Or put another way, syntax B
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal_var1@PAGEOFF]
load [r0, MergedGlobal_var2@PAGEOFF]
load [r0, MergedGlobal_var3@PAGEOFF]

For syntax A to be valid, you need to be sure that MergedGlobal@PAGEOFF+varX_offset fits the encoding space.
For syntax B to be valid, you need to be sure that MergedGlobal@PAGE == MergedGlobal_varX@PAGE.

Thus, in both cases, this is not generally correct.

To increase the likelihood of this being correct, you’ll need to align the thing on a PAGE, and make sure that the reachable fields are within the encoding space (this should be fine, because of the get max offset thing). I’d say that is not desirable.

Therefore, I think we need to go for approach #1.
That said, if we can set an alignment that makes sense, that sounds good. In fact, for external globals, we may have to produce an accurate alignment for external users to get what they expect based on the original alignment of each field.

What do you think?

Thanks,
-Quentin

Hi Quentin,

Is in fact, syntax A
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal@PAGEOFF+var1_offset]
load [r0, MergedGlobal@PAGEOFF+var2_offset]
load [r0, MergedGlobal@PAGEOFF+var3_offset]

Or put another way, syntax B
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal_var1@PAGEOFF]
load [r0, MergedGlobal_var2@PAGEOFF]
load [r0, MergedGlobal_var3@PAGEOFF]

For syntax A to be valid, you need to be sure that MergedGlobal@PAGEOFF+varX_offset fits the encoding space.
For syntax B to be valid, you need to be sure that MergedGlobal@PAGE == MergedGlobal_varX@PAGE.

Thus, in both cases, this is not generally correct.

If we don't set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B are incorrect.

But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct. i.e. For syntax A, it can guarantee (MergedGlobal@PAGEOFF+varX_offset)<4096, which meet the encoding requirement of load/store instructions. For syntax B, it can guarantee MergedGlobal@PAGE==MergedGlobal_varX@PAGE is true.

To increase the likelihood of this being correct, you’ll need to align the thing on a PAGE, and make sure that the reachable fields are within the encoding space (this should be fine, because of the get max offset thing). I’d say that is not desirable.

We don't really need to make MergedGlobal aligned to a PAGE, but asking MergedGlobal aligned to RoundUpToPowerOfTwo(sizeof(MergedGlobal)) is enough.
If you don't think this is true, can you give me a negative example?

Thanks,
-Jiangning

Therefore, I think we need to go for approach #1.
That said, if we can set an alignment that makes sense, that sounds good. In fact, for external globals, we may have to produce an accurate alignment for external users to get what they expect based on the original alignment of each field.

Hi Jiangning,

If we don't set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B are incorrect.

Yes, they are incorrect and that’s why I said we should go for your first approach:
adrp r0, MergedGlobal@PAGE
add r0, r0, MergedGlobal@PAGEOFF
load [r0, #var1_offset]
load [r0, #var2_offset]
load [r0, #var3_offset]

But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct.

Yes, but my point is, we do not want to have this big alignment. Especially because it may make sense *only* for ARM64. Moreover, the fact that this alignment works is because both ADD and LDR have the same encoding space for the immediate. In my opinion this is a lot of assumptions that would be hard to match for all targets to make that a desirable change.

Therefore, I think we should go for your approach #1 and set a *reasonable* alignment based on the data layout.
Otherwise, you will have to demonstrate that this big alignment is desirable for all targets.

Thanks,
-Quentin

Hi Quentin,

But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct.

Yes, but my point is, we do not want to have this big alignment. Especially because it may make sense *only* for ARM64. Moreover, the fact that this alignment works is because both ADD and LDR have the same encoding space for the immediate. In my opinion this is a lot of assumptions that would be hard to match for all targets to make that a desirable change.h for all targets to make that a desirable change.

I see your point now. I think you are correct and there is an assumption here. What about we create a target specific hook and allow target/back-end to choose the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal)?

Thanks,
-Jiangning

Hi Jiangning,

A target hook seems like a good idea.
I would be more agressive on its scope though.

What about we create a target specific hook and allow target/back-end to choose the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal)?

Instead of choosing the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal), what about choosing an alignment period :).
We could then assert that the returned alignment is at least equal to the natural alignment. The default implementation of the target hook could just return the natural alignment.

What do you think?

Thanks,
-Quentin

Jiangning updated this revision to Diff 9137.May 6 2014, 6:49 PM

Hi Quentin,

Sorry for my late response! I was in vacation and busy at other tasks.

I think I totally agree with you, so I added a target hook returning the alignment required by global merge on external, and in global merge pass there is an assertion to make sure the alignment returned by target hook always meets natural alignment requirement.

Thanks,
-Jiangning

The changes of this version are mainly the followings, and no other things.

Thanks,
-Jiangning

include/llvm/Target/TargetLowering.h
889

This is newly added target hook to return alignment required for different targets.

lib/Target/ARM64/ARM64ISelLowering.cpp
640

The specific version for ARM64 returns RoundupToPowerOf2(sizeof(MergedGlobal)).

Hi Jiangning,

Thanks for the update.

I’ll give a shot at the patch as it is and get back to you with performance numbers.

Cheers,

-Quentin

lib/Transforms/Scalar/GlobalMerge.cpp
215

Any reason why we do not want to set an alignment for the internal too?

Hi Jiangning,

I've stopped the benchmarking because I am seeing compiler crashes, with the current patch, when the merge on external is enabled:
Assertion failed: (Globals[i]->hasExternalLinkage() == IsExternal), function doMerge, file lib/Transforms/Scalar/GlobalMerge.cpp, line 199.

The following tests do not compile:
External/SPEC/CINT2000/255.vortex/255.vortex
External/SPEC/CINT2006/445.gobmk/445.gobmk
External/SPEC/CINT95/147.vortex/147.vortex
MultiSource/Applications/oggenc/oggenc
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame

I am attaching the reduced test case from 445.gobmk.


To reproduce:
llc -global-merge-on-external=true reduced_testcase.ll

Could you have a look please?

As soon as this is fixed, I can start over the benchmarking.

Thanks,

Quentin

Jiangning updated this revision to Diff 9301.May 12 2014, 12:06 AM

Hi Quentin,

Sorry for those failures introduced, and yes they are all my fault, and now the new patch should have all issues fixed.

There was a logic error around the original assertion and it was only to check the 1st global for each different global variable merges. And actually this is was newly introduced by the 1st refactoring patch.

Anyway this new patch fixed the followings,

  1. The compilation failures you mentioned. A new test case global-merge-3.ll is added to detect this regression. This test intends to mix the "internal global" and "global", and test multiple merged globals.

If "internal globals" and "globals" are mixed in the same merged globals, we will use the 1st "global" name as the suffix as this newly created merged global variable.

  1. Ask "internal global" to also use alignment target hook. Yes, you are right, we can also apply this special hook to internal globals. I didn't do this previously, because I didn't want to mix the globals exposing names to other symbols with the internal globals, for which we don't need to expose names to other modules at all. And also I think applying this alignment hook to "internal globals" should be orthogonal with -global-merge-on-external, so now I introduced another command line switch -global-merge-aligned. i.e.

-global-merge-on-external: control if globals excluding "internal globals" will be merged.
-global-merge-aligned: control if the special alignment will be applied to merged globals.

  1. This patch is also formatted with git-clang-format.

Thanks,
-Jiangning

Hi Jiangning,

also I think applying this alignment hook to "internal globals" should be orthogonal with -global-merge-on-external

Makes sense.

I introduced another command line switch -global-merge-aligned.

That’s a good idea, thanks!

That said, shouldn’t we set the natural alignment for external globals by default? I.e., I guess externals uses will assume that at least the natural alignment is set.

Anyway, I’ll give a shot to the new patch.

Thanks,
-Quentin

Hi Quentin,

I introduced another command line switch -global-merge-aligned.

That’s a good idea, thanks!

That said, shouldn’t we set the natural alignment for external globals by default? I.e., I guess externals uses will assume that at least the natural alignment is set.

I think if I don't explicit set alignment, by default it should be natural alignment for both internal and external global variables, right? The original code for internal globals doesn't explicitly set alignment either.

Now I've got an A57/A53 big-little board, so I did some experiment and the data shows there isn't really performance impact for spec2000 int on both A57 and A53 (I believe you understand I can't post the performance data directly here due to legal issue), although I do see the following adrp instruction reductions for ARM64,

bzip2 0.28%
crafty 1.20%
vpr 0.00%
vortex 4.83%
eon 0.05%
mcf 0.00%
gzip 3.04%
perlbmk 0.02%
parser 4.49%
twolf 0.00%
gap 2.08%
gcc 1.39%

I think one of reasons might be this optimization only affects global symbol address, but it is a loop invariant essentially, and it should be always hoisted out of hot loops, so at least for spec2000 we wouldn't be able to see performance impact.

Once the pseudo instruction MOVAddr issue is solved for ARM64, there might be more adrp instruction reduction, but after this change I would still not expect performance impact for spec2000, because my measurement for AArch64 didn't show performance impact for spec2000 either.

As far as this optimization improvement itself concerned, now I think code size reduction could be the key benefit, and it wouldn't hurt any performance.

Considering this patch has been posted for almost 4 weeks and we are approaching the deadline of migrating AArch64 to ARM64, do you agree to commit this patch first? If this can happen before phasing out AArch64, it will save us a lot of more efforts to rebase, because I think the final switching from AArch64 to ARM64, and renaming ARM64 to AArch64 would be a significant change on TOT.

Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first.

Thanks,
-Jiangning

Anyway, I’ll give a shot to the new patch.

Thanks,
-Quentin

Hi Jiangning,

Sorry for the delay to get back to you but the results are not what I was expected and I had to double check them.
So, the short story is, I am seeing the following a lot of regressions, although the number of ADRP/ADR instructions decreased a lot in the final binaries.

Note: for the performances numbers, I have filtered out the tests that run for less than a second and the tests that have similar performance:
Columns:

  • Reference: Global merge on external disabled.
  • Test: Global merge on external enabled as well as the alignment thing.
  • Expension: Test/Reference (Smaller is better).

Performances

  • O3 **

Benchmark_ID Reference Test Expansion Percent

ASC_Sequoia/IRSmk/IRSmk 16.6782 16.5785 0.99 -1%
Adobe-C++/loop_unroll 4.9308 5.0514 1.02 +2%
BenchmarkGame/n-body 2.38 2.3966 1.01 +1%
CFP2000/177.mesa/177.me 4.0972 4.1418 1.01 +1%
CINT2000/164.gzip/164.g 22.8601 23.2677 1.02 +2%
CINT2000/186.crafty/186 9.1118 9.2452 1.01 +1%
CINT2000/253.perlbmk/25 13.5299 13.6566 1.01 +1%
CINT2000/254.gap/254.ga 3.8361 3.8128 0.99 -1%
CINT2000/255.vortex/255 4.8632 5.0017 1.03 +3%
CINT2006/403.gcc/403.gc 3.251 3.2304 0.99 -1%
CINT2006/458.sjeng/458. 8.8006 8.7538 0.99 -1%
CINT2006/471.omnetpp/47 1.3495 1.323 0.98 -2%
McGill/queens 4.2418 4.2123 0.99 -1%
Misc-C++/Large/ray 5.2962 5.3265 1.01 +1%
Olden/power/power 2.3012 2.3335 1.01 +1%
SIBsim4/SIBsim4 5.6249 5.5846 0.99 -1%
Shootout-C++/ary3 2.1923 2.2114 1.01 +1%
Shootout-C++/lists1 1.4118 1.4452 1.02 +2%
VersaBench/8b10b/8b10b 13.3247 12.9642 0.97 -3%
VersaBench/ecbdes/ecbde 4.9098 4.9364 1.01 +1%
aha/aha 4.2497 4.3452 1.02 +2%
lambda-0.1.3/lambda 8.8454 8.7103 0.98 -2%
mafft/pairlocalalign 59.696 57.5007 0.96 -4%

siod/siod 3.4691 3.4929 1.01 +1%

Min (24) - - 0.96 -

Max (24) - - 1.03 -

Sum (24) 211 210 0.99 +1%

A.Mean (24) - - 1 +0%

G.Mean 2 (24) - - 1 +0%

Overall there are more regressions (14) than improvements (10) and on average it is neutral.

  • Os **

Benchmark_ID Reference Test Expansion Percent

Adobe-C++/loop_unroll 5.0411 5.168 1.03 +3%
CINT2000/164.gzip/164.g 22.5516 23.0424 1.02 +2%
CINT2000/186.crafty/186 9.5044 9.6343 1.01 +1%
CINT2006/400.perlbench/ 15.0506 15.1874 1.01 +1%
CINT2006/456.hmmer/456. 6.088 6.1583 1.01 +1%
CINT2006/462.libquantum 2.6566 2.678 1.01 +1%
McGill/queens 4.2193 4.186 0.99 -1%
Misc-C++/Large/ray 6.5632 6.6263 1.01 +1%
Olden/power/power 3.4379 3.2233 0.94 -6%
Polybench/stencils/fdtd 3.6964 3.718 1.01 +1%
Ptrdist/ft/ft 2.3729 2.3966 1.01 +1%
SIBsim4/SIBsim4 5.6219 5.6644 1.01 +1%
Trimaran/enc-3des/enc-3 3.7445 3.7256 0.99 -1%
Trimaran/enc-pc1/enc-pc 1.6924 1.7176 1.01 +1%
VersaBench/8b10b/8b10b 13.4432 13.1539 0.98 -2%
lambda-0.1.3/lambda 8.681 9.1763 1.06 +6%
mafft/pairlocalalign 59.1719 58.6067 0.99 -1%
povray 6.2992 6.4061 1.02 +2%

siod/siod 4.2536 4.3336 1.02 +2%

Min (19) - - 0.94 -

Max (19) - - 1.06 -

Sum (19) 184 185 1 +0%

A.Mean (19) - - 1.01 +1%

G.Mean 2 (19) - - 1.01 +1%

Overall there are more regressions (14) than improvements (5) and on average it is a regression.

Static Count of ADRP/ADR

This is the number of ADRP/ADR instruction in the final binaries. In other words, the linker optimizations already took place.

  • O3 ** -------------------------------------------------------------------------------

Min (90) - - 0.3 -

Max (90) - - 1.11 -

Sum (90) 160078 152264 0.95 +5%

A.Mean (90) - - 0.88 -12%

G.Mean 2 (90) - - 0.86 -14%

Here are the details for the regressions:

Benchmark_ID Reference Test Expansion Percent

C/Output/globalrefs.sim 9 10 1.11 +11%
CFP2006/444.namd/Output 431 444 1.03 +3%
Misc/Output/flops-1.sim 16 17 1.06 +6%
Misc/Output/flops-3.sim 15 16 1.07 +7%
Misc/Output/flops-8.sim 15 16 1.07 +7%

Shootout-C++/EH/Output/ 16 17 1.06 +6%

Min (6) - - 1.03 -

Max (6) - - 1.11 -

Sum (6) 502 520 1.04 -3%

A.Mean (6) - - 1.07 +7%

G.Mean 2 (6) - - 1.07 +7%

  • Os ** -------------------------------------------------------------------------------

Min (87) - - 0.3 -

Max (87) - - 1.11 -

Sum (87) 101239 96762 0.96 +5%

A.Mean (87) - - 0.89 -11%

G.Mean 2 (87) - - 0.86 -14%

Here are the details for the regressions:

Benchmark_ID Reference Test Expansion Percent

C/Output/globalrefs.sim 9 10 1.11 +11%
CFP2006/444.namd/Output 349 385 1.1 +10%
Misc/Output/flops-1.sim 16 17 1.06 +6%
Misc/Output/flops-3.sim 15 16 1.07 +7%
Misc/Output/flops-8.sim 15 16 1.07 +7%

Shootout-C++/EH/Output/ 10 11 1.1 +10%

Min (6) - - 1.06 -

Max (6) - - 1.11 -

Sum (6) 414 455 1.1 -9%

A.Mean (6) - - 1.09 +9%

G.Mean 2 (6) - - 1.08 +8%

Long Story

After the first run, a lot of applications were regressing, so I suspected that somehow my device got noisy. I've rerun all the regressions/improvements 10x times and observed the standard deviation (SD). The SD was close to 0 percent for most tests, i.e., the regressions/improvements were real.

I've investigated a few of them and figured it was our lowering using pseudo instruction for ADRP that was bitting us. The short story is this was producing redundant ADRPs as you initially saw when you started to work on that patch.
I thought I have fixed the problem with a quick patch prior to the first round of experiments, but apparently, it was not sufficient.

Anyway, I have made another quick patch where I basically blocked any folding of those instructions, thus maximizing the reuse of ADRPs. This is the current baseline.
However, like the numbers show, even if we remove a lot of ADRPs, there is something else going on that make this worthless.
I have to dig into that to see what we can do.

In the meantime, I can give you the raw numbers if you want to gather your own statistics.

Thanks,

-Quentin

I think if I don't explicit set alignment, by default it should be natural alignment for both internal and external global variables, right?

I am not sure. IIRC, no alignment means 1-byte aligned.

The original code for internal globals doesn't explicitly set alignment either.

This is not a problem because the module will see that as all the globals are locally known. But for external, the problem is different as the modules will not see the new alignment and thus, assume whatever the front-end or ABI tells to the backend.
I fell like there is something wrong here, but I may just be paranoid. I wanted to let you know so you can double check :).

Now I've got an A57/A53 big-little board, so I did some experiment and the data shows there isn't really performance impact for spec2000 int on both A57 and A53 (I believe you understand I can't post the performance data directly here due to legal issue), although I do see the following adrp instruction reductions for ARM64,

I concur the ADRP reduction for CINT2000, here the numbers for O3:

Benchmark_ID Reference Test Expansion Percent

./164.gzip/Output/164.g 655 632 0.96 -4%
./175.vpr/Output/175.vp 1678 1678 1 +0%
./176.gcc/Output/176.gc 14838 14701 0.99 -1%
./181.mcf/Output/181.mc 39 39 1 +0%
./186.crafty/Output/186 3922 3896 0.99 -1%
./197.parser/Output/197 1398 1366 0.98 -2%
./252.eon/Output/252.eo 1162 1160 1 +0%
./253.perlbmk/Output/25 9446 9444 1 +0%
./254.gap/Output/254.ga 6058 5988 0.99 -1%
./255.vortex/Output/255 7811 7566 0.97 -3%
./256.bzip2/Output/256. 590 590 1 +0%

./300.twolf/Output/300. 3244 3244 1 +0%

Min (12) - - 0.96 -

Max (12) - - 1 -

Sum (12) 50841 50304 0.99 +1%

A.Mean (12) - - 0.99 -1%

G.Mean 2 (12) - - 0.99 -1%

However, I do see some performance regressions (see my previous post).

Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first.

I am fine with that assuming:

  1. This is disabled by default for the arm64-apple-* triple.
  2. You have double check the alignment stuff :).

Thanks,
-Quentin

Hi Quentin,

Thanks a lot of your performance data, and it also surprised by those regressions. But I can reproduce the regressions on ARM64 for the benchmarks below,

CINT2000/164.gzip/164.g 22.8601 23.2677 1.02 +2%
CINT2000/255.vortex/255 4.8632 5.0017 1.03 +3%

If I don't have other higher priority tasks, I will have a look. But it would be much better if your patch of fixing MOVAddr issue can be shared with me.

Is there any possibility after this optimization the live range of base address is extended, and triggered spill/fill in hot loop by RA? If this is the case the rematerialization needs to be tuned, I think.

Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first.

I am fine with that assuming:

  1. This is disabled by default for the arm64-apple-* triple.

Actually both -global-merge-on-external and -global-merge-aligned are false by default, so if end-user doesn't enable them we wouldn't see any performance impact. Therefore, they are disabled by default for all targets.

Actually I was hoping your patch can get it enabled at least for ARM64. :-)

  1. You have double check the alignment stuff :).

I checked the alignment behavior. If we don't set alignment, which is also the default behavior with/without my patch, the following code in AsmPrinter will be effective,

// If the alignment is specified, we *must* obey it.  Overaligning a global
// with a specified alignment is a prompt way to break globals emitted to
// sections and expected to be contiguous (e.g. ObjC metadata).
unsigned AlignLog = getGVAlignmentLog2(GV, *DL);

for (const HandlerInfo &HI : Handlers) {
  NamedRegionTimer T(HI.TimerName, HI.TimerGroupName, TimePassesIsEnabled);
  HI.Handler->setSymbolSize(GVSym, Size);
}

// Handle common and BSS local symbols (.lcomm).
if (GVKind.isCommon() || GVKind.isBSSLocal()) {
  if (Size == 0) Size = 1;   // .comm Foo, 0 is undefined, avoid it.
  unsigned Align = 1 << AlignLog;

So that means the RoundupToPowerof2(sizeof(MergedGlobal)) will be used as alignment.

The test case test/CodeGen/AArch64/global-merge.ll, which only uses "internal globals", also shows below with llc,

$ ../build/bin/llc -mtriple=arm64-apple-ios -O1 < test/CodeGen/AArch64/global-merge.ll
.section TEXT,text,regular,pure_instructions
.globl _f1
.align 2
_f1: ; @f1
.cfi_startproc
; BB#0:
Lloh0:
adrp x8, MergedGlobals@PAGE
Lloh1:
add x8, x8,
MergedGlobals@PAGEOFF
stp w0, w1, [x8]
ret
.loh AdrpAdd Lloh0, Lloh1
.cfi_endproc

.zerofill DATA,bss,__MergedGlobals,8,3 ; @_MergedGlobals

So actually I'm keeping the original behavior.

Anyway, my patch doesn't change the original semantic at all, if only -global-merge-on-external and -global-merge-aligned are not enabled.

Are you happy with this?

Thanks,
-Jiangning

Thanks,
-Quentin

Hi Jiangning,

If I don't have other higher priority tasks, I will have a look. But it would be much better if your patch of fixing MOVAddr issue can be shared with me.

Sure! It is more a hack than a patch, but here it is :).

Is there any possibility after this optimization the live range of base address is extended, and triggered spill/fill in hot loop by RA? If this is the case the rematerialization needs to be tuned, I think.

I thought about that and this is possible. Though the current implementation still uses the pseudo instructions so the rematerialization should kick in. That said, this is also possible that the rematerialization algorithm is not perfect.

Actually I was hoping your patch can get it enabled at least for ARM64. :-)

For ARM64, yes, but for ARM64-apple-* no :). We need to address the regressions first before enabling it by default for our (apple) triple, but if you need it for other OS, I am fine with that, though I cannot test the performances for those.
We can land the patch with it disabled by default for all targets then enabled it for the targets (triple) we care along the way.

Thanks for checking the alignment thing.

Yes, I am happy with that :).

Thanks,
-Quentin

Hi Quentin,

I committed this patch as r208934, and will analyze those regression with
your patch.

Thanks,
-Jiangning

2014-05-16 0:57 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:

Hi Jiangning,

If I don't have other higher priority tasks, I will have a look. But it

would be much better if your patch of fixing MOVAddr issue can be shared
with me.

Sure! It is more a hack than a patch, but here it is :).

Is there any possibility after this optimization the live range of base

address is extended, and triggered spill/fill in hot loop by RA? If this is
the case the rematerialization needs to be tuned, I think.

I thought about that and this is possible. Though the current
implementation still uses the pseudo instructions so the rematerialization
should kick in. That said, this is also possible that the rematerialization
algorithm is not perfect.

Actually I was hoping your patch can get it enabled at least for ARM64.

:-)

For ARM64, yes, but for ARM64-apple-* no :). We need to address the
regressions first before enabling it by default for our (apple) triple, but
if you need it for other OS, I am fine with that, though I cannot test the
performances for those.
We can land the patch with it disabled by default for all targets then
enabled it for the targets (triple) we care along the way.

Thanks for checking the alignment thing.

Yes, I am happy with that :).

Thanks,
-Quentin

http://reviews.llvm.org/D3432

I committed this patch as r208934, and will analyze those regression with your patch.

Thanks!

Let me know if I can help for the regressions.

-Quentin

qcolombet accepted this revision.May 15 2014, 5:02 PM
qcolombet edited edge metadata.

Mark this as accepted to close it.

-Quentin

This revision is now accepted and ready to land.May 15 2014, 5:02 PM
ab closed this revision.Feb 2 2015, 3:57 PM
ab added a subscriber: ab.

Closing, as this was committed a while ago (r208934, r210640).

test/CodeGen/AArch64/global-merge.ll