This is an archive of the discontinued LLVM Phabricator instance.

Implement ADRP CSE for global symbols
ClosedPublic

Authored by Jiangning on Apr 18 2014, 8:43 PM.

Download Raw Diff

Details

Reviewers

qcolombet
t.p.northover

Summary

This patch depends on another patch "Enable global merge pass." at http://reviews.llvm.org/D3431.

This patch is to implement ADRP CSE for global symbols defined within the module being compiled.

Diff Detail

Event Timeline

Hi Jiangning,

This sounds like a good idea. I've got a couple of smaller comments:

lib/IR/Globals.cpp
277–279	Are these clauses tested? I don't see any aliases or GEPs in your examples. I also vaguely remember Rafael saying that a GEP-based alias wasn't intended to be supported, though I wouldn't swear to it (and I'm fuzzy on exactly why it's a bad idea).
lib/Transforms/Scalar/GlobalMerge.cpp
287–290	This might be clearer split up. I know my brain starts to melt when I see massive nested conditionals like this. Perhaps linkage conditions vs intrinsic blockers?
test/CodeGen/AArch64/global_merge_2.ll
1 ↗	(On Diff #8660)	And speaking of tests, perhaps this would be a good time to promote GlobalMerge to something that can be run through opt and use that to do the testing.

Hi Jiangning,

I copy/paste my comments from the other thread here to ease the tracking.

I wanted to sort out the redundant ADRP stuff on ARM64 while using the pseudo instructions (for address computations) before doing any measurements.
It appears that your patch that adds the external support is not playing nicely (I haven’t looked why) with the existing framework.
I.e.,
Here is my test case (notice the use of internal for the globals):
@x = internal global i32 0, align 4
@y = internal global i32 0, align 4

define void @f1(i32 %a1, i32 %a2) {

store i32 %a1, i32* @x, align 4
store i32 %a2, i32* @y, align 4
ret void

}

Test1, your patch is disabled:
llc -o - -mtriple=arm64-apple-ios test.ll -global-merge-on-external=false -arm64-collect-loh=false

_f1: ; @f1
.cfi_startproc
; BB#0:
adrp x8, MergedGlobals@PAGE
add x8, x8, MergedGlobals@PAGEOFF
stp w0, w1, [x8]
ret
.cfi_endproc

Test2, your patch is enabled:
llc -o - -mtriple=arm64-apple-ios test.ll -global-merge-on-external=true -arm64-collect-loh=false

_f1: ; @f1
.cfi_startproc
; BB#0:
adrp x8, MergedGlobals@PAGE
adrp x9, MergedGlobals@PAGE
add x9, x9, MergedGlobals@PAGEOFF
str w0, [x8, MergedGlobals@PAGEOFF]
str w1, [x9, #4]
ret
.cfi_endproc

Could you take a look please?

Thanks,
-Quentin

This patch is to fix the issue raised by Quentin and Tim. The code was cleaned, and the followings are changed

Change IsInternalOnly to IsExternal, and add assertion.
The behavior of -global-merge-on-external will not affect static variables any longer.
Split a long if condition expression.

Thanks,
-Jiangning

Jiangning added inline comments.Apr 21 2014, 11:32 PM

lib/IR/Globals.cpp
277–279	Tim, Yes, these clauses are tested, and the follows are just aliases to be checked. ; CHECK: y = _MergedGlobals_x+4 ; CHECK: z = _MergedGlobals_x+8 Without this change, compiler would fail to run the test case global_merge_2.ll. GEPs are from ".globl y" and ".globl z" i32* getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 1) i32* getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 2) I don't know what Rafael had said previously, can you point me that email thread if you know, then I can consider this issue.
lib/Transforms/Scalar/GlobalMerge.cpp
287–290	Tim, That's OK, and I've split it up in new version.

Hi Quenin,

Thanks for your testing!

It appears that your patch that adds the external support is not playing

nicely (I haven’t looked why) with the existing framework.

I uploaded a new version to fix the issue you mentioned and now
-global-merge-on-external will not affect stack variable merge any longer.

Thanks,
-Jiangning

Jiangning added inline comments.Apr 21 2014, 11:47 PM

test/CodeGen/AArch64/global_merge_2.ll
1 ↗	(On Diff #8660)	Tim, Do you mean we should move it out of PreISel stage and hoist it to middle end which can really be covered by opt? If yes, do you have any suggestions, where we should drop this pass? I mean, which passes should be before/after this global merge pass? Anyway, I can investigate it myself, and I thought it was designed in PreISel on purpose, although I don't know the reason yet.

Hi Jiangning,

Tim, Do you mean we should move it out of PreISel stage and hoist it to middle end which can really be covered by opt? If yes, do you have any suggestions, where we should drop this pass?

I didn't mean any change in the order (I've no idea where it would be
best placed, though "as late as possible" seems reasonable so PreISel
may be right), just that it would be good to be able to run it via opt
instead of llc. I did the same recently to the Atomics pass I've been
working on.

Cheers.

Tim

Hi Jiangning,

I confirm that the changes you did fix the “regression” on ARM64, however I do not think this is the way to go.
Indeed, what fixes the problem, as you know, is the fact that we do not set any alignment information for internal globals. Because of this lack of information, the ARM64 backend does not try to fold the ADDlow into the memory operation since it may not be correct.

That said, if we can set some reasonable alignment, we should do it, and not backed this off because of ARM64 current limitation. Nevertheless, we need to be careful not to introduce any regression here because so far the global merge pass was producing unaligned globals.

Now, regarding a reasonable alignment, I do not think the current setting make sense.
Setting the alignment to the size of the struct looks wrong to me (think a struct with 20 times the same type). I guess this information should be set using the data layout on MergedTy. After all, all the types are currently compliant with the ABI (otherwise, we excluded the related globals).

I’ve also a couple of minor comments, see the inline comments.

I am looking into making the folding smarter to avoid the redundant ADRP.

Thanks,
-Quentin

lib/Transforms/Scalar/GlobalMerge.cpp
167	Should be hoisted outside of the loop, since we assert this does not change within the loop.
188	Ditto.
203	This is the alignment thing I was talking about.

This new version fixed the followings,

Following Tim's feedback, make global merge pass work for opt, so the test cases under test/Transforms/GlobalMerge are updated to use opt only.
Move the original test under test/Transforms/GlobalMerge to test/CodeGen and keep the llc test. The test using llc should be still meaningful, because it make guarantee global merge pass is really enabled in PreISel for optimization level -O1 and above. The opt test can't cover this.
Following Quentin's feedback, hoisted some codes out of the loop.

Thanks,
-Jiangning

This patch make some more little changes,

Remove file mode change
Remove useless comments

Thanks,
-Jiangning

Jiangning added inline comments.Apr 24 2014, 2:58 AM

lib/Transforms/Scalar/GlobalMerge.cpp

167

Accept and fixed by the new version uploaded.

188

Accept and fixed by the new version uploaded.

203

Hi Quentin,

Appreciate for your further feedback!

I understand your point. Actually I think we have three solutions to solve this ADRP issue, let me summarize them, and make sure our discussion is on the same page.

Materialize the base address of merged global data structure with a single register. MOVAddr in ARM64 just does this work. This is essentially to use "adrp + add" to compute the base of merged global data structure. This way, the merge global data structure needn't the structure size alignment, but the natural alignment. The disadvantage of this solution is, at compile time we don't really know if the offset in add instruction can really be propagated and combined into load/store instructions, because it might be larger than a page size. For this solution the instruction sequence could be like below, (Suppose we have three loads)

adrp r0, MergedGlobal@PAGE
add r1, r0, MergedGlobal@PAGE_OFF
load [r1, var1_offset_within_MergedGlobal]
load [r1, var2_offset_within_MergedGlobal]
load [r1, var3_offset_within_MergedGlobal]

Refer to the chart below as well,

page 1                     page 2

r0
------------------------------r1-----------| // merged global data structure
----------------------------s-------e
var1_offset_within_MergedGlobal = e - s

(MergedGlobal@PAGE_OFF + var1_offset_within_MergedGlobal) > PageSize

We don't use a single register to describe the base address of merged global data structure at all. If we can guarantee the merged global data structure is always within a page, and doesn't cross page boundary, we would be able make sure the offset from the page boundary is always smaller than a page, so at compile time, the offset to page boundary would be able to be fused into load/store instruction directly. Setting the alignment to be the maximum merged data structure size could guarantee the merged data structure doesn't cross page boundary. The disadvantage of this solution is, we would probably increase run-time memory consumption, because there might be some "holes" in data section. With this solution, we would have the following instruction sequence finally,

adrp r0, MergedGlobal@PAGE
load [r0, var1_offset_within_PAGE]
load [r0, var2_offset_within_PAGE]
load [r0, var3_offset_within_PAGE]

Refer to the chart below as well,

page 1                     page 2

r0
--------------------|----------| // merged global data structure
s----------------------e
var1_offset_within_PAGE = e - s

We use link-time solution to help the removal of ADRP and ADD. Link-time optimization might not be cost-less because the dead instruction can only be replaced with NOP instruction.

Both solution 1) and 2) can remove some ADRPs statically. And the ADD introduced in solution 1) could probably be reduced by solution 3) as well.

Do you think the disadvantage with solution 2) is unacceptable?

Thanks,
-Jiangning

Hi Quentin,

I just realized probably I misunderstood your point of using data layout. I
think you are correct the merged size can't be simply accumulated by
indivisual global variable size, because of natural alignment requirement
inside the merged data structure.

Sorry about that and I will change my code soon.

Thanks,
-Jiangning

Hi Jiangning,

I think you are correct the merged size can't be simply accumulated by indivisual global variable size, because of natural alignment requirement inside the merged data structure.

The size is fine, we accumulate for AllocSize, i.e., with padding and such.
What I meant is the alignment of the whole structure does not look reasonable :).

To be more specific, I think we should set the alignment of the merge structure to DL->getABIAlignment(MergedTy), or something like that.

The example I had in mind was:
@a1 = global i32 align 4
@a2 = global i32 align 4
@a3 = global i32 align 4
@a4 = global i32 align 4

>

struct MergeGlobal {
i32, i32, i32, i32
}
With the current approach we would get an alignment of 4*4 = 16, whereas 4 may have been enough. Anyhow, this depends on the data layout.

Thanks for looking into it.

-Quentin

Hi Quentin,

I thought I misunderstood you, but looking at your explanation below, it
seems I wasn't. See my answer below,

2014-04-25 1:06 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:

Hi Jiangning,

I think you are correct the merged size can't be simply accumulated by

indivisual global variable size, because of natural alignment requirement
inside the merged data structure.

The size is fine, we accumulate for AllocSize, i.e., with padding and such.
What I meant is the alignment of the whole structure does not look
reasonable :).

To be more specific, I think we should set the alignment of the merge
structure to DL->getABIAlignment(MergedTy), or something like that.

The example I had in mind was:
@a1 = global i32 align 4
@a2 = global i32 align 4
@a3 = global i32 align 4
@a4 = global i32 align 4

>

struct MergeGlobal {
i32, i32, i32, i32
}
With the current approach we would get an alignment of 4*4 = 16, whereas 4
may have been enough. Anyhow, this depends on the data layout.

No. I think it should be 16, otherwise this MergeGlobal maybe cross page
boundary, and the offset would not be able to fused into load/store
instruction at compile-time.

If we set alignment to 16, we would be able to guarantee at compile-time
that MergeGlobal is always within a page. The load/store instruction
requires the offset is not larger than the value holding 4096 element.
Conservatively, we can treat it as a page limitation, although in reality
it only holds for bye element.

Could you please look into again my last reply at
http://reviews.llvm.org/D3432#14 containing the summary of three different
solutions? My choice is solution 2). I drew some text charts to elaborate
the idea. For your example, I'm hoping we can at compile-time generate a
single ADRP and no ADD instruction at all.

Thanks,
-Jiangning

Thanks for looking into it.

-Quentin

http://reviews.llvm.org/D3432

Hi Jiangning,

Ok, I see your point now.
This leads me to the following question: why don’t we do the same for internal globals?

Anyway, you should comment why we are setting this value for the alignment :).

When that is fixed, I can give it a shot.

Thanks,
-Quentin

Hi Jiangning,

On a second though, I do not think approach #2 is reasonable.
In your example:
adrp r0, MergedGlobal@PAGE
load [r0, var1_offset_within_PAGE]
load [r0, var2_offset_within_PAGE]
load [r0, var3_offset_within_PAGE]

Is in fact, syntax A
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal@PAGEOFF+var1_offset]
load [r0, MergedGlobal@PAGEOFF+var2_offset]
load [r0, MergedGlobal@PAGEOFF+var3_offset]

Or put another way, syntax B
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal_var1@PAGEOFF]
load [r0, MergedGlobal_var2@PAGEOFF]
load [r0, MergedGlobal_var3@PAGEOFF]

For syntax A to be valid, you need to be sure that MergedGlobal@PAGEOFF+varX_offset fits the encoding space.
For syntax B to be valid, you need to be sure that MergedGlobal@PAGE == MergedGlobal_varX@PAGE.

Thus, in both cases, this is not generally correct.

To increase the likelihood of this being correct, you’ll need to align the thing on a PAGE, and make sure that the reachable fields are within the encoding space (this should be fine, because of the get max offset thing). I’d say that is not desirable.

Therefore, I think we need to go for approach #1.
That said, if we can set an alignment that makes sense, that sounds good. In fact, for external globals, we may have to produce an accurate alignment for external users to get what they expect based on the original alignment of each field.

What do you think?

Thanks,
-Quentin

Hi Quentin,

Is in fact, syntax A
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal@PAGEOFF+var1_offset]
load [r0, MergedGlobal@PAGEOFF+var2_offset]
load [r0, MergedGlobal@PAGEOFF+var3_offset]

Or put another way, syntax B
adrp r0, MergedGlobal@PAGE
load [r0, MergedGlobal_var1@PAGEOFF]
load [r0, MergedGlobal_var2@PAGEOFF]
load [r0, MergedGlobal_var3@PAGEOFF]

For syntax A to be valid, you need to be sure that MergedGlobal@PAGEOFF+varX_offset fits the encoding space.
For syntax B to be valid, you need to be sure that MergedGlobal@PAGE == MergedGlobal_varX@PAGE.

Thus, in both cases, this is not generally correct.

If we don't set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B are incorrect.

But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct. i.e. For syntax A, it can guarantee (MergedGlobal@PAGEOFF+varX_offset)<4096, which meet the encoding requirement of load/store instructions. For syntax B, it can guarantee MergedGlobal@PAGE==MergedGlobal_varX@PAGE is true.

To increase the likelihood of this being correct, you’ll need to align the thing on a PAGE, and make sure that the reachable fields are within the encoding space (this should be fine, because of the get max offset thing). I’d say that is not desirable.

We don't really need to make MergedGlobal aligned to a PAGE, but asking MergedGlobal aligned to RoundUpToPowerOfTwo(sizeof(MergedGlobal)) is enough.
If you don't think this is true, can you give me a negative example?

Thanks,
-Jiangning

Therefore, I think we need to go for approach #1.
That said, if we can set an alignment that makes sense, that sounds good. In fact, for external globals, we may have to produce an accurate alignment for external users to get what they expect based on the original alignment of each field.

Hi Jiangning,

If we don't set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B are incorrect.

Yes, they are incorrect and that’s why I said we should go for your first approach:
adrp r0, MergedGlobal@PAGE
add r0, r0, MergedGlobal@PAGEOFF
load [r0, #var1_offset]
load [r0, #var2_offset]
load [r0, #var3_offset]

But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct.

Yes, but my point is, we do not want to have this big alignment. Especially because it may make sense *only* for ARM64. Moreover, the fact that this alignment works is because both ADD and LDR have the same encoding space for the immediate. In my opinion this is a lot of assumptions that would be hard to match for all targets to make that a desirable change.

Therefore, I think we should go for your approach #1 and set a *reasonable* alignment based on the data layout.
Otherwise, you will have to demonstrate that this big alignment is desirable for all targets.

Thanks,
-Quentin

Hi Quentin,

But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct.

Yes, but my point is, we do not want to have this big alignment. Especially because it may make sense *only* for ARM64. Moreover, the fact that this alignment works is because both ADD and LDR have the same encoding space for the immediate. In my opinion this is a lot of assumptions that would be hard to match for all targets to make that a desirable change.h for all targets to make that a desirable change.

I see your point now. I think you are correct and there is an assumption here. What about we create a target specific hook and allow target/back-end to choose the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal)?

Thanks,
-Jiangning

Hi Jiangning,

A target hook seems like a good idea.
I would be more agressive on its scope though.

What about we create a target specific hook and allow target/back-end to choose the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal)?

Instead of choosing the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal), what about choosing an alignment period :).
We could then assert that the returned alignment is at least equal to the natural alignment. The default implementation of the target hook could just return the natural alignment.

What do you think?

Thanks,
-Quentin

Hi Quentin,

Sorry for my late response! I was in vacation and busy at other tasks.

I think I totally agree with you, so I added a target hook returning the alignment required by global merge on external, and in global merge pass there is an assertion to make sure the alignment returned by target hook always meets natural alignment requirement.

Thanks,
-Jiangning

The changes of this version are mainly the followings, and no other things.

Thanks,
-Jiangning

include/llvm/Target/TargetLowering.h
889	This is newly added target hook to return alignment required for different targets.
lib/Target/ARM64/ARM64ISelLowering.cpp
640	The specific version for ARM64 returns RoundupToPowerOf2(sizeof(MergedGlobal)).

Hi Jiangning,

Thanks for the update.

I’ll give a shot at the patch as it is and get back to you with performance numbers.

Cheers,

-Quentin

lib/Transforms/Scalar/GlobalMerge.cpp
225	Any reason why we do not want to set an alignment for the internal too?

Hi Jiangning,

I've stopped the benchmarking because I am seeing compiler crashes, with the current patch, when the merge on external is enabled:
Assertion failed: (Globals[i]->hasExternalLinkage() == IsExternal), function doMerge, file lib/Transforms/Scalar/GlobalMerge.cpp, line 199.

The following tests do not compile:
External/SPEC/CINT2000/255.vortex/255.vortex
External/SPEC/CINT2006/445.gobmk/445.gobmk
External/SPEC/CINT95/147.vortex/147.vortex
MultiSource/Applications/oggenc/oggenc
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame

I am attaching the reduced test case from 445.gobmk.

reduced_testcase.ll158 KBDownload

To reproduce:
llc -global-merge-on-external=true reduced_testcase.ll

Could you have a look please?

As soon as this is fixed, I can start over the benchmarking.

Thanks,

Quentin

Hi Quentin,

Sorry for those failures introduced, and yes they are all my fault, and now the new patch should have all issues fixed.

There was a logic error around the original assertion and it was only to check the 1st global for each different global variable merges. And actually this is was newly introduced by the 1st refactoring patch.

Anyway this new patch fixed the followings,

The compilation failures you mentioned. A new test case global-merge-3.ll is added to detect this regression. This test intends to mix the "internal global" and "global", and test multiple merged globals.

If "internal globals" and "globals" are mixed in the same merged globals, we will use the 1st "global" name as the suffix as this newly created merged global variable.

Ask "internal global" to also use alignment target hook. Yes, you are right, we can also apply this special hook to internal globals. I didn't do this previously, because I didn't want to mix the globals exposing names to other symbols with the internal globals, for which we don't need to expose names to other modules at all. And also I think applying this alignment hook to "internal globals" should be orthogonal with -global-merge-on-external, so now I introduced another command line switch -global-merge-aligned. i.e.

-global-merge-on-external: control if globals excluding "internal globals" will be merged.
-global-merge-aligned: control if the special alignment will be applied to merged globals.

This patch is also formatted with git-clang-format.

Thanks,
-Jiangning

Hi Jiangning,

also I think applying this alignment hook to "internal globals" should be orthogonal with -global-merge-on-external

Makes sense.

I introduced another command line switch -global-merge-aligned.

That’s a good idea, thanks!

That said, shouldn’t we set the natural alignment for external globals by default? I.e., I guess externals uses will assume that at least the natural alignment is set.

Anyway, I’ll give a shot to the new patch.

Thanks,
-Quentin

Hi Quentin,

I introduced another command line switch -global-merge-aligned.

That’s a good idea, thanks!

That said, shouldn’t we set the natural alignment for external globals by default? I.e., I guess externals uses will assume that at least the natural alignment is set.

I think if I don't explicit set alignment, by default it should be natural alignment for both internal and external global variables, right? The original code for internal globals doesn't explicitly set alignment either.

Now I've got an A57/A53 big-little board, so I did some experiment and the data shows there isn't really performance impact for spec2000 int on both A57 and A53 (I believe you understand I can't post the performance data directly here due to legal issue), although I do see the following adrp instruction reductions for ARM64,

bzip2 0.28%
crafty 1.20%
vpr 0.00%
vortex 4.83%
eon 0.05%
mcf 0.00%
gzip 3.04%
perlbmk 0.02%
parser 4.49%
twolf 0.00%
gap 2.08%
gcc 1.39%

I think one of reasons might be this optimization only affects global symbol address, but it is a loop invariant essentially, and it should be always hoisted out of hot loops, so at least for spec2000 we wouldn't be able to see performance impact.

Once the pseudo instruction MOVAddr issue is solved for ARM64, there might be more adrp instruction reduction, but after this change I would still not expect performance impact for spec2000, because my measurement for AArch64 didn't show performance impact for spec2000 either.

As far as this optimization improvement itself concerned, now I think code size reduction could be the key benefit, and it wouldn't hurt any performance.

Considering this patch has been posted for almost 4 weeks and we are approaching the deadline of migrating AArch64 to ARM64, do you agree to commit this patch first? If this can happen before phasing out AArch64, it will save us a lot of more efforts to rebase, because I think the final switching from AArch64 to ARM64, and renaming ARM64 to AArch64 would be a significant change on TOT.

Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first.

Thanks,
-Jiangning

Anyway, I’ll give a shot to the new patch.

Thanks,
-Quentin

Hi Jiangning,

Sorry for the delay to get back to you but the results are not what I was expected and I had to double check them.
So, the short story is, I am seeing the following a lot of regressions, although the number of ADRP/ADR instructions decreased a lot in the final binaries.

Note: for the performances numbers, I have filtered out the tests that run for less than a second and the tests that have similar performance:
Columns:

Reference: Global merge on external disabled.
Test: Global merge on external enabled as well as the alignment thing.
Expension: Test/Reference (Smaller is better).

Performances

O3 **

Benchmark_ID Reference Test Expansion Percent

ASC_Sequoia/IRSmk/IRSmk 16.6782 16.5785 0.99 -1%
Adobe-C++/loop_unroll 4.9308 5.0514 1.02 +2%
BenchmarkGame/n-body 2.38 2.3966 1.01 +1%
CFP2000/177.mesa/177.me 4.0972 4.1418 1.01 +1%
CINT2000/164.gzip/164.g 22.8601 23.2677 1.02 +2%
CINT2000/186.crafty/186 9.1118 9.2452 1.01 +1%
CINT2000/253.perlbmk/25 13.5299 13.6566 1.01 +1%
CINT2000/254.gap/254.ga 3.8361 3.8128 0.99 -1%
CINT2000/255.vortex/255 4.8632 5.0017 1.03 +3%
CINT2006/403.gcc/403.gc 3.251 3.2304 0.99 -1%
CINT2006/458.sjeng/458. 8.8006 8.7538 0.99 -1%
CINT2006/471.omnetpp/47 1.3495 1.323 0.98 -2%
McGill/queens 4.2418 4.2123 0.99 -1%
Misc-C++/Large/ray 5.2962 5.3265 1.01 +1%
Olden/power/power 2.3012 2.3335 1.01 +1%
SIBsim4/SIBsim4 5.6249 5.5846 0.99 -1%
Shootout-C++/ary3 2.1923 2.2114 1.01 +1%
Shootout-C++/lists1 1.4118 1.4452 1.02 +2%
VersaBench/8b10b/8b10b 13.3247 12.9642 0.97 -3%
VersaBench/ecbdes/ecbde 4.9098 4.9364 1.01 +1%
aha/aha 4.2497 4.3452 1.02 +2%
lambda-0.1.3/lambda 8.8454 8.7103 0.98 -2%
mafft/pairlocalalign 59.696 57.5007 0.96 -4%

siod/siod 3.4691 3.4929 1.01 +1%

Min (24) - - 0.96 -

Max (24) - - 1.03 -

Sum (24) 211 210 0.99 +1%

A.Mean (24) - - 1 +0%

G.Mean 2 (24) - - 1 +0%

Overall there are more regressions (14) than improvements (10) and on average it is neutral.

Os **

Benchmark_ID Reference Test Expansion Percent

Adobe-C++/loop_unroll 5.0411 5.168 1.03 +3%
CINT2000/164.gzip/164.g 22.5516 23.0424 1.02 +2%
CINT2000/186.crafty/186 9.5044 9.6343 1.01 +1%
CINT2006/400.perlbench/ 15.0506 15.1874 1.01 +1%
CINT2006/456.hmmer/456. 6.088 6.1583 1.01 +1%
CINT2006/462.libquantum 2.6566 2.678 1.01 +1%
McGill/queens 4.2193 4.186 0.99 -1%
Misc-C++/Large/ray 6.5632 6.6263 1.01 +1%
Olden/power/power 3.4379 3.2233 0.94 -6%
Polybench/stencils/fdtd 3.6964 3.718 1.01 +1%
Ptrdist/ft/ft 2.3729 2.3966 1.01 +1%
SIBsim4/SIBsim4 5.6219 5.6644 1.01 +1%
Trimaran/enc-3des/enc-3 3.7445 3.7256 0.99 -1%
Trimaran/enc-pc1/enc-pc 1.6924 1.7176 1.01 +1%
VersaBench/8b10b/8b10b 13.4432 13.1539 0.98 -2%
lambda-0.1.3/lambda 8.681 9.1763 1.06 +6%
mafft/pairlocalalign 59.1719 58.6067 0.99 -1%
povray 6.2992 6.4061 1.02 +2%

siod/siod 4.2536 4.3336 1.02 +2%

Min (19) - - 0.94 -

Max (19) - - 1.06 -

Sum (19) 184 185 1 +0%

A.Mean (19) - - 1.01 +1%

G.Mean 2 (19) - - 1.01 +1%

Overall there are more regressions (14) than improvements (5) and on average it is a regression.

Static Count of ADRP/ADR

This is the number of ADRP/ADR instruction in the final binaries. In other words, the linker optimizations already took place.

O3 ** -------------------------------------------------------------------------------

Min (90) - - 0.3 -

Max (90) - - 1.11 -

Sum (90) 160078 152264 0.95 +5%

A.Mean (90) - - 0.88 -12%

G.Mean 2 (90) - - 0.86 -14%

Here are the details for the regressions:

Benchmark_ID Reference Test Expansion Percent

C/Output/globalrefs.sim 9 10 1.11 +11%
CFP2006/444.namd/Output 431 444 1.03 +3%
Misc/Output/flops-1.sim 16 17 1.06 +6%
Misc/Output/flops-3.sim 15 16 1.07 +7%
Misc/Output/flops-8.sim 15 16 1.07 +7%

Shootout-C++/EH/Output/ 16 17 1.06 +6%

Min (6) - - 1.03 -

Max (6) - - 1.11 -

Sum (6) 502 520 1.04 -3%

A.Mean (6) - - 1.07 +7%

G.Mean 2 (6) - - 1.07 +7%

Os ** -------------------------------------------------------------------------------

Min (87) - - 0.3 -

Max (87) - - 1.11 -

Sum (87) 101239 96762 0.96 +5%

A.Mean (87) - - 0.89 -11%

G.Mean 2 (87) - - 0.86 -14%

Here are the details for the regressions:

Benchmark_ID Reference Test Expansion Percent

C/Output/globalrefs.sim 9 10 1.11 +11%
CFP2006/444.namd/Output 349 385 1.1 +10%
Misc/Output/flops-1.sim 16 17 1.06 +6%
Misc/Output/flops-3.sim 15 16 1.07 +7%
Misc/Output/flops-8.sim 15 16 1.07 +7%

Shootout-C++/EH/Output/ 10 11 1.1 +10%

Min (6) - - 1.06 -

Max (6) - - 1.11 -

Sum (6) 414 455 1.1 -9%

A.Mean (6) - - 1.09 +9%

G.Mean 2 (6) - - 1.08 +8%

Long Story

After the first run, a lot of applications were regressing, so I suspected that somehow my device got noisy. I've rerun all the regressions/improvements 10x times and observed the standard deviation (SD). The SD was close to 0 percent for most tests, i.e., the regressions/improvements were real.

I've investigated a few of them and figured it was our lowering using pseudo instruction for ADRP that was bitting us. The short story is this was producing redundant ADRPs as you initially saw when you started to work on that patch.
I thought I have fixed the problem with a quick patch prior to the first round of experiments, but apparently, it was not sufficient.

Anyway, I have made another quick patch where I basically blocked any folding of those instructions, thus maximizing the reuse of ADRPs. This is the current baseline.
However, like the numbers show, even if we remove a lot of ADRPs, there is something else going on that make this worthless.
I have to dig into that to see what we can do.

In the meantime, I can give you the raw numbers if you want to gather your own statistics.

Thanks,

-Quentin

I think if I don't explicit set alignment, by default it should be natural alignment for both internal and external global variables, right?

I am not sure. IIRC, no alignment means 1-byte aligned.

The original code for internal globals doesn't explicitly set alignment either.

This is not a problem because the module will see that as all the globals are locally known. But for external, the problem is different as the modules will not see the new alignment and thus, assume whatever the front-end or ABI tells to the backend.
I fell like there is something wrong here, but I may just be paranoid. I wanted to let you know so you can double check :).

Now I've got an A57/A53 big-little board, so I did some experiment and the data shows there isn't really performance impact for spec2000 int on both A57 and A53 (I believe you understand I can't post the performance data directly here due to legal issue), although I do see the following adrp instruction reductions for ARM64,

I concur the ADRP reduction for CINT2000, here the numbers for O3:

Benchmark_ID Reference Test Expansion Percent

./164.gzip/Output/164.g 655 632 0.96 -4%
./175.vpr/Output/175.vp 1678 1678 1 +0%
./176.gcc/Output/176.gc 14838 14701 0.99 -1%
./181.mcf/Output/181.mc 39 39 1 +0%
./186.crafty/Output/186 3922 3896 0.99 -1%
./197.parser/Output/197 1398 1366 0.98 -2%
./252.eon/Output/252.eo 1162 1160 1 +0%
./253.perlbmk/Output/25 9446 9444 1 +0%
./254.gap/Output/254.ga 6058 5988 0.99 -1%
./255.vortex/Output/255 7811 7566 0.97 -3%
./256.bzip2/Output/256. 590 590 1 +0%

./300.twolf/Output/300. 3244 3244 1 +0%

Min (12) - - 0.96 -

Max (12) - - 1 -

Sum (12) 50841 50304 0.99 +1%

A.Mean (12) - - 0.99 -1%

G.Mean 2 (12) - - 0.99 -1%

However, I do see some performance regressions (see my previous post).

Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first.

I am fine with that assuming:

This is disabled by default for the arm64-apple-* triple.
You have double check the alignment stuff :).

Thanks,
-Quentin

Hi Quentin,

Thanks a lot of your performance data, and it also surprised by those regressions. But I can reproduce the regressions on ARM64 for the benchmarks below,

CINT2000/164.gzip/164.g 22.8601 23.2677 1.02 +2%
CINT2000/255.vortex/255 4.8632 5.0017 1.03 +3%

If I don't have other higher priority tasks, I will have a look. But it would be much better if your patch of fixing MOVAddr issue can be shared with me.

Is there any possibility after this optimization the live range of base address is extended, and triggered spill/fill in hot loop by RA? If this is the case the rematerialization needs to be tuned, I think.

Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first.

I am fine with that assuming:

This is disabled by default for the arm64-apple-* triple.

Actually both -global-merge-on-external and -global-merge-aligned are false by default, so if end-user doesn't enable them we wouldn't see any performance impact. Therefore, they are disabled by default for all targets.

Actually I was hoping your patch can get it enabled at least for ARM64. :-)

You have double check the alignment stuff :).

I checked the alignment behavior. If we don't set alignment, which is also the default behavior with/without my patch, the following code in AsmPrinter will be effective,

// If the alignment is specified, we *must* obey it.  Overaligning a global
// with a specified alignment is a prompt way to break globals emitted to
// sections and expected to be contiguous (e.g. ObjC metadata).
unsigned AlignLog = getGVAlignmentLog2(GV, *DL);

for (const HandlerInfo &HI : Handlers) {
  NamedRegionTimer T(HI.TimerName, HI.TimerGroupName, TimePassesIsEnabled);
  HI.Handler->setSymbolSize(GVSym, Size);
}

// Handle common and BSS local symbols (.lcomm).
if (GVKind.isCommon() || GVKind.isBSSLocal()) {
  if (Size == 0) Size = 1;   // .comm Foo, 0 is undefined, avoid it.
  unsigned Align = 1 << AlignLog;

So that means the RoundupToPowerof2(sizeof(MergedGlobal)) will be used as alignment.

The test case test/CodeGen/AArch64/global-merge.ll, which only uses "internal globals", also shows below with llc,

$ ../build/bin/llc -mtriple=arm64-apple-ios -O1 < test/CodeGen/AArch64/global-merge.ll
.section TEXT,text,regular,pure_instructions
.globl _f1
.align 2
_f1: ; @f1
.cfi_startproc
; BB#0:
Lloh0:
adrp x8, MergedGlobals@PAGE
Lloh1:
add x8, x8, MergedGlobals@PAGEOFF
stp w0, w1, [x8]
ret
.loh AdrpAdd Lloh0, Lloh1
.cfi_endproc

.zerofill DATA,bss,__MergedGlobals,8,3 ; @_MergedGlobals

So actually I'm keeping the original behavior.

Anyway, my patch doesn't change the original semantic at all, if only -global-merge-on-external and -global-merge-aligned are not enabled.

Are you happy with this?

Thanks,
-Jiangning

Thanks,
-Quentin

Hi Jiangning,

If I don't have other higher priority tasks, I will have a look. But it would be much better if your patch of fixing MOVAddr issue can be shared with me.

Sure! It is more a hack than a patch, but here it is :).

arm64-block-adrp-folding.patch481 BDownload

Is there any possibility after this optimization the live range of base address is extended, and triggered spill/fill in hot loop by RA? If this is the case the rematerialization needs to be tuned, I think.

I thought about that and this is possible. Though the current implementation still uses the pseudo instructions so the rematerialization should kick in. That said, this is also possible that the rematerialization algorithm is not perfect.

Actually I was hoping your patch can get it enabled at least for ARM64. :-)

For ARM64, yes, but for ARM64-apple-* no :). We need to address the regressions first before enabling it by default for our (apple) triple, but if you need it for other OS, I am fine with that, though I cannot test the performances for those.
We can land the patch with it disabled by default for all targets then enabled it for the targets (triple) we care along the way.

Thanks for checking the alignment thing.

Yes, I am happy with that :).

Thanks,
-Quentin

Hi Quentin,

I committed this patch as r208934, and will analyze those regression with
your patch.

Thanks,
-Jiangning

2014-05-16 0:57 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:

Hi Jiangning,

If I don't have other higher priority tasks, I will have a look. But it

would be much better if your patch of fixing MOVAddr issue can be shared
with me.

Sure! It is more a hack than a patch, but here it is :).

arm64-block-adrp-folding.patch481 BDownload

Is there any possibility after this optimization the live range of base

address is extended, and triggered spill/fill in hot loop by RA? If this is
the case the rematerialization needs to be tuned, I think.

I thought about that and this is possible. Though the current
implementation still uses the pseudo instructions so the rematerialization
should kick in. That said, this is also possible that the rematerialization
algorithm is not perfect.

Actually I was hoping your patch can get it enabled at least for ARM64.

:-)

For ARM64, yes, but for ARM64-apple-* no :). We need to address the
regressions first before enabling it by default for our (apple) triple, but
if you need it for other OS, I am fine with that, though I cannot test the
performances for those.
We can land the patch with it disabled by default for all targets then
enabled it for the targets (triple) we care along the way.

Thanks for checking the alignment thing.

Yes, I am happy with that :).

Thanks,
-Quentin

http://reviews.llvm.org/D3432

I committed this patch as r208934, and will analyze those regression with your patch.

Thanks!

Let me know if I can help for the regressions.

-Quentin

Mark this as accepted to close it.

-Quentin

This revision is now accepted and ready to land.May 15 2014, 5:02 PM

Closing, as this was committed a while ago (r208934, r210640).

Revision Contents

Path

Size

include/

llvm/

IR/

GlobalAlias.h

4 lines

Target/

TargetLowering.h

7 lines

lib/

CodeGen/

AsmPrinter/

AsmPrinter.cpp

7 lines

IR/

Globals.cpp

25 lines

Target/

AArch64/

AArch64ISelLowering.h

4 lines

AArch64ISelLowering.cpp

14 lines

ARM64/

ARM64ISelLowering.h

4 lines

ARM64ISelLowering.cpp

14 lines

Transforms/

Scalar/

GlobalMerge.cpp

85 lines

Scalar.cpp

1 line

test/

CodeGen/

AArch64/

global-merge.ll

34 lines

global_merge_1.ll

ARM/

global-merge-1.ll

85 lines

ARM64/

global-merge.ll

88 lines

Transforms/

GlobalMerge/

AArch64/

22 lines

30 lines

27 lines

4 lines

ARM/

arm.ll

23 lines

ARM64/

arm64.ll

21 lines

tools/

opt/

opt.cpp

Diff 9301

include/llvm/IR/GlobalAlias.h

Context not available.
	static inline bool classof(const Value *V) {	static inline bool classof(const Value *V) {
	return V->getValueID() == Value::GlobalAliasVal;	return V->getValueID() == Value::GlobalAliasVal;
	}	}

		// return the constant offset of an expression, with which this global var
		// has alias.
		uint64_t calculateOffset(const DataLayout &DL) const;
	};	};

	template <>	template <>
Context not available.

include/llvm/Target/TargetLowering.h

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	#include "llvm/IR/CallingConv.h"	#include "llvm/IR/CallingConv.h"
	#include "llvm/IR/InlineAsm.h"	#include "llvm/IR/InlineAsm.h"
	#include "llvm/IR/IRBuilder.h"	#include "llvm/IR/IRBuilder.h"
		#include "llvm/IR/GlobalVariable.h"
	#include "llvm/MC/MCRegisterInfo.h"	#include "llvm/MC/MCRegisterInfo.h"
	#include "llvm/Target/TargetCallingConv.h"	#include "llvm/Target/TargetCallingConv.h"
	#include "llvm/Target/TargetMachine.h"	#include "llvm/Target/TargetMachine.h"
Context not available.
	return 0;	return 0;
	}	}

		/// Returns the alignment required by global merge on external symbols.
		/// By default, returns the natural alignment of merged data structure.
		virtual unsigned getGlobalMergeAlignment(StructType *MergedTy) const {
		JiangningAuthorUnsubmitted Not Done Reply Inline Actions This is newly added target hook to return alignment required for different targets. Jiangning: This is newly added target hook to return alignment required for different targets.
		return getDataLayout()->getABITypeAlignment(MergedTy);
		}

	/// Returns true if a cast between SrcAS and DestAS is a noop.	/// Returns true if a cast between SrcAS and DestAS is a noop.
	virtual bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const {	virtual bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const {
	return false;	return false;
Context not available.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp

Context not available.
	EmitVisibility(Name, Alias.getVisibility());	EmitVisibility(Name, Alias.getVisibility());

	// Emit the directives as assignments aka .set:	// Emit the directives as assignments aka .set:
	OutStreamer.EmitAssignment(Name,	const MCExpr *Expr = MCSymbolRefExpr::Create(Target, OutContext);
	MCSymbolRefExpr::Create(Target, OutContext));	if (uint64_t Offset = Alias.calculateOffset(*TM.getDataLayout()))
		Expr = MCBinaryExpr::CreateAdd(Expr,
		MCConstantExpr::Create(Offset, OutContext), OutContext);
		OutStreamer.EmitAssignment(Name, Expr);
	}	}
	}	}

Context not available.

lib/IR/Globals.cpp

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	#include "llvm/IR/GlobalValue.h"	#include "llvm/IR/GlobalValue.h"
	#include "llvm/ADT/SmallPtrSet.h"	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/IR/Constants.h"	#include "llvm/IR/Constants.h"
		#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/DerivedTypes.h"	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/GlobalAlias.h"	#include "llvm/IR/GlobalAlias.h"
	#include "llvm/IR/GlobalVariable.h"	#include "llvm/IR/GlobalVariable.h"
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Are these clauses tested? I don't see any aliases or GEPs in your examples. I also vaguely remember Rafael saying that a GEP-based alias wasn't intended to be supported, though I wouldn't swear to it (and I'm fuzzy on exactly why it's a bad idea). t.p.northover: Are these clauses tested? I don't see any aliases or GEPs in your examples. I also vaguely…
		JiangningAuthorUnsubmitted Not Done Reply Inline Actions Tim, Yes, these clauses are tested, and the follows are just aliases to be checked. ; CHECK: y = _MergedGlobals_x+4 ; CHECK: z = _MergedGlobals_x+8 Without this change, compiler would fail to run the test case global_merge_2.ll. GEPs are from ".globl y" and ".globl z" i32* getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 1) i32* getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 2) I don't know what Rafael had said previously, can you point me that email thread if you know, then I can consider this issue. Jiangning: Tim, Yes, these clauses are tested, and the follows are just aliases to be checked. ; CHECK: y…
Context not available.
	return GV;	return GV;
	}	}
	}	}

		uint64_t GlobalAlias::calculateOffset(const DataLayout &DL) const {
		uint64_t Offset = 0;
		const Constant *C = this;
		while (C) {
		if (const GlobalAlias *GA = dyn_cast<GlobalAlias>(C)) {
		C = GA->getAliasee();
		} else if (const ConstantExpr *CE = dyn_cast<ConstantExpr>(C)) {
		if (CE->getOpcode() == Instruction::GetElementPtr) {
		std::vector<Value*> Args;
		for (unsigned I = 1; I < CE->getNumOperands(); ++I)
		Args.push_back(CE->getOperand(I));
		Offset += DL.getIndexedOffset(CE->getOperand(0)->getType(), Args);
		}
		C = CE->getOperand(0);
		} else if (isa<GlobalValue>(C)) {
		return Offset;
		} else {
		assert(0 && "Unexpected type in alias chain!");
		return 0;
		}
		}
		return Offset;
		}
Context not available.

lib/Target/AArch64/AArch64ISelLowering.h

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	/// be used for loads / stores from the global.	/// be used for loads / stores from the global.
	unsigned getMaximalGlobalOffset() const override;	unsigned getMaximalGlobalOffset() const override;

		/// getGlobalMergeAlignment - Set alignment to be the max size of merged
		/// global variable data structure, and make it aligned up to power of 2.
		unsigned getGlobalMergeAlignment(StructType *MergedTy) const override;

	protected:	protected:
	std::pair<const TargetRegisterClass*, uint8_t>	std::pair<const TargetRegisterClass*, uint8_t>
	findRepresentativeClass(MVT VT) const override;	findRepresentativeClass(MVT VT) const override;
Context not available.

lib/Target/AArch64/AArch64ISelLowering.cpp

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	return 4095;	return 4095;
	}	}

		/// getGlobalMergeAlignment - Set alignment to be the max size of merged
		/// global variable data structure, and make it aligned up to power of 2.
		/// This way, we could guarantee the merged global variable data structure
		/// doesn't cross page boundary, because usually OS always allocates page at
		/// 4096-byte aligned boundary.
		unsigned AArch64TargetLowering::getGlobalMergeAlignment(
		StructType *MergedTy) const {
		unsigned Align = getDataLayout()->getTypeAllocSize(MergedTy);
		if (Align & (Align - 1))
		Align = llvm::NextPowerOf2(Align);

		return Align;
		}

Context not available.

lib/Target/ARM64/ARM64ISelLowering.h

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	/// be used for loads / stores from the global.	/// be used for loads / stores from the global.
	unsigned getMaximalGlobalOffset() const override;	unsigned getMaximalGlobalOffset() const override;

		/// getGlobalMergeAlignment - Set alignment to be the max size of merged
		/// global variable data structure, and make it aligned up to power of 2.
		unsigned getGlobalMergeAlignment(StructType *MergedTy) const override;

	/// Returns true if a cast between SrcAS and DestAS is a noop.	/// Returns true if a cast between SrcAS and DestAS is a noop.
	bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override {	bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override {
	// Addrspacecasts are always noops.	// Addrspacecasts are always noops.
Context not available.

lib/Target/ARM64/ARM64ISelLowering.cpp

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	return 4095;	return 4095;
	}	}

		/// getGlobalMergeAlignment - Set alignment to be the max size of merged
		/// global variable data structure, and make it aligned up to power of 2.
		JiangningAuthorUnsubmitted Not Done Reply Inline Actions The specific version for ARM64 returns RoundupToPowerOf2(sizeof(MergedGlobal)). Jiangning: The specific version for ARM64 returns RoundupToPowerOf2(sizeof(MergedGlobal)).
		/// This way, we could guarantee the merged global variable data structure
		/// doesn't cross page boundary, because usually OS always allocates page at
		/// 4096-byte aligned boundary.
		unsigned ARM64TargetLowering::getGlobalMergeAlignment(
		StructType *MergedTy) const {
		unsigned Align = getDataLayout()->getTypeAllocSize(MergedTy);
		if (Align & (Align - 1))
		Align = llvm::NextPowerOf2(Align);

		return Align;
		}

	FastISel *	FastISel *
	ARM64TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,	ARM64TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
	const TargetLibraryInfo *libInfo) const {	const TargetLibraryInfo *libInfo) const {
Context not available.

lib/Transforms/Scalar/GlobalMerge.cpp

Property	Old Value	New Value
File Mode	100644	100755

Context not available.

#define DEBUG_TYPE "global-merge"

static cl::opt<bool>

EnableGlobalMerge("global-merge", cl::Hidden,

EnableGlobalMerge("enable-global-merge", cl::NotHidden,

cl::desc("Enable global merge pass"),

cl::init(true));

Context not available.

cl::desc("Enable global merge pass on constants"),

cl::init(false));

static cl::opt<bool>

EnableGlobalMergeOnExternal("global-merge-on-external", cl::Hidden,

cl::desc("Enable global merge pass on external linkage"),

cl::init(false));

static cl::opt<bool>

EnableGlobalMergeAligned("global-merge-aligned", cl::Hidden,

cl::desc("Set target specific alignment for global merge pass"),

cl::init(false));

STATISTIC(NumMerged , "Number of globals merged");

namespace {

class GlobalMerge : public FunctionPass {

Context not available.

} // end anonymous namespace

char GlobalMerge::ID = 0;

INITIALIZE_PASS(GlobalMerge, "global-merge",

"Global Merge", false, false)

static void *initializeGlobalMergePassOnce(PassRegistry &Registry) {

PassInfo *PI = new PassInfo(

"Merge global variables",

"global-merge", &GlobalMerge::ID,

PassInfo::NormalCtor_t(callDefaultCtor<GlobalMerge>), false,

false, PassInfo::TargetMachineCtor_t(

callTargetMachineCtor<GlobalMerge>));

Registry.registerPass(*PI, true);

return PI;

}

void llvm::initializeGlobalMergePass(PassRegistry &Registry) {

CALL_ONCE_INITIALIZATION(initializeGlobalMergePassOnce)

}

bool GlobalMerge::doMerge(SmallVectorImpl<GlobalVariable*> &Globals,

Module &M, bool isConst, unsigned AddrSpace) const {

qcolombetUnsubmitted

Not Done

Should be hoisted outside of the loop, since we assert this does not change within the loop.

qcolombet: Should be hoisted outside of the loop, since we assert this does not change within the loop.

JiangningAuthorUnsubmitted

Not Done

Accept and fixed by the new version uploaded.

Jiangning: Accept and fixed by the new version uploaded.

Context not available.

Type *Int32Ty = Type::getInt32Ty(M.getContext());

assert (Globals.size() > 1);

for (size_t i = 0, e = Globals.size(); i != e; ) {

size_t j = 0;

uint64_t MergedSize = 0;

std::vector<Type*> Tys;

std::vector<Constant*> Inits;

bool HasExternal = false;

GlobalVariable *TheFirstExternal = 0;

qcolombetUnsubmitted

Not Done

Ditto.

qcolombet: Ditto.

JiangningAuthorUnsubmitted

Not Done

Accept and fixed by the new version uploaded.

Jiangning: Accept and fixed by the new version uploaded.

for (j = i; j != e; ++j) {

Type *Ty = Globals[j]->getType()->getElementType();

MergedSize += DL->getTypeAllocSize(Ty);

Context not available.

}

Tys.push_back(Ty);

Inits.push_back(Globals[j]->getInitializer());

if (Globals[j]->hasExternalLinkage() && !HasExternal) {

HasExternal = true;

TheFirstExternal = Globals[j];

}

qcolombetUnsubmitted

Not Done

This is the alignment thing I was talking about.

qcolombet: This is the alignment thing I was talking about.

JiangningAuthorUnsubmitted

Not Done

Hi Quentin,

Appreciate for your further feedback!

I understand your point. Actually I think we have three solutions to solve this ADRP issue, let me summarize them, and make sure our discussion is on the same page.

Materialize the base address of merged global data structure with a single register. MOVAddr in ARM64 just does this work. This is essentially to use "adrp + add" to compute the base of merged global data structure. This way, the merge global data structure needn't the structure size alignment, but the natural alignment. The disadvantage of this solution is, at compile time we don't really know if the offset in add instruction can really be propagated and combined into load/store instructions, because it might be larger than a page size. For this solution the instruction sequence could be like below, (Suppose we have three loads)

adrp r0, MergedGlobal@PAGE
add r1, r0, MergedGlobal@PAGE_OFF
load [r1, var1_offset_within_MergedGlobal]
load [r1, var2_offset_within_MergedGlobal]
load [r1, var3_offset_within_MergedGlobal]

Refer to the chart below as well,

page 1                     page 2

r0
------------------------------r1-----------| // merged global data structure
----------------------------s-------e
var1_offset_within_MergedGlobal = e - s

(MergedGlobal@PAGE_OFF + var1_offset_within_MergedGlobal) > PageSize

We don't use a single register to describe the base address of merged global data structure at all. If we can guarantee the merged global data structure is always within a page, and doesn't cross page boundary, we would be able make sure the offset from the page boundary is always smaller than a page, so at compile time, the offset to page boundary would be able to be fused into load/store instruction directly. Setting the alignment to be the maximum merged data structure size could guarantee the merged data structure doesn't cross page boundary. The disadvantage of this solution is, we would probably increase run-time memory consumption, because there might be some "holes" in data section. With this solution, we would have the following instruction sequence finally,

adrp r0, MergedGlobal@PAGE
load [r0, var1_offset_within_PAGE]
load [r0, var2_offset_within_PAGE]
load [r0, var3_offset_within_PAGE]

Refer to the chart below as well,

page 1                     page 2

r0
--------------------|----------| // merged global data structure
s----------------------e
var1_offset_within_PAGE = e - s

We use link-time solution to help the removal of ADRP and ADD. Link-time optimization might not be cost-less because the dead instruction can only be replaced with NOP instruction.

Both solution 1) and 2) can remove some ADRPs statically. And the ADD introduced in solution 1) could probably be reduced by solution 3) as well.

Do you think the disadvantage with solution 2) is unacceptable?

Thanks,
-Jiangning

Jiangning: Hi Quentin, Appreciate for your further feedback! I understand your point. Actually I think…

// If merged variables doesn't have external linkage, we needn't to expose

// the symbol after merging.

GlobalValue::LinkageTypes Linkage = HasExternal ?

GlobalValue::ExternalLinkage :

GlobalValue::InternalLinkage ;

// If merged variables have external linkage, we use symbol name of the

// first variable merged as the suffix of global symbol name. This would

// be able to avoid the link-time naming conflict for globalm symbols.

Twine MergedGVName = HasExternal ?

"_MergedGlobals_" + TheFirstExternal->getName() :

"_MergedGlobals" ;

StructType *MergedTy = StructType::get(M.getContext(), Tys);

Constant *MergedInit = ConstantStruct::get(MergedTy, Inits);

GlobalVariable *MergedGV = new GlobalVariable(M, MergedTy, isConst,

GlobalValue::InternalLinkage,

Linkage, MergedInit, MergedGVName,

MergedInit, "_MergedGlobals",

nullptr, GlobalVariable::NotThreadLocal,

nullptr,

AddrSpace);

GlobalVariable::NotThreadLocal,

AddrSpace);

if (EnableGlobalMergeAligned) {

qcolombetUnsubmitted

Not Done

Any reason why we do not want to set an alignment for the internal too?

qcolombet: Any reason why we do not want to set an alignment for the internal too?

unsigned Align = TLI->getGlobalMergeAlignment(MergedTy);

assert(((Align % DL->getABITypeAlignment(MergedTy)) == 0) &&

"Specified alignment doesn't meet natural alignment requirement.");

MergedGV->setAlignment(Align);

}

for (size_t k = i; k < j; ++k) {

GlobalValue::LinkageTypes Linkage = Globals[k]->getLinkage();

std::string Name = Globals[k]->getName();

Constant *Idx[2] = {

ConstantInt::get(Int32Ty, 0),

ConstantInt::get(Int32Ty, k-i)

Context not available.

Constant *GEP = ConstantExpr::getInBoundsGetElementPtr(MergedGV, Idx);

Globals[k]->replaceAllUsesWith(GEP);

Globals[k]->eraseFromParent();

if (Linkage != GlobalValue::InternalLinkage) {

// Generate a new alias...

new GlobalAlias(GEP->getType(), Linkage, Name, GEP, &M);

}

NumMerged++;

}

i = j;

t.p.northoverUnsubmitted

Not Done

This might be clearer split up. I know my brain starts to melt when I see massive nested conditionals like this. Perhaps linkage conditions vs intrinsic blockers?

t.p.northover: This might be clearer split up. I know my brain starts to melt when I see massive nested…

JiangningAuthorUnsubmitted

Not Done

Tim, That's OK, and I've split it up in new version.

Jiangning: Tim, That's OK, and I've split it up in new version.

Context not available.

// Grab all non-const globals.

for (Module::global_iterator I = M.global_begin(),

E = M.global_end(); I != E; ++I) {

// Merge is safe for "normal" internal globals only

// Merge is safe for "normal" internal or external globals only

if (!I->hasLocalLinkage() || I->isThreadLocal() || I->hasSection())

if (I->isDeclaration() || I->isThreadLocal() || I->hasSection())

continue;

if (!(EnableGlobalMergeOnExternal && I->hasExternalLinkage())

&& !I->hasInternalLinkage())

continue;

PointerType *PT = dyn_cast<PointerType>(I->getType());

Context not available.

lib/Transforms/Scalar/Scalar.cpp

Property	Old Value	New Value
File Mode	100644	100755

Context not available.
	initializeDSEPass(Registry);	initializeDSEPass(Registry);
	initializeGVNPass(Registry);	initializeGVNPass(Registry);
	initializeEarlyCSEPass(Registry);	initializeEarlyCSEPass(Registry);
		initializeGlobalMergePass(Registry);
	initializeIndVarSimplifyPass(Registry);	initializeIndVarSimplifyPass(Registry);
	initializeJumpThreadingPass(Registry);	initializeJumpThreadingPass(Registry);
	initializeLICMPass(Registry);	initializeLICMPass(Registry);
Context not available.

test/CodeGen/AArch64/global-merge.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -O0 \| FileCheck --check-prefix=NO-MERGE %s
				; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -O0 -global-merge-on-external=true -global-merge-aligned=true \| FileCheck --check-prefix=NO-MERGE %s

				; RUN: llc < %s -mtriple=arm64-apple-ios -O0 \| FileCheck %s --check-prefix=CHECK-APPLE-IOS-NO-MERGE
				; RUN: llc < %s -mtriple=arm64-apple-ios -O0 -global-merge-on-external=true -global-merge-aligned=false \| FileCheck %s --check-prefix=CHECK-APPLE-IOS-NO-MERGE

				; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -O1 \| FileCheck %s
				; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -O1 -global-merge-on-external=true -global-merge-aligned=true \| FileCheck %s

				; RUN: llc < %s -mtriple=arm64-apple-ios -O1 \| FileCheck %s --check-prefix=CHECK-APPLE-IOS
				; RUN: llc < %s -mtriple=arm64-apple-ios -O1 -global-merge-on-external=true -global-merge-aligned=false \| FileCheck %s --check-prefix=CHECK-APPLE-IOS

				@m = internal global i32 0, align 4
				@n = internal global i32 0, align 4

				define void @f1(i32 %a1, i32 %a2) {
				; CHECK-LABEL: f1:
				; CHECK: adrp x{{[0-9]+}}, _MergedGlobals
				; CHECK-NOT: adrp

				; CHECK-APPLE-IOS-LABEL: f1:
				; CHECK-APPLE-IOS: adrp x{{[0-9]+}}, __MergedGlobals
				; CHECK-APPLE-IOS-NOT: adrp
				store i32 %a1, i32* @m, align 4
				store i32 %a2, i32* @n, align 4
				ret void
				}

				; CHECK: .local _MergedGlobals
				; CHECK: .comm _MergedGlobals,8,8
				; NO-MERGE-NOT: .local _MergedGlobals

				; CHECK-APPLE-IOS: .zerofill __DATA,__bss,__MergedGlobals,8,3
				; CHECK-APPLE-IOS-NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,8,3

test/CodeGen/AArch64/global_merge_1.ll

This file was deleted.

	; RUN: llc < %s -mtriple=aarch64-none-linux-gnu \| FileCheck %s

	@m = internal global i32 0, align 4
	@n = internal global i32 0, align 4

	define void @f1(i32 %a1, i32 %a2) {
	; CHECK-LABEL: f1:
	; CHECK: adrp x{{[0-9]+}}, _MergedGlobals
	; CHECK-NOT: adrp
	store i32 %a1, i32* @m, align 4
	store i32 %a2, i32* @n, align 4
	ret void
	}

	; CHECK: .local _MergedGlobals
	; CHECK: .comm _MergedGlobals,8,8

test/CodeGen/ARM/global-merge-1.ll

This file was added.

				; RUN: llc %s -O0 -o - \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O0 -o - -enable-global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O0 -o - -enable-global-merge=true \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O1 -o - \| FileCheck -check-prefix=MERGE %s
				; RUN: llc %s -O1 -o - -enable-global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O1 -o - -enable-global-merge=true \| FileCheck -check-prefix=MERGE %s

				; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2
				; MERGE: .zerofill __DATA,__bss,__MergedGlobals,60,4
				; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2

				; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4
				; NO-MERGE: .zerofill __DATA,__bss,_bar,20,2
				; NO-MERGE: .zerofill __DATA,__bss,_baz,20,2
				; NO-MERGE: .zerofill __DATA,__bss,_foo,20,2
				; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4

				target datalayout = "e-p:32:32:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:32:64-v128:32:128-a0:0:32-n32-S32"
				target triple = "thumbv7-apple-ios3.0.0"

				@bar = internal global [5 x i32] zeroinitializer, align 4
				@baz = internal global [5 x i32] zeroinitializer, align 4
				@foo = internal global [5 x i32] zeroinitializer, align 4

				; Function Attrs: nounwind ssp
				define internal void @initialize() #0 {
				%1 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %1, i32* getelementptr inbounds ([5 x i32]* @bar, i32 0, i32 0), align 4, !tbaa !1
				%2 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %2, i32* getelementptr inbounds ([5 x i32]* @baz, i32 0, i32 0), align 4, !tbaa !1
				%3 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %3, i32* getelementptr inbounds ([5 x i32]* @bar, i32 0, i32 1), align 4, !tbaa !1
				%4 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %4, i32* getelementptr inbounds ([5 x i32]* @baz, i32 0, i32 1), align 4, !tbaa !1
				%5 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %5, i32* getelementptr inbounds ([5 x i32]* @bar, i32 0, i32 2), align 4, !tbaa !1
				%6 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %6, i32* getelementptr inbounds ([5 x i32]* @baz, i32 0, i32 2), align 4, !tbaa !1
				%7 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %7, i32* getelementptr inbounds ([5 x i32]* @bar, i32 0, i32 3), align 4, !tbaa !1
				%8 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %8, i32* getelementptr inbounds ([5 x i32]* @baz, i32 0, i32 3), align 4, !tbaa !1
				%9 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %9, i32* getelementptr inbounds ([5 x i32]* @bar, i32 0, i32 4), align 4, !tbaa !1
				%10 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
				store i32 %10, i32* getelementptr inbounds ([5 x i32]* @baz, i32 0, i32 4), align 4, !tbaa !1
				ret void
				}

				declare i32 @calc(...) #1

				; Function Attrs: nounwind ssp
				define internal void @calculate() #0 {
				%1 = load <4 x i32>* bitcast ([5 x i32]* @bar to <4 x i32>*), align 4
				%2 = load <4 x i32>* bitcast ([5 x i32]* @baz to <4 x i32>*), align 4
				%3 = mul <4 x i32> %2, %1
				store <4 x i32> %3, <4 x i32>* bitcast ([5 x i32]* @foo to <4 x i32>*), align 4
				%4 = load i32* getelementptr inbounds ([5 x i32]* @bar, i32 0, i32 4), align 4, !tbaa !1
				%5 = load i32* getelementptr inbounds ([5 x i32]* @baz, i32 0, i32 4), align 4, !tbaa !1
				%6 = mul nsw i32 %5, %4
				store i32 %6, i32* getelementptr inbounds ([5 x i32]* @foo, i32 0, i32 4), align 4, !tbaa !1
				ret void
				}

				; Function Attrs: nounwind readnone ssp
				define internal i32* @returnFoo() #2 {
				ret i32* getelementptr inbounds ([5 x i32]* @foo, i32 0, i32 0)
				}

				attributes #0 = { nounwind ssp "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #1 = { "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #2 = { nounwind readnone ssp "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #3 = { nounwind }

				!llvm.ident = !{!0}

				!0 = metadata !{metadata !"LLVM version 3.4 "}
				!1 = metadata !{metadata !2, metadata !2, i64 0}
				!2 = metadata !{metadata !"int", metadata !3, i64 0}
				!3 = metadata !{metadata !"omnipotent char", metadata !4, i64 0}
				!4 = metadata !{metadata !"Simple C/C++ TBAA"}

test/CodeGen/ARM64/global-merge.ll

This file was added.

				; RUN: llc %s -O0 -o - \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O0 -o - -enable-global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O0 -o - -enable-global-merge=true \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O1 -o - \| FileCheck -check-prefix=MERGE %s
				; RUN: llc %s -O1 -o - -enable-global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
				; RUN: llc %s -O1 -o - -enable-global-merge=true \| FileCheck -check-prefix=MERGE %s

				; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2
				; MERGE: .zerofill __DATA,__bss,__MergedGlobals,60,4
				; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
				; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2

				; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4
				; NO-MERGE: .zerofill __DATA,__bss,_bar,20,2
				; NO-MERGE: .zerofill __DATA,__bss,_baz,20,2
				; NO-MERGE: .zerofill __DATA,__bss,_foo,20,2
				; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"
				target triple = "arm64-apple-ios7.0.0"

				@bar = internal global [5 x i32] zeroinitializer, align 4
				@baz = internal global [5 x i32] zeroinitializer, align 4
				@foo = internal global [5 x i32] zeroinitializer, align 4

				; Function Attrs: nounwind ssp
				define internal void @initialize() #0 {
				%1 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %1, i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 0), align 4
				%2 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %2, i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 0), align 4
				%3 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %3, i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 1), align 4
				%4 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %4, i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 1), align 4
				%5 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %5, i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 2), align 4
				%6 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %6, i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 2), align 4
				%7 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %7, i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 3), align 4
				%8 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %8, i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 3), align 4
				%9 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %9, i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 4), align 4
				%10 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #2
				store i32 %10, i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 4), align 4
				ret void
				}

				declare i32 @calc(...)

				; Function Attrs: nounwind ssp
				define internal void @calculate() #0 {
				%1 = load i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 0), align 4
				%2 = load i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 0), align 4
				%3 = mul nsw i32 %2, %1
				store i32 %3, i32* getelementptr inbounds ([5 x i32]* @foo, i64 0, i64 0), align 4
				%4 = load i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 1), align 4
				%5 = load i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 1), align 4
				%6 = mul nsw i32 %5, %4
				store i32 %6, i32* getelementptr inbounds ([5 x i32]* @foo, i64 0, i64 1), align 4
				%7 = load i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 2), align 4
				%8 = load i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 2), align 4
				%9 = mul nsw i32 %8, %7
				store i32 %9, i32* getelementptr inbounds ([5 x i32]* @foo, i64 0, i64 2), align 4
				%10 = load i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 3), align 4
				%11 = load i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 3), align 4
				%12 = mul nsw i32 %11, %10
				store i32 %12, i32* getelementptr inbounds ([5 x i32]* @foo, i64 0, i64 3), align 4
				%13 = load i32* getelementptr inbounds ([5 x i32]* @bar, i64 0, i64 4), align 4
				%14 = load i32* getelementptr inbounds ([5 x i32]* @baz, i64 0, i64 4), align 4
				%15 = mul nsw i32 %14, %13
				store i32 %15, i32* getelementptr inbounds ([5 x i32]* @foo, i64 0, i64 4), align 4
				ret void
				}

				; Function Attrs: nounwind readnone ssp
				define internal i32* @returnFoo() #1 {
				ret i32* getelementptr inbounds ([5 x i32]* @foo, i64 0, i64 0)
				}

				attributes #0 = { nounwind ssp }
				attributes #1 = { nounwind readnone ssp }
				attributes #2 = { nounwind }

test/Transforms/GlobalMerge/AArch64/global-merge-1.ll

This file was added.

				; RUN: opt %s -mtriple=aarch64-none-linux-gnu -global-merge -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=aarch64-none-linux-gnu -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s

				; RUN: opt %s -mtriple=arm64-linux-gnuabi -global-merge -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=arm64-linux-gnuabi -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s

				; RUN: opt %s -mtriple=arm64-apple-ios -global-merge -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=arm64-apple-ios -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s

				@m = internal global i32 0, align 4
				@n = internal global i32 0, align 4

				; CHECK: @_MergedGlobals = internal global { i32, i32 } zeroinitializer

				define void @f1(i32 %a1, i32 %a2) {
				; CHECK-LABEL: @f1
				; CHECK: getelementptr inbounds ({ i32, i32 }* @_MergedGlobals, i32 0, i32 0)
				; CHECK: getelementptr inbounds ({ i32, i32 }* @_MergedGlobals, i32 0, i32 1)
				store i32 %a1, i32* @m, align 4
				store i32 %a2, i32* @n, align 4
				ret void
				}

test/Transforms/GlobalMerge/AArch64/global-merge-2.ll

This file was added.

				; RUN: opt %s -mtriple=aarch64-none-linux-gnu -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=arm64-linux-gnuabi -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=arm64-apple-ios -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s

				@x = global i32 0, align 4
				@y = global i32 0, align 4
				@z = global i32 0, align 4

				; CHECK: @_MergedGlobals_x = global { i32, i32, i32 } zeroinitializer, align 16
				; CHECK: @x = alias getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 0)
				; CHECK: @y = alias getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 1)
				; CHECK: @z = alias getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 2)

				define void @f1(i32 %a1, i32 %a2) {
				; CHECK-LABEL: @f1
				; CHECK: getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 0)
				; CHECK: getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 1)
				store i32 %a1, i32* @x, align 4
				store i32 %a2, i32* @y, align 4
				ret void
				}

				define void @g1(i32 %a1, i32 %a2) {
				; CHECK-LABEL: @g1
				; CHECK: getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 1)
				; CHECK: getelementptr inbounds ({ i32, i32, i32 }* @_MergedGlobals_x, i32 0, i32 2)
				store i32 %a1, i32* @y, align 4
				store i32 %a2, i32* @z, align 4
				ret void
				}

test/Transforms/GlobalMerge/AArch64/global-merge-3.ll

This file was added.

				; RUN: opt %s -mtriple=aarch64-none-linux-gnu -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=arm64-linux-gnuabi -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s
				; RUN: opt %s -mtriple=arm64-apple-ios -global-merge -global-merge-on-external -global-merge-aligned -S -o - \| FileCheck %s

				@x = global [1000 x i32] zeroinitializer, align 1
				@y = global [1000 x i32] zeroinitializer, align 1
				@z = internal global i32 1, align 4

				; CHECK: @_MergedGlobals_x = global { i32, [1000 x i32] } { i32 1, [1000 x i32] zeroinitializer }, align 4096
				; CHECK: @_MergedGlobals_y = global { [1000 x i32] } zeroinitializer, align 4096

				; CHECK: @x = alias getelementptr inbounds ({ i32, [1000 x i32] }* @_MergedGlobals_x, i32 0, i32 1)
				; CHECK: @y = alias getelementptr inbounds ({ [1000 x i32] }* @_MergedGlobals_y, i32 0, i32 0)

				define void @f1(i32 %a1, i32 %a2, i32 %a3) {
				; CHECK-LABEL: @f1
				; CHECK: %x3 = getelementptr inbounds [1000 x i32]* getelementptr inbounds ({ i32, [1000 x i32] }* @_MergedGlobals_x, i32 0, i32 1), i32 0, i64 3
				; CHECK: %y3 = getelementptr inbounds [1000 x i32]* getelementptr inbounds ({ [1000 x i32] }* @_MergedGlobals_y, i32 0, i32 0), i32 0, i64 3
				; CHECK: store i32 %a3, i32* getelementptr inbounds ({ i32, [1000 x i32] }* @_MergedGlobals_x, i32 0, i32 0), align 4

				%x3 = getelementptr inbounds [1000 x i32]* @x, i32 0, i64 3
				%y3 = getelementptr inbounds [1000 x i32]* @y, i32 0, i64 3
				store i32 %a1, i32* %x3, align 4
				store i32 %a2, i32* %y3, align 4
				store i32 %a3, i32* @z, align 4
				ret void
				}

test/Transforms/GlobalMerge/AArch64/lit.local.cfg

This file was added.

				targets = set(config.root.targets_to_build.split())
				if not 'AArch64' in targets:
				config.unsupported = True

test/Transforms/GlobalMerge/ARM/arm.ll

	; RUN: llc %s -O0 -o - \| FileCheck -check-prefix=NO-MERGE %s	; RUN: opt %s -mtriple=arm-linux-gnuabi -global-merge -S -o - \| FileCheck %s
	; RUN: llc %s -O0 -o - -global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
	; RUN: llc %s -O0 -o - -global-merge=true \| FileCheck -check-prefix=NO-MERGE %s
	; RUN: llc %s -O1 -o - \| FileCheck -check-prefix=MERGE %s
	; RUN: llc %s -O1 -o - -global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
	; RUN: llc %s -O1 -o - -global-merge=true \| FileCheck -check-prefix=MERGE %s

	; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2
	; MERGE: .zerofill __DATA,__bss,__MergedGlobals,60,4
	; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2

	; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4
	; NO-MERGE: .zerofill __DATA,__bss,_bar,20,2
	; NO-MERGE: .zerofill __DATA,__bss,_baz,20,2
	; NO-MERGE: .zerofill __DATA,__bss,_foo,20,2
	; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4

	target datalayout = "e-p:32:32:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:32:64-v128:32:128-a0:0:32-n32-S32"	target datalayout = "e-p:32:32:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:32:64-v128:32:128-a0:0:32-n32-S32"
	target triple = "thumbv7-apple-ios3.0.0"	target triple = "thumbv7-apple-ios3.0.0"
Context not available.
	@baz = internal global [5 x i32] zeroinitializer, align 4	@baz = internal global [5 x i32] zeroinitializer, align 4
	@foo = internal global [5 x i32] zeroinitializer, align 4	@foo = internal global [5 x i32] zeroinitializer, align 4

		; CHECK: @_MergedGlobals = internal global { [5 x i32], [5 x i32], [5 x i32] } zeroinitializer

	; Function Attrs: nounwind ssp	; Function Attrs: nounwind ssp
	define internal void @initialize() #0 {	define internal void @initialize() #0 {
	%1 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3	%1 = tail call i32 bitcast (i32 (...)* @calc to i32 ()*)() #3
Context not available.

test/Transforms/GlobalMerge/ARM64/arm64.ll

	; RUN: llc %s -O0 -o - \| FileCheck -check-prefix=NO-MERGE %s			; RUN: opt %s -mtriple=arm64-linux-gnuabi -global-merge -S -o - \| FileCheck %s
	; RUN: llc %s -O0 -o - -global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
	; RUN: llc %s -O0 -o - -global-merge=true \| FileCheck -check-prefix=NO-MERGE %s
	; RUN: llc %s -O1 -o - \| FileCheck -check-prefix=MERGE %s
	; RUN: llc %s -O1 -o - -global-merge=false \| FileCheck -check-prefix=NO-MERGE %s
	; RUN: llc %s -O1 -o - -global-merge=true \| FileCheck -check-prefix=MERGE %s

	; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2			; CHECK: @_MergedGlobals = internal global { [5 x i32], [5 x i32], [5 x i32] } zeroinitializer
	; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2
	; MERGE: .zerofill __DATA,__bss,__MergedGlobals,60,4
	; MERGE-NOT: .zerofill __DATA,__bss,_bar,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_baz,20,2
	; MERGE-NOT: .zerofill __DATA,__bss,_foo,20,2

	; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4
	; NO-MERGE: .zerofill __DATA,__bss,_bar,20,2
	; NO-MERGE: .zerofill __DATA,__bss,_baz,20,2
	; NO-MERGE: .zerofill __DATA,__bss,_foo,20,2
	; NO-MERGE-NOT: .zerofill __DATA,__bss,__MergedGlobals,60,4

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"
	target triple = "arm64-apple-ios7.0.0"			target triple = "arm64-apple-ios7.0.0"

tools/opt/opt.cpp

Property	Old Value	New Value
File Mode	100644	100755

The contents of this file were not changed.

This is an archive of the discontinued LLVM Phabricator instance.

Implement ADRP CSE for global symbolsClosedPublic

Details

Diff Detail

Event Timeline

>

>

Performances

Benchmark_ID Reference Test Expansion Percent

siod/siod 3.4691 3.4929 1.01 +1%

Min (24) - - 0.96 -

Max (24) - - 1.03 -

Sum (24) 211 210 0.99 +1%

A.Mean (24) - - 1 +0%

G.Mean 2 (24) - - 1 +0%

Benchmark_ID Reference Test Expansion Percent

siod/siod 4.2536 4.3336 1.02 +2%

Min (19) - - 0.94 -

Max (19) - - 1.06 -

Sum (19) 184 185 1 +0%

A.Mean (19) - - 1.01 +1%

G.Mean 2 (19) - - 1.01 +1%

Static Count of ADRP/ADR

Min (90) - - 0.3 -

Max (90) - - 1.11 -

Sum (90) 160078 152264 0.95 +5%

A.Mean (90) - - 0.88 -12%

G.Mean 2 (90) - - 0.86 -14%

Benchmark_ID Reference Test Expansion Percent

Shootout-C++/EH/Output/ 16 17 1.06 +6%

Min (6) - - 1.03 -

Max (6) - - 1.11 -

Sum (6) 502 520 1.04 -3%

A.Mean (6) - - 1.07 +7%

G.Mean 2 (6) - - 1.07 +7%

Min (87) - - 0.3 -

Max (87) - - 1.11 -

Sum (87) 101239 96762 0.96 +5%

A.Mean (87) - - 0.89 -11%

G.Mean 2 (87) - - 0.86 -14%

Benchmark_ID Reference Test Expansion Percent

Shootout-C++/EH/Output/ 10 11 1.1 +10%

Min (6) - - 1.06 -

Max (6) - - 1.11 -

Sum (6) 414 455 1.1 -9%

A.Mean (6) - - 1.09 +9%

G.Mean 2 (6) - - 1.08 +8%

Long Story

Benchmark_ID Reference Test Expansion Percent

./300.twolf/Output/300. 3244 3244 1 +0%

Min (12) - - 0.96 -

Max (12) - - 1 -

Sum (12) 50841 50304 0.99 +1%

A.Mean (12) - - 0.99 -1%

G.Mean 2 (12) - - 0.99 -1%

Revision Contents

Diff 9301

include/llvm/IR/GlobalAlias.h

include/llvm/Target/TargetLowering.h

lib/CodeGen/AsmPrinter/AsmPrinter.cpp

lib/IR/Globals.cpp

lib/Target/AArch64/AArch64ISelLowering.h

lib/Target/AArch64/AArch64ISelLowering.cpp

lib/Target/ARM64/ARM64ISelLowering.h

lib/Target/ARM64/ARM64ISelLowering.cpp

lib/Transforms/Scalar/GlobalMerge.cpp

lib/Transforms/Scalar/Scalar.cpp

test/CodeGen/AArch64/global-merge.ll

test/CodeGen/AArch64/global_merge_1.ll

test/CodeGen/ARM/global-merge-1.ll

test/CodeGen/ARM64/global-merge.ll

test/Transforms/GlobalMerge/AArch64/global-merge-1.ll

test/Transforms/GlobalMerge/AArch64/global-merge-2.ll

test/Transforms/GlobalMerge/AArch64/global-merge-3.ll

test/Transforms/GlobalMerge/AArch64/lit.local.cfg

test/Transforms/GlobalMerge/ARM/arm.ll

test/Transforms/GlobalMerge/ARM64/arm64.ll

tools/opt/opt.cpp

Implement ADRP CSE for global symbols
ClosedPublic