This patch depends on another patch "Enable global merge pass." at http://reviews.llvm.org/D3431.
This patch is to implement ADRP CSE for global symbols defined within the module being compiled.
Differential D3432
Implement ADRP CSE for global symbols Jiangning on Apr 18 2014, 8:43 PM. Authored by
Details
This patch depends on another patch "Enable global merge pass." at http://reviews.llvm.org/D3431. This patch is to implement ADRP CSE for global symbols defined within the module being compiled.
Diff Detail Event TimelineComment Actions Hi Jiangning, This sounds like a good idea. I've got a couple of smaller comments:
Comment Actions Hi Jiangning, I copy/paste my comments from the other thread here to ease the tracking. I wanted to sort out the redundant ADRP stuff on ARM64 while using the pseudo instructions (for address computations) before doing any measurements. define void @f1(i32 %a1, i32 %a2) { store i32 %a1, i32* @x, align 4 store i32 %a2, i32* @y, align 4 ret void } Test1, your patch is disabled: _f1: ; @f1 Test2, your patch is enabled: _f1: ; @f1 Could you take a look please? Thanks, Comment Actions This patch is to fix the issue raised by Quentin and Tim. The code was cleaned, and the followings are changed
Thanks,
Comment Actions Hi Quenin, Thanks for your testing! It appears that your patch that adds the external support is not playing
I uploaded a new version to fix the issue you mentioned and now Thanks,
Comment Actions Hi Jiangning,
I didn't mean any change in the order (I've no idea where it would be Cheers. Tim Comment Actions Hi Jiangning, I confirm that the changes you did fix the “regression” on ARM64, however I do not think this is the way to go. That said, if we can set some reasonable alignment, we should do it, and not backed this off because of ARM64 current limitation. Nevertheless, we need to be careful not to introduce any regression here because so far the global merge pass was producing unaligned globals. Now, regarding a reasonable alignment, I do not think the current setting make sense. I’ve also a couple of minor comments, see the inline comments. I am looking into making the folding smarter to avoid the redundant ADRP. Thanks,
Comment Actions This new version fixed the followings,
Thanks, Comment Actions This patch make some more little changes,
Thanks,
Comment Actions Hi Quentin, I just realized probably I misunderstood your point of using data layout. I Sorry about that and I will change my code soon. Thanks, Comment Actions Hi Jiangning,
The size is fine, we accumulate for AllocSize, i.e., with padding and such. To be more specific, I think we should set the alignment of the merge structure to DL->getABIAlignment(MergedTy), or something like that. The example I had in mind was: >struct MergeGlobal { Thanks for looking into it. -Quentin Comment Actions Hi Quentin, I thought I misunderstood you, but looking at your explanation below, it 2014-04-25 1:06 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:
No. I think it should be 16, otherwise this MergeGlobal maybe cross page If we set alignment to 16, we would be able to guarantee at compile-time Could you please look into again my last reply at Thanks,
Comment Actions Hi Jiangning, Ok, I see your point now. Anyway, you should comment why we are setting this value for the alignment :). When that is fixed, I can give it a shot. Thanks, Comment Actions Hi Jiangning, On a second though, I do not think approach #2 is reasonable. Is in fact, syntax A Or put another way, syntax B For syntax A to be valid, you need to be sure that MergedGlobal@PAGEOFF+varX_offset fits the encoding space. Thus, in both cases, this is not generally correct. To increase the likelihood of this being correct, you’ll need to align the thing on a PAGE, and make sure that the reachable fields are within the encoding space (this should be fine, because of the get max offset thing). I’d say that is not desirable. Therefore, I think we need to go for approach #1. What do you think? Thanks, Comment Actions Hi Quentin,
If we don't set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B are incorrect. But if we set MergedGlobal's alignment to RoundUpToPowerOfTwo(sizeof(MergedGlobal)), both syntax A and B should be correct. i.e. For syntax A, it can guarantee (MergedGlobal@PAGEOFF+varX_offset)<4096, which meet the encoding requirement of load/store instructions. For syntax B, it can guarantee MergedGlobal@PAGE==MergedGlobal_varX@PAGE is true.
We don't really need to make MergedGlobal aligned to a PAGE, but asking MergedGlobal aligned to RoundUpToPowerOfTwo(sizeof(MergedGlobal)) is enough. Thanks,
Comment Actions Hi Jiangning,
Yes, they are incorrect and that’s why I said we should go for your first approach:
Yes, but my point is, we do not want to have this big alignment. Especially because it may make sense *only* for ARM64. Moreover, the fact that this alignment works is because both ADD and LDR have the same encoding space for the immediate. In my opinion this is a lot of assumptions that would be hard to match for all targets to make that a desirable change. Therefore, I think we should go for your approach #1 and set a *reasonable* alignment based on the data layout. Thanks, Comment Actions Hi Quentin,
I see your point now. I think you are correct and there is an assumption here. What about we create a target specific hook and allow target/back-end to choose the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal)? Thanks, Comment Actions Hi Jiangning, A target hook seems like a good idea.
Instead of choosing the alignment between RoundupToPowerOfTwo(sizeof(MergedGlobal)) and NaturalAlignment(MergedGlobal), what about choosing an alignment period :). What do you think? Thanks, Comment Actions Hi Quentin, Sorry for my late response! I was in vacation and busy at other tasks. I think I totally agree with you, so I added a target hook returning the alignment required by global merge on external, and in global merge pass there is an assertion to make sure the alignment returned by target hook always meets natural alignment requirement. Thanks, Comment Actions The changes of this version are mainly the followings, and no other things. Thanks,
Comment Actions Hi Jiangning, Thanks for the update. I’ll give a shot at the patch as it is and get back to you with performance numbers. Cheers, -Quentin
Comment Actions Hi Jiangning, I've stopped the benchmarking because I am seeing compiler crashes, with the current patch, when the merge on external is enabled: The following tests do not compile: I am attaching the reduced test case from 445.gobmk. To reproduce: llc -global-merge-on-external=true reduced_testcase.ll Could you have a look please? As soon as this is fixed, I can start over the benchmarking. Thanks, Quentin Comment Actions Hi Quentin, Sorry for those failures introduced, and yes they are all my fault, and now the new patch should have all issues fixed. There was a logic error around the original assertion and it was only to check the 1st global for each different global variable merges. And actually this is was newly introduced by the 1st refactoring patch. Anyway this new patch fixed the followings,
If "internal globals" and "globals" are mixed in the same merged globals, we will use the 1st "global" name as the suffix as this newly created merged global variable.
-global-merge-on-external: control if globals excluding "internal globals" will be merged.
Thanks, Comment Actions Hi Jiangning,
Makes sense.
That’s a good idea, thanks! That said, shouldn’t we set the natural alignment for external globals by default? I.e., I guess externals uses will assume that at least the natural alignment is set. Anyway, I’ll give a shot to the new patch. Thanks, Comment Actions Hi Quentin,
I think if I don't explicit set alignment, by default it should be natural alignment for both internal and external global variables, right? The original code for internal globals doesn't explicitly set alignment either. Now I've got an A57/A53 big-little board, so I did some experiment and the data shows there isn't really performance impact for spec2000 int on both A57 and A53 (I believe you understand I can't post the performance data directly here due to legal issue), although I do see the following adrp instruction reductions for ARM64, bzip2 0.28% I think one of reasons might be this optimization only affects global symbol address, but it is a loop invariant essentially, and it should be always hoisted out of hot loops, so at least for spec2000 we wouldn't be able to see performance impact. Once the pseudo instruction MOVAddr issue is solved for ARM64, there might be more adrp instruction reduction, but after this change I would still not expect performance impact for spec2000, because my measurement for AArch64 didn't show performance impact for spec2000 either. As far as this optimization improvement itself concerned, now I think code size reduction could be the key benefit, and it wouldn't hurt any performance. Considering this patch has been posted for almost 4 weeks and we are approaching the deadline of migrating AArch64 to ARM64, do you agree to commit this patch first? If this can happen before phasing out AArch64, it will save us a lot of more efforts to rebase, because I think the final switching from AArch64 to ARM64, and renaming ARM64 to AArch64 would be a significant change on TOT. Anyway, I would be extremely appreciative if you can understand the current situation and agree to upstream this patch first. Thanks,
Comment Actions Hi Jiangning, Sorry for the delay to get back to you but the results are not what I was expected and I had to double check them. Note: for the performances numbers, I have filtered out the tests that run for less than a second and the tests that have similar performance:
Performances
Benchmark_ID Reference Test Expansion PercentASC_Sequoia/IRSmk/IRSmk 16.6782 16.5785 0.99 -1% siod/siod 3.4691 3.4929 1.01 +1%Min (24) - - 0.96 -Max (24) - - 1.03 -Sum (24) 211 210 0.99 +1%A.Mean (24) - - 1 +0%G.Mean 2 (24) - - 1 +0%Overall there are more regressions (14) than improvements (10) and on average it is neutral.
Benchmark_ID Reference Test Expansion PercentAdobe-C++/loop_unroll 5.0411 5.168 1.03 +3% siod/siod 4.2536 4.3336 1.02 +2%Min (19) - - 0.94 -Max (19) - - 1.06 -Sum (19) 184 185 1 +0%A.Mean (19) - - 1.01 +1%G.Mean 2 (19) - - 1.01 +1%Overall there are more regressions (14) than improvements (5) and on average it is a regression. Static Count of ADRP/ADRThis is the number of ADRP/ADR instruction in the final binaries. In other words, the linker optimizations already took place.
Min (90) - - 0.3 -Max (90) - - 1.11 -Sum (90) 160078 152264 0.95 +5%A.Mean (90) - - 0.88 -12%G.Mean 2 (90) - - 0.86 -14%Here are the details for the regressions: Benchmark_ID Reference Test Expansion PercentC/Output/globalrefs.sim 9 10 1.11 +11% Shootout-C++/EH/Output/ 16 17 1.06 +6%Min (6) - - 1.03 -Max (6) - - 1.11 -Sum (6) 502 520 1.04 -3%A.Mean (6) - - 1.07 +7%G.Mean 2 (6) - - 1.07 +7%
Min (87) - - 0.3 -Max (87) - - 1.11 -Sum (87) 101239 96762 0.96 +5%A.Mean (87) - - 0.89 -11%G.Mean 2 (87) - - 0.86 -14%Here are the details for the regressions: Benchmark_ID Reference Test Expansion PercentC/Output/globalrefs.sim 9 10 1.11 +11% Shootout-C++/EH/Output/ 10 11 1.1 +10%Min (6) - - 1.06 -Max (6) - - 1.11 -Sum (6) 414 455 1.1 -9%A.Mean (6) - - 1.09 +9%G.Mean 2 (6) - - 1.08 +8%Long StoryAfter the first run, a lot of applications were regressing, so I suspected that somehow my device got noisy. I've rerun all the regressions/improvements 10x times and observed the standard deviation (SD). The SD was close to 0 percent for most tests, i.e., the regressions/improvements were real. I've investigated a few of them and figured it was our lowering using pseudo instruction for ADRP that was bitting us. The short story is this was producing redundant ADRPs as you initially saw when you started to work on that patch. Anyway, I have made another quick patch where I basically blocked any folding of those instructions, thus maximizing the reuse of ADRPs. This is the current baseline. In the meantime, I can give you the raw numbers if you want to gather your own statistics. Thanks, -Quentin Comment Actions
I am not sure. IIRC, no alignment means 1-byte aligned.
This is not a problem because the module will see that as all the globals are locally known. But for external, the problem is different as the modules will not see the new alignment and thus, assume whatever the front-end or ABI tells to the backend.
I concur the ADRP reduction for CINT2000, here the numbers for O3: Benchmark_ID Reference Test Expansion Percent./164.gzip/Output/164.g 655 632 0.96 -4% ./300.twolf/Output/300. 3244 3244 1 +0%Min (12) - - 0.96 -Max (12) - - 1 -Sum (12) 50841 50304 0.99 +1%A.Mean (12) - - 0.99 -1%G.Mean 2 (12) - - 0.99 -1%However, I do see some performance regressions (see my previous post).
I am fine with that assuming:
Thanks, Comment Actions Hi Quentin, Thanks a lot of your performance data, and it also surprised by those regressions. But I can reproduce the regressions on ARM64 for the benchmarks below, CINT2000/164.gzip/164.g 22.8601 23.2677 1.02 +2% If I don't have other higher priority tasks, I will have a look. But it would be much better if your patch of fixing MOVAddr issue can be shared with me. Is there any possibility after this optimization the live range of base address is extended, and triggered spill/fill in hot loop by RA? If this is the case the rematerialization needs to be tuned, I think.
Actually both -global-merge-on-external and -global-merge-aligned are false by default, so if end-user doesn't enable them we wouldn't see any performance impact. Therefore, they are disabled by default for all targets. Actually I was hoping your patch can get it enabled at least for ARM64. :-)
I checked the alignment behavior. If we don't set alignment, which is also the default behavior with/without my patch, the following code in AsmPrinter will be effective, // If the alignment is specified, we *must* obey it. Overaligning a global // with a specified alignment is a prompt way to break globals emitted to // sections and expected to be contiguous (e.g. ObjC metadata). unsigned AlignLog = getGVAlignmentLog2(GV, *DL); for (const HandlerInfo &HI : Handlers) { NamedRegionTimer T(HI.TimerName, HI.TimerGroupName, TimePassesIsEnabled); HI.Handler->setSymbolSize(GVSym, Size); } // Handle common and BSS local symbols (.lcomm). if (GVKind.isCommon() || GVKind.isBSSLocal()) { if (Size == 0) Size = 1; // .comm Foo, 0 is undefined, avoid it. unsigned Align = 1 << AlignLog; So that means the RoundupToPowerof2(sizeof(MergedGlobal)) will be used as alignment. The test case test/CodeGen/AArch64/global-merge.ll, which only uses "internal globals", also shows below with llc, $ ../build/bin/llc -mtriple=arm64-apple-ios -O1 < test/CodeGen/AArch64/global-merge.ll .zerofill DATA,bss,__MergedGlobals,8,3 ; @_MergedGlobals So actually I'm keeping the original behavior. Anyway, my patch doesn't change the original semantic at all, if only -global-merge-on-external and -global-merge-aligned are not enabled. Are you happy with this? Thanks,
Comment Actions Hi Jiangning,
Sure! It is more a hack than a patch, but here it is :).
I thought about that and this is possible. Though the current implementation still uses the pseudo instructions so the rematerialization should kick in. That said, this is also possible that the rematerialization algorithm is not perfect.
For ARM64, yes, but for ARM64-apple-* no :). We need to address the regressions first before enabling it by default for our (apple) triple, but if you need it for other OS, I am fine with that, though I cannot test the performances for those. Thanks for checking the alignment thing. Yes, I am happy with that :). Thanks, Comment Actions Hi Quentin, I committed this patch as r208934, and will analyze those regression with Thanks, 2014-05-16 0:57 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:
Comment Actions
Thanks! Let me know if I can help for the regressions. -Quentin |
This is newly added target hook to return alignment required for different targets.