Extending findExistingExpansion can use existing value in ExprValueMap.
This patch gives 0.3~0.5% performance improvements on benchmarks(test-suite, spec2000, spec2006, commercial benchmark)
Differential D15559
[SCEVExpander] Make findExistingExpansion smarter flyingforyou on Dec 16 2015, 12:59 AM. Authored by
Details Extending findExistingExpansion can use existing value in ExprValueMap. This patch gives 0.3~0.5% performance improvements on benchmarks(test-suite, spec2000, spec2006, commercial benchmark)
Diff Detail Event TimelineComment Actions This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion? Comment Actions Hi Junmo, I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite? Comment Actions Thanks Charlie for comment.
I will figure it out what's wrong. I set up spec2006, recently.
I also think cheking the unroll factor only is not enough for figuring out that TripCountSC is really expensive or not. Junmo. Comment Actions I've just seen the following regressions for spec2006 on the A57 in T32 mode, spec.cpu2006.ref.456_hmmer 13.34% spec.cpu2006.ref.471_omnetpp 7.87% spec.cpu2006.ref.458_sjeng 3.72% spec.cpu2006.ref.429_mcf 3.48% spec.cpu2006.ref.462_libquantum 3.47% spec.cpu2006.ref.482_sphinx3 2.27% spec.cpu2006.ref.445_gobmk 1.90% spec.cpu2006.ref.453_povray 1.61% spec.cpu2006.ref.447_dealII 1.10% hmmer is clearly a good test case here, must be some register pressure issues. Comment Actions I forgot to mention that I did see the following improvements in A57-T32 for spec2006: spec.cpu2006.ref.470_lbm -10.92% spec.cpu2006.ref.473_astar -9.53% spec.cpu2006.ref.444_namd -1.40% spec.cpu2006.ref.401_bzip2 -1.06% But this is an overall regression on spec2006. Comment Actions Current algorithm doesn't care about where does trip count's division come from. If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count. I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57. I hope that this patch is acceptable for everybody. Junmo.
Comment Actions I'd rather have this sort of logic live in SCEVExpander::isHighCostExpansion than be scattered throughout the codebase. I think the "increment by one" case is covered by findExistingExpansion already; and the "decrement by one" case can be added there as well by teaching it to look at the initial values for the induction phi nodes. Comment Actions Thanks for comment. Sanjoy. test2's TripCountSC is "(1 umax ((23 + %conv7) /u %conv7))". How can we handle this in SCEVExpander::isHighCostExpansion? Please let me know. Junmo. Comment Actions Addressed Sanjoy's comment. Your comment was really helpful for me. I am still not sure that this patch is ok or not. Thanks.
Comment Actions Addressed Sanjoy's comment. Following your suggestion, I realized that checing IV's step is not necessary any more. I really appreciate your review.
Comment Actions I ran some benchmarks against http://reviews.llvm.org/differential/diff/43408/ spec regressions on a Cortex-A57 (A64): spec.cpu2000.ref.253_perlbmk 4.71% spec.cpu2000.ref.256_bzip2 2.07% spec.cpu2006.ref.433_milc 1.24% There's a small improvement in 254_gap spec.cpu2000.ref.254_gap -1.76% (I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann) spec regressions on a Cortex-A57 (T32): spec.cpu2006.ref.482_sphinx3 2.31% and an improvement in: spec.cpu2000.ref.181_mcf -2.13%
Did you notice these regressions on your hardware? I find it strange to think this could all be accounted to different revisions of this core. I'm testing on this platform: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dto0038a/index.html) I see some small improvments in a commerical benchmark (1-2%), but it is worth mentioning the SPEC changes, since I wouldn't write those off as no regressions; it's definite regression in SPEC, but whether the other improvements in test-suite balance them in the view of the community is debatable. Comment Actions I ran test-suite, spec on Galaxy S6 Target(CPU 2.1Ghz, MIF 1.5Ghz, Of course it's developtment board. Not exactly same with Galaxy S6). In my test, I got different data.
My test result about Spec (Opt/Ori) spec.cpu2000.ref.253_perlbmk -> I can't get result. I think without modifying source code, we can't compile it. spec.cpu2000.ref.256_bzip2 99.93% spec.cpu2006.ref.433_milc 99.52% spec.cpu2000.ref.254_gap 99.18% My test result about test-suite (Opt/Ori) I got totally different result on this. SingleSource/Benchmarks/Shootout-C++/ackermann 81.35% MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan -> execution time is 0. It's same. I think both of data also can be worth. And This data is not meaningful any more. As you know, the patch was completely changed. I will share the result data base on recent patch soon. PS. Could you share how to build "spec.cpu2000.ref.253_perlbmk", please? Comment Actions I ran some benchmarks against http://reviews.llvm.org/differential/diff/43578/. (Base Revision: r256315)
Other benchmarks in test-suite's Geomean value : 99.73% On commertial benchmark, we can see the 0.1 ~0.2% improvement of overall. Junmo.
Comment Actions Hi Junmo, Please find some comments inline. Thanks,
Comment Actions Michael, I replied your inline comments. Could you check them, please? Sanjoy, Michael want to insert some patch. Could you give him a opinion, please? Junmo.
Comment Actions Addressed Michael's comments. Waiting Sanjoy's comment. Comment Actions This is looking close to ready to land now. Some nitpicky comments inline.
Comment Actions Sanjoy, I am waiting your review. :) Michael, could you make an example which is more complex that current clean up passes can't deal with? Comment Actions
Providing a concrete example here isn't even that important due to arguments I raised earlier, but here you go: @a = global i64 1 declare void @foo() define i32 @test3(i64* %loc, i64 %conv7) { entry: %rem0 = load i64, i64* %loc, align 8 %ExpensiveComputation = udiv i64 %rem0, 42 br label %outer.header outer.header: %i = phi i64 [ 1, %entry ], [ %i.next, %outer.latch ] br label %bb1 bb1: %addr = getelementptr i64, i64* @a, i64 %i %x = load volatile i64, i64* %addr %div11 = udiv i64 %x, %i %cmp.i38 = icmp ugt i64 %div11, 1 %div12 = select i1 %cmp.i38, i64 %div11, i64 1 br label %for.body for.body: %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ] %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ] %mul1 = mul i64 %rem1, 48271 %rem2 = urem i64 %mul1, 2147483647 %dec = add i64 %k1, -1 %cmp = icmp eq i64 %dec, 0 br i1 %cmp, label %outer.latch, label %for.body outer.latch: %i.next = add i64 %i, 1 %outer.cmp = icmp eq i64 %i.next, 1000 br i1 %outer.cmp, label %exit, label %outer.header exit: %rem3 = phi i64 [ %rem2, %outer.latch ] store i64 %rem3, i64* %loc, align 8 ret i32 0 } I ran it with opt < test.ll -S -unroll-runtime -loop-unroll -early-cse -instcombine -simplifycfg -licm and always saw a code like this in the output: %x = load volatile i64, i64* %addr, align 8 %div11 = udiv i64 %x, %i %cmp.i38 = icmp ugt i64 %div11, 1 %div12 = select i1 %cmp.i38, i64 %div11, i64 1 %1 = udiv i64 %x, %0 ; <<< Redundant expensive computation %2 = icmp ugt i64 %1, 1 %umax = select i1 %2, i64 %1, i64 1 Probably, you can come up with another set of passes that will be able to clean this up, but then the example might be made even more complicated, so the new set will fail. The problem here is that with this patch compiler thinks that it generates no redundant code to clean-up, while in fact it's not even limited in what it can generate - we expand *any* existing SCEV expression. So, I'd suggest you to address the issue with expand before going further with this one. Michael Comment Actions Looks like the change I've been asking for function has been committed (see D12090). So, probably we just need to update this patch, and it'll be ok to go in.
Comment Actions Hi Junmo, It looks better now, couple of comments/questions inline. Michael
Comment Actions
Oh, I see, makes sense. The patch LGTM, thanks for your patience:) Michael
Comment Actions Michael, Sanjoy, Thank you for your patience, so many great suggestions! It was great time. ! Merged in r260938. Thanks, |
Nit: do you mean I's operands?