Extending findExistingExpansion can use existing value in ExprValueMap.
This patch gives 0.3~0.5% performance improvements on benchmarks(test-suite, spec2000, spec2006, commercial benchmark)
Differential D15559
[SCEVExpander] Make findExistingExpansion smarter flyingforyou on Dec 16 2015, 12:59 AM. Authored by
Details Extending findExistingExpansion can use existing value in ExprValueMap. This patch gives 0.3~0.5% performance improvements on benchmarks(test-suite, spec2000, spec2006, commercial benchmark)
Diff Detail
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes
Comment Actions I ran some benchmarks against http://reviews.llvm.org/differential/diff/43408/ spec regressions on a Cortex-A57 (A64): spec.cpu2000.ref.253_perlbmk 4.71% spec.cpu2000.ref.256_bzip2 2.07% spec.cpu2006.ref.433_milc 1.24% There's a small improvement in 254_gap spec.cpu2000.ref.254_gap -1.76% (I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann) spec regressions on a Cortex-A57 (T32): spec.cpu2006.ref.482_sphinx3 2.31% and an improvement in: spec.cpu2000.ref.181_mcf -2.13%
Did you notice these regressions on your hardware? I find it strange to think this could all be accounted to different revisions of this core. I'm testing on this platform: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dto0038a/index.html) I see some small improvments in a commerical benchmark (1-2%), but it is worth mentioning the SPEC changes, since I wouldn't write those off as no regressions; it's definite regression in SPEC, but whether the other improvements in test-suite balance them in the view of the community is debatable. Comment Actions I ran test-suite, spec on Galaxy S6 Target(CPU 2.1Ghz, MIF 1.5Ghz, Of course it's developtment board. Not exactly same with Galaxy S6). In my test, I got different data.
My test result about Spec (Opt/Ori) spec.cpu2000.ref.253_perlbmk -> I can't get result. I think without modifying source code, we can't compile it. spec.cpu2000.ref.256_bzip2 99.93% spec.cpu2006.ref.433_milc 99.52% spec.cpu2000.ref.254_gap 99.18% My test result about test-suite (Opt/Ori) I got totally different result on this. SingleSource/Benchmarks/Shootout-C++/ackermann 81.35% MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan -> execution time is 0. It's same. I think both of data also can be worth. And This data is not meaningful any more. As you know, the patch was completely changed. I will share the result data base on recent patch soon. PS. Could you share how to build "spec.cpu2000.ref.253_perlbmk", please? Comment Actions I ran some benchmarks against http://reviews.llvm.org/differential/diff/43578/. (Base Revision: r256315)
Other benchmarks in test-suite's Geomean value : 99.73% On commertial benchmark, we can see the 0.1 ~0.2% improvement of overall. Junmo.
Comment Actions Hi Junmo, Please find some comments inline. Thanks,
Comment Actions Michael, I replied your inline comments. Could you check them, please? Sanjoy, Michael want to insert some patch. Could you give him a opinion, please? Junmo.
Comment Actions Addressed Michael's comments. Waiting Sanjoy's comment. Comment Actions This is looking close to ready to land now. Some nitpicky comments inline.
Comment Actions Sanjoy, I am waiting your review. :) Michael, could you make an example which is more complex that current clean up passes can't deal with? Comment Actions
Providing a concrete example here isn't even that important due to arguments I raised earlier, but here you go: @a = global i64 1 declare void @foo() define i32 @test3(i64* %loc, i64 %conv7) { entry: %rem0 = load i64, i64* %loc, align 8 %ExpensiveComputation = udiv i64 %rem0, 42 br label %outer.header outer.header: %i = phi i64 [ 1, %entry ], [ %i.next, %outer.latch ] br label %bb1 bb1: %addr = getelementptr i64, i64* @a, i64 %i %x = load volatile i64, i64* %addr %div11 = udiv i64 %x, %i %cmp.i38 = icmp ugt i64 %div11, 1 %div12 = select i1 %cmp.i38, i64 %div11, i64 1 br label %for.body for.body: %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ] %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ] %mul1 = mul i64 %rem1, 48271 %rem2 = urem i64 %mul1, 2147483647 %dec = add i64 %k1, -1 %cmp = icmp eq i64 %dec, 0 br i1 %cmp, label %outer.latch, label %for.body outer.latch: %i.next = add i64 %i, 1 %outer.cmp = icmp eq i64 %i.next, 1000 br i1 %outer.cmp, label %exit, label %outer.header exit: %rem3 = phi i64 [ %rem2, %outer.latch ] store i64 %rem3, i64* %loc, align 8 ret i32 0 } I ran it with opt < test.ll -S -unroll-runtime -loop-unroll -early-cse -instcombine -simplifycfg -licm and always saw a code like this in the output: %x = load volatile i64, i64* %addr, align 8 %div11 = udiv i64 %x, %i %cmp.i38 = icmp ugt i64 %div11, 1 %div12 = select i1 %cmp.i38, i64 %div11, i64 1 %1 = udiv i64 %x, %0 ; <<< Redundant expensive computation %2 = icmp ugt i64 %1, 1 %umax = select i1 %2, i64 %1, i64 1 Probably, you can come up with another set of passes that will be able to clean this up, but then the example might be made even more complicated, so the new set will fail. The problem here is that with this patch compiler thinks that it generates no redundant code to clean-up, while in fact it's not even limited in what it can generate - we expand *any* existing SCEV expression. So, I'd suggest you to address the issue with expand before going further with this one. Michael Comment Actions Looks like the change I've been asking for function has been committed (see D12090). So, probably we just need to update this patch, and it'll be ok to go in.
Comment Actions Hi Junmo, It looks better now, couple of comments/questions inline. Michael
Comment Actions
Oh, I see, makes sense. The patch LGTM, thanks for your patience:) Michael
Comment Actions Michael, Sanjoy, Thank you for your patience, so many great suggestions! It was great time. ! Merged in r260938. Thanks, |