This is an archive of the discontinued LLVM Phabricator instance.

[SCEVExpander] Make findExistingExpansion smarter
ClosedPublic

Authored by flyingforyou on Dec 16 2015, 12:59 AM.

Details

Summary

Extending findExistingExpansion can use existing value in ExprValueMap.

This patch gives 0.3~0.5% performance improvements on benchmarks(test-suite, spec2000, spec2006, commercial benchmark)

Diff Detail

Event Timeline

flyingforyou retitled this revision from to [LoopUnrollRuntime] Do unroll when the unroll factor is 2.
flyingforyou updated this object.
flyingforyou added reviewers: jmolloy, zzheng.
flyingforyou added a subscriber: llvm-commits.

This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion?

zzheng edited edge metadata.Dec 16 2015, 11:26 AM

I'm OK with this change for AArch64. Like Hal said, please add a test case.

Hi Junmo,

I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite?

Thanks Charlie for comment.

I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite?

I will figure it out what's wrong. I set up spec2006, recently.

@hfinkel This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion?

@zzheng I'm OK with this change for AArch64. Like Hal said, please add a test case.

I also think cheking the unroll factor only is not enough for figuring out that TripCountSC is really expensive or not.
So I will upload new logic & test case next week. I hope that could be acceptable.

Junmo.

I've just seen the following regressions for spec2006 on the A57 in T32 mode,

spec.cpu2006.ref.456_hmmer 	13.34%
spec.cpu2006.ref.471_omnetpp 	7.87%
spec.cpu2006.ref.458_sjeng 	3.72%
spec.cpu2006.ref.429_mcf 	3.48%
spec.cpu2006.ref.462_libquantum 	3.47%
spec.cpu2006.ref.482_sphinx3 	2.27%
spec.cpu2006.ref.445_gobmk 	1.90%
spec.cpu2006.ref.453_povray 	1.61%
spec.cpu2006.ref.447_dealII 	1.10%

hmmer is clearly a good test case here, must be some register pressure issues.

I forgot to mention that I did see the following improvements in A57-T32 for spec2006:

spec.cpu2006.ref.470_lbm 	-10.92%
spec.cpu2006.ref.473_astar 	-9.53%
spec.cpu2006.ref.444_namd 	-1.40%
spec.cpu2006.ref.401_bzip2 	-1.06%

But this is an overall regression on spec2006.

flyingforyou retitled this revision from [LoopUnrollRuntime] Do unroll when the unroll factor is 2 to [LoopUnrollRuntime] Do unroll when IV's setp is one or minus one.
flyingforyou updated this object.
flyingforyou edited edge metadata.

Current algorithm doesn't care about where does trip count's division come from.
That means, if TripCount is divided by some value in loop's preheader, compiler will give up doing unrolling. (Even if IV's step is one or minus one.)

If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count.

I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57.

I hope that this patch is acceptable for everybody.

Junmo.

mzolotukhin added inline comments.
lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333

There is a method getStepRecurrence in AR, which you might want to use here.

Also, I don't think that changing AllowExpensiveTripCount here is a good way to achieve what you try to achieve. Why (and where) is this tripcount considered expensive? To me it looks like we should rather fix those places to recognize cases like this as 'inexpensive'.

flyingforyou added inline comments.Dec 21 2015, 4:52 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333

There is a method getStepRecurrence in AR, which you might want to use here.

Thanks, I will change this later.

Why (and where) is this tripcount considered expensive?

When we run test case "high-cost-trip-count-computation.ll", we can get expression of Tripcount SCEV likes below.
test1 func: (1 + ((-1 + (-1 * %v12) + (8193 smax (%v12 + %step))) /u %step))
test2 func: (1 umax ((23 + %conv7) /u %conv7))

test1's division is generated for computing tripcount. Because %step is unknown. This case, we consider this is expensive. We need to generate new division for computing tripcount.
test2's division is generated by original code which is in loop pre header(entry). In this case, isHighCostExpansion can't figure it out this is for computing trip count or it came from original code.
And isHighCostExpansion is also used by IndVarSimplify. It's not only for LoopUnrollRuntime.

If you still think changing AllowExpensiveTripCount is not good way, please let me know your idea.

Junmo.

mzolotukhin added inline comments.Dec 21 2015, 5:40 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333

Ok, I think I understood the issue.

So, in isHighCostExpansionHelper routine there is a check

if (At && findExistingExpansion(S, At, L) != nullptr)
    return false;

Ideally, it should find the existing computation and immediately report that the expansion isn't expensive. In practice, this function doesn't always work as we'd like it to work (see e.g. https://llvm.org/bugs/show_bug.cgi?id=24920), but I'd suggest fixing it instead of working around in other places.

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
45–49

It's definitely possible to simplify this testcase by removing lots of unneeded code from this loop. Please keep only parts that are needed to expose the issue.

flyingforyou added inline comments.Dec 21 2015, 6:27 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333

Thanks for sharing bug about "isHighCostExpansionHelper" function.

LoopUnrollRuntime use isHighCostExpansion function without "At". This means it only want to check TripCountSC has expensive expression or not.

So, checking IV's step may not be working around for LoopUnrollRuntime.

Junmo.

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
45–49

Yes. You're right. I will remove unnecessary codes.
Thanks.

mzolotukhin added inline comments.Dec 21 2015, 6:58 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333

Yes, we're not passing At right now, but we can if we need to. If we do need to expand trip count, it'll be inserted into the loop preheader (look for PreHeaderBR in the code further down), so we actually know At.

And, my main concern is that I don't believe that statement "step is 1/-1" is equivalent to statement "trip count expression is cheap to expand". At least, it's not obvious and requires some proof/explanation (i.e. comments in the code).

flyingforyou added a reviewer: sanjoy.

Addressed Michael's comments.

Thanks.

sanjoy requested changes to this revision.Dec 22 2015, 1:12 AM
sanjoy edited edge metadata.

I'd rather have this sort of logic live in SCEVExpander::isHighCostExpansion than be scattered throughout the codebase. I think the "increment by one" case is covered by findExistingExpansion already; and the "decrement by one" case can be added there as well by teaching it to look at the initial values for the induction phi nodes.

This revision now requires changes to proceed.Dec 22 2015, 1:12 AM

Thanks for comment. Sanjoy.

test2's TripCountSC is "(1 umax ((23 + %conv7) /u %conv7))".

How can we handle this in SCEVExpander::isHighCostExpansion?

Please let me know.

Junmo.

flyingforyou edited edge metadata.

Addressed Sanjoy's comment.

Your comment was really helpful for me. I am still not sure that this patch is ok or not.
But comparing SCEV expression between TripCountSC and exit value makes this patch more robust.

Thanks.

sanjoy requested changes to this revision.Dec 22 2015, 11:40 PM
sanjoy edited edge metadata.
sanjoy added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1828

A better name for this function would be IsContainedIn. And this should use SCEVTraversal.

1854

I don't see why you need to match twice? You should just be able to do

match(BB->getTerminator(), m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHS)))

with Value *RHS. Then you can make your later code conditional on isa<Instruction>(RHS).

1884

I don't understand this part -- how do you conclude that LHS is a correct expansion for S? E.g. what if getSCEV(LHS) is {S,+,1}? Then clearly the expansion of S isn't LHS.

I think you need to do a (recursive) search of the IR level operands of LHS and RHS to see if any of them are congruent to S.

lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319

I don't think this is the right value for At. At should be the location you want to expand the SCEV expression at, which in this case is PreHeaderBR.

This revision now requires changes to proceed.Dec 22 2015, 11:40 PM
flyingforyou retitled this revision from [LoopUnrollRuntime] Do unroll when IV's setp is one or minus one to [SCEVExpander] Make findExistingExpansion smarter.
flyingforyou updated this object.
flyingforyou edited edge metadata.
flyingforyou marked an inline comment as done.

Addressed Sanjoy's comment.

Following your suggestion, I realized that checing IV's step is not necessary any more.

I really appreciate your review.

mzolotukhin requested changes to this revision.Dec 23 2015, 10:15 AM
mzolotukhin added a reviewer: mzolotukhin.
mzolotukhin added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1841

Shouldn't it be SCEVTraversal<SCEVSearch>?

1846–1847

What are LHS and RHS here?

1858–1860

This loop will always end on the first operand.
Should it be

if (Op)
  return Op;

?

1885–1886

You could do

if (RHS = dyn_cast<Instruction>(RHSV)) {
...

Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction.

lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319

Could we avoid changing anything unless we're confident we're going to unroll? E.g. we can split PH--->Header edge, then split PEnd basic block, and then we can decide to bail out.

346–347

Please make sure that expander also knows how to reuse existing values. From my last investigation of PR24920 I remember that fixing findExistingExpansion alone isn't enough, as we generate expensive trip count anyway, ignoring the fact that we can reuse existing value for it.

Also, you might take a look at my patch for fixing this bug, which I posted in D12765 (see 6th comment from the end).

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28

I think this description is a bit misleading. We can unroll not because IV's step is minus one, but because tripcount is known to be divisible by %conv7 (which will be unroll factor, right?), and we don't need to generate a new expression for the new tripcount. Is this correct?

53

Nitpick: this metadata is probably unnecessary.

This revision now requires changes to proceed.Dec 23 2015, 10:15 AM
flyingforyou edited edge metadata.
flyingforyou marked 3 inline comments as done.

Addressed Michael's comments.

flyingforyou marked an inline comment as done.Dec 23 2015, 6:43 PM
flyingforyou added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1841

I think current code is ok.
But there is same code in ScalarEvolution class. So I remove this function. Thanks.

1858–1860

Oh...Thanks.

1885–1886

Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction.

I am not sure. Isn't it optional??

lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319

Yes. You're right.

346–347

I think this is another issue about findExistingExpansion . I am not sure that I can fix the issue in this patch.

flyingforyou added inline comments.Dec 23 2015, 6:43 PM
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28

It will not be unroll factor. I just want to make example that trip count can be divided by original codes. Not belong to IV's step is unknown likes first example. The main point of first example(test) is that if IV's step is unknown, we should make division for getting tripcount for unrolled loop.

we don't need to generate a new expression for the new tripcount. Is this correct?

This is correct.

If you have better explanation about this example, please let me know.

53

This is original code. I'm not sure it's ok to remove.
Sanjoy, Can I remove this?

Current algorithm doesn't care about where does trip count's division come from.
That means, if TripCount is divided by some value in loop's preheader, compiler will give up doing unrolling. (Even if IV's step is one or minus one.)

If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43408/

spec regressions on a Cortex-A57 (A64):

spec.cpu2000.ref.253_perlbmk 	4.71%
spec.cpu2000.ref.256_bzip2 	2.07%
spec.cpu2006.ref.433_milc 	1.24%

There's a small improvement in 254_gap

spec.cpu2000.ref.254_gap 	-1.76%

(I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann)

spec regressions on a Cortex-A57 (T32):

spec.cpu2006.ref.482_sphinx3 	2.31%

and an improvement in:

spec.cpu2000.ref.181_mcf 	-2.13%

I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57.

Did you notice these regressions on your hardware? I find it strange to think this could all be accounted to different revisions of this core. I'm testing on this platform: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dto0038a/index.html)

I see some small improvments in a commerical benchmark (1-2%), but it is worth mentioning the SPEC changes, since I wouldn't write those off as no regressions; it's definite regression in SPEC, but whether the other improvements in test-suite balance them in the view of the community is debatable.

flyingforyou marked an inline comment as done.Dec 24 2015, 8:24 PM

I ran test-suite, spec on Galaxy S6 Target(CPU 2.1Ghz, MIF 1.5Ghz, Of course it's developtment board. Not exactly same with Galaxy S6).
I also can test on Juno. But it's very slow and I want to test on real world environment.

In my test, I got different data.

Charlie's test result
spec on a Cortex-A57 (A64):
spec.cpu2000.ref.253_perlbmk 4.71%
spec.cpu2000.ref.256_bzip2 2.07%
spec.cpu2006.ref.433_milc 1.24%
spec.cpu2000.ref.254_gap -1.76%

(I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann)

My test result about Spec (Opt/Ori)
spec.cpu2000.ref.253_perlbmk 	-> I can't get result. I think without modifying source code, we can't compile it.
spec.cpu2000.ref.256_bzip2 	99.93% 
spec.cpu2006.ref.433_milc 	99.52%
spec.cpu2000.ref.254_gap 	99.18%
My test result about test-suite (Opt/Ori)
I got totally different result on this.
SingleSource/Benchmarks/Shootout-C++/ackermann	81.35%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan -> execution time is 0. It's same.

I think both of data also can be worth. And This data is not meaningful any more. As you know, the patch was completely changed. I will share the result data base on recent patch soon.

PS. Could you share how to build "spec.cpu2000.ref.253_perlbmk", please?

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43578/. (Base Revision: r256315)
Env: Cortex-A57 2.1Ghz, MIF 1.5Ghz, AArch64
Test: Execute each benchmark 4 times and get average.

Test NameOpt_Exe_Time/Ori_Exe_Time
External/SPEC/CFP2000/177.mesa/177.mesa100.33%
External/SPEC/CFP2000/179.art/179.art102.20%
External/SPEC/CFP2000/183.equake/183.equake99.79%
External/SPEC/CFP2000/188.ammp/188.ammp99.97%
External/SPEC/CFP2006/433.milc/433.milc100.05%
External/SPEC/CFP2006/444.namd/444.namd100.06%
External/SPEC/CFP2006/447.dealII/447.dealII100.08%
External/SPEC/CFP2006/450.soplex/450.soplex100.00%
External/SPEC/CFP2006/470.lbm/470.lbm99.97%
External/SPEC/CINT2000/164.gzip/164.gzip100.13%
External/SPEC/CINT2000/175.vpr/175.vpr99.67%
External/SPEC/CINT2000/176.gcc/176.gcc99.60%
External/SPEC/CINT2000/181.mcf/181.mcf99.93%
External/SPEC/CINT2000/186.crafty/186.crafty99.76%
External/SPEC/CINT2000/197.parser/197.parser100.34%
External/SPEC/CINT2000/254.gap/254.gap100.34%
External/SPEC/CINT2000/255.vortex/255.vortex99.61%
External/SPEC/CINT2000/256.bzip2/256.bzip299.71%
External/SPEC/CINT2006/400.perlbench/400.perlbench100.21%
External/SPEC/CINT2006/401.bzip2/401.bzip299.87%
External/SPEC/CINT2006/403.gcc/403.gcc99.74%
External/SPEC/CINT2006/429.mcf/429.mcf100.25%
External/SPEC/CINT2006/445.gobmk/445.gobmk100.89%
External/SPEC/CINT2006/456.hmmer/456.hmmer101.26%
External/SPEC/CINT2006/458.sjeng/458.sjeng99.88%
External/SPEC/CINT2006/462.libquantum/462.libquantum99.75%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp98.04%
External/SPEC/CINT2006/473.astar/473.astar99.80%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk98.48%
Geomean99.99%

Other benchmarks in test-suite's Geomean value : 99.73%

On commertial benchmark, we can see the 0.1 ~0.2% improvement of overall.

Junmo.

sanjoy requested changes to this revision.Jan 17 2016, 1:12 PM
sanjoy edited edge metadata.
sanjoy added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1825

Nit: do you mean I's operands?

1828

Nit: LLVM convention is Depth.

1833

Might want to check SE.isSCEVAble(I->getType()) before calling getSCEV to avoid creating too many SCEVUnknown objects unnecessarily.

1836

Use a range for here.

1838

I think it will be cleaner to handle the non-instruction case in the caller (and remove the I == nullptr check above), since there is no reason to have the initial caller (first level of recursion) pass in nullptr for I.

I don't think depth++ is correct -- you'll blow the stack if I is a PHI node with itself as its first operand. At the very least, it needs to be ++depth. Also, since you're calling the counter depth, it'd make more sense to pass in depth + 1 here, and not ++depth. If you do want to increment depth in place, it might be better to call it Iter.

1875

Why do you care about RHS being a constant?

I think it is better to have the logic be:

  1. Check LHS as usual
  2. If RHSV is an Instruction then check RHS as well
  3. Use FindSameSCEVInst to do a deeper search on LHS

Is there any reason why you don't need to call FindSameSCEVInst on RHS as well (when it is an instruction)?

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53

Reading the code for @test, I think the metadata is required to tell LLVM that it is okay to generate a divide by %step if it thinks that is profitable to do (and so the test case actually checks LLVM's heuristics, not correctness).

This revision now requires changes to proceed.Jan 17 2016, 1:12 PM
flyingforyou marked 5 inline comments as done and an inline comment as not done.Jan 17 2016, 10:40 PM
flyingforyou added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1825

Yep.

1838

Sure, I will remove` I == nullptr`, add nullptr check when calling this function.

Thanks. I will change ++depth to Depth + 1.

1875

There is no critical reason that I care about RHS beging a constant.

I will modify the logic what you said. It seems more proper way for findExistingExpansion.

Thanks.

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53

Thanks. I'll remain the code.

flyingforyou edited edge metadata.
flyingforyou marked 2 inline comments as done.

Addressed Sanjoy's comments.

mzolotukhin requested changes to this revision.Jan 20 2016, 4:17 PM
mzolotukhin edited edge metadata.

Hi Junmo,

Please find some comments inline.

Thanks,
Michael

lib/Analysis/ScalarEvolutionExpander.cpp
1863–1864

I don't understand why we need to break symmetry in this code. Anyway, if we do this, we should either add an assert that the opposite situation never happens, or properly handle both cases. The fact that we've only seen one case is not a justification for not handling the other one.

IOW, why can't LHS be Value, and RHS be Instruction?

lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347

This should be in the same patch, otherwise the change will only paper over the issue, and lead to actually worse code in the end. If expandCodeFor isn't updated, it'll fail to reuse the existing value, so it'll generate yet another division. The idea of the patch is to teach SCEV to reuse existing expressions, not just hack around the high cost of division. The division should still be considered expensive, so we definitely shouldn't generate redundant ones.

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28

Maybe something like this?

Though SCEV for loop tripcount contains division, it shouldn't be considered expensive, since the division already exists in the code and we don't need to expand it once more. Thus, it shouldn't prevent us from unrolling the loop.

Also, I think I don't completely understand why we care so much about step being -1. Why is it important?

53

This metadata is used as !range !0 in @test and as !tbaa !0 in @test2. Are you sure this is correct?

This revision now requires changes to proceed.Jan 20 2016, 4:17 PM
flyingforyou added inline comments.Jan 20 2016, 9:19 PM
lib/Analysis/ScalarEvolutionExpander.cpp
1863–1864

I think LHS can be Value. But what I know is above pattern can cover most of cases.

if (match(BB->getTerminator(),
           m_Br(m_ICmp(Pred, m_Value(LHSV), m_Value(RHSV)), TrueBB,
                FalseBB))) {
  dbgs() << "m_Value(LHSV), m_Value(RHSV) pattern\n";
  if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Instruction(RHS)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Instruction(RHS) pattern\n";
  else if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHSV)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Value(RHSV) pattern\n";
  else
    dbgs() << "unknown(LHS), unknown(RHS) pattern\n";

}

When I test above code, I can't find "unknown(LHS), unknown(RHS) pattern\n" on building big benchmarks programs including LLVM itself.

m_Value(LHSV), m_Value(RHSV) pattern50650
m_Instruction(LHS), m_Instruction(RHS) pattern30178
m_Instruction(LHS), m_Value(RHSV) pattern20472

So I think it's no necessary to consider LHS is Value now.

Actually, we don't make compare instruction likes cmp constant, reg...

lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347

Michael, I am not an expert of SCEV. I think you might want to get review from Sanjoy.

Sanjoy, How about applying below patch on this?

-    if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At))
+    const SCEV *LHSSE = SE.getSCEV(LHS);
+    if (LHSSE->getType() == S->getType() &&
+        isa<SCEVConstant>(SE.getMinusSCEV(LHSSE, S)) &&
+        SE.DT.dominates(LHS, At))
       return LHS;
 
     if (Instruction *RHS = dyn_cast<Instruction>(RHSV)) {
-      if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At))
+      const SCEV *RHSSE = SE.getSCEV(RHS);
+      if (RHSSE->getType() == S->getType() &&
+          isa<SCEVConstant>(SE.getMinusSCEV(RHSSE, S)) &&
+          SE.DT.dominates(RHS, At))
         return RHS;
     }
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28

Cool. Michael.
I will modify comments as you said.

-1 indicates that tripcount doesn't need to be divided for calculation when loop unrolling.
Anyway thanks for good sugesstion.

53

I don't think that makes errors. But I got your point. Thanks.
I will modify !tbaa !0 to !tbaa !1
and add !1 = !{i64 0}

Is this ok?

Michael, I replied your inline comments. Could you check them, please?

Sanjoy, Michael want to insert some patch. Could you give him a opinion, please?

Junmo.

mzolotukhin added inline comments.Jan 25 2016, 12:09 PM
lib/Analysis/ScalarEvolutionExpander.cpp
1863–1864

My concern isn't about us missing some real cases right now, it's more about future maintenance of this code. If later on someone finds such a case, it'll be hard to figure out, why isn't it covered in the first place, when the code was committed.

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53

This is ok (or you can just remove !tbaa !1 and !1 = !{i64 0} at all). Thanks!

flyingforyou updated this object.
flyingforyou edited edge metadata.
flyingforyou marked 5 inline comments as done.

Addressed Michael's comments.
Thanks for good description example.

Waiting Sanjoy's comment.

sanjoy requested changes to this revision.Jan 27 2016, 12:40 PM
sanjoy edited edge metadata.

This is looking close to ready to land now. Some nitpicky comments inline.

lib/Analysis/ScalarEvolutionExpander.cpp
1827

How about renaming this to FindCongruentInst, or FindInstWithSameSCEV. FindSameSCEVInst does not sound okay grammatically.

1833

Minor: I'd add a small comment that you're doing the pre-check on using isSCEVable to avoid creating too many SCEVUnknown expressions.

1834

I'm worried that we're possibly doing too much work here (getSCEV can be expensive). Can we further filter on the type of I? For instance, if the width of I->getType() and S->getType() are different, then we don't need to bother calling getSCEV on I.

Edit: I'm much less worried about doing too much work here, after seeing the guard on SE.hasOperand in the caller; but I still think filtering on the type is a good idea.

1868

These two identical blocks should definitely be combined into a lambda.

1869

I'd "CSE" the two calls to getSCEV(LHS) to avoid the extra dictionary lookup.

1871

Please don't reuse the LHS variable here. Perhaps auto *ExistingInst = ..?

This revision now requires changes to proceed.Jan 27 2016, 12:40 PM
mzolotukhin requested changes to this revision.Jan 27 2016, 4:40 PM
mzolotukhin edited edge metadata.
mzolotukhin added inline comments.
lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347

We can wait for Sanjoy's opinion too, but the patch can't be landed without this issue fixed first anyway.

I slightly modified your testcase to expose the issue:

define i32 @test2(i64* %loc, i64 %conv7) {
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42 ; <<< Extra computations are added to the trip-count expression
  br label %bb1
bb1:
  %div11 = udiv i64 %ExpensiveComputation, %conv7
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  br label %for.body
for.body:
  %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ]
  %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ]
  %mul1 = mul i64 %rem1, 48271
  %rem2 = urem i64 %mul1, 2147483647
  %dec = add i64 %k1, -1
  %cmp = icmp eq i64 %dec, 0
  br i1 %cmp, label %exit, label %for.body
exit:
  %rem3 = phi i64 [ %rem2, %for.body ]
  store i64 %rem3, i64* %loc, align 8
  ret i32 0
}

If we apply the patch and run this test thru opt, the loop will be successfully unrolled, *but* we'll generate the code for "expensive" computations. This is IR after unrolling:

...
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42
  br label %bb1

bb1:                                              ; preds = %entry
  %div11 = udiv i64 %ExpensiveComputation, %conv7
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  %0 = udiv i64 %rem0, 42      ; <<<< This is our %ExpensiveComputation, and we just duplicated it!
  %1 = udiv i64 %0, %conv7     ; <<<
  %2 = icmp ugt i64 %1, 1
  %umax = select i1 %2, i64 %1, i64 1
  %3 = add i64 %umax, -1
  %xtraiter = and i64 %umax, 7
  %lcmp.mod = icmp ne i64 %xtraiter, 0
  br i1 %lcmp.mod, label %for.body.prol, label %bb1.split
...

Please note inserted instruction %0 and %1 - they're completely redundant (as we compute the same value, that we have in %div11).

So, as I said, it's not sufficient to just fix findExistingExpransion, we also need to fix SCEVExpander::expand.

sanjoy added inline comments.Jan 27 2016, 5:43 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347

I think the original idea was that if we know that a certain value is already being computed, then we can emit code to compute it again, and rely on later passes to get rid of the redundancy (so in your example above we'd expect %0 and %1 to be CSE'ed with %ExpensiveComputation and %div11).

But I understand your concern that this is a pessimization unless we run some cleanup passes afterwards, and even then we rely on the cleanup passes being smart enough to undo our mess. Perhaps SCEVExpander::expand can be fixed in a separate change (since the scope of that fix is bigger than what is addressed here) that goes in before this one?

mzolotukhin added inline comments.Jan 27 2016, 6:00 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347

My example was pretty simple, just to demonstrate the problem, and in this case indeed a simple clean-up would be sufficient to remove the introduced redundancy. But it's easy to construct a more complicated test, which existing clean-up passes might have problems to deal with, and since SCEV expressions might be very complicated, expanding all of them might be not a good idea, especially taking into account that some of them are actually expensive.

As for separating the patches - I'm entirely for it, but I just want to first make sure that there are no fundamental issues that can prevent us from fixing expand. My biggest concern is that this change leads to somewhat inconsistent state, where findExistingExpansion and expand have different assumptions. I'll feel much easier about it once I see the expand part of the patch.

flyingforyou marked 6 inline comments as done.Jan 27 2016, 9:47 PM
flyingforyou added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1834

Thanks for good sugesstion.

1871

Thanks for guiding this. As you said, using lambda could be more clear than before. Great!.

flyingforyou edited edge metadata.
flyingforyou marked 2 inline comments as done.

Addressed Sanjoy's comments.

Thanks,
Junmo.

Sanjoy, I am waiting your review. :)

Michael, could you make an example which is more complex that current clean up passes can't deal with?
I checked your previous example's expensive expression is cleaned up.

mzolotukhin requested changes to this revision.Feb 3 2016, 3:53 PM
mzolotukhin edited edge metadata.

Michael, could you make an example which is more complex that current clean up passes can't deal with? I checked your previous example's expensive expression is cleaned up.

Providing a concrete example here isn't even that important due to arguments I raised earlier, but here you go:

@a = global i64 1

declare void @foo()

define i32 @test3(i64* %loc, i64 %conv7) {
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42
  br label %outer.header

outer.header:
  %i = phi i64 [ 1, %entry ], [ %i.next, %outer.latch ]
  br label %bb1

bb1:
  %addr = getelementptr i64, i64* @a, i64 %i
  %x = load volatile i64, i64* %addr
  %div11 = udiv i64 %x, %i
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  br label %for.body

for.body:
  %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ]
  %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ]
  %mul1 = mul i64 %rem1, 48271
  %rem2 = urem i64 %mul1, 2147483647
  %dec = add i64 %k1, -1
  %cmp = icmp eq i64 %dec, 0
  br i1 %cmp, label %outer.latch, label %for.body

outer.latch:
  %i.next = add i64 %i, 1
  %outer.cmp = icmp eq i64 %i.next, 1000
  br i1 %outer.cmp, label %exit, label %outer.header

exit:
  %rem3 = phi i64 [ %rem2, %outer.latch ]
  store i64 %rem3, i64* %loc, align 8
  ret i32 0
}

I ran it with

opt < test.ll -S -unroll-runtime -loop-unroll -early-cse -instcombine -simplifycfg -licm

and always saw a code like this in the output:

%x = load volatile i64, i64* %addr, align 8
%div11 = udiv i64 %x, %i
%cmp.i38 = icmp ugt i64 %div11, 1
%div12 = select i1 %cmp.i38, i64 %div11, i64 1
%1 = udiv i64 %x, %0                           ; <<< Redundant expensive computation
%2 = icmp ugt i64 %1, 1
%umax = select i1 %2, i64 %1, i64 1

Probably, you can come up with another set of passes that will be able to clean this up, but then the example might be made even more complicated, so the new set will fail. The problem here is that with this patch compiler thinks that it generates no redundant code to clean-up, while in fact it's not even limited in what it can generate - we expand *any* existing SCEV expression.

So, I'd suggest you to address the issue with expand before going further with this one.

Michael

This revision now requires changes to proceed.Feb 3 2016, 3:53 PM

Looks like the change I've been asking for function has been committed (see D12090). So, probably we just need to update this patch, and it'll be ok to go in.

flyingforyou edited edge metadata.

Addressed Michael's comment.

It's very good timing.!

Thanks.

mzolotukhin added inline comments.Feb 5 2016, 11:42 AM
lib/Analysis/ScalarEvolutionExpander.cpp
1874

I think FindCongruentInst isn't needed right now. The logic in findExistingExpansion should basically mirror the logic from expand: expand just looks at the hash map, so we should do the same here. (And that should do the trick in our example here)

flyingforyou edited edge metadata.
flyingforyou marked an inline comment as done.

Addressed Michael's comment.

Hi Junmo,

It looks better now, couple of comments/questions inline.

Michael

lib/Analysis/ScalarEvolutionExpander.cpp
1858–1865

Is there any point from changing LHS and RHS from Instruction to Value? Looks like later on we cast it to Instruction anyway..

1888–1889

Why do we need these conditions? Isn't dominance sufficient?

Also, would it make sense to factor out this code block to a separate function (and use it both here and in expand)?

flyingforyou marked an inline comment as done.Feb 15 2016, 4:19 PM
flyingforyou added a subscriber: sunfish.
flyingforyou added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
1888–1889

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

would it make sense to factor out this code block to a separate function (and use it both here and in expand)?

Sure.

flyingforyou updated this object.

Addressed Michael's comments.
Thanks.

mzolotukhin accepted this revision.Feb 15 2016, 4:41 PM
mzolotukhin edited edge metadata.

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

Oh, I see, makes sense. The patch LGTM, thanks for your patience:)

Michael

include/llvm/Analysis/ScalarEvolutionExpander.h
266 ↗(On Diff #48035)

Please add a comment for the new function.

This revision was automatically updated to reflect the committed changes.
flyingforyou marked an inline comment as done.

Michael, Sanjoy,

Thank you for your patience, so many great suggestions!

It was great time. !

Merged in r260938.

Thanks,
Junmo.