Download Raw Diff

Details

Reviewers

mzolotukhin
jmolloy
zzheng
sanjoy

Commits

rG6ebdc14cf1b2: [SCEVExpander] Make findExistingExpansion smarter
rL260938: [SCEVExpander] Make findExistingExpansion smarter

Summary

Extending findExistingExpansion can use existing value in ExprValueMap.

This patch gives 0.3~0.5% performance improvements on benchmarks(test-suite, spec2000, spec2006, commercial benchmark)

Diff Detail

Repository: rL LLVM

Event Timeline

flyingforyou updated this revision to Diff 42961.Dec 16 2015, 12:59 AM

flyingforyou retitled this revision from to [LoopUnrollRuntime] Do unroll when the unroll factor is 2.

flyingforyou updated this object.

flyingforyou added reviewers: jmolloy, zzheng.

flyingforyou added a subscriber: llvm-commits.

This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion?

I'm OK with this change for AArch64. Like Hal said, please add a test case.

Hi Junmo,

I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite?

Thanks Charlie for comment.

I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite?

I will figure it out what's wrong. I set up spec2006, recently.

@hfinkel This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion?

@zzheng I'm OK with this change for AArch64. Like Hal said, please add a test case.

I also think cheking the unroll factor only is not enough for figuring out that TripCountSC is really expensive or not.
So I will upload new logic & test case next week. I hope that could be acceptable.

Junmo.

I've just seen the following regressions for spec2006 on the A57 in T32 mode,

spec.cpu2006.ref.456_hmmer 	13.34%
spec.cpu2006.ref.471_omnetpp 	7.87%
spec.cpu2006.ref.458_sjeng 	3.72%
spec.cpu2006.ref.429_mcf 	3.48%
spec.cpu2006.ref.462_libquantum 	3.47%
spec.cpu2006.ref.482_sphinx3 	2.27%
spec.cpu2006.ref.445_gobmk 	1.90%
spec.cpu2006.ref.453_povray 	1.61%
spec.cpu2006.ref.447_dealII 	1.10%

hmmer is clearly a good test case here, must be some register pressure issues.

I forgot to mention that I did see the following improvements in A57-T32 for spec2006:

spec.cpu2006.ref.470_lbm 	-10.92%
spec.cpu2006.ref.473_astar 	-9.53%
spec.cpu2006.ref.444_namd 	-1.40%
spec.cpu2006.ref.401_bzip2 	-1.06%

But this is an overall regression on spec2006.

Current algorithm doesn't care about where does trip count's division come from.
That means, if TripCount is divided by some value in loop's preheader, compiler will give up doing unrolling. (Even if IV's step is one or minus one.)

If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count.

I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57.

I hope that this patch is acceptable for everybody.

Junmo.

mzolotukhin added a subscriber: mzolotukhin.Dec 21 2015, 4:12 PM

mzolotukhin added inline comments.

lib/Transforms/Utils/LoopUnrollRuntime.cpp
325–331 ↗	(On Diff #43408)	There is a method `getStepRecurrence` in `AR`, which you might want to use here. Also, I don't think that changing `AllowExpensiveTripCount` here is a good way to achieve what you try to achieve. Why (and where) is this tripcount considered expensive? To me it looks like we should rather fix those places to recognize cases like this as 'inexpensive'.

flyingforyou added inline comments.Dec 21 2015, 4:52 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
325–331 ↗	(On Diff #43408)	There is a method getStepRecurrence in AR, which you might want to use here. Thanks, I will change this later. Why (and where) is this tripcount considered expensive? When we run test case "high-cost-trip-count-computation.ll", we can get expression of Tripcount SCEV likes below. test1 func: (1 + ((-1 + (-1 * %v12) + (8193 smax (%v12 + %step))) /u %step)) test2 func: (1 umax ((23 + %conv7) /u %conv7)) test1's division is generated for computing tripcount. Because %step is unknown. This case, we consider this is expensive. We need to generate new division for computing tripcount. test2's division is generated by original code which is in loop pre header(entry). In this case, isHighCostExpansion can't figure it out this is for computing trip count or it came from original code. And isHighCostExpansion is also used by IndVarSimplify. It's not only for LoopUnrollRuntime. If you still think changing AllowExpensiveTripCount is not good way, please let me know your idea. Junmo.

mzolotukhin added inline comments.Dec 21 2015, 5:40 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
325–331 ↗	(On Diff #43408)	Ok, I think I understood the issue. So, in `isHighCostExpansionHelper` routine there is a check if (At && findExistingExpansion(S, At, L) != nullptr) return false; Ideally, it should find the existing computation and immediately report that the expansion isn't expensive. In practice, this function doesn't always work as we'd like it to work (see e.g. https://llvm.org/bugs/show_bug.cgi?id=24920), but I'd suggest fixing it instead of working around in other places.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
45–49 ↗	(On Diff #43408)	It's definitely possible to simplify this testcase by removing lots of unneeded code from this loop. Please keep only parts that are needed to expose the issue.

flyingforyou added inline comments.Dec 21 2015, 6:27 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
325–331 ↗	(On Diff #43408)	Thanks for sharing bug about "isHighCostExpansionHelper" function. LoopUnrollRuntime use isHighCostExpansion function without "At". This means it only want to check TripCountSC has expensive expression or not. So, checking IV's step may not be working around for LoopUnrollRuntime. Junmo.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
45–49 ↗	(On Diff #43408)	Yes. You're right. I will remove unnecessary codes. Thanks.

mzolotukhin added inline comments.Dec 21 2015, 6:58 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
325–331 ↗	(On Diff #43408)	Yes, we're not passing `At` right now, but we can if we need to. If we do need to expand trip count, it'll be inserted into the loop preheader (look for `PreHeaderBR` in the code further down), so we actually know `At`. And, my main concern is that I don't believe that statement "step is 1/-1" is equivalent to statement "trip count expression is cheap to expand". At least, it's not obvious and requires some proof/explanation (i.e. comments in the code).

Addressed Michael's comments.

Thanks.

I'd rather have this sort of logic live in SCEVExpander::isHighCostExpansion than be scattered throughout the codebase. I think the "increment by one" case is covered by findExistingExpansion already; and the "decrement by one" case can be added there as well by teaching it to look at the initial values for the induction phi nodes.

This revision now requires changes to proceed.Dec 22 2015, 1:12 AM

Thanks for comment. Sanjoy.

test2's TripCountSC is "(1 umax ((23 + %conv7) /u %conv7))".

How can we handle this in SCEVExpander::isHighCostExpansion?

Please let me know.

Junmo.

Addressed Sanjoy's comment.

Your comment was really helpful for me. I am still not sure that this patch is ok or not.
But comparing SCEV expression between TripCountSC and exit value makes this patch more robust.

Thanks.

sanjoy requested changes to this revision.Dec 22 2015, 11:40 PM

sanjoy edited edge metadata.

sanjoy added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1828 ↗	(On Diff #43507)	A better name for this function would be `IsContainedIn`. And this should use `SCEVTraversal`.
1854 ↗	(On Diff #43507)	I don't see why you need to `match` twice? You should just be able to do match(BB->getTerminator(), m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHS))) with `Value *RHS`. Then you can make your later code conditional on `isa<Instruction>(RHS)`.
1884 ↗	(On Diff #43507)	I don't understand this part -- how do you conclude that `LHS` is a correct expansion for `S`? E.g. what if `getSCEV(LHS)` is `{S,+,1}`? Then clearly the expansion of `S` isn't `LHS`. I think you need to do a (recursive) search of the IR level operands of `LHS` and `RHS` to see if any of them are congruent to `S`.
lib/Transforms/Utils/LoopUnrollRuntime.cpp
316 ↗	(On Diff #43507)	I don't think this is the right value for `At`. `At` should be the location you want to expand the SCEV expression at, which in this case is `PreHeaderBR`.

This revision now requires changes to proceed.Dec 22 2015, 11:40 PM

Addressed Sanjoy's comment.

Following your suggestion, I realized that checing IV's step is not necessary any more.

I really appreciate your review.

Herald added a subscriber: sanjoy. · View Herald TranscriptDec 23 2015, 3:16 AM

mzolotukhin requested changes to this revision.Dec 23 2015, 10:15 AM

mzolotukhin added a reviewer: mzolotukhin.

mzolotukhin added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1841 ↗	(On Diff #43517)	Shouldn't it be `SCEVTraversal<SCEVSearch>`?
1846–1847 ↗	(On Diff #43517)	What are `LHS` and `RHS` here?
1858–1860 ↗	(On Diff #43517)	This loop will always end on the first operand. Should it be if (Op) return Op; ?
1885–1886 ↗	(On Diff #43517)	You could do if (RHS = dyn_cast<Instruction>(RHSV)) { ... Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction.
lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319 ↗	(On Diff #43517)	Could we avoid changing anything unless we're confident we're going to unroll? E.g. we can split `PH`--->`Header` edge, then split `PEnd` basic block, and then we can decide to bail out.
346–347 ↗	(On Diff #43517)	Please make sure that expander also knows how to reuse existing values. From my last investigation of PR24920 I remember that fixing `findExistingExpansion` alone isn't enough, as we generate expensive trip count anyway, ignoring the fact that we can reuse existing value for it. Also, you might take a look at my patch for fixing this bug, which I posted in D12765 (see 6th comment from the end).
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28 ↗	(On Diff #43517)	I think this description is a bit misleading. We can unroll not because IV's step is minus one, but because tripcount is known to be divisible by `%conv7` (which will be unroll factor, right?), and we don't need to generate a new expression for the new tripcount. Is this correct?
53 ↗	(On Diff #43517)	Nitpick: this metadata is probably unnecessary.

This revision now requires changes to proceed.Dec 23 2015, 10:15 AM

Addressed Michael's comments.

flyingforyou marked an inline comment as done.Dec 23 2015, 6:43 PM

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1841 ↗	(On Diff #43517)	I think current code is ok. But there is same code in ScalarEvolution class. So I remove this function. Thanks.
1858–1860 ↗	(On Diff #43517)	Oh...Thanks.
1885–1886 ↗	(On Diff #43517)	Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction. I am not sure. Isn't it optional??
lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319 ↗	(On Diff #43517)	Yes. You're right.
346–347 ↗	(On Diff #43517)	I think this is another issue about `findExistingExpansion` . I am not sure that I can fix the issue in this patch.

flyingforyou added inline comments.Dec 23 2015, 6:43 PM

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28 ↗	(On Diff #43517)	It will not be unroll factor. I just want to make example that trip count can be divided by original codes. Not belong to IV's step is unknown likes first example. The main point of first example(test) is that if IV's step is unknown, we should make division for getting tripcount for unrolled loop. we don't need to generate a new expression for the new tripcount. Is this correct? This is correct. If you have better explanation about this example, please let me know.
53 ↗	(On Diff #43517)	This is original code. I'm not sure it's ok to remove. Sanjoy, Can I remove this?

In D15559#315149, @flyingforyou wrote:

Current algorithm doesn't care about where does trip count's division come from.
That means, if TripCount is divided by some value in loop's preheader, compiler will give up doing unrolling. (Even if IV's step is one or minus one.)

If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43408/

spec regressions on a Cortex-A57 (A64):

spec.cpu2000.ref.253_perlbmk 	4.71%
spec.cpu2000.ref.256_bzip2 	2.07%
spec.cpu2006.ref.433_milc 	1.24%

There's a small improvement in 254_gap

spec.cpu2000.ref.254_gap 	-1.76%

(I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann)

spec regressions on a Cortex-A57 (T32):

spec.cpu2006.ref.482_sphinx3 	2.31%

and an improvement in:

spec.cpu2000.ref.181_mcf 	-2.13%

I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57.

Did you notice these regressions on your hardware? I find it strange to think this could all be accounted to different revisions of this core. I'm testing on this platform: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dto0038a/index.html)

I see some small improvments in a commerical benchmark (1-2%), but it is worth mentioning the SPEC changes, since I wouldn't write those off as no regressions; it's definite regression in SPEC, but whether the other improvements in test-suite balance them in the view of the community is debatable.

I ran test-suite, spec on Galaxy S6 Target(CPU 2.1Ghz, MIF 1.5Ghz, Of course it's developtment board. Not exactly same with Galaxy S6).
I also can test on Juno. But it's very slow and I want to test on real world environment.

In my test, I got different data.

Charlie's test result
spec on a Cortex-A57 (A64):
spec.cpu2000.ref.253_perlbmk 4.71%
spec.cpu2000.ref.256_bzip2 2.07%
spec.cpu2006.ref.433_milc 1.24%
spec.cpu2000.ref.254_gap -1.76%

(I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann)

My test result about Spec (Opt/Ori)
spec.cpu2000.ref.253_perlbmk 	-> I can't get result. I think without modifying source code, we can't compile it.
spec.cpu2000.ref.256_bzip2 	99.93% 
spec.cpu2006.ref.433_milc 	99.52%
spec.cpu2000.ref.254_gap 	99.18%

My test result about test-suite (Opt/Ori)
I got totally different result on this.
SingleSource/Benchmarks/Shootout-C++/ackermann	81.35%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan -> execution time is 0. It's same.

I think both of data also can be worth. And This data is not meaningful any more. As you know, the patch was completely changed. I will share the result data base on recent patch soon.

PS. Could you share how to build "spec.cpu2000.ref.253_perlbmk", please?

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43578/. (Base Revision: r256315)
Env: Cortex-A57 2.1Ghz, MIF 1.5Ghz, AArch64
Test: Execute each benchmark 4 times and get average.

Test Name	Opt_Exe_Time/Ori_Exe_Time
External/SPEC/CFP2000/177.mesa/177.mesa	100.33%
External/SPEC/CFP2000/179.art/179.art	102.20%
External/SPEC/CFP2000/183.equake/183.equake	99.79%
External/SPEC/CFP2000/188.ammp/188.ammp	99.97%
External/SPEC/CFP2006/433.milc/433.milc	100.05%
External/SPEC/CFP2006/444.namd/444.namd	100.06%
External/SPEC/CFP2006/447.dealII/447.dealII	100.08%
External/SPEC/CFP2006/450.soplex/450.soplex	100.00%
External/SPEC/CFP2006/470.lbm/470.lbm	99.97%
External/SPEC/CINT2000/164.gzip/164.gzip	100.13%
External/SPEC/CINT2000/175.vpr/175.vpr	99.67%
External/SPEC/CINT2000/176.gcc/176.gcc	99.60%
External/SPEC/CINT2000/181.mcf/181.mcf	99.93%
External/SPEC/CINT2000/186.crafty/186.crafty	99.76%
External/SPEC/CINT2000/197.parser/197.parser	100.34%
External/SPEC/CINT2000/254.gap/254.gap	100.34%
External/SPEC/CINT2000/255.vortex/255.vortex	99.61%
External/SPEC/CINT2000/256.bzip2/256.bzip2	99.71%
External/SPEC/CINT2006/400.perlbench/400.perlbench	100.21%
External/SPEC/CINT2006/401.bzip2/401.bzip2	99.87%
External/SPEC/CINT2006/403.gcc/403.gcc	99.74%
External/SPEC/CINT2006/429.mcf/429.mcf	100.25%
External/SPEC/CINT2006/445.gobmk/445.gobmk	100.89%
External/SPEC/CINT2006/456.hmmer/456.hmmer	101.26%
External/SPEC/CINT2006/458.sjeng/458.sjeng	99.88%
External/SPEC/CINT2006/462.libquantum/462.libquantum	99.75%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp	98.04%
External/SPEC/CINT2006/473.astar/473.astar	99.80%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk	98.48%
Geomean	99.99%

Other benchmarks in test-suite's Geomean value : 99.73%

On commertial benchmark, we can see the 0.1 ~0.2% improvement of overall.

Junmo.

junbuml added a subscriber: junbuml.Dec 25 2015, 8:25 PM

sanjoy requested changes to this revision.Jan 17 2016, 1:12 PM

sanjoy edited edge metadata.

sanjoy added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1825 ↗	(On Diff #43578)	Nit: do you mean `I's operands`?
1828 ↗	(On Diff #43578)	Nit: LLVM convention is `Depth`.
1833 ↗	(On Diff #43578)	Might want to check `SE.isSCEVAble(I->getType())` before calling `getSCEV` to avoid creating too many `SCEVUnknown` objects unnecessarily.
1836 ↗	(On Diff #43578)	Use a range `for` here.
1838 ↗	(On Diff #43578)	I think it will be cleaner to handle the non-instruction case in the caller (and remove the `I == nullptr` check above), since there is no reason to have the initial caller (first level of recursion) pass in `nullptr` for `I`. I don't think `depth++` is correct -- you'll blow the stack if `I` is a PHI node with itself as its first operand. At the very least, it needs to be `++depth`. Also, since you're calling the counter `depth`, it'd make more sense to pass in `depth + 1` here, and not `++depth`. If you do want to increment `depth` in place, it might be better to call it `Iter`.
1871 ↗	(On Diff #43578)	Why do you care about `RHS` being a constant? I think it is better to have the logic be: Check `LHS` as usual If `RHSV` is an `Instruction` then check `RHS` as well Use `FindSameSCEVInst` to do a deeper search on `LHS` Is there any reason why you don't need to call `FindSameSCEVInst` on `RHS` as well (when it is an instruction)?
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53 ↗	(On Diff #43578)	Reading the code for `@test`, I think the metadata is required to tell LLVM that it is okay to generate a divide by `%step` if it thinks that is profitable to do (and so the test case actually checks LLVM's heuristics, not correctness).

This revision now requires changes to proceed.Jan 17 2016, 1:12 PM

flyingforyou marked 5 inline comments as done and an inline comment as not done.Jan 17 2016, 10:40 PM

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1825 ↗	(On Diff #43578)	Yep.
1838 ↗	(On Diff #43578)	Sure, I will remove` I == nullptr`, add nullptr check when calling this function. Thanks. I will change `++depth` to `Depth + 1`.
1871 ↗	(On Diff #43578)	There is no critical reason that I care about RHS beging a constant. I will modify the logic what you said. It seems more proper way for `findExistingExpansion`. Thanks.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53 ↗	(On Diff #43578)	Thanks. I'll remain the code.

Addressed Sanjoy's comments.

Hi Junmo,

Please find some comments inline.

Thanks,
Michael

lib/Analysis/ScalarEvolutionExpander.cpp
1863–1864 ↗	(On Diff #45155)	I don't understand why we need to break symmetry in this code. Anyway, if we do this, we should either add an assert that the opposite situation never happens, or properly handle both cases. The fact that we've only seen one case is not a justification for not handling the other one. IOW, why can't LHS be Value, and RHS be Instruction?
lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347 ↗	(On Diff #45155)	This should be in the same patch, otherwise the change will only paper over the issue, and lead to actually worse code in the end. If `expandCodeFor` isn't updated, it'll fail to reuse the existing value, so it'll generate yet another division. The idea of the patch is to teach SCEV to reuse existing expressions, not just hack around the high cost of division. The division should still be considered expensive, so we definitely shouldn't generate redundant ones.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28 ↗	(On Diff #45155)	Maybe something like this? Though SCEV for loop tripcount contains division, it shouldn't be considered expensive, since the division already exists in the code and we don't need to expand it once more. Thus, it shouldn't prevent us from unrolling the loop. Also, I think I don't completely understand why we care so much about step being -1. Why is it important?
53 ↗	(On Diff #45155)	This metadata is used as `!range !0` in `@test` and as `!tbaa !0` in `@test2`. Are you sure this is correct?

This revision now requires changes to proceed.Jan 20 2016, 4:17 PM

flyingforyou added inline comments.Jan 20 2016, 9:19 PM

lib/Analysis/ScalarEvolutionExpander.cpp

1863–1864 ↗

(On Diff #45155)

I think LHS can be Value. But what I know is above pattern can cover most of cases.

if (match(BB->getTerminator(),
           m_Br(m_ICmp(Pred, m_Value(LHSV), m_Value(RHSV)), TrueBB,
                FalseBB))) {
  dbgs() << "m_Value(LHSV), m_Value(RHSV) pattern\n";
  if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Instruction(RHS)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Instruction(RHS) pattern\n";
  else if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHSV)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Value(RHSV) pattern\n";
  else
    dbgs() << "unknown(LHS), unknown(RHS) pattern\n";

}

When I test above code, I can't find "unknown(LHS), unknown(RHS) pattern\n" on building big benchmarks programs including LLVM itself.

m_Value(LHSV), m_Value(RHSV) pattern	50650
m_Instruction(LHS), m_Instruction(RHS) pattern	30178
m_Instruction(LHS), m_Value(RHSV) pattern	20472

So I think it's no necessary to consider LHS is Value now.

Actually, we don't make compare instruction likes cmp constant, reg...

lib/Transforms/Utils/LoopUnrollRuntime.cpp

346–347 ↗

(On Diff #45155)

Michael, I am not an expert of SCEV. I think you might want to get review from Sanjoy.

Sanjoy, How about applying below patch on this?

-    if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At))
+    const SCEV *LHSSE = SE.getSCEV(LHS);
+    if (LHSSE->getType() == S->getType() &&
+        isa<SCEVConstant>(SE.getMinusSCEV(LHSSE, S)) &&
+        SE.DT.dominates(LHS, At))
       return LHS;
 
     if (Instruction *RHS = dyn_cast<Instruction>(RHSV)) {
-      if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At))
+      const SCEV *RHSSE = SE.getSCEV(RHS);
+      if (RHSSE->getType() == S->getType() &&
+          isa<SCEVConstant>(SE.getMinusSCEV(RHSSE, S)) &&
+          SE.DT.dominates(RHS, At))
         return RHS;
     }

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll

27–28 ↗

(On Diff #45155)

Cool. Michael.
I will modify comments as you said.

-1 indicates that tripcount doesn't need to be divided for calculation when loop unrolling.
Anyway thanks for good sugesstion.

53 ↗

(On Diff #45155)

I don't think that makes errors. But I got your point. Thanks.
I will modify !tbaa !0 to !tbaa !1
and add !1 = !{i64 0}

Is this ok?

Michael, I replied your inline comments. Could you check them, please?

Sanjoy, Michael want to insert some patch. Could you give him a opinion, please?

Junmo.

mzolotukhin added inline comments.Jan 25 2016, 12:09 PM

lib/Analysis/ScalarEvolutionExpander.cpp
1863–1864 ↗	(On Diff #45155)	My concern isn't about us missing some real cases right now, it's more about future maintenance of this code. If later on someone finds such a case, it'll be hard to figure out, why isn't it covered in the first place, when the code was committed.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53 ↗	(On Diff #45155)	This is ok (or you can just remove `!tbaa !1` and `!1 = !{i64 0}` at all). Thanks!

Addressed Michael's comments.
Thanks for good description example.

Waiting Sanjoy's comment.

This is looking close to ready to land now. Some nitpicky comments inline.

lib/Analysis/ScalarEvolutionExpander.cpp
1827 ↗	(On Diff #46105)	How about renaming this to `FindCongruentInst`, or `FindInstWithSameSCEV`. `FindSameSCEVInst` does not sound okay grammatically.
1834 ↗	(On Diff #46105)	I'm worried that we're possibly doing too much work here (`getSCEV` can be expensive). Can we further filter on the type of `I`? For instance, if the width of `I->getType()` and `S->getType()` are different, then we don't need to bother calling `getSCEV` on `I`. Edit: I'm much less worried about doing too much work here, after seeing the guard on `SE.hasOperand` in the caller; but I still think filtering on the type is a good idea.
1865 ↗	(On Diff #46105)	These two identical blocks should definitely be combined into a lambda.
1866 ↗	(On Diff #46105)	I'd "CSE" the two calls to `getSCEV(LHS)` to avoid the extra dictionary lookup.
1870 ↗	(On Diff #46105)	Please don't reuse the `LHS` variable here. Perhaps `auto *ExistingInst = ..`?
1833 ↗	(On Diff #45155)	Minor: I'd add a small comment that you're doing the pre-check on using `isSCEVable` to avoid creating too many `SCEVUnknown` expressions.

This revision now requires changes to proceed.Jan 27 2016, 12:40 PM

mzolotukhin requested changes to this revision.Jan 27 2016, 4:40 PM

mzolotukhin edited edge metadata.

mzolotukhin added inline comments.

lib/Transforms/Utils/LoopUnrollRuntime.cpp

346–347 ↗

(On Diff #46105)

We can wait for Sanjoy's opinion too, but the patch can't be landed without this issue fixed first anyway.

I slightly modified your testcase to expose the issue:

define i32 @test2(i64* %loc, i64 %conv7) {
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42 ; <<< Extra computations are added to the trip-count expression
  br label %bb1
bb1:
  %div11 = udiv i64 %ExpensiveComputation, %conv7
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  br label %for.body
for.body:
  %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ]
  %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ]
  %mul1 = mul i64 %rem1, 48271
  %rem2 = urem i64 %mul1, 2147483647
  %dec = add i64 %k1, -1
  %cmp = icmp eq i64 %dec, 0
  br i1 %cmp, label %exit, label %for.body
exit:
  %rem3 = phi i64 [ %rem2, %for.body ]
  store i64 %rem3, i64* %loc, align 8
  ret i32 0
}

If we apply the patch and run this test thru opt, the loop will be successfully unrolled, *but* we'll generate the code for "expensive" computations. This is IR after unrolling:

...
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42
  br label %bb1

bb1:                                              ; preds = %entry
  %div11 = udiv i64 %ExpensiveComputation, %conv7
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  %0 = udiv i64 %rem0, 42      ; <<<< This is our %ExpensiveComputation, and we just duplicated it!
  %1 = udiv i64 %0, %conv7     ; <<<
  %2 = icmp ugt i64 %1, 1
  %umax = select i1 %2, i64 %1, i64 1
  %3 = add i64 %umax, -1
  %xtraiter = and i64 %umax, 7
  %lcmp.mod = icmp ne i64 %xtraiter, 0
  br i1 %lcmp.mod, label %for.body.prol, label %bb1.split
...

Please note inserted instruction %0 and %1 - they're completely redundant (as we compute the same value, that we have in %div11).

So, as I said, it's not sufficient to just fix findExistingExpransion, we also need to fix SCEVExpander::expand.

sanjoy added inline comments.Jan 27 2016, 5:43 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347 ↗	(On Diff #46105)	I think the original idea was that if we know that a certain value is already being computed, then we can emit code to compute it again, and rely on later passes to get rid of the redundancy (so in your example above we'd expect `%0` and `%1` to be CSE'ed with `%ExpensiveComputation` and `%div11`). But I understand your concern that this is a pessimization unless we run some cleanup passes afterwards, and even then we rely on the cleanup passes being smart enough to undo our mess. Perhaps `SCEVExpander::expand` can be fixed in a separate change (since the scope of that fix is bigger than what is addressed here) that goes in before this one?

mzolotukhin added inline comments.Jan 27 2016, 6:00 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347 ↗	(On Diff #46105)	My example was pretty simple, just to demonstrate the problem, and in this case indeed a simple clean-up would be sufficient to remove the introduced redundancy. But it's easy to construct a more complicated test, which existing clean-up passes might have problems to deal with, and since SCEV expressions might be very complicated, expanding all of them might be not a good idea, especially taking into account that some of them are actually expensive. As for separating the patches - I'm entirely for it, but I just want to first make sure that there are no fundamental issues that can prevent us from fixing `expand`. My biggest concern is that this change leads to somewhat inconsistent state, where `findExistingExpansion` and `expand` have different assumptions. I'll feel much easier about it once I see the `expand` part of the patch.

flyingforyou marked 6 inline comments as done.Jan 27 2016, 9:47 PM

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1834 ↗	(On Diff #46105)	Thanks for good sugesstion.
1870 ↗	(On Diff #46105)	Thanks for guiding this. As you said, using lambda could be more clear than before. Great!.

Addressed Sanjoy's comments.

Thanks,
Junmo.

Sanjoy, I am waiting your review. :)

Michael, could you make an example which is more complex that current clean up passes can't deal with?
I checked your previous example's expensive expression is cleaned up.

Michael, could you make an example which is more complex that current clean up passes can't deal with? I checked your previous example's expensive expression is cleaned up.

Providing a concrete example here isn't even that important due to arguments I raised earlier, but here you go:

@a = global i64 1

declare void @foo()

define i32 @test3(i64* %loc, i64 %conv7) {
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42
  br label %outer.header

outer.header:
  %i = phi i64 [ 1, %entry ], [ %i.next, %outer.latch ]
  br label %bb1

bb1:
  %addr = getelementptr i64, i64* @a, i64 %i
  %x = load volatile i64, i64* %addr
  %div11 = udiv i64 %x, %i
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  br label %for.body

for.body:
  %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ]
  %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ]
  %mul1 = mul i64 %rem1, 48271
  %rem2 = urem i64 %mul1, 2147483647
  %dec = add i64 %k1, -1
  %cmp = icmp eq i64 %dec, 0
  br i1 %cmp, label %outer.latch, label %for.body

outer.latch:
  %i.next = add i64 %i, 1
  %outer.cmp = icmp eq i64 %i.next, 1000
  br i1 %outer.cmp, label %exit, label %outer.header

exit:
  %rem3 = phi i64 [ %rem2, %outer.latch ]
  store i64 %rem3, i64* %loc, align 8
  ret i32 0
}

I ran it with

opt < test.ll -S -unroll-runtime -loop-unroll -early-cse -instcombine -simplifycfg -licm

and always saw a code like this in the output:

%x = load volatile i64, i64* %addr, align 8
%div11 = udiv i64 %x, %i
%cmp.i38 = icmp ugt i64 %div11, 1
%div12 = select i1 %cmp.i38, i64 %div11, i64 1
%1 = udiv i64 %x, %0                           ; <<< Redundant expensive computation
%2 = icmp ugt i64 %1, 1
%umax = select i1 %2, i64 %1, i64 1

Probably, you can come up with another set of passes that will be able to clean this up, but then the example might be made even more complicated, so the new set will fail. The problem here is that with this patch compiler thinks that it generates no redundant code to clean-up, while in fact it's not even limited in what it can generate - we expand *any* existing SCEV expression.

So, I'd suggest you to address the issue with expand before going further with this one.

Michael

This revision now requires changes to proceed.Feb 3 2016, 3:53 PM

Looks like the change I've been asking for function has been committed (see D12090). So, probably we just need to update this patch, and it'll be ok to go in.

Addressed Michael's comment.

It's very good timing.!

Thanks.

mzolotukhin added inline comments.Feb 5 2016, 11:42 AM

lib/Analysis/ScalarEvolutionExpander.cpp
1905 ↗	(On Diff #46993)	I think `FindCongruentInst` isn't needed right now. The logic in `findExistingExpansion` should basically mirror the logic from `expand`: `expand` just looks at the hash map, so we should do the same here. (And that should do the trick in our example here)

Addressed Michael's comment.

Hi Junmo,

It looks better now, couple of comments/questions inline.

Michael

lib/Analysis/ScalarEvolutionExpander.cpp
1865–1870 ↗	(On Diff #47946)	Is there any point from changing `LHS` and `RHS` from `Instruction` to `Value`? Looks like later on we cast it to `Instruction` anyway..
1901–1902 ↗	(On Diff #47946)	Why do we need these conditions? Isn't dominance sufficient? Also, would it make sense to factor out this code block to a separate function (and use it both here and in `expand`)?

flyingforyou marked an inline comment as done.Feb 15 2016, 4:19 PM

flyingforyou added a subscriber: sunfish.

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp

1901–1902 ↗

(On Diff #47946)

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

would it make sense to factor out this code block to a separate function (and use it both here and in expand)?

Sure.

Addressed Michael's comments.
Thanks.

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

Oh, I see, makes sense. The patch LGTM, thanks for your patience:)

Michael

include/llvm/Analysis/ScalarEvolutionExpander.h
266 ↗	(On Diff #48035)	Please add a comment for the new function.

Closed by commit rL260938: [SCEVExpander] Make findExistingExpansion smarter (authored by flyingforyou). · Explain WhyFeb 15 2016, 10:51 PM

This revision was automatically updated to reflect the committed changes.

flyingforyou marked an inline comment as done.

Michael, Sanjoy,

Thank you for your patience, so many great suggestions!

It was great time. !

Merged in r260938.

Thanks,
Junmo.

This is an archive of the discontinued LLVM Phabricator instance.

[SCEVExpander] Make findExistingExpansion smarter
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 48045

llvm/trunk/include/llvm/Analysis/ScalarEvolutionExpander.h

llvm/trunk/lib/Analysis/ScalarEvolutionExpander.cpp

llvm/trunk/lib/Transforms/Utils/LoopUnrollRuntime.cpp

llvm/trunk/test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll

This is an archive of the discontinued LLVM Phabricator instance.

[SCEVExpander] Make findExistingExpansion smarterClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 48045

llvm/trunk/include/llvm/Analysis/ScalarEvolutionExpander.h

llvm/trunk/lib/Analysis/ScalarEvolutionExpander.cpp

llvm/trunk/lib/Transforms/Utils/LoopUnrollRuntime.cpp

llvm/trunk/test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll

[SCEVExpander] Make findExistingExpansion smarter
ClosedPublic