This is an archive of the discontinued LLVM Phabricator instance.

This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion?

I'm OK with this change for AArch64. Like Hal said, please add a test case.

Hi Junmo,

I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite?

Thanks Charlie for comment.

I had a look at A57 in A64 mode and noticed a 4.81% regression in spec2006's hmmer benchmark. Do you have access to this suite?

I will figure it out what's wrong. I set up spec2006, recently.

@hfinkel This needs a test case (either in test/Transforms/LoopUnroll/AArch64 or in test/Transforms/LoopUnroll). Also, why does this logic belong here and not in Expander.isHighCostExpansion?

@zzheng I'm OK with this change for AArch64. Like Hal said, please add a test case.

I also think cheking the unroll factor only is not enough for figuring out that TripCountSC is really expensive or not.
So I will upload new logic & test case next week. I hope that could be acceptable.

Junmo.

I've just seen the following regressions for spec2006 on the A57 in T32 mode,

spec.cpu2006.ref.456_hmmer 	13.34%
spec.cpu2006.ref.471_omnetpp 	7.87%
spec.cpu2006.ref.458_sjeng 	3.72%
spec.cpu2006.ref.429_mcf 	3.48%
spec.cpu2006.ref.462_libquantum 	3.47%
spec.cpu2006.ref.482_sphinx3 	2.27%
spec.cpu2006.ref.445_gobmk 	1.90%
spec.cpu2006.ref.453_povray 	1.61%
spec.cpu2006.ref.447_dealII 	1.10%

hmmer is clearly a good test case here, must be some register pressure issues.

I forgot to mention that I did see the following improvements in A57-T32 for spec2006:

spec.cpu2006.ref.470_lbm 	-10.92%
spec.cpu2006.ref.473_astar 	-9.53%
spec.cpu2006.ref.444_namd 	-1.40%
spec.cpu2006.ref.401_bzip2 	-1.06%

But this is an overall regression on spec2006.

Current algorithm doesn't care about where does trip count's division come from.
That means, if TripCount is divided by some value in loop's preheader, compiler will give up doing unrolling. (Even if IV's step is one or minus one.)

If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count.

I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57.

I hope that this patch is acceptable for everybody.

Junmo.

mzolotukhin added a subscriber: mzolotukhin.Dec 21 2015, 4:12 PM

mzolotukhin added inline comments.

lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333	There is a method `getStepRecurrence` in `AR`, which you might want to use here. Also, I don't think that changing `AllowExpensiveTripCount` here is a good way to achieve what you try to achieve. Why (and where) is this tripcount considered expensive? To me it looks like we should rather fix those places to recognize cases like this as 'inexpensive'.

flyingforyou added inline comments.Dec 21 2015, 4:52 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333	There is a method getStepRecurrence in AR, which you might want to use here. Thanks, I will change this later. Why (and where) is this tripcount considered expensive? When we run test case "high-cost-trip-count-computation.ll", we can get expression of Tripcount SCEV likes below. test1 func: (1 + ((-1 + (-1 * %v12) + (8193 smax (%v12 + %step))) /u %step)) test2 func: (1 umax ((23 + %conv7) /u %conv7)) test1's division is generated for computing tripcount. Because %step is unknown. This case, we consider this is expensive. We need to generate new division for computing tripcount. test2's division is generated by original code which is in loop pre header(entry). In this case, isHighCostExpansion can't figure it out this is for computing trip count or it came from original code. And isHighCostExpansion is also used by IndVarSimplify. It's not only for LoopUnrollRuntime. If you still think changing AllowExpensiveTripCount is not good way, please let me know your idea. Junmo.

mzolotukhin added inline comments.Dec 21 2015, 5:40 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333	Ok, I think I understood the issue. So, in `isHighCostExpansionHelper` routine there is a check if (At && findExistingExpansion(S, At, L) != nullptr) return false; Ideally, it should find the existing computation and immediately report that the expansion isn't expensive. In practice, this function doesn't always work as we'd like it to work (see e.g. https://llvm.org/bugs/show_bug.cgi?id=24920), but I'd suggest fixing it instead of working around in other places.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
45–49	It's definitely possible to simplify this testcase by removing lots of unneeded code from this loop. Please keep only parts that are needed to expose the issue.

flyingforyou added inline comments.Dec 21 2015, 6:27 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333	Thanks for sharing bug about "isHighCostExpansionHelper" function. LoopUnrollRuntime use isHighCostExpansion function without "At". This means it only want to check TripCountSC has expensive expression or not. So, checking IV's step may not be working around for LoopUnrollRuntime. Junmo.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
45–49	Yes. You're right. I will remove unnecessary codes. Thanks.

mzolotukhin added inline comments.Dec 21 2015, 6:58 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
327–333	Yes, we're not passing `At` right now, but we can if we need to. If we do need to expand trip count, it'll be inserted into the loop preheader (look for `PreHeaderBR` in the code further down), so we actually know `At`. And, my main concern is that I don't believe that statement "step is 1/-1" is equivalent to statement "trip count expression is cheap to expand". At least, it's not obvious and requires some proof/explanation (i.e. comments in the code).

Addressed Michael's comments.

Thanks.

I'd rather have this sort of logic live in SCEVExpander::isHighCostExpansion than be scattered throughout the codebase. I think the "increment by one" case is covered by findExistingExpansion already; and the "decrement by one" case can be added there as well by teaching it to look at the initial values for the induction phi nodes.

This revision now requires changes to proceed.Dec 22 2015, 1:12 AM

Thanks for comment. Sanjoy.

test2's TripCountSC is "(1 umax ((23 + %conv7) /u %conv7))".

How can we handle this in SCEVExpander::isHighCostExpansion?

Please let me know.

Junmo.

Addressed Sanjoy's comment.

Your comment was really helpful for me. I am still not sure that this patch is ok or not.
But comparing SCEV expression between TripCountSC and exit value makes this patch more robust.

Thanks.

sanjoy requested changes to this revision.Dec 22 2015, 11:40 PM

sanjoy edited edge metadata.

sanjoy added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1828	A better name for this function would be `IsContainedIn`. And this should use `SCEVTraversal`.
1854	I don't see why you need to `match` twice? You should just be able to do match(BB->getTerminator(), m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHS))) with `Value *RHS`. Then you can make your later code conditional on `isa<Instruction>(RHS)`.
1884	I don't understand this part -- how do you conclude that `LHS` is a correct expansion for `S`? E.g. what if `getSCEV(LHS)` is `{S,+,1}`? Then clearly the expansion of `S` isn't `LHS`. I think you need to do a (recursive) search of the IR level operands of `LHS` and `RHS` to see if any of them are congruent to `S`.
lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319	I don't think this is the right value for `At`. `At` should be the location you want to expand the SCEV expression at, which in this case is `PreHeaderBR`.

This revision now requires changes to proceed.Dec 22 2015, 11:40 PM

Addressed Sanjoy's comment.

Following your suggestion, I realized that checing IV's step is not necessary any more.

I really appreciate your review.

Herald added a subscriber: sanjoy. · View Herald TranscriptDec 23 2015, 3:16 AM

mzolotukhin requested changes to this revision.Dec 23 2015, 10:15 AM

mzolotukhin added a reviewer: mzolotukhin.

mzolotukhin added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1841	Shouldn't it be `SCEVTraversal<SCEVSearch>`?
1846–1847	What are `LHS` and `RHS` here?
1858–1860	This loop will always end on the first operand. Should it be if (Op) return Op; ?
1885–1886	You could do if (RHS = dyn_cast<Instruction>(RHSV)) { ... Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction.
lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319	Could we avoid changing anything unless we're confident we're going to unroll? E.g. we can split `PH`--->`Header` edge, then split `PEnd` basic block, and then we can decide to bail out.
346–347	Please make sure that expander also knows how to reuse existing values. From my last investigation of PR24920 I remember that fixing `findExistingExpansion` alone isn't enough, as we generate expensive trip count anyway, ignoring the fact that we can reuse existing value for it. Also, you might take a look at my patch for fixing this bug, which I posted in D12765 (see 6th comment from the end).
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28	I think this description is a bit misleading. We can unroll not because IV's step is minus one, but because tripcount is known to be divisible by `%conv7` (which will be unroll factor, right?), and we don't need to generate a new expression for the new tripcount. Is this correct?
53	Nitpick: this metadata is probably unnecessary.

This revision now requires changes to proceed.Dec 23 2015, 10:15 AM

Addressed Michael's comments.

flyingforyou marked an inline comment as done.Dec 23 2015, 6:43 PM

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1841	I think current code is ok. But there is same code in ScalarEvolution class. So I remove this function. Thanks.
1858–1860	Oh...Thanks.
1885–1886	Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction. I am not sure. Isn't it optional??
lib/Transforms/Utils/LoopUnrollRuntime.cpp
318–319	Yes. You're right.
346–347	I think this is another issue about `findExistingExpansion` . I am not sure that I can fix the issue in this patch.

flyingforyou added inline comments.Dec 23 2015, 6:43 PM

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28	It will not be unroll factor. I just want to make example that trip count can be divided by original codes. Not belong to IV's step is unknown likes first example. The main point of first example(test) is that if IV's step is unknown, we should make division for getting tripcount for unrolled loop. we don't need to generate a new expression for the new tripcount. Is this correct? This is correct. If you have better explanation about this example, please let me know.
53	This is original code. I'm not sure it's ok to remove. Sanjoy, Can I remove this?

In D15559#315149, @flyingforyou wrote:

Current algorithm doesn't care about where does trip count's division come from.
That means, if TripCount is divided by some value in loop's preheader, compiler will give up doing unrolling. (Even if IV's step is one or minus one.)

If IV's step is constant likes one or minus one or multiple of 2, we don't need to generate division for computing trip count.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43408/

spec regressions on a Cortex-A57 (A64):

spec.cpu2000.ref.253_perlbmk 	4.71%
spec.cpu2000.ref.256_bzip2 	2.07%
spec.cpu2006.ref.433_milc 	1.24%

There's a small improvement in 254_gap

spec.cpu2000.ref.254_gap 	-1.76%

(I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann)

spec regressions on a Cortex-A57 (T32):

spec.cpu2006.ref.482_sphinx3 	2.31%

and an improvement in:

spec.cpu2000.ref.181_mcf 	-2.13%

I tested this patch on r256132 with test-suite, spec2000,2006, commercial benchmarks. There is no regression on Cortex-A57.

Did you notice these regressions on your hardware? I find it strange to think this could all be accounted to different revisions of this core. I'm testing on this platform: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dto0038a/index.html)

I see some small improvments in a commerical benchmark (1-2%), but it is worth mentioning the SPEC changes, since I wouldn't write those off as no regressions; it's definite regression in SPEC, but whether the other improvements in test-suite balance them in the view of the community is debatable.

I ran test-suite, spec on Galaxy S6 Target(CPU 2.1Ghz, MIF 1.5Ghz, Of course it's developtment board. Not exactly same with Galaxy S6).
I also can test on Juno. But it's very slow and I want to test on real world environment.

In my test, I got different data.

Charlie's test result
spec on a Cortex-A57 (A64):
spec.cpu2000.ref.253_perlbmk 4.71%
spec.cpu2000.ref.256_bzip2 2.07%
spec.cpu2006.ref.433_milc 1.24%
spec.cpu2000.ref.254_gap -1.76%

(I'm focussing on SPEC because it's less noisy than the benchmarks in test-suite, although I do note a 23% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and a 7.76% regression in lnt.SingleSource/Benchmarks/Shootout-C++/ackermann)

My test result about Spec (Opt/Ori)
spec.cpu2000.ref.253_perlbmk 	-> I can't get result. I think without modifying source code, we can't compile it.
spec.cpu2000.ref.256_bzip2 	99.93% 
spec.cpu2006.ref.433_milc 	99.52%
spec.cpu2000.ref.254_gap 	99.18%

My test result about test-suite (Opt/Ori)
I got totally different result on this.
SingleSource/Benchmarks/Shootout-C++/ackermann	81.35%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan -> execution time is 0. It's same.

I think both of data also can be worth. And This data is not meaningful any more. As you know, the patch was completely changed. I will share the result data base on recent patch soon.

PS. Could you share how to build "spec.cpu2000.ref.253_perlbmk", please?

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43578/. (Base Revision: r256315)
Env: Cortex-A57 2.1Ghz, MIF 1.5Ghz, AArch64
Test: Execute each benchmark 4 times and get average.

Test Name	Opt_Exe_Time/Ori_Exe_Time
External/SPEC/CFP2000/177.mesa/177.mesa	100.33%
External/SPEC/CFP2000/179.art/179.art	102.20%
External/SPEC/CFP2000/183.equake/183.equake	99.79%
External/SPEC/CFP2000/188.ammp/188.ammp	99.97%
External/SPEC/CFP2006/433.milc/433.milc	100.05%
External/SPEC/CFP2006/444.namd/444.namd	100.06%
External/SPEC/CFP2006/447.dealII/447.dealII	100.08%
External/SPEC/CFP2006/450.soplex/450.soplex	100.00%
External/SPEC/CFP2006/470.lbm/470.lbm	99.97%
External/SPEC/CINT2000/164.gzip/164.gzip	100.13%
External/SPEC/CINT2000/175.vpr/175.vpr	99.67%
External/SPEC/CINT2000/176.gcc/176.gcc	99.60%
External/SPEC/CINT2000/181.mcf/181.mcf	99.93%
External/SPEC/CINT2000/186.crafty/186.crafty	99.76%
External/SPEC/CINT2000/197.parser/197.parser	100.34%
External/SPEC/CINT2000/254.gap/254.gap	100.34%
External/SPEC/CINT2000/255.vortex/255.vortex	99.61%
External/SPEC/CINT2000/256.bzip2/256.bzip2	99.71%
External/SPEC/CINT2006/400.perlbench/400.perlbench	100.21%
External/SPEC/CINT2006/401.bzip2/401.bzip2	99.87%
External/SPEC/CINT2006/403.gcc/403.gcc	99.74%
External/SPEC/CINT2006/429.mcf/429.mcf	100.25%
External/SPEC/CINT2006/445.gobmk/445.gobmk	100.89%
External/SPEC/CINT2006/456.hmmer/456.hmmer	101.26%
External/SPEC/CINT2006/458.sjeng/458.sjeng	99.88%
External/SPEC/CINT2006/462.libquantum/462.libquantum	99.75%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp	98.04%
External/SPEC/CINT2006/473.astar/473.astar	99.80%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk	98.48%
Geomean	99.99%

Other benchmarks in test-suite's Geomean value : 99.73%

On commertial benchmark, we can see the 0.1 ~0.2% improvement of overall.

Junmo.

junbuml added a subscriber: junbuml.Dec 25 2015, 8:25 PM

sanjoy requested changes to this revision.Jan 17 2016, 1:12 PM

sanjoy edited edge metadata.

sanjoy added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1825	Nit: do you mean `I's operands`?
1828	Nit: LLVM convention is `Depth`.
1833	Might want to check `SE.isSCEVAble(I->getType())` before calling `getSCEV` to avoid creating too many `SCEVUnknown` objects unnecessarily.
1836	Use a range `for` here.
1838	I think it will be cleaner to handle the non-instruction case in the caller (and remove the `I == nullptr` check above), since there is no reason to have the initial caller (first level of recursion) pass in `nullptr` for `I`. I don't think `depth++` is correct -- you'll blow the stack if `I` is a PHI node with itself as its first operand. At the very least, it needs to be `++depth`. Also, since you're calling the counter `depth`, it'd make more sense to pass in `depth + 1` here, and not `++depth`. If you do want to increment `depth` in place, it might be better to call it `Iter`.
1871	Why do you care about `RHS` being a constant? I think it is better to have the logic be: Check `LHS` as usual If `RHSV` is an `Instruction` then check `RHS` as well Use `FindSameSCEVInst` to do a deeper search on `LHS` Is there any reason why you don't need to call `FindSameSCEVInst` on `RHS` as well (when it is an instruction)?
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53	Reading the code for `@test`, I think the metadata is required to tell LLVM that it is okay to generate a divide by `%step` if it thinks that is profitable to do (and so the test case actually checks LLVM's heuristics, not correctness).

This revision now requires changes to proceed.Jan 17 2016, 1:12 PM

flyingforyou marked 5 inline comments as done and an inline comment as not done.Jan 17 2016, 10:40 PM

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1825	Yep.
1838	Sure, I will remove` I == nullptr`, add nullptr check when calling this function. Thanks. I will change `++depth` to `Depth + 1`.
1871	There is no critical reason that I care about RHS beging a constant. I will modify the logic what you said. It seems more proper way for `findExistingExpansion`. Thanks.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53	Thanks. I'll remain the code.

Addressed Sanjoy's comments.

Hi Junmo,

Please find some comments inline.

Thanks,
Michael

lib/Analysis/ScalarEvolutionExpander.cpp
1861–1863	I don't understand why we need to break symmetry in this code. Anyway, if we do this, we should either add an assert that the opposite situation never happens, or properly handle both cases. The fact that we've only seen one case is not a justification for not handling the other one. IOW, why can't LHS be Value, and RHS be Instruction?
lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347	This should be in the same patch, otherwise the change will only paper over the issue, and lead to actually worse code in the end. If `expandCodeFor` isn't updated, it'll fail to reuse the existing value, so it'll generate yet another division. The idea of the patch is to teach SCEV to reuse existing expressions, not just hack around the high cost of division. The division should still be considered expensive, so we definitely shouldn't generate redundant ones.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
27–28	Maybe something like this? Though SCEV for loop tripcount contains division, it shouldn't be considered expensive, since the division already exists in the code and we don't need to expand it once more. Thus, it shouldn't prevent us from unrolling the loop. Also, I think I don't completely understand why we care so much about step being -1. Why is it important?
53	This metadata is used as `!range !0` in `@test` and as `!tbaa !0` in `@test2`. Are you sure this is correct?

This revision now requires changes to proceed.Jan 20 2016, 4:17 PM

flyingforyou added inline comments.Jan 20 2016, 9:19 PM

lib/Analysis/ScalarEvolutionExpander.cpp

1861–1863

I think LHS can be Value. But what I know is above pattern can cover most of cases.

if (match(BB->getTerminator(),
           m_Br(m_ICmp(Pred, m_Value(LHSV), m_Value(RHSV)), TrueBB,
                FalseBB))) {
  dbgs() << "m_Value(LHSV), m_Value(RHSV) pattern\n";
  if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Instruction(RHS)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Instruction(RHS) pattern\n";
  else if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHSV)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Value(RHSV) pattern\n";
  else
    dbgs() << "unknown(LHS), unknown(RHS) pattern\n";

}

When I test above code, I can't find "unknown(LHS), unknown(RHS) pattern\n" on building big benchmarks programs including LLVM itself.

m_Value(LHSV), m_Value(RHSV) pattern	50650
m_Instruction(LHS), m_Instruction(RHS) pattern	30178
m_Instruction(LHS), m_Value(RHSV) pattern	20472

So I think it's no necessary to consider LHS is Value now.

Actually, we don't make compare instruction likes cmp constant, reg...

lib/Transforms/Utils/LoopUnrollRuntime.cpp

346–347

Michael, I am not an expert of SCEV. I think you might want to get review from Sanjoy.

Sanjoy, How about applying below patch on this?

-    if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At))
+    const SCEV *LHSSE = SE.getSCEV(LHS);
+    if (LHSSE->getType() == S->getType() &&
+        isa<SCEVConstant>(SE.getMinusSCEV(LHSSE, S)) &&
+        SE.DT.dominates(LHS, At))
       return LHS;
 
     if (Instruction *RHS = dyn_cast<Instruction>(RHSV)) {
-      if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At))
+      const SCEV *RHSSE = SE.getSCEV(RHS);
+      if (RHSSE->getType() == S->getType() &&
+          isa<SCEVConstant>(SE.getMinusSCEV(RHSSE, S)) &&
+          SE.DT.dominates(RHS, At))
         return RHS;
     }

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll

27–28

Cool. Michael.
I will modify comments as you said.

-1 indicates that tripcount doesn't need to be divided for calculation when loop unrolling.
Anyway thanks for good sugesstion.

I don't think that makes errors. But I got your point. Thanks.
I will modify !tbaa !0 to !tbaa !1
and add !1 = !{i64 0}

Is this ok?

Michael, I replied your inline comments. Could you check them, please?

Sanjoy, Michael want to insert some patch. Could you give him a opinion, please?

Junmo.

mzolotukhin added inline comments.Jan 25 2016, 12:09 PM

lib/Analysis/ScalarEvolutionExpander.cpp
1861–1863	My concern isn't about us missing some real cases right now, it's more about future maintenance of this code. If later on someone finds such a case, it'll be hard to figure out, why isn't it covered in the first place, when the code was committed.
test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll
53	This is ok (or you can just remove `!tbaa !1` and `!1 = !{i64 0}` at all). Thanks!

Addressed Michael's comments.
Thanks for good description example.

Waiting Sanjoy's comment.

This is looking close to ready to land now. Some nitpicky comments inline.

lib/Analysis/ScalarEvolutionExpander.cpp
1827	How about renaming this to `FindCongruentInst`, or `FindInstWithSameSCEV`. `FindSameSCEVInst` does not sound okay grammatically.
1833	Minor: I'd add a small comment that you're doing the pre-check on using `isSCEVable` to avoid creating too many `SCEVUnknown` expressions.
1834	I'm worried that we're possibly doing too much work here (`getSCEV` can be expensive). Can we further filter on the type of `I`? For instance, if the width of `I->getType()` and `S->getType()` are different, then we don't need to bother calling `getSCEV` on `I`. Edit: I'm much less worried about doing too much work here, after seeing the guard on `SE.hasOperand` in the caller; but I still think filtering on the type is a good idea.
1864–1865	These two identical blocks should definitely be combined into a lambda.
1865	I'd "CSE" the two calls to `getSCEV(LHS)` to avoid the extra dictionary lookup.
1879	Please don't reuse the `LHS` variable here. Perhaps `auto *ExistingInst = ..`?

This revision now requires changes to proceed.Jan 27 2016, 12:40 PM

mzolotukhin requested changes to this revision.Jan 27 2016, 4:40 PM

mzolotukhin edited edge metadata.

mzolotukhin added inline comments.

lib/Transforms/Utils/LoopUnrollRuntime.cpp

346–347

We can wait for Sanjoy's opinion too, but the patch can't be landed without this issue fixed first anyway.

I slightly modified your testcase to expose the issue:

define i32 @test2(i64* %loc, i64 %conv7) {
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42 ; <<< Extra computations are added to the trip-count expression
  br label %bb1
bb1:
  %div11 = udiv i64 %ExpensiveComputation, %conv7
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  br label %for.body
for.body:
  %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ]
  %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ]
  %mul1 = mul i64 %rem1, 48271
  %rem2 = urem i64 %mul1, 2147483647
  %dec = add i64 %k1, -1
  %cmp = icmp eq i64 %dec, 0
  br i1 %cmp, label %exit, label %for.body
exit:
  %rem3 = phi i64 [ %rem2, %for.body ]
  store i64 %rem3, i64* %loc, align 8
  ret i32 0
}

If we apply the patch and run this test thru opt, the loop will be successfully unrolled, *but* we'll generate the code for "expensive" computations. This is IR after unrolling:

...
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42
  br label %bb1

bb1:                                              ; preds = %entry
  %div11 = udiv i64 %ExpensiveComputation, %conv7
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  %0 = udiv i64 %rem0, 42      ; <<<< This is our %ExpensiveComputation, and we just duplicated it!
  %1 = udiv i64 %0, %conv7     ; <<<
  %2 = icmp ugt i64 %1, 1
  %umax = select i1 %2, i64 %1, i64 1
  %3 = add i64 %umax, -1
  %xtraiter = and i64 %umax, 7
  %lcmp.mod = icmp ne i64 %xtraiter, 0
  br i1 %lcmp.mod, label %for.body.prol, label %bb1.split
...

Please note inserted instruction %0 and %1 - they're completely redundant (as we compute the same value, that we have in %div11).

So, as I said, it's not sufficient to just fix findExistingExpransion, we also need to fix SCEVExpander::expand.

sanjoy added inline comments.Jan 27 2016, 5:43 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347	I think the original idea was that if we know that a certain value is already being computed, then we can emit code to compute it again, and rely on later passes to get rid of the redundancy (so in your example above we'd expect `%0` and `%1` to be CSE'ed with `%ExpensiveComputation` and `%div11`). But I understand your concern that this is a pessimization unless we run some cleanup passes afterwards, and even then we rely on the cleanup passes being smart enough to undo our mess. Perhaps `SCEVExpander::expand` can be fixed in a separate change (since the scope of that fix is bigger than what is addressed here) that goes in before this one?

mzolotukhin added inline comments.Jan 27 2016, 6:00 PM

lib/Transforms/Utils/LoopUnrollRuntime.cpp
346–347	My example was pretty simple, just to demonstrate the problem, and in this case indeed a simple clean-up would be sufficient to remove the introduced redundancy. But it's easy to construct a more complicated test, which existing clean-up passes might have problems to deal with, and since SCEV expressions might be very complicated, expanding all of them might be not a good idea, especially taking into account that some of them are actually expensive. As for separating the patches - I'm entirely for it, but I just want to first make sure that there are no fundamental issues that can prevent us from fixing `expand`. My biggest concern is that this change leads to somewhat inconsistent state, where `findExistingExpansion` and `expand` have different assumptions. I'll feel much easier about it once I see the `expand` part of the patch.

flyingforyou marked 6 inline comments as done.Jan 27 2016, 9:47 PM

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
1834	Thanks for good sugesstion.
1879	Thanks for guiding this. As you said, using lambda could be more clear than before. Great!.

Addressed Sanjoy's comments.

Thanks,
Junmo.

Sanjoy, I am waiting your review. :)

Michael, could you make an example which is more complex that current clean up passes can't deal with?
I checked your previous example's expensive expression is cleaned up.

Michael, could you make an example which is more complex that current clean up passes can't deal with? I checked your previous example's expensive expression is cleaned up.

Providing a concrete example here isn't even that important due to arguments I raised earlier, but here you go:

@a = global i64 1

declare void @foo()

define i32 @test3(i64* %loc, i64 %conv7) {
entry:
  %rem0 = load i64, i64* %loc, align 8
  %ExpensiveComputation = udiv i64 %rem0, 42
  br label %outer.header

outer.header:
  %i = phi i64 [ 1, %entry ], [ %i.next, %outer.latch ]
  br label %bb1

bb1:
  %addr = getelementptr i64, i64* @a, i64 %i
  %x = load volatile i64, i64* %addr
  %div11 = udiv i64 %x, %i
  %cmp.i38 = icmp ugt i64 %div11, 1
  %div12 = select i1 %cmp.i38, i64 %div11, i64 1
  br label %for.body

for.body:
  %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ]
  %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ]
  %mul1 = mul i64 %rem1, 48271
  %rem2 = urem i64 %mul1, 2147483647
  %dec = add i64 %k1, -1
  %cmp = icmp eq i64 %dec, 0
  br i1 %cmp, label %outer.latch, label %for.body

outer.latch:
  %i.next = add i64 %i, 1
  %outer.cmp = icmp eq i64 %i.next, 1000
  br i1 %outer.cmp, label %exit, label %outer.header

exit:
  %rem3 = phi i64 [ %rem2, %outer.latch ]
  store i64 %rem3, i64* %loc, align 8
  ret i32 0
}

I ran it with

opt < test.ll -S -unroll-runtime -loop-unroll -early-cse -instcombine -simplifycfg -licm

and always saw a code like this in the output:

%x = load volatile i64, i64* %addr, align 8
%div11 = udiv i64 %x, %i
%cmp.i38 = icmp ugt i64 %div11, 1
%div12 = select i1 %cmp.i38, i64 %div11, i64 1
%1 = udiv i64 %x, %0                           ; <<< Redundant expensive computation
%2 = icmp ugt i64 %1, 1
%umax = select i1 %2, i64 %1, i64 1

Probably, you can come up with another set of passes that will be able to clean this up, but then the example might be made even more complicated, so the new set will fail. The problem here is that with this patch compiler thinks that it generates no redundant code to clean-up, while in fact it's not even limited in what it can generate - we expand *any* existing SCEV expression.

So, I'd suggest you to address the issue with expand before going further with this one.

Michael

This revision now requires changes to proceed.Feb 3 2016, 3:53 PM

Looks like the change I've been asking for function has been committed (see D12090). So, probably we just need to update this patch, and it'll be ok to go in.

Addressed Michael's comment.

It's very good timing.!

Thanks.

mzolotukhin added inline comments.Feb 5 2016, 11:42 AM

lib/Analysis/ScalarEvolutionExpander.cpp
1870	I think `FindCongruentInst` isn't needed right now. The logic in `findExistingExpansion` should basically mirror the logic from `expand`: `expand` just looks at the hash map, so we should do the same here. (And that should do the trick in our example here)

Addressed Michael's comment.

Hi Junmo,

It looks better now, couple of comments/questions inline.

Michael

lib/Analysis/ScalarEvolutionExpander.cpp
1856–1863	Is there any point from changing `LHS` and `RHS` from `Instruction` to `Value`? Looks like later on we cast it to `Instruction` anyway..
1884–1885	Why do we need these conditions? Isn't dominance sufficient? Also, would it make sense to factor out this code block to a separate function (and use it both here and in `expand`)?

flyingforyou marked an inline comment as done.Feb 15 2016, 4:19 PM

flyingforyou added a subscriber: sunfish.

flyingforyou added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp

1884–1885

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

would it make sense to factor out this code block to a separate function (and use it both here and in expand)?

Sure.

Addressed Michael's comments.
Thanks.

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

Oh, I see, makes sense. The patch LGTM, thanks for your patience:)

Michael

include/llvm/Analysis/ScalarEvolutionExpander.h
266 ↗	(On Diff #48035)	Please add a comment for the new function.

Closed by commit rL260938: [SCEVExpander] Make findExistingExpansion smarter (authored by flyingforyou). · Explain WhyFeb 15 2016, 10:51 PM

This revision was automatically updated to reflect the committed changes.

flyingforyou marked an inline comment as done.

Michael, Sanjoy,

Thank you for your patience, so many great suggestions!

It was great time. !

Merged in r260938.

Thanks,
Junmo.

Revision Contents

Path

Size

lib/

Analysis/

ScalarEvolutionExpander.cpp

46 lines

Transforms/

Utils/

LoopUnrollRuntime.cpp

8 lines

test/

Transforms/

LoopUnroll/

high-cost-trip-count-computation.ll

26 lines

Diff 43578

lib/Analysis/ScalarEvolutionExpander.cpp

Show First 20 Lines • Show All 1,816 Lines • ▼ Show 20 Lines if (OrigPhiRef->getType() != Phi->getType()) {

NewIV = Builder.CreateTruncOrBitCast(OrigPhiRef, Phi->getType(), IVName); NewIV = Builder.CreateTruncOrBitCast(OrigPhiRef, Phi->getType(), IVName);

} }

Phi->replaceAllUsesWith(NewIV); Phi->replaceAllUsesWith(NewIV);

DeadInsts.emplace_back(Phi); DeadInsts.emplace_back(Phi);

} }

return NumElim; return NumElim;

} }

// Recursive search of the IR's operands to see if any of them are congruent to

sanjoyUnsubmitted

Done

Nit: do you mean I's operands?

sanjoy: Nit: do you mean `I's operands`?

flyingforyouAuthorUnsubmitted

Not Done

Yep.

flyingforyou: Yep.

// S. This function is called by findExistingExpansion.

static Instruction *FindSameSCEVInst(ScalarEvolution &SE, const SCEV *S,

sanjoyUnsubmitted

Done

How about renaming this to FindCongruentInst, or FindInstWithSameSCEV. FindSameSCEVInst does not sound okay grammatically.

sanjoy: How about renaming this to `FindCongruentInst`, or `FindInstWithSameSCEV`. `FindSameSCEVInst`…

Instruction *I, unsigned depth) {

sanjoyUnsubmitted

Not Done

A better name for this function would be IsContainedIn. And this should use SCEVTraversal.

sanjoy: A better name for this function would be `IsContainedIn`. And this should use `SCEVTraversal`.

sanjoyUnsubmitted

Done

Nit: LLVM convention is Depth.

sanjoy: Nit: LLVM convention is `Depth`.

// Limit our recursion depth.

if (I == nullptr || depth > 3)

return nullptr;

if (SE.getSCEV(I) == S)

sanjoyUnsubmitted

Done

Might want to check SE.isSCEVAble(I->getType()) before calling getSCEV to avoid creating too many SCEVUnknown objects unnecessarily.

sanjoy: Might want to check `SE.isSCEVAble(I->getType())` before calling `getSCEV` to avoid creating…

sanjoyUnsubmitted

Done

Minor: I'd add a small comment that you're doing the pre-check on using isSCEVable to avoid creating too many SCEVUnknown expressions.

sanjoy: Minor: I'd add a small comment that you're doing the pre-check on using `isSCEVable` to avoid…

return I;

sanjoyUnsubmitted

Done

I'm worried that we're possibly doing too much work here (getSCEV can be expensive). Can we further filter on the type of I? For instance, if the width of I->getType() and S->getType() are different, then we don't need to bother calling getSCEV on I.

Edit: I'm much less worried about doing too much work here, after seeing the guard on SE.hasOperand in the caller; but I still think filtering on the type is a good idea.

sanjoy: I'm worried that we're possibly doing too much work here (`getSCEV` can be expensive). Can we…

flyingforyouAuthorUnsubmitted

Not Done

Thanks for good sugesstion.

flyingforyou: Thanks for good sugesstion.

for (unsigned i = 0; i < I->getNumOperands(); ++i) {

sanjoyUnsubmitted

Done

Use a range for here.

sanjoy: Use a range `for` here.

Instruction *Op = FindSameSCEVInst(

SE, S, dyn_cast<Instruction>(I->getOperand(i)), ++depth);

sanjoyUnsubmitted

Done

I think it will be cleaner to handle the non-instruction case in the caller (and remove the I == nullptr check above), since there is no reason to have the initial caller (first level of recursion) pass in nullptr for I.

I don't think depth++ is correct -- you'll blow the stack if I is a PHI node with itself as its first operand. At the very least, it needs to be ++depth. Also, since you're calling the counter depth, it'd make more sense to pass in depth + 1 here, and not ++depth. If you do want to increment depth in place, it might be better to call it Iter.

sanjoy: I think it will be cleaner to handle the non-instruction case in the caller (and remove the `I…

flyingforyouAuthorUnsubmitted

Not Done

Sure, I will remove` I == nullptr`, add nullptr check when calling this function.

Thanks. I will change ++depth to Depth + 1.

flyingforyou: Sure, I will remove` I == nullptr`, add nullptr check when calling this function. Thanks. I…

if (Op != nullptr)

return Op;

}

mzolotukhinUnsubmitted

Done

Shouldn't it be SCEVTraversal<SCEVSearch>?

mzolotukhin: Shouldn't it be `SCEVTraversal<SCEVSearch>`?

flyingforyouAuthorUnsubmitted

Not Done

I think current code is ok.
But there is same code in ScalarEvolution class. So I remove this function. Thanks.

flyingforyou: I think current code is ok. But there is same code in ScalarEvolution class. So I remove this…

return nullptr;

}

Value *SCEVExpander::findExistingExpansion(const SCEV *S, Value *SCEVExpander::findExistingExpansion(const SCEV *S,

const Instruction *At, Loop *L) { const Instruction *At, Loop *L) {

mzolotukhinUnsubmitted

Done

What are LHS and RHS here?

mzolotukhin: What are `LHS` and `RHS` here?

using namespace llvm::PatternMatch; using namespace llvm::PatternMatch;

SmallVector<BasicBlock *, 4> ExitingBlocks; SmallVector<BasicBlock *, 4> ExitingBlocks;

L->getExitingBlocks(ExitingBlocks); L->getExitingBlocks(ExitingBlocks);

// Look for suitable value in simple conditions at the loop exits. // Look for suitable value in simple conditions at the loop exits.

for (BasicBlock *BB : ExitingBlocks) { for (BasicBlock *BB : ExitingBlocks) {

sanjoyUnsubmitted

Not Done

I don't see why you need to match twice? You should just be able to do

match(BB->getTerminator(), m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHS)))

with Value *RHS. Then you can make your later code conditional on isa<Instruction>(RHS).

sanjoy: I don't see why you need to `match` twice? You should just be able to do ``` match(BB…

ICmpInst::Predicate Pred; ICmpInst::Predicate Pred;

Instruction *LHS, *RHS; Instruction *LHS;

Value *RHSV;

BasicBlock *TrueBB, *FalseBB; BasicBlock *TrueBB, *FalseBB;

if (!match(BB->getTerminator(), if (!match(BB->getTerminator(),

mzolotukhinUnsubmitted

Done

This loop will always end on the first operand.
Should it be

if (Op)
  return Op;

mzolotukhin: This loop will always end on the first operand. Should it be ``` if (Op) return Op; ``` ?

flyingforyouAuthorUnsubmitted

Not Done

Oh...Thanks.

flyingforyou: Oh...Thanks.

m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Instruction(RHS)), m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHSV)), TrueBB,

TrueBB, FalseBB))) FalseBB)))

continue; continue;

mzolotukhinUnsubmitted

Not Done

I don't understand why we need to break symmetry in this code. Anyway, if we do this, we should either add an assert that the opposite situation never happens, or properly handle both cases. The fact that we've only seen one case is not a justification for not handling the other one.

IOW, why can't LHS be Value, and RHS be Instruction?

mzolotukhin: I don't understand why we need to break symmetry in this code. Anyway, if we do this, we should…

flyingforyouAuthorUnsubmitted

Not Done

I think LHS can be Value. But what I know is above pattern can cover most of cases.

if (match(BB->getTerminator(),
           m_Br(m_ICmp(Pred, m_Value(LHSV), m_Value(RHSV)), TrueBB,
                FalseBB))) {
  dbgs() << "m_Value(LHSV), m_Value(RHSV) pattern\n";
  if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Instruction(RHS)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Instruction(RHS) pattern\n";
  else if (match(BB->getTerminator(),
             m_Br(m_ICmp(Pred, m_Instruction(LHS), m_Value(RHSV)), TrueBB,
                  FalseBB)))
    dbgs() << "m_Instruction(LHS), m_Value(RHSV) pattern\n";
  else
    dbgs() << "unknown(LHS), unknown(RHS) pattern\n";

}

When I test above code, I can't find "unknown(LHS), unknown(RHS) pattern\n" on building big benchmarks programs including LLVM itself.

m_Value(LHSV), m_Value(RHSV) pattern	50650
m_Instruction(LHS), m_Instruction(RHS) pattern	30178
m_Instruction(LHS), m_Value(RHSV) pattern	20472

So I think it's no necessary to consider LHS is Value now.

Actually, we don't make compare instruction likes cmp constant, reg...

flyingforyou: I think LHS can be Value. But what I know is above pattern can cover most of cases. ```…

mzolotukhinUnsubmitted

Not Done

My concern isn't about us missing some real cases right now, it's more about future maintenance of this code. If later on someone finds such a case, it'll be hard to figure out, why isn't it covered in the first place, when the code was committed.

mzolotukhin: My concern isn't about us missing some real cases right now, it's more about future maintenance…

mzolotukhinUnsubmitted

Done

Is there any point from changing LHS and RHS from Instruction to Value? Looks like later on we cast it to Instruction anyway..

mzolotukhin: Is there any point from changing `LHS` and `RHS` from `Instruction` to `Value`? Looks like…

if (Instruction *RHS = dyn_cast<Instruction>(RHSV)) {

sanjoyUnsubmitted

Done

I'd "CSE" the two calls to getSCEV(LHS) to avoid the extra dictionary lookup.

sanjoy: I'd "CSE" the two calls to `getSCEV(LHS)` to avoid the extra dictionary lookup.

sanjoyUnsubmitted

Done

These two identical blocks should definitely be combined into a lambda.

sanjoy: These two identical blocks should definitely be combined into a lambda.

if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At)) if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At))

return LHS; return LHS;

if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At)) if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At))

return RHS; return RHS;

mzolotukhinUnsubmitted

Done

I think FindCongruentInst isn't needed right now. The logic in findExistingExpansion should basically mirror the logic from expand: expand just looks at the hash map, so we should do the same here. (And that should do the trick in our example here)

mzolotukhin: I think `FindCongruentInst` isn't needed right now. The logic in `findExistingExpansion` should…

} else if (isa<ConstantInt>(RHSV)) {

sanjoyUnsubmitted

Not Done

Why do you care about RHS being a constant?

I think it is better to have the logic be:

Check LHS as usual
If RHSV is an Instruction then check RHS as well
Use FindSameSCEVInst to do a deeper search on LHS

Is there any reason why you don't need to call FindSameSCEVInst on RHS as well (when it is an instruction)?

sanjoy: Why do you care about `RHS` being a constant? I think it is better to have the logic be: #…

flyingforyouAuthorUnsubmitted

Not Done

There is no critical reason that I care about RHS beging a constant.

I will modify the logic what you said. It seems more proper way for findExistingExpansion.

Thanks.

flyingforyou: There is no critical reason that I care about RHS beging a constant. I will modify the logic…

if (SE.hasOperand(SE.getSCEV(LHS), S)) {

LHS = FindSameSCEVInst(SE, S, LHS, 0);

if (LHS != nullptr && SE.DT.dominates(LHS, At))

return LHS;

}

} }

sanjoyUnsubmitted

Done

Please don't reuse the LHS variable here. Perhaps auto *ExistingInst = ..?

sanjoy: Please don't reuse the `LHS` variable here. Perhaps `auto *ExistingInst = ..`?

flyingforyouAuthorUnsubmitted

Not Done

Thanks for guiding this. As you said, using lambda could be more clear than before. Great!.

flyingforyou: Thanks for guiding this. As you said, using lambda could be more clear than before. Great!.

// There is potential to make this significantly smarter, but this simple // There is potential to make this significantly smarter, but this simple

// heuristic already gets some interesting cases. // heuristic already gets some interesting cases.

// Can not find suitable value. // Can not find suitable value.

return nullptr; return nullptr;

sanjoyUnsubmitted

Not Done

I don't understand this part -- how do you conclude that LHS is a correct expansion for S? E.g. what if getSCEV(LHS) is {S,+,1}? Then clearly the expansion of S isn't LHS.

I think you need to do a (recursive) search of the IR level operands of LHS and RHS to see if any of them are congruent to S.

sanjoy: I don't understand this part -- how do you conclude that `LHS` is a correct expansion for `S`?

} }

mzolotukhinUnsubmitted

Not Done

Why do we need these conditions? Isn't dominance sufficient?

Also, would it make sense to factor out this code block to a separate function (and use it both here and in expand)?

mzolotukhin: Why do we need these conditions? Isn't dominance sufficient? Also, would it make sense to…

flyingforyouAuthorUnsubmitted

Not Done

Attached Wei's comment in r259815

Fix a regression for r259736.

When SCEV expansion tries to reuse an existing value, it is needed to ensure
that using the Value at the InsertPt will not break LCSSA. The fix adds a
check that InsertPt is either inside the candidate Value's parent loop, or
the candidate Value's parent loop is nullptr.

would it make sense to factor out this code block to a separate function (and use it both here and in expand)?

Sure.

flyingforyou: Attached Wei's comment in r259815 > Fix a regression for r259736. > > When SCEV…

mzolotukhinUnsubmitted

Not Done

You could do

if (RHS = dyn_cast<Instruction>(RHSV)) {
...

Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction.

mzolotukhin: You could do ``` if (RHS = dyn_cast<Instruction>(RHSV)) { ... ``` Also, I wonder if we should…

flyingforyouAuthorUnsubmitted

Not Done

Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the instruction.

I am not sure. Isn't it optional??

flyingforyou: > Also, I wonder if we should check for the case where LHS is a ConstantInt and RHS is the…

bool SCEVExpander::isHighCostExpansionHelper( bool SCEVExpander::isHighCostExpansionHelper(

const SCEV *S, Loop *L, const Instruction *At, const SCEV *S, Loop *L, const Instruction *At,

SmallPtrSetImpl<const SCEV *> &Processed) { SmallPtrSetImpl<const SCEV *> &Processed) {

// If we can find an existing value for this scev avaliable at the point "At" // If we can find an existing value for this scev avaliable at the point "At"

// then consider the expression cheap. // then consider the expression cheap.

if (At && findExistingExpansion(S, At, L) != nullptr) if (At && findExistingExpansion(S, At, L) != nullptr)

return false; return false;

▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

lib/Transforms/Utils/LoopUnrollRuntime.cpp

Show First 20 Lines • Show All 305 Lines • ▼ Show 20 Lines	bool llvm::UnrollRuntimeLoopProlog(Loop *L, unsigned Count,

// Add 1 since the backedge count doesn't include the first loop iteration		// Add 1 since the backedge count doesn't include the first loop iteration
const SCEV *TripCountSC =		const SCEV *TripCountSC =
SE->getAddExpr(BECountSC, SE->getConstant(BECountSC->getType(), 1));		SE->getAddExpr(BECountSC, SE->getConstant(BECountSC->getType(), 1));
if (isa<SCEVCouldNotCompute>(TripCountSC))		if (isa<SCEVCouldNotCompute>(TripCountSC))
return false;		return false;

BasicBlock *Header = L->getHeader();		BasicBlock *Header = L->getHeader();
		BasicBlock *PH = L->getLoopPreheader();
		BranchInst *PreHeaderBR = cast<BranchInst>(PH->getTerminator());
const DataLayout &DL = Header->getModule()->getDataLayout();		const DataLayout &DL = Header->getModule()->getDataLayout();
SCEVExpander Expander(*SE, DL, "loop-unroll");		SCEVExpander Expander(*SE, DL, "loop-unroll");
if (!AllowExpensiveTripCount && Expander.isHighCostExpansion(TripCountSC, L))		if (!AllowExpensiveTripCount &&
		Expander.isHighCostExpansion(TripCountSC, L, PreHeaderBR))
		sanjoyUnsubmitted Not Done Reply Inline Actions I don't think this is the right value for `At`. `At` should be the location you want to expand the SCEV expression at, which in this case is `PreHeaderBR`. sanjoy: I don't think this is the right value for `At`. `At` should be the location you want to expand…
		mzolotukhinUnsubmitted Done Reply Inline Actions Could we avoid changing anything unless we're confident we're going to unroll? E.g. we can split `PH`--->`Header` edge, then split `PEnd` basic block, and then we can decide to bail out. mzolotukhin: Could we avoid changing anything unless we're confident we're going to unroll? E.g. we can…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Yes. You're right. flyingforyou: Yes. You're right.
return false;		return false;

// We only handle cases when the unroll factor is a power of 2.		// We only handle cases when the unroll factor is a power of 2.
// Count is the loop unroll factor, the number of extra copies added + 1.		// Count is the loop unroll factor, the number of extra copies added + 1.
if (!isPowerOf2_32(Count))		if (!isPowerOf2_32(Count))
return false;		return false;

// This constraint lets us deal with an overflowing trip count easily; see the		// This constraint lets us deal with an overflowing trip count easily; see the
// comment on ModVal below.		// comment on ModVal below.
if (Log2_32(Count) > BEWidth)		if (Log2_32(Count) > BEWidth)
return false;		return false;

// If this loop is nested, then the loop unroller changes the code in		// If this loop is nested, then the loop unroller changes the code in
// parent loop, so the Scalar Evolution pass needs to be run again		// parent loop, so the Scalar Evolution pass needs to be run again
		mzolotukhinUnsubmitted Not Done Reply Inline Actions There is a method `getStepRecurrence` in `AR`, which you might want to use here. Also, I don't think that changing `AllowExpensiveTripCount` here is a good way to achieve what you try to achieve. Why (and where) is this tripcount considered expensive? To me it looks like we should rather fix those places to recognize cases like this as 'inexpensive'. mzolotukhin: There is a method `getStepRecurrence` in `AR`, which you might want to use here. Also, I don't…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions There is a method getStepRecurrence in AR, which you might want to use here. Thanks, I will change this later. Why (and where) is this tripcount considered expensive? When we run test case "high-cost-trip-count-computation.ll", we can get expression of Tripcount SCEV likes below. test1 func: (1 + ((-1 + (-1 * %v12) + (8193 smax (%v12 + %step))) /u %step)) test2 func: (1 umax ((23 + %conv7) /u %conv7)) test1's division is generated for computing tripcount. Because %step is unknown. This case, we consider this is expensive. We need to generate new division for computing tripcount. test2's division is generated by original code which is in loop pre header(entry). In this case, isHighCostExpansion can't figure it out this is for computing trip count or it came from original code. And isHighCostExpansion is also used by IndVarSimplify. It's not only for LoopUnrollRuntime. If you still think changing AllowExpensiveTripCount is not good way, please let me know your idea. Junmo. flyingforyou: > There is a method getStepRecurrence in AR, which you might want to use here. Thanks, I will…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Ok, I think I understood the issue. So, in `isHighCostExpansionHelper` routine there is a check if (At && findExistingExpansion(S, At, L) != nullptr) return false; Ideally, it should find the existing computation and immediately report that the expansion isn't expensive. In practice, this function doesn't always work as we'd like it to work (see e.g. https://llvm.org/bugs/show_bug.cgi?id=24920), but I'd suggest fixing it instead of working around in other places. mzolotukhin: Ok, I think I understood the issue. So, in `isHighCostExpansionHelper` routine there is a…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Thanks for sharing bug about "isHighCostExpansionHelper" function. LoopUnrollRuntime use isHighCostExpansion function without "At". This means it only want to check TripCountSC has expensive expression or not. So, checking IV's step may not be working around for LoopUnrollRuntime. Junmo. flyingforyou: Thanks for sharing bug about "isHighCostExpansionHelper" function. LoopUnrollRuntime use…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Yes, we're not passing `At` right now, but we can if we need to. If we do need to expand trip count, it'll be inserted into the loop preheader (look for `PreHeaderBR` in the code further down), so we actually know `At`. And, my main concern is that I don't believe that statement "step is 1/-1" is equivalent to statement "trip count expression is cheap to expand". At least, it's not obvious and requires some proof/explanation (i.e. comments in the code). mzolotukhin: Yes, we're not passing `At` right now, but we can if we need to. If we do need to expand trip…
if (Loop *ParentLoop = L->getParentLoop())		if (Loop *ParentLoop = L->getParentLoop())
SE->forgetLoop(ParentLoop);		SE->forgetLoop(ParentLoop);

BasicBlock *PH = L->getLoopPreheader();
BasicBlock *Latch = L->getLoopLatch();		BasicBlock *Latch = L->getLoopLatch();
// It helps to splits the original preheader twice, one for the end of the		// It helps to splits the original preheader twice, one for the end of the
// prolog code and one for a new loop preheader		// prolog code and one for a new loop preheader
BasicBlock *PEnd = SplitEdge(PH, Header, DT, LI);		BasicBlock *PEnd = SplitEdge(PH, Header, DT, LI);
BasicBlock *NewPH = SplitBlock(PEnd, PEnd->getTerminator(), DT, LI);		BasicBlock *NewPH = SplitBlock(PEnd, PEnd->getTerminator(), DT, LI);
BranchInst *PreHeaderBR = cast<BranchInst>(PH->getTerminator());		PreHeaderBR = cast<BranchInst>(PH->getTerminator());

// Compute the number of extra iterations required, which is:		// Compute the number of extra iterations required, which is:
// extra iterations = run-time trip count % (loop unroll factor + 1)		// extra iterations = run-time trip count % (loop unroll factor + 1)
Value *TripCount = Expander.expandCodeFor(TripCountSC, TripCountSC->getType(),		Value *TripCount = Expander.expandCodeFor(TripCountSC, TripCountSC->getType(),
PreHeaderBR);		PreHeaderBR);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Please make sure that expander also knows how to reuse existing values. From my last investigation of PR24920 I remember that fixing `findExistingExpansion` alone isn't enough, as we generate expensive trip count anyway, ignoring the fact that we can reuse existing value for it. Also, you might take a look at my patch for fixing this bug, which I posted in D12765 (see 6th comment from the end). mzolotukhin: Please make sure that expander also knows how to reuse existing values. From my last…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions I think this is another issue about `findExistingExpansion` . I am not sure that I can fix the issue in this patch. flyingforyou: I think this is another issue about `findExistingExpansion` . I am not sure that I can fix the…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This should be in the same patch, otherwise the change will only paper over the issue, and lead to actually worse code in the end. If `expandCodeFor` isn't updated, it'll fail to reuse the existing value, so it'll generate yet another division. The idea of the patch is to teach SCEV to reuse existing expressions, not just hack around the high cost of division. The division should still be considered expensive, so we definitely shouldn't generate redundant ones. mzolotukhin: This should be in the same patch, otherwise the change will only paper over the issue, and lead…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Michael, I am not an expert of SCEV. I think you might want to get review from Sanjoy. Sanjoy, How about applying below patch on this? - if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At)) + const SCEV LHSSE = SE.getSCEV(LHS); + if (LHSSE->getType() == S->getType() && + isa<SCEVConstant>(SE.getMinusSCEV(LHSSE, S)) && + SE.DT.dominates(LHS, At)) return LHS; if (Instruction RHS = dyn_cast<Instruction>(RHSV)) { - if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At)) + const SCEV RHSSE = SE.getSCEV(RHS); + if (RHSSE->getType() == S->getType() && + isa<SCEVConstant>(SE.getMinusSCEV(RHSSE, S)) && + SE.DT.dominates(RHS, At)) return RHS; } flyingforyou:* Michael, I am not an expert of SCEV. I think you might want to get review from Sanjoy. Sanjoy…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions We can wait for Sanjoy's opinion too, but the patch can't be landed without this issue fixed first anyway. I slightly modified your testcase to expose the issue: define i32 @test2(i64* %loc, i64 %conv7) { entry: %rem0 = load i64, i64* %loc, align 8 %ExpensiveComputation = udiv i64 %rem0, 42 ; <<< Extra computations are added to the trip-count expression br label %bb1 bb1: %div11 = udiv i64 %ExpensiveComputation, %conv7 %cmp.i38 = icmp ugt i64 %div11, 1 %div12 = select i1 %cmp.i38, i64 %div11, i64 1 br label %for.body for.body: %rem1 = phi i64 [ %rem0, %bb1 ], [ %rem2, %for.body ] %k1 = phi i64 [ %div12, %bb1 ], [ %dec, %for.body ] %mul1 = mul i64 %rem1, 48271 %rem2 = urem i64 %mul1, 2147483647 %dec = add i64 %k1, -1 %cmp = icmp eq i64 %dec, 0 br i1 %cmp, label %exit, label %for.body exit: %rem3 = phi i64 [ %rem2, %for.body ] store i64 %rem3, i64* %loc, align 8 ret i32 0 } If we apply the patch and run this test thru `opt`, the loop will be successfully unrolled, but we'll generate the code for "expensive" computations. This is IR after unrolling: ... entry: %rem0 = load i64, i64* %loc, align 8 %ExpensiveComputation = udiv i64 %rem0, 42 br label %bb1 bb1: ; preds = %entry %div11 = udiv i64 %ExpensiveComputation, %conv7 %cmp.i38 = icmp ugt i64 %div11, 1 %div12 = select i1 %cmp.i38, i64 %div11, i64 1 %0 = udiv i64 %rem0, 42 ; <<<< This is our %ExpensiveComputation, and we just duplicated it! %1 = udiv i64 %0, %conv7 ; <<< %2 = icmp ugt i64 %1, 1 %umax = select i1 %2, i64 %1, i64 1 %3 = add i64 %umax, -1 %xtraiter = and i64 %umax, 7 %lcmp.mod = icmp ne i64 %xtraiter, 0 br i1 %lcmp.mod, label %for.body.prol, label %bb1.split ... Please note inserted instruction `%0` and `%1` - they're completely redundant (as we compute the same value, that we have in `%div11`). So, as I said, it's not sufficient to just fix `findExistingExpransion`, we also need to fix `SCEVExpander::expand`. mzolotukhin: We can wait for Sanjoy's opinion too, but the patch can't be landed without this issue fixed…
		sanjoyUnsubmitted Not Done Reply Inline Actions I think the original idea was that if we know that a certain value is already being computed, then we can emit code to compute it again, and rely on later passes to get rid of the redundancy (so in your example above we'd expect `%0` and `%1` to be CSE'ed with `%ExpensiveComputation` and `%div11`). But I understand your concern that this is a pessimization unless we run some cleanup passes afterwards, and even then we rely on the cleanup passes being smart enough to undo our mess. Perhaps `SCEVExpander::expand` can be fixed in a separate change (since the scope of that fix is bigger than what is addressed here) that goes in before this one? sanjoy: I think the original idea was that if we know that a certain value is already being computed…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions My example was pretty simple, just to demonstrate the problem, and in this case indeed a simple clean-up would be sufficient to remove the introduced redundancy. But it's easy to construct a more complicated test, which existing clean-up passes might have problems to deal with, and since SCEV expressions might be very complicated, expanding all of them might be not a good idea, especially taking into account that some of them are actually expensive. As for separating the patches - I'm entirely for it, but I just want to first make sure that there are no fundamental issues that can prevent us from fixing `expand`. My biggest concern is that this change leads to somewhat inconsistent state, where `findExistingExpansion` and `expand` have different assumptions. I'll feel much easier about it once I see the `expand` part of the patch. mzolotukhin: My example was pretty simple, just to demonstrate the problem, and in this case indeed a simple…
Value *BECount = Expander.expandCodeFor(BECountSC, BECountSC->getType(),		Value *BECount = Expander.expandCodeFor(BECountSC, BECountSC->getType(),
PreHeaderBR);		PreHeaderBR);

IRBuilder<> B(PreHeaderBR);		IRBuilder<> B(PreHeaderBR);
Value *ModVal = B.CreateAnd(TripCount, Count - 1, "xtraiter");		Value *ModVal = B.CreateAnd(TripCount, Count - 1, "xtraiter");

// If ModVal is zero, we know that either		// If ModVal is zero, we know that either
// 1. there are no iteration to be run in the prologue loop		// 1. there are no iteration to be run in the prologue loop
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll

Show All 18 Lines	loop: ; preds = %entry, %loop
%v15 = add nuw nsw i64 %k.015, %step		%v15 = add nuw nsw i64 %k.015, %step
%v16 = icmp slt i64 %v15, 8193		%v16 = icmp slt i64 %v15, 8193
br i1 %v16, label %loop, label %loopexit		br i1 %v16, label %loop, label %loopexit

loopexit: ; preds = %loop		loopexit: ; preds = %loop
ret i32 0		ret i32 0
}		}

		;; We expect this loop to be unrolled, because IV's step is minus one.
		;; In this case, we don't need to generate dvision for computing trip count.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I think this description is a bit misleading. We can unroll not because IV's step is minus one, but because tripcount is known to be divisible by `%conv7` (which will be unroll factor, right?), and we don't need to generate a new expression for the new tripcount. Is this correct? mzolotukhin: I think this description is a bit misleading. We can unroll not because IV's step is minus one…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions It will not be unroll factor. I just want to make example that trip count can be divided by original codes. Not belong to IV's step is unknown likes first example. The main point of first example(test) is that if IV's step is unknown, we should make division for getting tripcount for unrolled loop. we don't need to generate a new expression for the new tripcount. Is this correct? This is correct. If you have better explanation about this example, please let me know. flyingforyou: It will not be unroll factor. I just want to make example that trip count can be divided by…
		mzolotukhinUnsubmitted Done Reply Inline Actions Maybe something like this? Though SCEV for loop tripcount contains division, it shouldn't be considered expensive, since the division already exists in the code and we don't need to expand it once more. Thus, it shouldn't prevent us from unrolling the loop. Also, I think I don't completely understand why we care so much about step being -1. Why is it important? mzolotukhin: Maybe something like this? ``` Though SCEV for loop tripcount contains division, it shouldn't…
		flyingforyouAuthorUnsubmitted Done Reply Inline Actions Cool. Michael. I will modify comments as you said. `-1` indicates that tripcount doesn't need to be divided for calculation when loop unrolling. Anyway thanks for good sugesstion. flyingforyou: Cool. Michael. I will modify comments as you said. `-1` indicates that tripcount doesn't need…

		define i32 @test2(i64* %loc, i64 %conv7) {
		; CHECK-LABEL: @test2(
		; CHECK-LABEL: for.body.prol
		entry:
		%rem0 = load i64, i64* %loc, align 8, !tbaa !0
		%div11 = udiv i64 %rem0, %conv7
		%cmp.i38 = icmp ugt i64 %div11, 1
		%div12 = select i1 %cmp.i38, i64 %div11, i64 1
		br label %for.body
		for.body:
		%rem1 = phi i64 [ %rem0, %entry ], [ %rem2, %for.body ]
		%k1 = phi i64 [ %div12, %entry ], [ %dec, %for.body ]
		%mul1 = mul i64 %rem1, 48271
		%rem2 = urem i64 %mul1, 2147483647
		%dec = add i64 %k1, -1
		%cmp = icmp eq i64 %dec, 0
		br i1 %cmp, label %exit, label %for.body
		exit:
		%rem3 = phi i64 [ %rem2, %for.body ]
		store i64 %rem3, i64* %loc, align 8, !tbaa !0
		mzolotukhinUnsubmitted Not Done Reply Inline Actions It's definitely possible to simplify this testcase by removing lots of unneeded code from this loop. Please keep only parts that are needed to expose the issue. mzolotukhin: It's definitely possible to simplify this testcase by removing lots of unneeded code from this…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Yes. You're right. I will remove unnecessary codes. Thanks. flyingforyou: Yes. You're right. I will remove unnecessary codes. Thanks.
		ret i32 0
		}

!0 = !{i64 1, i64 100}		!0 = !{i64 1, i64 100}
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Nitpick: this metadata is probably unnecessary. mzolotukhin: Nitpick: this metadata is probably unnecessary.
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions This is original code. I'm not sure it's ok to remove. Sanjoy, Can I remove this? flyingforyou: This is original code. I'm not sure it's ok to remove. Sanjoy, Can I remove this?
		sanjoyUnsubmitted Not Done Reply Inline Actions Reading the code for `@test`, I think the metadata is required to tell LLVM that it is okay to generate a divide by `%step` if it thinks that is profitable to do (and so the test case actually checks LLVM's heuristics, not correctness). sanjoy: Reading the code for `@test`, I think the metadata is required to tell LLVM that it is okay to…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Thanks. I'll remain the code. flyingforyou: Thanks. I'll remain the code.
		mzolotukhinUnsubmitted Done Reply Inline Actions This metadata is used as `!range !0` in `@test` and as `!tbaa !0` in `@test2`. Are you sure this is correct? mzolotukhin: This metadata is used as `!range !0` in `@test` and as `!tbaa !0` in `@test2`. Are you sure…
		flyingforyouAuthorUnsubmitted Done Reply Inline Actions I don't think that makes errors. But I got your point. Thanks. I will modify `!tbaa !0` to `!tbaa !1` and add `!1 = !{i64 0}` Is this ok? flyingforyou: I don't think that makes errors. But I got your point. Thanks. I will modify `!tbaa !0 ` to `!
		mzolotukhinUnsubmitted Done Reply Inline Actions This is ok (or you can just remove `!tbaa !1` and `!1 = !{i64 0}` at all). Thanks! mzolotukhin: This is ok (or you can just remove `!tbaa !1` and `!1 = !{i64 0}` at all). Thanks!

This is an archive of the discontinued LLVM Phabricator instance.

[SCEVExpander] Make findExistingExpansion smarterClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 43578

lib/Analysis/ScalarEvolutionExpander.cpp

lib/Transforms/Utils/LoopUnrollRuntime.cpp

test/Transforms/LoopUnroll/high-cost-trip-count-computation.ll

[SCEVExpander] Make findExistingExpansion smarter
ClosedPublic