Synced with the LLDB team. LGTM.
Jul 17 2020
May 12 2020
It looks a bit weird to check for FP constant again (see line 2802). The crux is that constant folding for CE fires only when size > 8 on line 2825. If folding happened at this point independent of size, then you would not need your checks in 2840ff. So could the size check be simply removed for the fix?
Feb 13 2020
+ Duncan, Michae for clang.
+ Amara, David for another opinion. Seems straight forward & LGTM.
Especially for inliner changes I suggest a) sharing performance numbers on standard benchmarks with O2 & O2 (thin) LTO, and b) offering a flag to toggle individual changes. Performance here includes run-time performance, compile-time, and code size. A simple change can have detrimental impact on any of key metrics folks might be watching, and that that impact could be different depending on the targets.
Feb 6 2020
Dec 12 2019
A question and a nit to get it started.
Ups, I removed the original reviewers?! Please add anyone else I missed. I only remember Kristof. Sorry for this nuisance!
Dec 5 2019
Dec 3 2019
I have one suggestion and one more question.
Nov 22 2019
Some food for thought & perhaps you could add a test (or more) for FP types other than double?
Nov 12 2019
Aug 21 2019
Jun 26 2019
+ reviewers to move this forward hopefully
Apr 16 2019
Could you separate size one checks are OK vs. checks that require even powers of 2? That smells like an inconsistency worth calling out explicitly.
Mar 18 2019
Looks reasonable. Added more reviewers for second opinions.
Dec 17 2018
+ Akira, Quentin for their driver / x86_64h experience for a quick double check. Fwiw, LGTM.
Dec 14 2018
This is straightforward way to pass along alignment tags and fixes (at least one critical) bug. LGTM.
Dec 13 2018
Dec 12 2018
Oct 23 2018
Jul 24 2018
+ David, Zia
Jun 13 2018
May 8 2018
The test case could possibly be shorter, but the change LGTM.
Feb 22 2018
Yes, this LGTM. I could not find a way to perform updates properly in the example. If recomputing the dependencies results in a compile-time issue we will have to dig deeper into MemDep.
Feb 21 2018
Jan 31 2018
I assume you will take care of the comment. LGTM.
Jan 30 2018
This is very close now. Could you add an explicit examples (eg show the IR) showing which initialization remain slow (Complex) and which are fast now? This should also address the spirit of Adrian's question I think.
Jan 29 2018
I like the spirit of the idea. What made you look into this? Some more questions and suggestions below, but LGTM as is.
Thank you drilling into this! I have a few questions below. Also, could you comment on the time savings you measured for your implementation?
Jan 4 2018
+ Matthias for review & thoughts about the unit test.
Jan 2 2018
Thanks! This LGTM then.
I see all issues that came up in this thread covered by the last patch. Before this can be committed I still want to check that all paths/issues are tested/covered wrt to fast-isel:
a) how do we guarantee that there is no fallback from GISel to FastISel (when GISel is supported)? This is probably a nit since it is obvious to everyone deeper in the implementation then I am.
b) for all the tests where fast-isel was added shouldn't there equivalent tests for GISel, too? Even if the tests target fast-isel specific issues how do we make sure GISel does not expose similar/same bugs?
Dec 5 2017
It looks pretty straightforward, but I'd ask (at least) Duncan or Akira, and Matthias to review this more carefully.
I think your commit makes accounting for both cases - the instructions inserted and the instructions deleted - consistent. From that angle this look OK. I have also some food for thought: in both cases the code makes the assumption that it inserts a dependent instruction chain. Only then it is correct to simply add instruction latencies. I have not checked that this assumption is correct for all cases. So I suggest to add a comment about this. Assuming also that you don't see any notable perf regression you can go ahead and commit from my perspective: LGTM.
Oct 11 2017
Thanks for working on this. LGTM!
Sep 6 2017
LGTM. Just some food for thought for possible follow up commits.
Sep 5 2017
I added a few minor remarks that you might want to consider before commit. Again, thanks for working on this and your patience!
Aug 25 2017
Thank you for sharing your numbers. I was hoping you had that show-off case.
I just have a few minor comments about readability and asserts. I still need bit more time to convince myself that all the pieces fit together. I was wondering if you could add a test that dumps the depths w/ and w/o the incremental updates (compile w/ full comination + dumps depths, compile with a threashold of 1 + dump depths, diff the dump)?
Aug 21 2017
Jul 27 2017
Jul 19 2017
It is not clear to me if and how the original questions have been answered by this patch yet. Could you elaborate and add comments, please? Much appreciate!
Jul 18 2017
Thanks for working on this! Please add comments and show off one or two examples that highlight the benefits of your extensions.
Jul 12 2017
Jul 6 2017
This is interesting. Unfortunately these heuristic changes are always tricky and never satisfying. Is this change motivated entirely by limitations of the current shrink-wrapping algorithm or do you see gains from better allocation also.
Jun 29 2017
Just a first superficial review. I haven't thought about the underlying concept itself yet.
May 11 2017
May 2 2017
Apr 20 2017
I don't know if this patch makes the situation better or worse. It seems to touch on the tip of an iceberg. The underlying problem is: What can or should happen when a loop is irreducible? It the answer is "don't unroll", then this is the fix your actually looking for. Obviously this loop is irreducible since it has a retreat edge from sink to body2, but body2 does not dominate sink. I'm also curious if this loop is in the source code already or if some pass in the compiler actually generated. FWIW, unless the compiler has a problem irreducible loops should be rare. And when the user writes an irreducible loop, I think it is OK when compiler optimizations turn conservative.
Apr 16 2017
I would like to understand this better. Could somebody explain what assumptions SCEV makes about its clients? Which assumption(s) is broken by SLP? It seems to me that this issue potentially touches fundamental design decisions/questions and I don't see any verifiers in place.
Apr 12 2017
Please double check your code one more time. Thank you!
Mar 28 2017
I'd love to see some comments an get your thoughts about a verifier.
Feb 9 2017
Thanks for following up on this.
Jan 26 2017
I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.
Jan 24 2017
Hans, you were right. The computeKnowBits etc parts should have been moved also. Sanjay committed the fix proper in r289442. I only kept the regression test in narrow-switch.ll (r293018).
Jan 18 2017
I'm leaning towards a LGTM since you addressed basically all my issues, but more people mushroomed and are curious about your performance data. So I think can't dodge that question anymore and need to share some good data for the benchmark(s) you are tuning for before you get the nod.
Jan 2 2017
From my perspective the implementation is close and only requires a few minor changes.
Dec 22 2016
The "automatic" generation of pattern e.g. with TableGen is on my longer term wish list, not a requirement for this patch. Sorry if my wording was confusing.
Do you have performance numbers?
Dec 21 2016
A few general remarks:
- I'm very much in favor of the MC combiner approach
- But I'm getting increasingly concerned about the MC code quality. I felt the FMA/Aarch64 support starts looking crabby (I'm the culprit) and this doesn't look much better I'm afraid. The approach is in need of more automation for better maintainability. This is meant as food for thought. Since I"m too blame for the approach I can't to be harsh in reviews :-)
- I'm also concerned about the compile-time (in particular since we don't track x86 specific issues ( or any other backend than Aarch64- or least I'm not aware that anyone is watching closely). Could you share some specific data about your perf gains and compile-time measurements? However, I think this optimization is for fast math only and compile-time is probably less of an issue in that mode. One way to deal with compile-time issues is to wrap some MC under an option.
- Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code.
Dec 20 2016
Dec 13 2016
Dec 12 2016
Dec 9 2016
Michael, could you also lend this your expert eye? Do you agree with the extra memory (pair vs vector) and compile-time (look at the loop tree) investment for this fix? Shortening the types may not be worth it at least on some architectures.
Dec 8 2016
Good catch. LGTM.
Dec 1 2016
Nov 18 2016
The Oz case looks interesting. Can you share more details/insights about the "inaccuracies" w/ specific examples? I'm wondering if that can be fixed in general or be more triggered towards some trunk characteristics. But this is just something we can think about and discuss while moving on and celebrate the ct/cs recoveries :-). So LGTM!
Nov 17 2016
Nov 14 2016
Thanks for following up!
Nov 1 2016
I thought I understand this until about the middle of the review. Now I could use some help perhaps with variable names and comments that reflect more clearly on the expression(s) you simplify. I think this is what Renato is looking for, too.
Oct 27 2016
Oct 26 2016
Could you provide more background on this idea? What is your motivational use case? When the trip count is low why optimize? If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt. There are also side effects on code size, register pressure etc. that could hurt performance.
Oct 14 2016
This looks weird - checking for null and then ignoring it?! Could you check why this IR is generated? I suspect there is either a bug or a missing optimization. In that case I suggest getting that fixed and adding an assertion in the select optimization.
Sep 15 2016
My concern is this would increase memory consumption in all backends. Maybe you can hide most of it with an opaque pointer in the machine operand.
Sep 8 2016
Jul 28 2016
Reduced test case.
Jul 27 2016
Nope, I don't see the tail call. Anyway, I'll simplify my test case. Don't worry about it.
Hm, FWIW this test doesn't reproduce with our compiler that exposes the problem at stage 2. And I still think when there are no uniforms within the loop there is no need to traverser the PHI nodes later.
Since the patch clearly fixes an obvious problem I'm OK with a commit even w/o a test case - LGTM.
Please add the options you used to compile? I can certainly shrink the test case a bit before I commit.
Jul 7 2016
Ahmed pointed out the clang commit causing the issue is r244207 - Mark calls in thunk functions as tail-call optimization candidates
Jul 6 2016
Jul 5 2016
Jul 1 2016
Jun 24 2016
I'll take a more in depth look into the load and store merge routines also. At a first glance it seems one could just add a few MSSA hooks rather than copy-paste-modify the code base.
I also take a (late :-() look at the core MSSA design. Could you outline the high-level MSSA design in more detail than http://reviews.llvm.org/D7864?Eg. how does it compare/relate to http://www.airs.com/dnovillo/Papers/mem-ssa.pdf? Thanks!
Jun 22 2016
May 27 2016
Thanks, Geoff. I take back my previous comment. I was worried that when A and B are allocated at stack offset x and this optimization changes x, it would only change it for A but not for B. But looking deeper into the code this is not the case.
Since I couldn't spot this in the code: in the scenario where there is one empty reallocate both of them to the empty slot or give up after it reallocated the first object?