LGTM, but please wait a day in case @arsenm has more comments.
Thu, Oct 22
Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.
The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.
Hazard recognisers is not really my area, but this looks like a straightforward refactoring to me. Two nits: one is a style issue inlined, the other is just a quick question if you plan to use this extra flexibility any time soon?
Many thanks for reviewing!
This looks consistent with the other pipeline, it's opt-in, and gives a nice uplift, so LGTM.
Wed, Oct 21
FWIW, I really dislike these pipeline tests because some of them are actually very tricky to update and I doubt they provide any useful information. But agreed with Dave that for consistency such a test would probably be best (in case someone finds it useful).
Cheers, typo fixed: C2->isNonPositive() -> C1->isNonPositive(). And the test case added.
I have added support for the other SLT case and have added tests, which I have not precommitted this time (hopefully easy to spot these cases), just to completely cover the SLT case. After this, I can quickly follow up to address other predicates, if that is okay.
- Precommitted 3 tests for testing commuting the operands in rGe86a70ce3def,
- And rebased this on that again.
- Precommitted the extra tests in rG782b8f0d38c9.
- Rebased this on that.
Tue, Oct 20
Yep, cheers, hopefully SPEC is better and more conclusive. The 1.2% uplift in one benchmark was on baremetal aarch64, will check if I can run some more things on that too.
Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.
Ah wait a minute, I am doing the last experiment again, just double checking if I haven't make mistake running them
Just curious if there's something in particular you are concerned about? I am just asking because then I can focus on that.
Thanks for your help.
- this is using your function, the pattern matching, which indeed is much nicer,
- and I think all tests are there, positive and negative. For example, for the positive tests, I have added all the combos, the different data types, including a vector test, and for the negative tests, there are indeed tests with different predicate and the nsw is on the 'wrong' add, etc.
I've CTMark and filter out the test that run very shortly with --filter-short and see the same trend, i.e. no regressions and some okay improvements:
Mon, Oct 19
Cheers guys, that's fair, will give the llvm test suite a try too.
Fixed precondition, and added a test case for that.
This seems to be missing some other preconditions on the constant values:
Sat, Oct 17
Fri, Oct 16
Forgot about this one, but agree with this.
Can we add tests for this?
I've got a slightly different proposal. This moves the loop flatten pass into IndVarSimplify for several reasons:
- loop-flatten is best run just before IndVarSimplify because IndVarSimplify can promote induction variables. For overflow analysis to see if loop flattening is legal, it's best if inductions variables haven't been promoted yet.
- When induction variables of a loop nest don't use the maximum legal integer type, we promote them to the widest type so we know loop flattening is safe thus avoiding overflow analysis. Promoting induction variables is what IndVarSimplify was already doing, so this reusing that.
- Last but not least, with the loops that we support with loop-flattening, induction variable simplification is exactly the point of this transform, so this looks like a good home for it. Thus, this also avoids quite some churn making modifications to LoopUtils where refactored/shared code could live, and in both of the passes.
Thu, Oct 15
I think Florian answered those questions, that looks indeed the most sensible way forward then.
I do see the elegance of just feeding the epilogue to the vectoriser again, but also have sympathy for not pushing the responsibility of the clean up down the line to something else especially if this is non-trivial. But to progress this discussion, I was wondering if we can say something more about this:
Wed, Oct 14
I guess this is similar to the widening that IndVarSimplify does? Can we just re-use the stuff from there or have IndVarSimplify just do it for us?
We might be able to salvage this by adding a check that the GEP dominates the loop latch?
Tue, Oct 13
Thanks for reviewing.
- I have precommitted the tests in rG66f22411e1bb, and
- use APInts in the comparisons.
Fri, Oct 9
Insert loop start at end of block in more cases
Hmm. Just a quick check - do we want that? I can see it improves some tail predication cases, that's good. But do we want that in general? The DLS instructions have a latency like any other, and earlier is better from that perspective. Or are we assuming that that latency will never matter into the LE instruction?
Thu, Oct 8
I think it might be really good if it would be possible to not implement all this overflow detection from scratch.
Is there nothing in SCEV already that does this?
Cheers, looks like a good change to me. Perhaps wait a day with committing just in case there are more/other opinions on this.
Wed, Oct 7
Just some minor questions inline.
Tue, Oct 6
Apologies for this. The fix is just a matter of updating the two failing
tests. What is the process for getting the change back in?
If it's just a trivial update of a test, you can just go ahead and recommit it. If you're in doubt, you can always put it for review again.
Mon, Oct 5
Fri, Oct 2
Thanks for reviewing!
Thu, Oct 1
Thanks Sam. And in addition to what you wrote, this is not enabled by default.
I plan to iterate on this in-tree, and then we can start thinking about the default, but that's TBD I guess.
I haven't looked in much detail at this patch, but this looks like some straightforward lowering of llvm.experimental.vector.reduce.add. Absolutely nothing wrong with that, but I am curious who's going to produce this intrinsic? The vectoriser, the matrix pass? In other words, any ideas on the bigger picture?
Wed, Sep 30
I've added the test cases from PR40581. Test v0 does not trigger yet, test v1 triggers. I propose adding support for v0 once we've got something in-tree.
Looks like a good fix to me.
Arg, silly! Thanks for letting me know.
With Dave's and Oliver's permission I am commandeering this because I really would like to see this getting committed soonish and I have some bandwidth to progress this.
Tue, Sep 29
If Sam has no further questions, this looks good to me.
Mon, Sep 28
Thanks Dave. With D88086 committed now, I don't think there's anything in our way anymore.
Fri, Sep 25
I am so happy that this approach works! I.e., this determines equality of TC and ElemenCount by calculating 2 scev expressions and subtracting them and testing the result for 0. Also a check for the base of the AddRec has been added now, so I think this addresses all comments.
Thu, Sep 24
I wanted to write the new checks in a separate patch as I thought it would be a new lump of code, wanted to get this clean up first out of the way, but since our last idea it is probably best to continue here. I.e., the TC == (ElemCount+VW-1) / VW is hopefully just a minor addition.
Sep 24 2020
Okidoki, nice one
Thanks, perfectly clear, LGTM.
Looks good, but ignoring the nits I have one question inlined that asks about explaining why we are doing this, and am interested to have a read first.
Actually, I guess if you could prove that the tripcount is precisely equal to (ElementCount + VectorWidth - 1)/VectorWidth, you could also use that to prove the subtraction doesn't overflow.
This sounds like the same suggestion that I made many moons ago... I suggested taking these values and substituting them into the expected SCEV expression, and then perform some SCEV algebra on it and the vector TC expression, until hopefully they both just equal ElementCount == ElementCount. My quick prototype 'worked', but I don't know if that says much.
Sep 23 2020
Thanks for looking Eli.
Sep 22 2020
Sorry, I wrote a reply end of last week, but apparently forgot to push submit. So please see my reply inline, but I will open a new review soon, where it's probably best to continue this discussion and my reply.