- User Since
- May 24 2016, 8:35 AM (138 w, 18 h)
Mon, Jan 14
Sounds good to me. Just adding Sjoerd/Sam who might know more about this code.
Wed, Jan 9
Looks sensible to me.
Tue, Jan 8
Some of D56386 seems to be mixed up in here I think?
LGTM, with a couple of minor points.
Mon, Jan 7
Perhaps we add a MIR test? To test this more directly.
Mon, Dec 31
Great stuff. This looks like it will be very useful. I have a few thoughts/questions/nits.
Fri, Dec 21
Tue, Dec 18
Looks good to me. Apologies for the delay.
Dec 13 2018
This looks find to me then, especially at aggressive opt level. IIRC, there is a test for the pass pipeline I would expect needs updating.
Dec 11 2018
Do you have any compile time numbers for this?
Dec 10 2018
Oh, yeah, that looks wrong! It should be testing a powerpc target.
Dec 7 2018
Dec 6 2018
This looks fine, of course, to go alongside D49281.
Dec 4 2018
When @dmgreen is happy with the unrolling changes, I think you're good to go.
Dec 2 2018
I ran a few bootstraps to check compile time (on AArch64). It didn't seem to make any noticable difference there, and this certainly looks like its good for codesize.
Nov 25 2018
Nov 23 2018
Thanks for the fix!
Nov 22 2018
Thanks for the fix!
Hello, Another one for you, this time to do with LCSSA not being preserved, run with "opt -loop-simplifycfg -indvars simplified.ll -S":
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-arm-none-eabi"
Nov 19 2018
Hello. I have an error from this, which I think may well be a knock-on affect for the code later in the pass. Something like this:
Nov 13 2018
I'm happy to change stuff around. I believe it's the way it is at the moment because of the place this is called, half way into a constructor (making virtuals difficult) and some archs requiring different pieces of info (like aarch64 being different for JIT's). I figured if in doubt, don't change too much at once.
Hello. I agree that this looks like an improvement. Can you add a testcase?
Nov 7 2018
Cleanup using tablegen classes.
Nov 6 2018
Unfortunately, this also increased codesize a little at -Oz, which I will have to look into.
Nov 5 2018
(but can cause problems for cases where the blocks are not in the form they will appear in assembly).
I'm not sure what sort of issue you're running into here?
Nov 4 2018
Nov 3 2018
Hello! Drive by comments. I've not looked in any depth.
Nov 1 2018
Oct 31 2018
Oct 25 2018
I've managed to convince myself that this looks OK. A couple of nits depending on what you think of them.
Nice cleanup. LGTM
I was thinking about how this might affect other little cores like the A53/A55, especially around the dual issue on q registers. I don't think it will make much difference though, and the CSE benefits look like a bigger win.
Oct 24 2018
Now ignores loops that will never be executed. I also have some code that uses SCEV to calculate if the backedge count is <= 1 and allow inlining there. It doesn't seem to come up very often though and needed some plumbing to get SE's/TLI's in the right places.
Oct 22 2018
Anyway, this looks good to me.
Interesting. I didn't realise it worked like that. I had presumed that lot of passes would have to be taught about the optional defs, as opposed to them not being marked as dead correctly.
Yep, this came up a few times in the tests I ran. Looks like a nice improvement, and I don't think it complicate things too much.
Oct 18 2018
Oct 16 2018
This looks like something that would be useful for us, where some of our benchmarks are very low noise. I currently have a couple of test-ish LNT instances with changes similar to this, either changing this value or setting ignore_small=False from the daily report page.
Oct 15 2018
I think we should deal with do while in another patch.
Yeah, defo. I just need to come up with a sensible way to fix it. I feel some of this is pushing against the edges of what instcombine should be doing, but I'll keep pushing until someone tells me to stop.
Oct 14 2018
Interesting. Nice improvements. What about small trees? It would seem that any tree less that 75 nodes would always be recalculated. Do the timings you ran include things to show that this is better? Or was that just looking at larger trees at the time?
Oct 12 2018
Yeah, I agree. LGTM.
Nice idea. The rtabi seems to says:
Write functions return the value written, read functions the value read.
Yeah, that looks like similar IR to what I was looking at. The vectorised version on Skylake (https://godbolt.org/z/RBS2Os) has a lot of shuffling, perhaps that's deemed unprofitable on Goldmont?
Oct 11 2018
Oh, no. That's not what I wanted to hear. I presume we are looking at the same bit of code!
Whilst we're here, can anyone think of a good way to simplify:
(S + -32) - (-32 & (S + umax(31 - S, -32)))
That's the "do" case. I think if we distribute the -32& through the add, that with the rest of instcombine + cse + instcombine again does get us down to:
S & 31
OK, now one bit ol' matcher. Thanks for the suggestions.
This is the page I look at as reference for alive coding:
Oct 10 2018
I ran some benchmarks for thumb1, they look great.
Just added a new test, and changed some of the others around a little.
I've taken the demand parts parts out of this, and added some extra tests for the do / while and signed /unsigned cases.
Luckily, it appears that Alive can do countLeadingOnes too (although I don't see it anywhere in the sources I have).
Updated to use activeBits. Thanks for the suggestions.
Oct 9 2018
Ah, I must have been looking at the wrong bit of code. Thanks!
It turns out the other case I ran into above ((S + -32) - (32 & (S + umax(31 - S, -32)))) was from do loops, not while loops. Signed will also be different to unsigned, with signed cases not having quite as small simplified forms.
These are the cases:
With some possible simplifications:
Hello, see the part about context in https://llvm.org/docs/Phabricator.html#phabricator-request-review-web. It's easier to review things if we can see the code around the patch as well as the code in the patch.
Hello. Very nice. I don't think I can speak to much of the detail here, especially the Hexagon parts, but can you:
- Add full context to the patch (-U99999)
- Replace the copyright headers to be more "llvmy"
Oct 5 2018
Oct 2 2018
I've added a new memcpy test from the original reproducer. It's a byte memcpy (people seem to love writing those), which I think is worth focusing on because its small, but still increases codesize. It expands to:
Unfortunately, since I did this everything seems to have changed and we now end up with something like:
(S + -32) - (32 & (S + umax(31 - S, -32)))
Let me know if you see anything funny from this patch.
Oct 1 2018
For loop-noinline.ll the "Simplified" instructions (the ones that cost nothing) appear to be:
3 x phis
one of the geps in the loop
the gep outside the loop
Sep 30 2018
Yep sure, with 52177, I no longer have a motivating case for this.
Sep 28 2018