- User Since
- May 11 2015, 7:59 AM (310 w, 1 d)
Tue, Apr 13
Mon, Apr 12
Makes sense to me.
Tue, Apr 6
Thu, Apr 1
Cheers! And no thanks - I've had commit access for a few years.
Wed, Mar 31
- Moved PatLeaf to top-level file.
- Added tests for missing FP conditions.
- Added switch tests.
Can you add tests that do explicit xors in the IR to show that it is not folded out where it shouldn't be?
Well this is what I mean, that I don't think this is possible. Writing a test in IR, the br has to take an i1, though this is expanded to i32 during codegen. I could write some IR tests with xors, but they should all be valid. AFAIK, I'd need a way of writing a test input in selection dag form to write a negative test. I have missed some FP conditions in the existing tests, so I'm going to add those.
Wed, Mar 24
Thanks all. I've added a PatLeaf to detect a boolean, but I wasn't sure how to write a negative test using LLVM IR considering that the branch always takes an i1. Any ideas?
Tue, Mar 23
Mar 10 2021
Feb 17 2021
The X86 test is indeed large! Considering that it's just an unrolled loop with the same sequence repeated MANY times, could you not just delete 3/4 of the loop and update the indvar accordingly?
Feb 16 2021
Maybe? Not something that I'll be continuing with though.
No, don't think I have the time/energy to possibly introduce a bug :)
Feb 15 2021
Nothing else from me.
Feb 12 2021
But why not just order the logic so that it is an NFC?
Shouldn't this be an NFC..? Looks like we should be favouring post-incs for MVE, even at optsize.
Sounds good. How about unifying it to one call to get the type of preferred indexing: none, pre and post?
Feb 10 2021
Thanks! Yes, agreed. I've also added a couple of tests before committing, just to check that unrolling is indeed disabled at -Os and -Oz.
Feb 3 2021
I've spent the day looking at lzma and my conclusions are that some loops are vectorized and/or unrolled but they're not on the hot path, and now I believe the improvement I saw on native was purely from secondary side effects. Loop unrolled code is ~4% larger and -msimd128 enabled is ~2.5%, so I'm still surprised you're not seeing change there.
My node results are far more stable than wasmtime, on both x86_64 and aarch64 with and without liftoff. Those results correlate with yours and show no benefit of loop unrolling, but I'm seeing ~4% binary size increase too. So I'm surprised and confused and I guess I will investigate further.
And to qualify that, native execution runs vary by 10s of msec and -fno-unroll-loops is ~9% slower.
Thanks @tlively. I just took a look at lzma and yes, it looks like it should benefit from unrolling... What runtime are you running on? I've just done a few runs on wasmtime and, frankly, the benchmark looks far too noisey to be useful: my execution times are varying between ~6.5-8 seconds!
Feb 2 2021
Gentle ping... and I'd like to clarify my point about code size: Wasm is actually no different to any other backend in that respect, where the user will can dictate whether it is the most important factor through optimisation levels. The ARM backend is a good example of this where we spend a considerable amount of time tuning our microcontrollers for both -Oz and -O3. The usage of optimisation levels nearly always requires some education, but I think this is easier than re-implementing loop unrolling elsewhere and/or letting users figure out how to enable sensible unrolling via the command line.
Jan 27 2021
Jan 26 2021
But if we do this I'd recommend we at least update changelogs and docs in preparation, to try to minimize the annoyance for users.
Sure and users should definitely be recommended to compile at -Os if code size is their most important metric. From my, admittedly limited, benchmarking the difference between the optimisation levels is quite negligible, yet -Os can reduce size without suffering the performance impact often associated with -Oz.
Jan 25 2021
I guess these costs look more reasonable compared to their scalar counterparts, though I don't think we have cttz for MVE. But as a general query, if these are being scalarized why isn't the cost 16? It looks like it could just be the accumulation costs of the scalar parts. Or have I missed something?
I'm inclined to merge this and leave it to users for whom the code size increase is unacceptable to use -Os/-Oz/-fno-unroll-loops to disable it.
The unrolling would be disabled at -Os and -Oz, see
Jan 22 2021
Thanks for taking a look!
Jan 21 2021
Jan 18 2021
Jan 15 2021
Sorry for the big delay Yvan, I haven't really been working on LLVM recently.
Dec 17 2020
Dec 11 2020
Dec 10 2020
Dec 8 2020
Please move the if (isa<ScalableVectorType>(RetTy)) return BaseT::getIntrinsicInstrCost(ICA, CostKind); bailing out logic into the beginning of each switch branch that does not work/is not currently tested. These can be removed on a case-by-case basis in the patches that fix these branches.
Dec 3 2020
Dec 2 2020
Dec 1 2020
Okay, fair enough.
Nov 30 2020
I'd prefer if we didn't knowingly create broken paths here... does it currently just crash in the AArch64 backend or are there also failures in the generic parts? I'm thinking that we could just have a separate switch statement which checks for supported opcodes for scalable vectors and protects us from going down bad paths. A TODO comment could also be added to document why and what needs fixing up.
Nov 26 2020
I think it would be a good idea to test that the paths through getCmpSelInstrCost and getArithmeticInstrCost work too.
Nov 24 2020
Intuitively, it sounds unlikely for getExitCount to return a pointer type but I'm not sure.
Indeed, thanks for checking. LGTM, thanks.
Nov 23 2020
This looks sensible to me and a step in the right direction. Have you ran any benchmarks? Would be good to know, at least, that this isn't causing any obvious regressions.
Nov 20 2020
I'd be happy with an assertion that DoLoopStart is connected properly, but WhileLoopStart is still my real concern with this logic.
So my thoughts:
- For the case where LRDef is not null, we should be able to check that this same value is used by 'Dec'. That way we have proved explicitly the the use-def chain is correct instead of relying on the check of the live-out value.
- For any the case where LRDef is null, I'm pretty sure we just need to try harder to find it - something needs to be setting LR for loop entry.
- My issue with the isSafeToDefRegAt is that is a local check. If WLS is in the preheader predecessor, we could decide we that defining LR is safe (locally), but then LR is redefined in the preheader.
Nov 19 2020
I wonder whether the zero-extend should also be moved to isHardwareLoopCandidate at this point.
Nice, sounds good to me.
Out of curiosity, did you try performing the +1 in HardwareLoopInfo::isHardwareLoopCandidate instead? Looking back through the code, I think it would be more clear to rename 'ExitCount' to 'TripCount' and perform this addition outside of the pass.
Nov 18 2020
I'm happy if Dave has no other suggestions.
Sounds good to me, I would have been in favour of just completely disabling wls for now :)
Nov 17 2020
Nov 16 2020
But now I understand that this isn't an NFC! So can you add a test please?
Thanks. It's okay to commit cleanup NFCs without review, so don't worry about these kind of patches up for review in the future.
Nov 12 2020
Ok, cheers! LGTM
Nov 11 2020
Getting phis to use the right register class sounds like a good generic change to make. If creating them like that is a pain, how about just moving most of this logic into OptimizePhis?
Explicit sounds good.
Nov 10 2020
Nov 9 2020
Nov 6 2020
I think that the code would still need a t2DoLoopStartTP instruction, so those change would all have to stay
Agreed, and I think this pseudo is very good idea.
Nov 5 2020
and the MVE costs should be our problem not yours :)
Indeed! Thanks both for the explanations.
That's all about different patches though! Any comments on this one?
Sorry, it's just difficult to gauge this without a view on the wider implementation idea. For the other stuff that you're talking about, of course we need that in MI but, like I said at the start, we could currently achieve this patch with probably 10x less code at the IR level.
With the MVE instructions taking LR, maybe you wouldn't need to make the terminator change, have you tried it yet? I guess I'm naively assuming that the register allocator will be much less likely to spill something that is used by most / all of the MVE instructions within the loop.
Nov 4 2020
I would like to put here start with combining t2LoopDec and t2LoopEnd
Ok, fair enough.
Nov 3 2020
Okay, fair enough! I was getting confused that this is still throughput only.
Thanks for doing this. But now I'm surprised to not see any Arm changes?
Things like reverting calls is easier to do at this point.
What calls do you mean..?
What specifically do you need from the pre-RA MachineInstr form that means this shouldn't be done at IR level? The search here is certainly not more simple and manageable than doing this MVETailPredication, where the components are also found, and handling phis at this level is just an added pain.
This sounds like a good idea to me, but if we're going to do stuff earlier, to make our lives easier, maybe we should do this at the IR level and use a target-specific intrinsic and free ourselves completely from searching at the MI level?
Cost += 2 * thisT()->getCmpSelInstrCost( BinaryOperator::ICmp, OverflowTy, OverflowTy, CmpInst::BAD_ICMP_PREDICATE, CostKind);
It looks like this second ICmp should be a select/