- User Since
- May 11 2015, 7:59 AM (296 w, 5 d)
Sorry for the big delay Yvan, I haven't really been working on LLVM recently.
Dec 17 2020
Dec 11 2020
Dec 10 2020
Dec 8 2020
Please move the if (isa<ScalableVectorType>(RetTy)) return BaseT::getIntrinsicInstrCost(ICA, CostKind); bailing out logic into the beginning of each switch branch that does not work/is not currently tested. These can be removed on a case-by-case basis in the patches that fix these branches.
Dec 3 2020
Dec 2 2020
Dec 1 2020
Okay, fair enough.
Nov 30 2020
I'd prefer if we didn't knowingly create broken paths here... does it currently just crash in the AArch64 backend or are there also failures in the generic parts? I'm thinking that we could just have a separate switch statement which checks for supported opcodes for scalable vectors and protects us from going down bad paths. A TODO comment could also be added to document why and what needs fixing up.
Nov 26 2020
I think it would be a good idea to test that the paths through getCmpSelInstrCost and getArithmeticInstrCost work too.
Nov 24 2020
Intuitively, it sounds unlikely for getExitCount to return a pointer type but I'm not sure.
Indeed, thanks for checking. LGTM, thanks.
Nov 23 2020
This looks sensible to me and a step in the right direction. Have you ran any benchmarks? Would be good to know, at least, that this isn't causing any obvious regressions.
Nov 20 2020
I'd be happy with an assertion that DoLoopStart is connected properly, but WhileLoopStart is still my real concern with this logic.
So my thoughts:
- For the case where LRDef is not null, we should be able to check that this same value is used by 'Dec'. That way we have proved explicitly the the use-def chain is correct instead of relying on the check of the live-out value.
- For any the case where LRDef is null, I'm pretty sure we just need to try harder to find it - something needs to be setting LR for loop entry.
- My issue with the isSafeToDefRegAt is that is a local check. If WLS is in the preheader predecessor, we could decide we that defining LR is safe (locally), but then LR is redefined in the preheader.
Nov 19 2020
I wonder whether the zero-extend should also be moved to isHardwareLoopCandidate at this point.
Nice, sounds good to me.
Out of curiosity, did you try performing the +1 in HardwareLoopInfo::isHardwareLoopCandidate instead? Looking back through the code, I think it would be more clear to rename 'ExitCount' to 'TripCount' and perform this addition outside of the pass.
Nov 18 2020
I'm happy if Dave has no other suggestions.
Sounds good to me, I would have been in favour of just completely disabling wls for now :)
Nov 17 2020
Nov 16 2020
But now I understand that this isn't an NFC! So can you add a test please?
Thanks. It's okay to commit cleanup NFCs without review, so don't worry about these kind of patches up for review in the future.
Nov 12 2020
Ok, cheers! LGTM
Nov 11 2020
Getting phis to use the right register class sounds like a good generic change to make. If creating them like that is a pain, how about just moving most of this logic into OptimizePhis?
Explicit sounds good.
Nov 10 2020
Nov 9 2020
Nov 6 2020
I think that the code would still need a t2DoLoopStartTP instruction, so those change would all have to stay
Agreed, and I think this pseudo is very good idea.
Nov 5 2020
and the MVE costs should be our problem not yours :)
Indeed! Thanks both for the explanations.
That's all about different patches though! Any comments on this one?
Sorry, it's just difficult to gauge this without a view on the wider implementation idea. For the other stuff that you're talking about, of course we need that in MI but, like I said at the start, we could currently achieve this patch with probably 10x less code at the IR level.
With the MVE instructions taking LR, maybe you wouldn't need to make the terminator change, have you tried it yet? I guess I'm naively assuming that the register allocator will be much less likely to spill something that is used by most / all of the MVE instructions within the loop.
Nov 4 2020
I would like to put here start with combining t2LoopDec and t2LoopEnd
Ok, fair enough.
Nov 3 2020
Okay, fair enough! I was getting confused that this is still throughput only.
Thanks for doing this. But now I'm surprised to not see any Arm changes?
Things like reverting calls is easier to do at this point.
What calls do you mean..?
What specifically do you need from the pre-RA MachineInstr form that means this shouldn't be done at IR level? The search here is certainly not more simple and manageable than doing this MVETailPredication, where the components are also found, and handling phis at this level is just an added pain.
This sounds like a good idea to me, but if we're going to do stuff earlier, to make our lives easier, maybe we should do this at the IR level and use a target-specific intrinsic and free ourselves completely from searching at the MI level?
Cost += 2 * thisT()->getCmpSelInstrCost( BinaryOperator::ICmp, OverflowTy, OverflowTy, CmpInst::BAD_ICMP_PREDICATE, CostKind);
It looks like this second ICmp should be a select/
Nov 2 2020
Looks like the Arm backend makes almost no attempt to cost intrinsics, so I'm assuming that BasicTTI is basically producing everything here... So, it looks like there's some scalarization cost missing in this nebulous code?
Oct 28 2020
Oct 27 2020
Oct 26 2020
Probably best to wait a little while for other's feedback, but this LGTM.
Oct 23 2020
I'm kind of surprised these sort of hints aren't handled in tablegen, but I guess not much c++ is needed. Looks like it works well enough too, cheers.
Oct 22 2020
Many thanks Dave, LGTM
Oct 21 2020
Having separate loops sounds good to me, but please remove the extra conditional before committing. Thanks.
A heroic effort to change all those tests! Are you planning on also updating WLS and the test_set intrinsic too..?
$lr = t2DoLoopStart renamable $rn
Oct 19 2020
How about results from the LLVM test suite? This should benefit size and performance?
Yeah, I'm happy to do that for you.
Oct 15 2020
I'm sorry for not finishing what I started with this intrinsic nonsense... Any simplification of these winding paths sounds great to me.
Sorry, I didn't realise this had been updated.
Oct 9 2020
I wasn't so much concerned about latencies, but I was guessing that by moving the instruction, the goal is to get some more tail-predication "for free".
Indeed, this greatly simplifies the effort required to ensure that we can generate the LSTP version. And I can't see the latency of DLS having any real affect on the performance of a loop, especially not compared to all the other things that can go bad!
Oct 1 2020
Can we rename the function to something like "ClobbersPredicate", if that's what we actually care about here?
+1 from me. I also wonder renaming IncludesRemovable to something like Ignore/SkipDead?
Maybe we could use SCEV for the overflow checks? ;P
Code looks good to me and we've been running this downstream for a few years.
Removed the code that later tries to move stuff around.
Sep 30 2020
Checking result of TryRemove.