Yeah, nice. Sounds good to me.
Fri, Aug 7
Thu, Aug 6
Sounds good to me then, thanks.
Wed, Aug 5
Thanks. Sorry about that, it appears clang doesn't like this code as much as gcc did and then my internet went out whilst I was trying to figure out what was wrong.
Registers are free unless you go over a limit. And I wouldn't expect it to be these individual cost functions that attempted to guess whether it was over that limit.
Do you mean it's the cost of a phi that is altering things? They certainly sounds like they should be free most of the time. Even for codesize I would expect them to be folded away a lot of the time. You just have to get the inputs to share a register after all.
I kind of think this should be the default (plus it's perhaps a little strange for -march=thumbv8.1-m to have a difference branch cost to -march=thumbv8.1-m+mve).
Tue, Aug 4
Sounds good. Let me know if you find anything.
OK. Lets give this another go. There is one additional test update where we manage to convert a loop to a memcpy call.
Thanks. LGTM, in that this work the same as DAGCombiner.
I was looking at this pass for something else recently. I'm not sure if the legality checks are really necessary in here. We might want to change it to just check for active lane masks and convert them to vctp's (when legal). I think that will always be better in terms of code quality than the expansion of the active lane mask, no matter if we end up transforming to a tail predicated loop or not.
Mon, Aug 3
The last time me and Sjoerd talked about it, we figured this wouldn't fix the issue exactly (it only fuses them during scheduling, you can still spill between the t2LoopDec and the End), and as we need a proper solution anyway this might not end up being useful on it's own.
Thanks. LGTM with a very minor nitpick
Quick question - what is the expected behaviour? Do we just never expect to see an bf16 add, and if we do it's a fatal error? Or is some form of automatic promotion expected to happen?
Sun, Aug 2
Oh right. Sorry I saw that bug but didn't put two and two together somehow.
Sat, Aug 1
Thanks to Adenilson and Hans I believe chromium has now fixed the issue their end. I have also ran another bootstrap and the llvm-test-suite, which are both still doing OK.
Fri, Jul 31
I think we are still at the behests of LSR's cost modeling I'm afraid. This can just do slightly better at fixing up the results afterwards, it can slightly improve things in case ISel comes up with something unoptimal.
Now includes a quick codesize metric, to try and detect cases where a t2LDRi12 can be shrunk to tLDRi.
Thu, Jul 30
Wed, Jul 29
Tue, Jul 28
Thanks. LGTM, but please give others a day in case they have comments.
Mon, Jul 27
Thu, Jul 23
Wed, Jul 22
OK. Hopefully fixed in 411eb87c7962ec817ab6bf7aa3c737a3159d2d4e. Thanks for the report/reproducer, and sorry I didn't catch it earlier.
Yeah OK. This should be simple enough. It's missing the predicate off VMUL_qr. I'll just run a quick check-all and submit the fix.
Oh. I must have missed a requires clause I suspect. I'll take a look now, and revert if I can't find the problem.
Tue, Jul 21
Cleaned up some of the MVE/NEONimmAllZerosV and friends. Now called ARMimmAllZerosV and made a little simpler where possible.
I think you could argue the cost of selects in many ways on ARM. A lot of them will be free (folded into another instruction), a lot will cost 1, some will cost 2 because of the IT or even higher on a T1 core. I think in the end, whatever looks best on benchmarks is probably the best way to go.
We might want to look at migrating the target-specific code into the target, similar to what we're doing for target-specific intrinsics in instcombine.
Mon, Jul 20
Seems like a good idea to me, I can't see this causing any harm. but do you have an example of where it enables LOBs?
Add extra VMLALVps + add -> VMLALVAps folds.
Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?
Sun, Jul 19
Moved to constant folding.
Ah. Sure, I think I can move it there. For some reason I was under the impression that constant folding did not handle target intrinsics, but I see there are some x86 and amd intrinsics in there already.