Fri, Apr 9
Thu, Apr 8
I think Dave also argued that this patch makes a lot of sense. Thus, I think left to do is addressing the previous nits.
Wed, Apr 7
Thanks for your comments. I will take over this work, but will first address the remarks in D93762. After that, I will return to this and then try to address your remarks asap.
Sorry for the delay, mostly nits inlined, one question about missing f16 tests.
Tue, Apr 6
Thu, Apr 1
Ah okay, thanks, got it! Makes perfect sense to me.
I will start looking into that.
Thanks, and I will wait a few days with committing.
Adjust this to D99710, that uses movi d0 that zeros 64 bits and not 128 bits, which enables this as a default for all cores.
By just looking at this patch I find it a bit difficult to get an overview of all moving parts involved. I.e., this makes probably sense:
I know you've worked on this for a while and investigated different strategies, but I think we also need to argue here why we would like to emit a memcpy loop instead of e.g. having optimised versions in the clib. In other words, is this the best we can do for all different alignments, sizes, etc.?
Now using "movi d0".
I think the exact suggestion was to use MOVID instead. I'm not sure how much it matters, but it may be a simpler instruction for some cores. This would then match what GCC emits.
Nice one, thanks.
Wed, Mar 31
After some more discussions, it turns out the original revision was doing the right thing. Except that we should be using the .2s variant as that may be more efficient on some cores.
Rebase to get this applied and compiling again.
Taking over this work.
This sets FeatureNoZCZeroingFP for some older cores.
Tue, Mar 30
Fair enough, let's refrain from micro-architectural details. But the point is that zero-cost zeroing idioms are supported on integer operations, which is why this is preferred. This should always gives the same or better performance, but it looks like you found a bit of corner case with dual issuing, which is a bit surprising but perhaps makes some sense for smaller in-order cores. I will add FeatureNoZCZeroingFP to the A55's description.
Just a query on the context of this work: this wasn't enabled at that time because of some regressions. How does that look now? Does this work rely on some fixes to address that, or has the picture changed?
Wed, Mar 24
Hi Stelios, many thanks for putting this together, good stuff.
I will do a code-review a bit later, but as there's potential for some corner cases here, first a testing question. Did you do a bootstrap build and e.g. ran the llvm test suite?
Yep, agreed, seems reasonable. Thanks for fixing.
I guess that must have been the C89 programmer in us... ;-)
Tue, Mar 23
Sorry for the delayed response. Please look at the latest changes which contains the cost model. I have also shared the SPEC CPU 2017 benchmark numbers for various optimization modes we have added.
Mon, Mar 22
Thanks Dave, forgot about the unsigned variants, but have added them now as well as the predicates.
Fri, Mar 19
Thu, Mar 18
Yeah, thanks Dave, I think I will be looking a bit more in this area, but this is a start...
Wed, Mar 17
This now generates cinc, which is even better.
Thanks for the suggestion Dave. Just did the pen and paper exercise and agree that:
Tue, Mar 16
This looks very reasonable to me.
Looks reasonable to me. Perhaps wait a day in case @RKSimon has more comments.
And the i64 costs go up because they are not legal ops? LGTM.
Okay, it's a bit of an indirect way, but fair enough I think.
Make sense to me, but was just wondering if you haven't seen any regressions? If the constant is hoisted, it could contribute to higher register pressure and spilling?
Mon, Mar 15
Thanks, nice one, LGTM.
Mar 12 2021
Mar 11 2021
Thanks, that definitely helped.
Mar 10 2021
I have also become interested in this work as this regularly shows up as difference between clang and gcc. For the first attempt to add a function specialisation pass in D36432 very rightfully questions were asked about compile-times and code-size, and these questions were repeated for this work. The table https://reviews.llvm.org/D36432#836883 shows good numbers, and I think in order to progress this work we need something similar. I.e. the approach here is a bit different and things may have changed. But this is also mentioned in the description:
Thanks, looks very reasonable to me.
Mar 9 2021
Mar 5 2021
However we would like have slight different behavior for fp16 other than ACLE: The evaluation format of fp16 set same as _Float16, which means no promotion are performed if there is no hardware half-precision supported.
This is fixing https://bugs.llvm.org/show_bug.cgi?id=47829
Yeah, that's fine, if they are missing, we could that separately.
Sorry, one more question. This implements the vector variants. How about the scalar ones? Have they been implemented already?
Mar 4 2021
Looks okay to me.
Mar 3 2021
Mar 2 2021
Ah, Dave's remark is a good one. These intrinsics are available in both AAch64 and ARM, and I missed that you covered only ARM here; Dave meant to add these tests for AArch64 too.
This is the Clang part. Do we need to add anything for LLVM, for example tests?
Mar 1 2021
I am abandoning this in favour of D89693, which I have repurposed to address this, because most of the discussions happened there.
This is changing our approach to preferring pre-indexed addressing modes, because:
- this what we want for runtime unrolled loops, which is what we want to address next,
- post indexed is currently better for some cases, but that's mainly cause because we miss an opportunity in the load/store optimiser. With that fixed, expectation is that pre-indexed gives the same or better perf than post-indexed.
- for what it is worth, pre-indexed is also the default for ARM,
Feb 25 2021
Thanks for commenting Dave. I will have another go at this, and try to come up with a better analysis, at least one we understand.
Feb 23 2021
Sorry missed the earlier message, but this has now been committed.
Feb 22 2021
This abandons the idea of looking at the IV, and incorporates D89894 to look at the pointer uses.