- User Since
- Jan 19 2017, 12:49 AM (349 w, 2 d)
Oct 12 2022
Oct 10 2022
Aug 19 2022
Aug 17 2022
Aug 16 2022
Hehe, I don't see why not but that feels like another potential improvement rather than a fix ;) I'll go ahead and commit this to unblock @paulwalker-arm if that's OK?
Fix posted in https://reviews.llvm.org/D131948
Aug 15 2022
Aug 11 2022
Fixed it in a NFC.
Aug 10 2022
Had not run this for ARM and there is a testcase there that also benefits from this change moving the zext up to the loads outside of the loop.
Is that somewhat clearer? I suspect the reason we never see a failure is because we 'replace' the input, rather than create a new ZExt. But like I said in the comment, it get's deleted. And as it is the trunc would just stay there dangling and get in the way of things.
I had forgotten to build for ARM too, which had similar testisms as AArch64 where this patch removes unnecessary Truncates. Updated tests.
I'm just gonna rebase this to see if I also get the failure on TypePromotion/ARM/calls.ll that the bot is reporting.
Aug 9 2022
Aug 8 2022
May 23 2022
I believe that 'scheduling doesn't matter for Out of Order cores' is a long-standing myth that we've seen not to be correct. Yes, scheduling is definitely different for out of order cores, the problem shifts towards thinking about a sliding window of instructions that go into specific pipelines and dispatch queues and the likes. And you now find yourself trying to optimize the utilisation of pipelines, avoiding bubbles, rather than looking for the 'perfect sequence'. Out of order execution has some other limits and in some cases it helps if the compiler can lend the core a hand. In this case the Neoverse N1 prefers ascending STP Q's and an updated Neoverse N1 Software Optimization Guide will be reflecting this.
May 17 2022
May 12 2022
May 11 2022
Feb 7 2022
Dec 17 2021
I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
Dec 6 2021
Thanks for the suggestion @samparker . I hadn't look too much into it after I found that the ICmp promotion leaves the code in a somewhat 'dirty' state.
For instance if you take the motivating example from Snappy, the BB before the one with the PHI + ZEXT has an icmp that feeds into the relevant PHI node and it ends up yielding:
%tag.0.in10 = phi i32 [ %2, %for.body ], [ %0, %for.body.preheader ] %1 = trunc i32 %tag.0.in10 to i8 %tag.0 = zext i8 %1 to i64
Nov 30 2021
I forgot to mention this address https://bugs.llvm.org/show_bug.cgi?id=51317 which is a suboptimal codegen issue encountered in snappy. Currently Snappy uses an inline asm workaround to make sure it doesn't end up with an extra 'and', this patch removes the need for that workaround.
Nov 18 2021
Coming back to this as I abandoned the approach in D112300, since that was deemed too early to be consulting target information.
Abandoning this in favour of reworking D111237.
Nov 1 2021
Oct 25 2021
Fair enough, out of curiosity, where would you think this type of sinking should live in the optimization pipeline?
@spatel Yeah that's where I first made the change, but as I was working on it I noticed the original transformation was being done in InstCombine and as a practice I prefer to prevent transformations rather than undo them. I am guessing we are not in agreement on that. I've started reworking the original approach to do it in TypePromotion.
@lebedev.ri what do you suggest I do? I am not entirely sure I agree that sinking instructions like this is canonicalization, but that aside, separating these Loads and Exts is making codegen worse for targets that support widening loads. Targets that don't would probably be better off with the sinking transformation. This is why I'd like to use TTI here.
Oct 22 2021
Abandoning this in favor of D112300, I think that it would be better to try to avoid the sinking of the Exts rather than trying to undo them this late.
Oct 21 2021
Has been applied but I forgot to include the Differential Revision.
Ah ... and I forgot to add the Differential Review to the commit... So I'll close this manually.
Oct 20 2021
Took me three attempts, will apply after the pre-merge checks and build are done.
Oct 19 2021
Oct 18 2021
Oct 11 2021
I found that other than the motivation example I showed above coming up with C-level examples, especially for cases where I don't think transforming is a good idea is difficult. I'll see if I have better luck with IR tests.
Oct 6 2021
Sep 23 2021
Sep 17 2021
Are dc zva and mrs broadly available on aarch64 or should they be guarded by some #define?
They are available for any AArch64 ISA, so no define required since this is for AArch64 only.
Aug 12 2021
Yeah no problem!
Aug 10 2021
Aug 4 2021
Aug 3 2021
Jul 29 2021
Jul 23 2021
Jul 7 2021
I think I've addressed all the comments and style issues.
Jul 5 2021
May 6 2021
May 5 2021
Apr 28 2021
Apr 23 2021
Alright I'll give it a wait, I was currently trying to figure out why I can't build the libc_string_unittests, getting a build error around elements_test.cpp saying there's an issue with the 'cstring' header file 'no member named 'strcmp' in the global namespace' which is an odd one ... What libc++ library do you use to build this normally?
I'm going to try to rewrite my memcmp implementation using these now and report findings as I go through it :).
Apr 22 2021
Feb 4 2021
Accidentally linked my commit of https://reviews.llvm.org/D92236 to this revision. Apologies.
I've committed the patch in https://reviews.llvm.org/rG369f7de3135a517a69c45084d4b175f7b0d5e6f5 but it seems when copying the revision link I may have hit ctrl + x in vim and linked it to D92235 :(