LGTM with a nit. Can you also remove FPRegs16Pat from ARMInstrFormats.td now that is no longer used?
Jun 23 2020
Hi Simon, thanks for working on this. Looks good overall. A few remarks inline.
Jun 18 2020
Addressed last round's review comments.
Jun 17 2020
Changes from last revision:
- the code generation relies on fullfp16 being present,
- the unit test also checks the codegen for soft float abi
Jun 16 2020
Jun 11 2020
Jun 9 2020
Jun 8 2020
Hey Oliver, thanks for looking at this.
I believe the codegen patterns for vmov and load/store half are incorrect on the bf16 type. Can someone suggest what is the right approach?
Jun 4 2020
Jun 3 2020
Jun 1 2020
May 28 2020
May 26 2020
Should poly128_t be available on AArch32 too? I don't see anything in the ACLE version you linked restricting it to AArch64 only, and the intrinsics reference has a number of intrinsics available for both ISAs using it.
It should but it is not that simple. The reason it is not available is that __int128_t is not supported in AArch32. I think that is future work, since this patch unblocks the bfloat reinterpret_cast patch, which btw is annotated with TODO comments regarding the poly128_t type for AArch32.
May 20 2020
Jan 20 2020
Sep 29 2019
ARM and AArch64 have a way to list the implied target features using the TargetParser but we can't directly use that in CodeGenModule because it's tied to the backend.
Sep 27 2019
However, passing the AArch64 architecture names in target-cpu isn't supported by LLVM
The Clang documentation suggests that arch is used to override the CPU, not the Architecture (which is rather confusing if you ask me). GCC makes more sense having separate target attributes for CPU and Architecture (see the equivalent GCC documentation). I think target-cpu should remain generic when it is not explicitly specified either on the command line (-mcpu) or as a function attribute (i.e target("arch=cortex-a57")). However, if the function attribute specifies an Architecture (i.e target("arch=armv8.4a")), I agree we should favor the subtarget features corresponding to armv8.4 over those of the command line. Similarly we should favor the subtarget features corresponding to cortex-a57 (not sure if we do so atm - I think we don't). ARM and AArch64 have a way to list the implied target features using the TargetParser but we can't directly use that in CodeGenModule because it's tied to the backend.
Sep 26 2019
Updated the Filecheck labels as suggested.
Sep 25 2019
Sep 23 2019
I think Clang is involved there too, in horribly non-obvious ways (for example I think that's the only way to get the actual libcalls you want rather than legacy ones). Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage (which, unintentionally, is what this patch does).
I am not sure I am following here. According to https://llvm.org/docs/Atomics.html the AtomicExpandPass will translate atomic operations on data sizes above MaxAtomicSizeInBitsSupported into calls to atomic libcalls. The docs say that even though the libcalls share the same names with clang builtins they are not directly related to them. Indeed, I hacked the AArhc64 backend to disallow codegen for 128-bit atomics and as a result LLVM emitted calls to __atomic_store_16 and __atomic_load_16. Are those legacy names? I also tried emitting IR for the clang builtins and I saw atomic load/store IR instructions (like those in your tests), no libcalls. Anyhow, my concern here is that if sometime in the future we replace the broken CAS loop with a libcall, the current patch will break ABI compatibity between v8.4 objects with atomic ldp/stp and v8.X objects without the extension. Moreover, this ABI incompatibility already exists between objects built with LLVM and GCC. Any thoughts?
Sep 20 2019
Hi Tim, thanks for looking into this optimization opportunity. I have a few remarks regarding this change:
- First, it appears that the current codegen (CAS loop) for 128-bit atomic accesses is broken based on this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3. There are two problematic cases as far as I understand: (1) const and (2) volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS. The CAS loop violates both.
Jul 14 2019
Jul 4 2019
Added the dependency of mve on dsp and some missing tests to cover those cases.
Jul 3 2019
Jul 2 2019
Jul 1 2019
I've split the patch.
The second change this patch makes
Could this be spilt into two patches?
Jun 28 2019
Dec 17 2018
Committed as https://reviews.llvm.org/rL349338
IIRC, there is a test for the pass pipeline I would expect needs updating.
Dec 12 2018
I've tested the patch with native builds of the llvm-test-suite on an AArch64 Cortex-A72 and couldn't spot anything interesting in terms of compilation time.
Dec 11 2018
Dec 4 2018
Nov 30 2018
Looks fine. Thanks!
It may be worthwhile allowing scalar PRE on GEPs that we know won't be combined into the addressing mode of a load/store, i.e. those where TargetTransformInfo::isLegalAddressingMode returns false.
Nov 28 2018
Nov 6 2018
This looks like a bunch of separate changes which should be split into multiple patches. Especially the changes to DAGCombine and InstCombiner::visitZExt .
(For reference: I was wondering why x86 doesn't show any diffs for this change; it looks like there's custom code in X86ISelLowering that already does the same thing.)
Nov 2 2018
Rebased and clang-formatted.
Oct 29 2018
I've autogenerated the filecheck lines to show the diff compared to the trunk codegen. For making sure we never fall-through to the next block, having changed the CC but not swapped (N2, N3), I've moved all the preconditions to the beginning of the block (instead of moving the block into a helper function).
Oct 24 2018
Is the motivating case integer or FP?
I'm asking because we have a canonicalization for integer cmp+sel for the IR in these tests, but we're missing the corresponding FP transform.
If we add the FP canonicalization in IR, would there still be a need for this backend patch? Ie, is something generating this select code in the DAG itself?
Oct 15 2018
Oct 12 2018
Sep 26 2018
Not ready for review. Using this as reference to and RFC in llvm-dev.
Sep 12 2018
Sep 7 2018
Aug 28 2018
Do you need help with pushing the changes?
Apologies for delaying this, I was out of office. I'll rebase and push it asap.
Aug 10 2018
Aug 9 2018
So, is everyone happy with this change?
Aug 7 2018
Aug 6 2018
This got reverted because of an out-of-memory error on an ubsan buildbot. Details and fix here -> https://reviews.llvm.org/D50323. I'll update the tests upon rebase.
Jul 30 2018
Did you test it with some benchmarks? Results?
I am running lnt, spec2000 and spec2006 on AArch64 at the moment. I'll post results soon.
Jul 26 2018
Jul 23 2018
Jul 20 2018
Changes to prior revision.
- Removed the update loop for PhiOps and used TrackingVH<MemoryAccess> instead.
- Replaced the Bitcode reproducer with IR using -preserve-ll-uselistorder.
Jul 19 2018
If the bitcode is crashing but the textual IR isn't, you're probably getting bitten by use-list ordering. You can use the preserve-ll-uselistorder option for "opt" to preserve it in IR.
Jul 18 2018
Does the original test-case crash reliably as IR for you? If so, please use that instead. (Phab won't let me download the attached bitcode, but with asan, I see use-after-free crashes 100% of the time in the original repro).
It does, but using opt -S -O3 ./tc_memphi_gvnhoist.ll -enable-gvn-hoist. Using bugpoint on that command you get the bitcode I uploaded.
Jul 17 2018
A few remarks:
- SmallVector<WeakVH, 8> PhiOps fixes the bug on its own (without the rest changes) and I am wondering why..
- When we mark a block as visited why do we cache it? When the recursion ends we might trivially remove the Phi. In that case the second cache insertion for the same key block should fail, no?
- Do we ever reach the PHIExistsButNeedsUpdate case? Is it when a Phi existed beforehand, meaning we did not create it? I can't think of another way to reach that state.
- Interestingly enough the reproducer only made opt crash in bitcode form and not in IR form.