User Details
- User Since
- May 24 2016, 8:35 AM (313 w, 4 d)
Yesterday
Thanks for the updates.
Thu, May 26
So it looks like roughly a 0.2/126 = 0.16% increase in compile time? I think that is OK for -O3, even though the expected gains will likely not be very large. It did come up as an improvement in the tests I ran though.
Thanks. LGTM
From the publicly available ARM software optimization guides, it looks like the indexed/non-indexed variants should give the same performance on A57/A75, but the indexed variants have less throughput on A55. A quick glance at the scheduling model for A55 seems to indicate that this difference is not reflected in the model though.
It also looks like the same issue would exist for other indexed variants handled there. If issues show up investigating handling this more granually sounds like a good idea. We could also limit the sinking to cores where indexed and non-indexed variants have the same performance, if needed.
Hello - We were having a discussion about a very similar patch in D125755. I think the outcome for this patch is that either:
- We need to do this later (maybe in CodeGenPrepare).
- We need to do this unconditionally without the call to TTI.preferCTTZLowering() and have the reverse transform later for targets that do not have a cheaper alternative.
- We need to argue some more :)
Wed, May 25
I think that non-legal types will be legalized to legal types where a splat can still be used for lane indexing.
And it seems like the first two operands of a fma are commutative?
And I don't the splat index can be out of range, if we consider illegal types to be legalized.
Thanks - do you have AArch64 numbers? The MachineCombiner does seem to be used on X86, but to a much smaller degree than on AArch64 where there are many more patterns. The reassociation might be useful, but it is the converting madd to mul+add and the fmla changes that are more worrying.
Thanks for getting the numbers.
Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?
The primary downside of target-specific transforms is that it goes against canonicalization: if a combine is expecting a specific form of IR, that form will only show up on specific targets. So we either miss some transforms on some targets, or we write code to match the same thing in each possible form.
It's always been a bit of a spectrum; transforms like inlining are fundamentally driven by heuristics, and those heuristics are going to lead to different IR on different targets. But we want to encourage using canonical forms, even when we sometimes end up transforming from A->B, then later end up transforming B->A in the target's code generator. This shapes the way we define IR to some extent; for example, llvm.cttz has an "is_zero_poison" flag so we can use the same intrinsic on all targets.
This isn't to say we can never make a target-specific decision early, but we should explore alternatives that allow making a target-specific decisions later, where possible.
Mon, May 23
I hadn't expected MachineCopyPropgation would be too expensive - it sounds like a simple concept and seems to already be run twice in the default pipeline. This is only adding an extra at -O3, so hopefully won't be too bad for compile time.
Sat, May 21
Fri, May 20
Thanks for splitting this out. I've been trying to run some tests.
Thanks for checking. Lets give this another go then. LGTM
OK thanks. LGTM
Thu, May 19
So long as there is no issues from @fhahn, this LGTM.
Sounds good to me.
It might be worth teaching AArch64TTIImpl::getRegisterClassForType that fp regs and vector regs are the same thing. That would require some testing to make sure it actually produces better results though.
I think for the case of D118979 it makes sense to prevent maximizing the vector bandwidth for fixed-length sve. The larger vectors will already be wide enough and as far as I understand they don't benefit from the wider types in the same way that NEON does.
Thanks for the comments.
If I haven't got this wrong, this says we could just always be zexting the constant from a sub:
https://alive2.llvm.org/ce/z/kTAyt2
https://alive2.llvm.org/ce/z/a6HHq4
That sounds like it would make simpler constants too, although I'm not sure there are any tests that show it.
Wed, May 18
I was taking a look at some of the ARM/Thumb tests that were failing if this used TII.isCopyInstr. There are definitely some problems happening in there, with undefined register.
Tue, May 17
Yeah - it needs to be earlier than that, before vectorization and preferably unrolling to get the costs correct.
I think AArch64 should be able to do the And efficiently in most cases. There is an instcombine equivalent fold that doesn't check one-use, so from that perspective it will only be things that come up from DAG combine that this alters. I see two win vararg tests changing, with code that comes from AArch64TargetLowering::LowerWindowsDYNAMIC_STACKALLOC.
Do you know what makes the AArch64::shouldFoldConstantShiftPairToMask necessary? Is it just something about those tests, or something fundamental to the architecture?
Mon, May 16
Fri, May 13
Thanks. LGTM
Thu, May 12
Thanks for the update. Have you tried a bootstrap to make sure it passes now?
Cheers
Thanks. LGTM
Do we need --asm-show-inst, or is it enough to just show the instructions? It seems quite verbose for the information it gives.
I noticed that for signed conversions, some non-NEON instruction sequences are shorter. I don't know if the longer one is still faster on current architectures (the patterns date back to the initial backend import)
Hello. This looks like two different patches.
In order sounds sensible to me. We may need a subtarget feature for this, it depends on what the Apple folks think, but we can always add one later if needed.
Wed, May 11
I thought you said you were just going to write a new pass for this? :)
Yeah I think the issue is that @cttztrunc should be doing and 0x3f, not and 0x1f.
Tue, May 10
Oh yeah, I missed that testcase. I'll revert for now. Thanks for the report.
Sounds good to me.
Sounds good to me.
Mon, May 9
Rebase
No, it tries to do the same as your patch - find reused scalars in the previously build nodes and emit shuffles, if possible, instead of bunch of extracts.
The problem is that we need to track the dependencies between nodes, that's why it is so big.