If the fma operates on a legal vector type, the indexed variants can be
used, if the second operand is a splat of a valid index.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I think that non-legal types will be legalized to legal types where a splat can still be used for lane indexing.
And it seems like the first two operands of a fma are commutative?
And I don't the splat index can be out of range, if we consider illegal types to be legalized.
Which means we might be able to simplify it to this, maybe with a check for fullfp16 and the the scalar types are f16/f32/f64?
+ case Intrinsic::fma: case Intrinsic::aarch64_neon_sqdmull: case Intrinsic::aarch64_neon_sqdmulh: case Intrinsic::aarch64_neon_sqrdmulh: // Sink splats for index lane variants if (isSplatShuffle(II->getOperand(0))) Ops.push_back(&II->getOperandUse(0)); if (isSplatShuffle(II->getOperand(1))) Ops.push_back(&II->getOperandUse(1)); return !Ops.empty();
Note that I have also heard of cases where the lane index fma instructions are slightly slower than a normal fma, as they get cracked into a dup and a fma micro-op. Which would make leaving them outside of the loop a better idea, and might make Machine Combiner a better place for this if it worked. But I don't know where that actually happens, and there doesn't seem to be an obvious place intree where the latencies are different except for one old core and this cyclone difference which doesn't make a lot of sense: https://godbolt.org/z/63eWsEqT1. So I think sinking them unconditionally is probably fine if it saves the register pressure. We just might need to change that in the future if we find cases where it does matter.
Thanks for taking a look!
That's a good point, the original checks are probably more paranoid than necessary. I updated the patch to use the common code with aarch64_neon_sqdmul*, with a check for the element type for fp16.
Note that I have also heard of cases where the lane index fma instructions are slightly slower than a normal fma, as they get cracked into a dup and a fma micro-op. Which would make leaving them outside of the loop a better idea, and might make Machine Combiner a better place for this if it worked. But I don't know where that actually happens, and there doesn't seem to be an obvious place intree where the latencies are different except for one old core and this cyclone difference which doesn't make a lot of sense: https://godbolt.org/z/63eWsEqT1. So I think sinking them unconditionally is probably fine if it saves the register pressure. We just might need to change that in the future if we find cases where it does matter.
From the publicly available ARM software optimization guides, it looks like the indexed/non-indexed variants should give the same performance on A57/A75, but the indexed variants have less throughput on A55. A quick glance at the scheduling model for A55 seems to indicate that this difference is not reflected in the model though.
It also looks like the same issue would exist for other indexed variants handled there. If issues show up investigating handling this more granually sounds like a good idea. We could also limit the sinking to cores where indexed and non-indexed variants have the same performance, if needed.
From the publicly available ARM software optimization guides, it looks like the indexed/non-indexed variants should give the same performance on A57/A75, but the indexed variants have less throughput on A55. A quick glance at the scheduling model for A55 seems to indicate that this difference is not reflected in the model though.
It also looks like the same issue would exist for other indexed variants handled there. If issues show up investigating handling this more granually sounds like a good idea. We could also limit the sinking to cores where indexed and non-indexed variants have the same performance, if needed.
The testing I usually run included some A55 runs, and I wasn't seeing any differences in performance. It turns out that the test case I have for fma+dup would come through to MachineCombiner as FADD+FMUL+DUP though, not llvm.fma. It was MachineCombiner that did the conversion in the end though (at least for one of the two fma's, the other becomes a non-lane-indexed version because the pattern is actually FADD(FMUL(COPY(DUP(..))), and I don't think the Machine Combiner is ignoring no-op copies).
Anyway - Having fixed that I still do not see any difference between the indexed and non-indexed variants of FMA A55 when the index versions are forced off, so I think performance-wise this is OK. The indexed variants seem to perform better in general, even if they do have a lower throughput.
So LGTM. Cheers