In the TTI calculation of vector insert and extract costs, we have an optimization that returns a cost of zero if we are inserting into or extracting from vector lane zero. All other inserts and extracts cost the base amount specified by the sub-target. However, the lane zero optimization only makes sense for floating-point types (i.e., within-class moves). For integer types, we should incur a cost for moving data from vector to general purpose registers, even for lane zero.
This patch modifies the lane zero optimization so that it applies only to floating-point types. Additionally, we now fall back to the base TTI implementation for all other floating-point inserts and extracts. The existing sub-target specified insert/extract costs are used only for the cross-class moves, which I think was probably the original intent. Since the existing code looks like a bug to me, I checked the X86 target, and it implements something similar to what is in this patch.
I've added a new cost model test case in Analysis/CostModel/AArch64/inserts-extracts.ll. All other test case changes are trivial (e.g., they lower the SLP threshold to ensure tests still vectorize).