According to the ARM Neoverse-N1 Software Optimization Guide the
extract instructions have a latency of 2 and a throughput of 2.
TTI returns a cost of 3, which seems too high.
PEXTR in x86 has a latency of 3 and throughput of 1, according to
https://www.agner.org/optimize/instruction_tables.pdf for Skylake,
yet TTI returns 1.
This patch sets the vector extract cost to 1.
nittest nit: this changes cost for both extract and insert, while summary mostly mentions EXT instruction cost. Might be good to call out that INS has a latency of 2 and throughput of 2 (unless it's common assumption that extract and insert instruction have the same cost).
Also, from the studies of D128302, I think the cost of extract/insert is better modeled by considering user instruction into account (e.g., if user instruction can access lane directly, extract could be combined into user in emitted code and have no cost). Nevertheless, my gut feeling is that 3 is a high number (for instructions of latency 2 and throughput 2); not sure if 1 is too small.