The instruction regexp "^INSv" for the insert gen-reg-to-element was also matching the element-to-element instruction, which has a latency of 2 and not 5 according to the Software Optimization Guide [1], so we were getting that wrong.
I haven't done any performance runs with this change because I don't have access to N2 hardware and also because the fix is hopefully obvious enough. My use-case with this was llvm-mca which is getting things wrong because of this.
[1] https://developer.arm.com/documentation/PJDOC-466751330-18256/latest/
I don't think this line needs to be added, as it will be handled by WriteV being N2Write_2cyc_1V already. The tighter regex on INSv..gpr should be enough. I believe that is what it means above in: