The insertelement IR instruction can lead to different codegen, there are quite a few variants available/applicable. One option is to generate an INS, which is "ASIMD insert, element to element" instruction. This is actually a cheap instructions as it only has a latency of 2 on modern cores like the N1, N2 and V1. Currently we model this with a cost of 3, which perhaps is slightly higher than needed, but that is for another time
This is about another variant, an indexed LD1, or "ASIMD load, 1 element, one lane, B/H/S" instruction, that loads a value and inserts an element into a vector. This is actually an expensive instruction, which has a latency of 8 on modern cores. We generate an indexed LD1 when an insertelement instruction has a load as an operand. And this patch is recognising that, assigning a cost of 4 to this type of insertelement instructions making it a bit more expensive than the 3 it was before. This new cost of 4 is fairly arbitrary, but the point is that it makes it more expensive.
It might be worth using ST->getVectorInsertExtractBaseCost() + 1