Instead of always spilling the vector and then performing the insert in-memory, when possible, insert directly into the correct half of the newly-split vector.
Unfortunately, aside from the SAD test, this doesn't really seem to get hit, and I'm not sure how to construct a test-case for the high lanes. One alternative is to make this safer by only inserting directly when Idx is 0 (this tested, and probably the common case). Let me know if you think that would be better.