The cost model was not accounting for the fact that we can generate a dual vrgather + an index expression sequence instead of scalarizing.
A couple cases to call out:
- I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
- Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.
- As indicated by the *regression* in reported cost for <2 x iN> vectors, our current vector lowering is missing support for a sub-case where scalarize-and-insert is actually faster than the generic fallback path.