As reported on PR51075, we fail to make use of dereferencable 128-bit vector loads for float2 loads which were then being widened for float4 operations, preventing a useful load-fold.
We already do a similar fold for insert_subvector patterns of 128-bit loads with 256-bit dereferencable pointers.
Please can you precommit this case change