This builds on the existing support to do this for 128-bit loads into 256-bit vectors and generalizes it.
New patterns added to support 8-bit and 16-bit elements, v8f32->v16f32 without DQI instructions, and adding fallback for when the load can't be folded.
Update the comment?