This patch extend to load/stores bf16 vector in non-avx512vl case.
We supported load/stores bf16 vector in avx512vl. but load/store just store register into mem.
If we can load/store 128/256 vector, we should also support them when the vector is bf16 vector.
In fact, we shouldn't limited this in type.
Todo: These maybe more types of load/store need to extend. Anyway let's fix bf16 first, because
we meet a urgent custom bug about it.
align 16 ?