Front-end bitcast <256 x i32> to x86_amx and generate load/store <256 x i32>*.
In instruction combine pass it transform load/store <256 x i32>* to
load/store x86_amx*. The the amx type lowering, we lower the load/store
instructions to amx load/store intrinsics with the stride value 64.