In this big patch I'm solving two things. (If you'll say that it is too big and not reviewable, I'll spit into two, but this things are connected)
- The current lowering of masked load/store for <2 x i32> and <2 x f32> is incorrect, and I'm solving this in type legalizer and subsequent "combine" in X86.
- I added the cost estimation for masked operations that shows that (1) masked load/store for these vector types are very expensive ( due to expanding loads and truncating stores ) (2) maskmov operation itself is not as cheap as vector load-store.