Currently, we generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors.
This patch should preserve all existing behavior with regular optimization levels, but also use splats whenever possible when optimizing for *size* on any CPU with AVX or AVX2.
The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save at least 8 bytes (up to 31 bytes) of constant pool data.
The change using -Os (function attribute "optsize") for the included testcase file with all 12 AVX2 vector data type cases (f32, f64, i8, i16, i32, i64 for 128-bit and 256-bit vectors) is:
AVX: +29 inst -112 data = 83 bytes saved AVX2: +29 inst -106 data = 77 bytes saved
Note: Is there any optimization pass in LLVM that merges constant pool data from different functions? This could also be done at link time? If that exists, it might change the criteria for generating a broadcast because we might not want to generate extra instructions if the same constant data was loaded multiple times.