It adds a shuffle_16x16 strategy LowerVectorTranspose and renames shuffle to shuffle_1d. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:
interleave 32-bit lanes using
8x _mm512_unpacklo_epi32
8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
8x _mm512_unpacklo_epi64
8x _mm512_unpackhi_epi64
permute 128-bit lanes using
16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
16x _mm512_shuffle_i32x4After the first stage, they got transposed to
0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29 2 18 3 19 6 22 7 23 10 26 11 27 14 30 15 31 32 48 33 49 ... 34 50 35 51 ... 64 80 65 81 ... ...
After the second stage, they got transposed to
0 16 32 48 ... 1 17 33 49 ... 2 18 34 49 ... 3 19 35 51 ... 64 80 96 112 ... 65 81 97 114 ... 66 82 98 113 ... 67 83 99 115 ... ...
After the thrid stage, they got transposed to
0 16 32 48 8 24 40 56 64 80 96 112 ... 1 17 33 49 ... 2 18 34 50 ... 3 19 35 51 ... 4 20 36 52 ... 5 21 37 53 ... 6 22 38 54 ... 7 23 39 55 ... 128 144 160 176 ... 129 145 161 177 ... ...
After the last stage, they got transposed to
0 16 32 48 64 80 96 112 ... 240 1 17 33 49 66 81 97 113 ... 241 2 18 34 50 67 82 98 114 ... 242 ... 15 31 47 63 79 96 111 127 ... 255
It might pay to make number of elements an argument (or template parameter) and then reusing this for mm256* variants.