Below patch transforms truncations between vectors of integers into X86ISD::PACKUS/PACKSS operations during DAG combine.
http://reviews.llvm.org/D14588
This change optimizes the code generation for below code by saving 22 instructions.
define void @truncate_v16i32_to_v16i8(<16 x i32> %a) {
%1 = trunc <16 x i32> %a to <16 x i8> store <16 x i8> %1, <16 x i8>* undef, align 4 ret void
}
With the mentioned patch we generating better code for truncate, so cost of truncate needs to be changed but looks like it got changed only in SSE2 table.
Whereas this change is also applicable for SSE4.1, so we should change the cost of truncate for that as well.
Prior to the mentioned patch in SSE4.1 we used to generate below code:
1 pextrb $4, %xmm0, %eax 2 pextrb $8, %xmm0, %ecx 3 pextrb $12, %xmm0, %edx 4 pinsrb $1, %eax, %xmm0 5 pinsrb $2, %ecx, %xmm0 6 pinsrb $3, %edx, %xmm0 7 pextrb $0, %xmm1, %eax 8 pinsrb $4, %eax, %xmm0 9 pextrb $4, %xmm1, %eax
10 pinsrb $5, %eax, %xmm0
11 pextrb $8, %xmm1, %eax
12 pinsrb $6, %eax, %xmm0
13 pextrb $12, %xmm1, %eax
14 pinsrb $7, %eax, %xmm0
15 pextrb $0, %xmm2, %eax
16 pinsrb $8, %eax, %xmm0
17 pextrb $4, %xmm2, %eax
18 pinsrb $9, %eax, %xmm0
19 pextrb $8, %xmm2, %eax
20 pinsrb $10, %eax, %xmm0
21 pextrb $12, %xmm2, %eax
22 pinsrb $11, %eax, %xmm0
23 pextrb $0, %xmm3, %eax
24 pinsrb $12, %eax, %xmm0
25 pextrb $4, %xmm3, %eax
26 pinsrb $13, %eax, %xmm0
27 pextrb $8, %xmm3, %eax
28 pinsrb $14, %eax, %xmm0
29 pextrb $12, %xmm3, %eax
30 pinsrb $15, %eax, %xmm0
31 movdqu %xmm0, (%rax)
32 retq
But after that we started generating better code:
1 movdqa .LCPI0_0(%rip), %xmm4 # xmm4 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0] 2 pand %xmm4, %xmm3 3 pand %xmm4, %xmm2 4 packuswb %xmm3, %xmm2 5 pand %xmm4, %xmm1 6 pand %xmm4, %xmm0 7 packuswb %xmm1, %xmm0 8 packuswb %xmm2, %xmm0 9 movdqu %xmm0, (%rax)
10 retq
Proposing change to reduce the cost of “TRUNCATE v16i32 to v16i8” from 30 to 7 in SSE4.1 table.
This change will enable better vectorization as “TRUNCATE v16i32 to v16i8” is not very expensive now.
To stop this happening in the future it may be better to just remove the entry from the SSE41 table to let it 'fall through' to the SSE2 entry. I think the MVT::v16i8/MVT::v16i16 entry can go as well.