Page MenuHomePhabricator

X86 TRUNCATE (v16i32 to v16i8) cost change in SSE4.1 mode

Authored by ashutosh.nema on Apr 19 2016, 11:27 PM.



Below patch transforms truncations between vectors of integers into X86ISD::PACKUS/PACKSS operations during DAG combine.

This change optimizes the code generation for below code by saving 22 instructions.
define void @truncate_v16i32_to_v16i8(<16 x i32> %a) {

%1 = trunc <16 x i32> %a to <16 x i8>
store <16 x i8> %1, <16 x i8>* undef, align 4
ret void


With the mentioned patch we generating better code for truncate, so cost of truncate needs to be changed but looks like it got changed only in SSE2 table.
Whereas this change is also applicable for SSE4.1, so we should change the cost of truncate for that as well.

Prior to the mentioned patch in SSE4.1 we used to generate below code:

1         pextrb  $4, %xmm0, %eax
2         pextrb  $8, %xmm0, %ecx
3         pextrb  $12, %xmm0, %edx
4         pinsrb  $1, %eax, %xmm0
5         pinsrb  $2, %ecx, %xmm0
6         pinsrb  $3, %edx, %xmm0
7         pextrb  $0, %xmm1, %eax
8         pinsrb  $4, %eax, %xmm0
9         pextrb  $4, %xmm1, %eax

10 pinsrb $5, %eax, %xmm0
11 pextrb $8, %xmm1, %eax
12 pinsrb $6, %eax, %xmm0
13 pextrb $12, %xmm1, %eax
14 pinsrb $7, %eax, %xmm0
15 pextrb $0, %xmm2, %eax
16 pinsrb $8, %eax, %xmm0
17 pextrb $4, %xmm2, %eax
18 pinsrb $9, %eax, %xmm0
19 pextrb $8, %xmm2, %eax
20 pinsrb $10, %eax, %xmm0
21 pextrb $12, %xmm2, %eax
22 pinsrb $11, %eax, %xmm0
23 pextrb $0, %xmm3, %eax
24 pinsrb $12, %eax, %xmm0
25 pextrb $4, %xmm3, %eax
26 pinsrb $13, %eax, %xmm0
27 pextrb $8, %xmm3, %eax
28 pinsrb $14, %eax, %xmm0
29 pextrb $12, %xmm3, %eax
30 pinsrb $15, %eax, %xmm0
31 movdqu %xmm0, (%rax)
32 retq

But after that we started generating better code:

1         movdqa  .LCPI0_0(%rip), %xmm4   # xmm4 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
2         pand    %xmm4, %xmm3
3         pand    %xmm4, %xmm2
4         packuswb        %xmm3, %xmm2
5         pand    %xmm4, %xmm1
6         pand    %xmm4, %xmm0
7         packuswb        %xmm1, %xmm0
8         packuswb        %xmm2, %xmm0
9         movdqu  %xmm0, (%rax)

10 retq

Proposing change to reduce the cost of “TRUNCATE v16i32 to v16i8” from 30 to 7 in SSE4.1 table.
This change will enable better vectorization as “TRUNCATE v16i32 to v16i8” is not very expensive now.

Diff Detail


Event Timeline

ashutosh.nema retitled this revision from to X86 TRUNCATE (v16i32 to v16i8) cost change in SSE4.1 mode.
ashutosh.nema updated this object.
ashutosh.nema set the repository for this revision to rL LLVM.
ashutosh.nema added a subscriber: llvm-commits.
RKSimon added inline comments.Apr 20 2016, 7:17 AM

To stop this happening in the future it may be better to just remove the entry from the SSE41 table to let it 'fall through' to the SSE2 entry. I think the MVT::v16i8/MVT::v16i16 entry can go as well.

Incorporated comments from Simon.

RKSimon accepted this revision.Apr 21 2016, 5:20 AM
RKSimon edited edge metadata.


This revision is now accepted and ready to land.Apr 21 2016, 5:20 AM

Thanks Simon for review, this change landed at revision 267123.