Page MenuHomePhabricator

[X86][SSE] Transform truncation from v8i32/v16i32 to v8i8/v16i8 into bitand and X86ISD::PACKUS operations during DAG combine.
ClosedPublic

Authored by congh on Nov 11 2015, 2:11 PM.

Details

Summary

This patch transforms truncation from v8i32/v16i32 to v8i8/v16i8 into bitand and X86ISD::PACKUS operations during DAG combine. We don't do it in lowering phase because after type legalization, the original truncation will be turned into a BUILD_VECTOR with each element that is extracted from a vector and then truncated, and from them it is difficult to do this optimization. This greatly improves the performance of those two truncations. For example, for the following IR:

define void @truncate_v16i32_to_v16i8(<16 x i32> %a) {

%1 = trunc <16 x i32> %a to <16 x i8>
store <16 x i8> %1, <16 x i8>* undef, align 4
ret void

}

On SSE2 previously it will be compiled into 33 instructions:

movdqa %xmm3, -24(%rsp)
movdqa %xmm1, -56(%rsp)
movdqa %xmm2, -40(%rsp)
movdqa %xmm0, -72(%rsp)
punpcklbw %xmm3, %xmm1
punpcklbw %xmm2, %xmm0
punpcklbw %xmm1, %xmm0
movd -20(%rsp), %xmm1
movd -52(%rsp), %xmm2
movd -16(%rsp), %xmm3
movd -48(%rsp), %xmm4
punpcklbw %xmm3, %xmm4
movd -36(%rsp), %xmm3
movd -68(%rsp), %xmm5
movd -32(%rsp), %xmm6
movd -64(%rsp), %xmm7
punpcklbw %xmm6, %xmm7
punpcklbw %xmm4, %xmm7
punpcklbw %xmm7, %xmm0
punpcklbw %xmm1, %xmm2
punpcklbw %xmm3, %xmm5
punpcklbw %xmm2, %xmm5
movd -12(%rsp), %xmm1
movd -44(%rsp), %xmm2
punpcklbw %xmm1, %xmm2
movd -28(%rsp), %xmm1
movd -60(%rsp), %xmm3
punpcklbw %xmm1, %xmm3
punpcklbw %xmm2, %xmm3
punpcklbw %xmm3, %xmm5
punpcklbw %xmm5, %xmm0
movdqu %xmm0, (%rax)
retq

and now it is compiled into 10 instructions:

movdqa LCPI0_0(%rip), %xmm4
pand %xmm4, %xmm3
pand %xmm4, %xmm2
packuswb %xmm3, %xmm2
pand %xmm4, %xmm1
pand %xmm4, %xmm0
packuswb %xmm1, %xmm0
packuswb %xmm2, %xmm0
movdqu %xmm0, (%rax)
retq

which saves 22 instructions (many of them are memops).

Diff Detail

Repository
rL LLVM

Event Timeline

congh updated this revision to Diff 39971.Nov 11 2015, 2:11 PM
congh retitled this revision from to [X86][SSE] Transform truncation from v8i32/v16i32 to v8i8/v16i8 into bitand and X86ISD::PACKUS operations during DAG combine..
congh updated this object.
congh added reviewers: hfinkel, dexonsmith, RKSimon, davidxl.
congh added a subscriber: llvm-commits.
RKSimon edited edge metadata.Nov 14 2015, 7:34 AM

I've added the current codegen to the vector-trunc.ll tests for comparison so please can you rebase against that?

I wonder if it would be better to combine to bitcast/shuffle pairs instead of specific X86ISD nodes? And then focus on improving the existing shuffle lowering with PACKUS (e.g. I don't think we're making use of PACKUSDW at all yet).

I've added the current codegen to the vector-trunc.ll tests for comparison so please can you rebase against that?

I have rebased my repo and all tests are passes with this patch.

I wonder if it would be better to combine to bitcast/shuffle pairs instead of specific X86ISD nodes? And then focus on improving the existing shuffle lowering with PACKUS (e.g. I don't think we're making use of PACKUSDW at all yet).

Without this patch, the truncation from v16i32 to v16i8 is first converted to many extract-element, truncation on scalars, and a build_vector after type legalization, which is very difficult to be lowered into PACKUS (at least the lowering is far from elegant). So what you are suggesting is first combining it into a shuffle on v64i8 that is bitcast from v16i32, right? I need to try this method and see if it is easy to lower the instructions after type legalization.

I have attempted to combine the truncation from v16i32 to v16i8 into bitcasts and shuffles, and found that after type legalization they are still converted into extract_vector_elt and BUILD_VECTOR (shown below). In order to do the same optimization, we have to check the pattern again on them and I think the resulted code is tedious. So why not do the same thing earlier with a simpler approach?

t0: ch = EntryToken
  t6: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg2
t33: v16i8 = bitcast t6
  t8: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg3
t34: v16i8 = bitcast t8
  t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg0
t35: v16i8 = bitcast t2
  t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg1
t36: v16i8 = bitcast t4
      t37: i8 = extract_vector_elt t35, Constant:i64<0>
      t39: i8 = extract_vector_elt t35, Constant:i64<4>
      t41: i8 = extract_vector_elt t35, Constant:i64<8>
      t43: i8 = extract_vector_elt t35, Constant:i64<12>
      t44: i8 = extract_vector_elt t36, Constant:i64<0>
      t45: i8 = extract_vector_elt t36, Constant:i64<4>
      t46: i8 = extract_vector_elt t36, Constant:i64<8>
      t47: i8 = extract_vector_elt t36, Constant:i64<12>
      t48: i8 = extract_vector_elt t33, Constant:i64<0>
      t49: i8 = extract_vector_elt t33, Constant:i64<4>
      t50: i8 = extract_vector_elt t33, Constant:i64<8>
      t51: i8 = extract_vector_elt t33, Constant:i64<12>
      t52: i8 = extract_vector_elt t34, Constant:i64<0>
      t53: i8 = extract_vector_elt t34, Constant:i64<4>
      t54: i8 = extract_vector_elt t34, Constant:i64<8>
      t55: i8 = extract_vector_elt t34, Constant:i64<12>
    t56: v16i8 = BUILD_VECTOR t37, t39, t41, t43, t44, t45, t46, t47, t48, t49, t50, t51, t52, t53, t54, t55

I've added the current codegen to the vector-trunc.ll tests for comparison so please can you rebase against that?

I have rebased my repo and all tests are passes with this patch.

Your patch still doesn't appear to be based on rL253132

congh updated this revision to Diff 40334.Nov 16 2015, 1:36 PM
congh edited edge metadata.

Update the patch after merging with trunk.

congh added a comment.Nov 16 2015, 1:37 PM

I've added the current codegen to the vector-trunc.ll tests for comparison so please can you rebase against that?

I have rebased my repo and all tests are passes with this patch.

Your patch still doesn't appear to be based on rL253132

Ah, sorry for forgetting to update the patch. Updated.

congh updated this revision to Diff 40335.Nov 16 2015, 1:39 PM

Fix commented code.

I think you need to rebase after rL253952

You should be able to support packing with packssdw as well - it would allow us to truncate vXi32 -> vXi16 on SSE2 targets

test/CodeGen/X86/vector-trunc.ll
270 ↗(On Diff #40335)

Aren't the SSE2/SSSE3/SSSE41 checks the same now? Shouldn't they just use SSE-CHECK ? Try to use update_llc_test_checks.py if you can.

congh updated this revision to Diff 41113.Nov 25 2015, 12:13 AM

Update the patch according to Simon's comments.

The optimization in this patch now covers more truncations with vXi32/vXi64 to vXi8/vXi16 for X >= 8.

Update the patch according to Simon's comments.

The optimization in this patch now covers more truncations with vXi32/vXi64 to vXi8/vXi16 for X >= 8.

Agree

I think you need to rebase after rL253952

You should be able to support packing with packssdw as well - it would allow us to truncate vXi32 -> vXi16 on SSE2 targets

Agree. I have updated the patch to support this truncation as well. PTAL.

test/CodeGen/X86/vector-trunc.ll
270 ↗(On Diff #40335)

Thanks so much for the advice of using update_llc_test_checks.py! It automatically combines tests of SSE2/SSSE3/SSSE41 into SSE test only.

Please can you rebase this - your PAVG patches will have affected PerformTRUNCATECombine etc.

Please can you rebase this - your PAVG patches will have affected PerformTRUNCATECombine etc.

OK. I just found my implementation of using PACKSS is incorrect in this patch. I need to use shift-left then shift-right to set the bits of the higher part for negative values before using PACKSS. I will fix it soon.

congh updated this revision to Diff 41368.Nov 30 2015, 12:26 AM

Fix the bug when truncating v4i32 to v2i16 using PACKSS: we need to set proper sign bits by using shift-left plus ashift-right.

When truncating from v2i64, first use SHUFP to extract lower i32 parts from 2 x v2i64 to 1 x v4i32.

Please can you rebase this - your PAVG patches will have affected PerformTRUNCATECombine etc.

I have rebased my patch and also fixed the bug in the patch (see the description of the patch update). PTAL. Thanks!

RKSimon added inline comments.Nov 30 2015, 5:36 AM
lib/Target/X86/X86ISelLowering.cpp
26229 ↗(On Diff #41368)

I understand that packus(ymm) won't do what we want but - won't AVX2 still benefit for cases where packus(xmm) is used? Why not just early out if (!hasSSE2 || VT.getSizeInBits() > 128) ?

26230 ↗(On Diff #41368)

Please can you add AVX512 as a test target to prove that its using vpmovdb etc.

26269 ↗(On Diff #41368)

Please don't use domain switches, they can cause massive stalls on pipes. Why not just use DAG.getVectorShuffle()?

congh added inline comments.Nov 30 2015, 12:23 PM
lib/Target/X86/X86ISelLowering.cpp
26229 ↗(On Diff #41368)

I have tested the instructions generated on AVX2 by using your suggested method and found we can save one instruction for v16i32 to v16i8 truncation. However, I found in this case we could still use ymm to get better results: what we need to do is doing packus on ymm but later shuffle the result. This can save us more than 40% instructions . I think this can be a follow-up patch.

26230 ↗(On Diff #41368)

OK. I have updated the test file.

26269 ↗(On Diff #41368)

Thanks for pointing out this problem! Unfortunately, there is no similar instruction as shufps for integers. Therefore, I could not find any better solution for vXi64 to vXi16 truncation and hence have removed this case from this patch.

congh updated this revision to Diff 41424.Nov 30 2015, 12:29 PM

Stop using SHUFP as it is for floating points and can cause domain switch.

Also update the test case by adding AVX512 test and v16i64 to v16i8 truncation.

Almost there I think - just a suggestion for the trunc8i64_8i8 test

test/CodeGen/X86/vector-trunc.ll
240 ↗(On Diff #41424)

Please can you change this to a store <8 x i8> instead? Legalizing a <8 x i8> return means that its technically a <8 x i16> with undef upper bytes - which makes the truncation shuffle code hardware to track, especially on pre SSE41.

congh updated this revision to Diff 42356.Dec 9 2015, 4:49 PM

Update a test case as suggested by Simon.

congh updated this revision to Diff 42357.Dec 9 2015, 4:50 PM

The previous update is incorrect. Update it again.

congh added inline comments.Dec 9 2015, 4:52 PM
test/CodeGen/X86/vector-trunc.ll
240 ↗(On Diff #41424)

You are right. I have update this test case.

Ping again?

RKSimon accepted this revision.Dec 19 2015, 9:01 AM
RKSimon edited edge metadata.

LGTM with a couple of (optional) minor corrections.

lib/Target/X86/X86ISelLowering.cpp
26141 ↗(On Diff #42357)

Maybe use the same naming convention for the out / in VTs?

OutVT + OutSVT
InVT + InSVT

Makes it easier to track.

26146 ↗(On Diff #42357)

Can you please assert that OutSVT is a i8 / i16 before this? Maybe use APInt to create the mask?

test/CodeGen/X86/vector-trunc.ll
156 ↗(On Diff #42357)

Its a pity that the SSSE3/SSE41 codegen is still so poor - can you add a FIXME explaining what still needs doing?

This revision is now accepted and ready to land.Dec 19 2015, 9:01 AM
congh marked 2 inline comments as done.Dec 21 2015, 11:38 AM

Thanks a lot for the review, Simon!

test/CodeGen/X86/vector-trunc.ll
156 ↗(On Diff #42357)

I'll adjust the conditions to let this SSSE3/SSE41 codegen be as good as SSE2.

This revision was automatically updated to reflect the committed changes.