This is an archive of the discontinued LLVM Phabricator instance.

[PATCH, PR24373] Combine shifts for x86
ClosedPublic

Authored by evstupac on Sep 25 2015, 5:01 AM.

Details

Summary

The patch folds

(ashr (shl, a, [56,48,32,24,16]), SarConst)

into

(shl, (sext (a), [56,48,32,24,16] - SarConst))

or into

(lshr, (sext (a), SarConst - [56,48,32,24,16]))

depending on sign of (SarConst - [56,48,32,24,16])

sexts in X86 are MOVs. The MOVs have the same code size as above SHIFTs (only SHIFT on 1 has lower code size).
However the MOVs have 2 advantages to SHIFTs on x86:

  1. MOVs can write to a register that differs from source
  2. MOVs accept memory operands

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac updated this revision to Diff 35714.Sep 25 2015, 5:01 AM
evstupac retitled this revision from to [PATCH, PR24373] Combine shifts for x86.
evstupac updated this object.
evstupac added reviewers: llvm-commits, nadav, majnemer.
evstupac set the repository for this revision to rL LLVM.
evstupac added a subscriber: llvm-commits.
RKSimon edited edge metadata.Sep 29 2015, 11:55 AM

Do you have any before/after perf timings?

test/CodeGen/X86/sar_fold.ll
3

Load tests?

Possibly regenerate this using update_llc_test_checks.py?

test/CodeGen/X86/sar_fold64.ll
2

Load tests?

Possibly regenerate this using update_llc_test_checks.py?

Do you have any before/after perf timings?

Yes. Spec2000 performance is almost flat.
Unit performance tests get up to 40% gain.
The patch fixes regression in PR24373.

evstupac added inline comments.Sep 29 2015, 3:18 PM
test/CodeGen/X86/sar_fold.ll
3

The test checks if "(a<<16)>>17" is folded to movswl and any of possible variant of "<<1". It could be "add %eax, %eax", "shl %eax" or even "lea". The test is to check only folding to movswl. Regenerating the test using update_llc_test_checks.py will make it less flexible.

test/CodeGen/X86/sar_fold64.ll
2

Yes.

I do wonder if this could be beneficial for other targets - possibly moving this to DAGCombiner and using a a test against isExtFree() or similar?

Also, please can you add tests against the load-execute versions movs*?

I do wonder if this could be beneficial for other targets - possibly moving this to DAGCombiner and using a a test against isExtFree() or similar?

That could be, however that is not obvious. I know that for Arm shifts could go in addition to the logic instructions. That they it could be better to leave shifts. Anyway keeping IR target independent is better. Note that at some point IR expand sext to pair of shifts (this, I think, simplify IR for further optimizations):

InstCombineCasts.cpp, visitSExt function// We need to emit a shl + ashr to do the sign extend.

Also, please can you add tests against the load-execute versions movs*?

Do you mean segment moves? If so there are no sar/shl pair that could be folded to such mov.

Also, please can you add tests against the load-execute versions movs*?

Do you mean segment moves? If so there are no sar/shl pair that could be folded to such mov.

I meant something that tested that a load + shl + ashr pattern gets combined to a folded movs** instruction

Also, please can you add tests against the load-execute versions movs*?

Do you mean segment moves? If so there are no sar/shl pair that could be folded to such mov.

I meant something that tested that a load + shl + ashr pattern gets combined to a folded movs** instruction

Ok. Good point. Actually that is what "test/CodeGen/X86/sar_fold.ll" test now, as in 32 bit mode parameter "i32 %a" goes from stack. Let's add CHECK-NEXT after BB0:
; CHECK: # BB#0:
; CHECK-NEXT: movswl {{[0-9]+}}(%esp), %eax
to be sure that there is no simple "movl" from stack.

evstupac updated this revision to Diff 36619.Oct 6 2015, 6:34 AM
evstupac edited edge metadata.

Add check for the tests.

RKSimon accepted this revision.Oct 11 2015, 7:02 AM
RKSimon edited edge metadata.

LGTM

This revision is now accepted and ready to land.Oct 11 2015, 7:02 AM
evstupac updated this revision to Diff 37349.Oct 14 2015, 7:15 AM
evstupac edited edge metadata.

During final check 2 new tests failed on x86:

  1. sar_fold64.ll failed on Windows. Fixed by using regular expression for parameter register:

"movswq %{{[cd][xi]}}, %rax" as parameter could be "cx" or "di"
and
"movsbq %{{[cdi]*l}}, %rax" as parameter could be "cl" or "dil"

  1. recently modified "vector-sext.ll" also failed as contains appropriate for folding shr/shl combinations.

This case is less obvious as for some reason the folding influence on scheduling pass.
I'm not happy with such dramatic change on scheduler pass, however I believe that should be fixed by another patch, not related to instructions combine. I've also checked the performance of changed function "load_sext_16i1_to_16i16" and it is the same with and without the patch.
I'll submit corresponding bug report after commit.

Is the patch still ok?

During final check 2 new tests failed on x86:

  1. sar_fold64.ll failed on Windows. Fixed by using regular expression for parameter register:

"movswq %{{[cd][xi]}}, %rax" as parameter could be "cx" or "di"
and
"movsbq %{{[cdi]*l}}, %rax" as parameter could be "cl" or "dil"

The easier way to fix those problems is to force a target triple to the tests.
What if you explicitly force -mtriple=x86_64-unknown-unknown to test sar_fold64.ll and -mtriple=i686-unknown-unknown to sar_fold.ll?

About the vector-sext.ll failures,
are those failures only related to the last RUN line in the file (the i686 run line)? Does the problem disappear if you replace -mcpu=i686 with -mcpu=generic ?

The easier way to fix those problems is to force a target triple to the tests.
What if you explicitly force -mtriple=x86_64-unknown-unknown to test sar_fold64.ll and -mtriple=i686-unknown-unknown to sar_fold.ll?

Yes. This also works.

About the vector-sext.ll failures,
are those failures only related to the last RUN line in the file (the i686 run line)? Does the problem disappear if you replace -mcpu=i686 with -mcpu=generic ?

No. The are related to AVX and AVX2 x86-64 lines. That way replace -mcpu=i686 by -mcpu=generic does not help.

RKSimon requested changes to this revision.Oct 14 2015, 10:46 AM
RKSimon edited edge metadata.

OK - it looks like some additional changes are required - comments below. I agree that the scheduling issue isn't directly tied to this patch, but you need to at least create a bugzilla with a minimal repro.

lib/Target/X86/X86ISelLowering.cpp
23489

I think you will need to add a test for N0.hasOneUse() as well here - otherwise there is a likely chance that the SHL will still need to be performed.

test/CodeGen/X86/sar_fold.ll
4

Please can you replace the march with a mtriple?

test/CodeGen/X86/sar_fold64.ll
3

Please can you replace the march with a mtriple?

test/CodeGen/X86/vector-sext.ll
1615 ↗(On Diff #37349)

Annotating with nounwind readnone should help here.

This revision now requires changes to proceed.Oct 14 2015, 10:46 AM
evstupac updated this revision to Diff 37398.Oct 14 2015, 3:18 PM
evstupac edited edge metadata.

replace march by mtriple in tests
add "nounwind readnone" to "@load_sext_16i1_to_16i16" function in vector-sext.ll test
add "!N0.hasOneUse()" to early exit from the folding

My concern is the massive increase in register pressure in load_sext_16i1_to_16i16.

We can certainly improve vXi1 -> vXiY sign extension lowering (it should be vectorizable using a broadcast + variable shl/mul + immediate sra) but I'm worried that there will be other similar cases that we just don't see in the tests.

test/CodeGen/X86/sar_fold.ll
2

Remove the -O2

test/CodeGen/X86/sar_fold64.ll
2

Remove the -O2

test/CodeGen/X86/vector-sext.ll
1615 ↗(On Diff #37398)

Turns out it didn't help ;-(

evstupac updated this revision to Diff 38234.Oct 23 2015, 6:22 AM
evstupac edited edge metadata.

"-O2" removed from sar_fold* tests.
vector-sext updated without "nounwind" attribute

delena added a subscriber: delena.Oct 31 2015, 9:17 AM
nadav edited edge metadata.Nov 5 2015, 9:04 AM

Did you measure the performance impact of this patch on the llvm test suite (or SPEC or other test suite?). Is this a win?

Yes. Spec2000 performance is almost flat.
Unit performance tests get up to 40% gain.
The patch fixes regression in PR24373.

nadav added a subscriber: nadav.Nov 12 2015, 9:20 AM

I did not get a chance to review this patch carefully. Andrea, Simon, David, Elena, Sanjay, did you get a chance to review the patch? Does it look okay? I did not see a LGTM in the thread.

Simon accepted this patch Oct 11 2015, 7:02 AM with LGTM.
But during the review a test requiring changes was added to LLVM tests. I've fixed it and now waiting for new approve.

andreadb accepted this revision.Nov 12 2015, 11:00 AM
andreadb added a reviewer: andreadb.

The change looks good to me.

Thanks for measuring the performances after this change.
I agree with you that the poor codegen caused by a suboptimal scheduling of instructions in test 'load_sext_16i1_to_16i16' can be addressed by a later patch.
However, please file a bug for it so that we don't lose track of that problem.

lib/Target/X86/X86ISelLowering.cpp
23516

You can remove this else after return.

Can you review the tests please - extra vXi1 sextload tests have been added to vector-sext.ll recently

Can you review the tests please - extra vXi1 sextload tests have been added to vector-sext.ll recently

Yes there is valuable change in the test.
I'll update the it before commit.
The issue with scheduling could be hided by adding "-enable-misched=0" option.
Anyway I'm going to file a bug after commit.

evstupac updated this revision to Diff 40651.Nov 19 2015, 8:19 AM
evstupac edited edge metadata.

Patch with updated "vector-sext.ll" that will be committed,

evstupac updated this revision to Diff 40654.Nov 19 2015, 8:24 AM

miss new "test/CodeGen/X86/sar_fold.ll" and "test/CodeGen/X86/sar_fold64.ll" while updating patch.

RKSimon accepted this revision.Nov 22 2015, 10:08 AM
RKSimon edited edge metadata.

LGTM

BTW I have a patch to improve the vXi1 sext codegen, I'll be putting it up for review soon.

This revision is now accepted and ready to land.Nov 22 2015, 10:08 AM
This revision was automatically updated to reflect the committed changes.