Download Raw Diff

Details

Reviewers

t.p.northover
greened
anhtuyen
efriedma
dmgreen
samparker
SjoerdMeijer
gchatelet

Commits

rG85342c27a303: [ARM] Optimize immediate selection

Summary

Optimize some specific immediates selection via breaking to two parts
against loading them from the constant pool.

Diff Detail

Event Timeline

benshi001 created this revision.Jul 13 2020, 10:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2020, 10:34 PM

Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

Harbormaster failed remote builds in B64102: Diff 277667!Jul 13 2020, 11:10 PM

benshi001 updated this revision to Diff 277678.Jul 14 2020, 12:04 AM

benshi001 updated this revision to Diff 277681.Jul 14 2020, 12:14 AM

gchatelet resigned from this revision.Jul 15 2020, 1:23 AM

I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.

benshi001 added a comment.Jul 15 2020, 7:04 AM

This comment was removed by benshi001.

benshi001 added a comment.Jul 15 2020, 7:13 AM

This comment was removed by benshi001.

Is there a benchmark test suite I can try?

If you think my change will be worse if there are multiple uses of the constant, then the same thing will happen for plus constant matches ARM_AM::isSOImmTwoPartVal. I think both positive and negative SOImmTwoPartVal constant should be handled in the same way.

My change also benefits the following code:

%x=%y - 0x2323
previous llvm will generate 12 bytes:
ldr ...
sub ...
an item in the constant pool

but my change simplifies it to only 8 bytes:
sub %x, %y, 0x2300
sub %x, %x, 0x23
(I did not add a test case for that)

In D83745#2152988, @SjoerdMeijer wrote:

I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.

for C code

unsigned int f0(unsigned int a)
{
        return a-0x2323;
}
unsigned int f1(unsigned int a) 
{
        return a+0x2323;
}

GCC generates

f0:
      sub     r0, r0, #8960
      sub     r0, r0, #35
      bx      lr
f1:
      add     r0, r0, #8960
      add     r0, r0, #35
      bx      lr

but llvm generates

f0:
      ldr     r1, .LCPI0_0
      add     r0, r0, r1
      bx      lr
.LCPI0_0:
      .long   4294958301                      @ 0xffffdcdd
f1:
      add     r0, r0, #35
      add     r0, r0, #8960
      bx      lr

And my patch will make llvm follows GCC's behavior.

Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.

benshi001 added a comment.Jul 15 2020, 9:14 PM

This comment was removed by benshi001.

In D83745#2153885, @SjoerdMeijer wrote:

Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.

I have added a new test case, https://reviews.llvm.org/D83928.

We can merge that test first, then we will see how this patch improves armv6's code quality.

benshi001 updated this revision to Diff 279073.Jul 19 2020, 5:13 AM

benshi001 edited the summary of this revision. (Show Details)

ping

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

In D83745#2175668, @SjoerdMeijer wrote:

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.

https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd

In D83745#2175668, @SjoerdMeijer wrote:

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Another improve in my patch is that

define i32 @sub2(i32 %0) {
  %2 = sub i32 %0, 8995
  ret i32 %2
}

Current llvm will put the imm 8995 in the constant pool and generate a LDR.

My patch will optimize it to

; CHECK-NEXT:    sub r0, r0, #35
; CHECK-NEXT:    sub r0, r0, #8960

That can be seen if https://reviews.llvm.org/D83928 is merged.

In D83745#2175933, @benshi001 wrote:

In D83745#2175668, @SjoerdMeijer wrote:

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.

https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd

Briefly looking at those numbers, that shows a bit of up and down behaviour. And while the geomean's show a small improvement, I don't find it quite convincing... i.e., not yet. Hence my request for some numbers. But first of all I am curious to know what causes the regressions? In other words, what are we missing with this patch, and can that be mitigated, or what kind of knock on effects does this have?
There is the llvm test-suite, see e.g. https://llvm.org/docs/TestSuiteGuide.html. If you can run that, that would be good.

And kind of obsolete benchmarks, but easy to run are coremark and dhrystone. That would be good as a first finger on the pulse too.

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

In D83745#2178363, @SjoerdMeijer wrote:

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

Thanks so much for your help. I have another patch https://reviews.llvm.org/D84100 also does immediate optimization.

I should combine them into one, but smaller changes are easy to review and test.

So I quite appreciate you if you can also have a look at it.

In D83745#2178363, @SjoerdMeijer wrote:

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

What's more, I do not expect there is any performance improvement by my patch, but I do expect there should be
code size reduction of the test suite by my patch. That's also a benefit to llvm-arm.

Think I shot myself a little bit in the foot here ;-). Our group focuses mostly on M-cores, and running A32 code wasn't as straightforward as I'd have hoped. I eyeballed codegen differences, and didn't see anything concerning. So I don't have data for making further objections.

So LGTM for now.

This revision is now accepted and ready to land.Jul 28 2020, 11:17 AM

Closed by commit rG85342c27a303: [ARM] Optimize immediate selection (authored by SjoerdMeijer). · Explain WhyJul 29 2020, 5:30 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG85342c27a303: [ARM] Optimize immediate selection.

SjoerdMeijer mentioned this in D83928: [ARM][TEST] Add a new test case of add-imm & sub-imm.Jul 29 2020, 5:32 AM

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Optimize immediate selection
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 277678

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

llvm/lib/Target/ARM/ARMInstrInfo.td

llvm/test/CodeGen/ARM/select-imm.ll

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Optimize immediate selectionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 277678

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

llvm/lib/Target/ARM/ARMInstrInfo.td

llvm/test/CodeGen/ARM/select-imm.ll

[ARM] Optimize immediate selection
ClosedPublic