This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Optimize immediate selection
ClosedPublic

Authored by benshi001 on Jul 13 2020, 10:34 PM.

Details

Summary

Optimize some specific immediates selection via breaking to two parts
against loading them from the constant pool.

Diff Detail

Event Timeline

benshi001 created this revision.Jul 13 2020, 10:34 PM
gchatelet resigned from this revision.Jul 15 2020, 1:23 AM
SjoerdMeijer added a subscriber: SjoerdMeijer.

I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.

This comment was removed by benshi001.
This comment was removed by benshi001.
  1. Is there a benchmark test suite I can try?
  1. If you think my change will be worse if there are multiple uses of the constant, then the same thing will happen for plus constant matches ARM_AM::isSOImmTwoPartVal. I think both positive and negative SOImmTwoPartVal constant should be handled in the same way.
  1. My change also benefits the following code:

%x=%y - 0x2323
previous llvm will generate 12 bytes:
ldr ...
sub ...
an item in the constant pool

but my change simplifies it to only 8 bytes:
sub %x, %y, 0x2300
sub %x, %x, 0x23
(I did not add a test case for that)

I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.

for C code

unsigned int f0(unsigned int a)
{
        return a-0x2323;
}
unsigned int f1(unsigned int a) 
{
        return a+0x2323;
}

GCC generates

f0:
      sub     r0, r0, #8960
      sub     r0, r0, #35
      bx      lr
f1:
      add     r0, r0, #8960
      add     r0, r0, #35
      bx      lr

but llvm generates

f0:
      ldr     r1, .LCPI0_0
      add     r0, r0, r1
      bx      lr
.LCPI0_0:
      .long   4294958301                      @ 0xffffdcdd
f1:
      add     r0, r0, #35
      add     r0, r0, #8960
      bx      lr

And my patch will make llvm follows GCC's behavior.

Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.

This comment was removed by benshi001.

Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.

I have added a new test case, https://reviews.llvm.org/D83928.

We can merge that test first, then we will see how this patch improves armv6's code quality.

benshi001 updated this revision to Diff 279073.Jul 19 2020, 5:13 AM
benshi001 edited the summary of this revision. (Show Details)

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.

https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd

benshi001 added a comment.EditedJul 27 2020, 7:42 AM

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Another improve in my patch is that

define i32 @sub2(i32 %0) {
  %2 = sub i32 %0, 8995
  ret i32 %2
}

Current llvm will put the imm 8995 in the constant pool and generate a LDR.

My patch will optimize it to

; CHECK-NEXT:    sub r0, r0, #35
; CHECK-NEXT:    sub r0, r0, #8960

That can be seen if https://reviews.llvm.org/D83928 is merged.

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.

https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd

Briefly looking at those numbers, that shows a bit of up and down behaviour. And while the geomean's show a small improvement, I don't find it quite convincing... i.e., not yet. Hence my request for some numbers. But first of all I am curious to know what causes the regressions? In other words, what are we missing with this patch, and can that be mitigated, or what kind of knock on effects does this have?
There is the llvm test-suite, see e.g. https://llvm.org/docs/TestSuiteGuide.html. If you can run that, that would be good.

And kind of obsolete benchmarks, but easy to run are coremark and dhrystone. That would be good as a first finger on the pulse too.

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

Thanks so much for your help. I have another patch https://reviews.llvm.org/D84100 also does immediate optimization.

I should combine them into one, but smaller changes are easy to review and test.

So I quite appreciate you if you can also have a look at it.

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

What's more, I do not expect there is any performance improvement by my patch, but I do expect there should be
code size reduction of the test suite by my patch. That's also a benefit to llvm-arm.

SjoerdMeijer accepted this revision.Jul 28 2020, 11:17 AM

Think I shot myself a little bit in the foot here ;-). Our group focuses mostly on M-cores, and running A32 code wasn't as straightforward as I'd have hoped. I eyeballed codegen differences, and didn't see anything concerning. So I don't have data for making further objections.

So LGTM for now.

This revision is now accepted and ready to land.Jul 28 2020, 11:17 AM
This revision was automatically updated to reflect the committed changes.