Optimize some specific immediates selection via breaking to two parts
against loading them from the constant pool.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.
- Is there a benchmark test suite I can try?
- If you think my change will be worse if there are multiple uses of the constant, then the same thing will happen for plus constant matches ARM_AM::isSOImmTwoPartVal. I think both positive and negative SOImmTwoPartVal constant should be handled in the same way.
- My change also benefits the following code:
%x=%y - 0x2323
previous llvm will generate 12 bytes:
ldr ...
sub ...
an item in the constant pool
but my change simplifies it to only 8 bytes:
sub %x, %y, 0x2300
sub %x, %x, 0x23
(I did not add a test case for that)
for C code
unsigned int f0(unsigned int a) { return a-0x2323; } unsigned int f1(unsigned int a) { return a+0x2323; }
GCC generates
f0: sub r0, r0, #8960 sub r0, r0, #35 bx lr f1: add r0, r0, #8960 add r0, r0, #35 bx lr
but llvm generates
f0: ldr r1, .LCPI0_0 add r0, r0, r1 bx lr .LCPI0_0: .long 4294958301 @ 0xffffdcdd f1: add r0, r0, #35 add r0, r0, #8960 bx lr
And my patch will make llvm follows GCC's behavior.
Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.
I have added a new test case, https://reviews.llvm.org/D83928.
We can merge that test first, then we will see how this patch improves armv6's code quality.
Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.
Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.
https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd
Another improve in my patch is that
define i32 @sub2(i32 %0) { %2 = sub i32 %0, 8995 ret i32 %2 }
Current llvm will put the imm 8995 in the constant pool and generate a LDR.
My patch will optimize it to
; CHECK-NEXT: sub r0, r0, #35 ; CHECK-NEXT: sub r0, r0, #8960
That can be seen if https://reviews.llvm.org/D83928 is merged.
Briefly looking at those numbers, that shows a bit of up and down behaviour. And while the geomean's show a small improvement, I don't find it quite convincing... i.e., not yet. Hence my request for some numbers. But first of all I am curious to know what causes the regressions? In other words, what are we missing with this patch, and can that be mitigated, or what kind of knock on effects does this have?
There is the llvm test-suite, see e.g. https://llvm.org/docs/TestSuiteGuide.html. If you can run that, that would be good.
And kind of obsolete benchmarks, but easy to run are coremark and dhrystone. That would be good as a first finger on the pulse too.
I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.
Thanks so much for your help. I have another patch https://reviews.llvm.org/D84100 also does immediate optimization.
I should combine them into one, but smaller changes are easy to review and test.
So I quite appreciate you if you can also have a look at it.
What's more, I do not expect there is any performance improvement by my patch, but I do expect there should be
code size reduction of the test suite by my patch. That's also a benefit to llvm-arm.
Think I shot myself a little bit in the foot here ;-). Our group focuses mostly on M-cores, and running A32 code wasn't as straightforward as I'd have hoped. I eyeballed codegen differences, and didn't see anything concerning. So I don't have data for making further objections.
So LGTM for now.