Page MenuHomePhabricator

[ARMISelLowering] Better handling of NEON load/store for sequential memory regions
Needs ReviewPublic

Authored by evgeny777 on Oct 30 2017, 5:41 AM.

Details

Summary

Consider sample code which copies 4x4 matrix row by row (see cascade-vld-vst.ll). Current revision generation following code (AArch32):

mov r2, #48
mov r3, r0
vld1.32 {d16, d17}, [r3], r2
vld1.64 {d18, d19}, [r3]
add r3, r0, #32
add r0, r0, #16
vld1.64 {d22, d23}, [r0]
add r0, r1, #16
vld1.64 {d20, d21}, [r3]
vst1.64 {d22, d23}, [r0]
add r0, r1, #32
vst1.64 {d20, d21}, [r0]
vst1.32 {d16, d17}, [r1], r2
vst1.64 {d18, d19}, [r1]
mov pc, lr

After this patch is applied:

vld1.32 {d16, d17}, [r0]!
vld1.32 {d18, d19}, [r0]!
vld1.32 {d20, d21}, [r0]!
vld1.64 {d22, d23}, [r0]
vst1.32 {d16, d17}, [r1]!
vst1.32 {d18, d19}, [r1]!
vst1.32 {d20, d21}, [r1]!
vst1.64 {d22, d23}, [r1]
mov pc, lr

It also speeds up our matrix multiplication function by 15%. Some of the existing LLVM test cases now have approx. 25% less instructions than before.

The improvement is based on two major changes to CombineBaseUpdate

  1. When we select address increment instruction to fold we prefer one which is equal to access size of load/store
  2. If we can't find such address increment bound to current load/store instruction address operand we navigate up the SelectionDAG chain and try to borrow address increment instruction bound to address operand of parent VST{X}_UPD and VLD{X}_UPD which we processed earlier.

Diff Detail

Event Timeline

evgeny777 created this revision.Oct 30 2017, 5:41 AM

So, this is an interesting change, on the TODO list for a while, but the code and the tests are changing too much in areas where not clearly related.

Furthermore, you have replaced a single-pass for a multi-pass on the Dag, which may have several unintended consequences, including compile time increase.

For your results, you have not shown which is your matrix benchmark, numbers on standard compiler benchmarks and the LLVM test-suite, nor you have said which architecture (aarch32-variant) you're benchmarking.

Giving that this will affect *all* AArch32 (armv7/8), I think we need to be cautious and extensive on our approach.

I'm adding more Arm folks to get a wider review as well as testing, but I am worried with the compilation time as well as the impact in many other code patterns than just matrix multiply.

cheers,
--renato

lib/Target/ARM/ARMISelLowering.cpp
11064

I'm worried about this... there should be a guaranteed exit condition here.

test/CodeGen/ARM/cascade-vld-vst.ll
30

Use regexp as above:

{{{d[0-9]+}}, {{d[0-9]+}}}
test/CodeGen/ARM/misched-fusion-aes.ll
79

why would these change?

test/CodeGen/ARM/vector-load.ll
258

you can't guarantee r0 as this returns void.

evgeny777 added inline comments.Oct 30 2017, 7:10 AM
lib/Target/ARM/ARMISelLowering.cpp
11064

What about reducing number of iterations to, let's say, 16? This should be more than enough for most of cases. Also the effect of compilation time increase should become limited.

test/CodeGen/ARM/misched-fusion-aes.ll
79

IR is compiled into a mixture of aesXX and vldXX instructions which are being scheduled differently after this patch is applied. I checked manually that parameters passed to each aesXX instruction are identical (though I can't guarantee that I didn't make any mistake, of course :)

rengolin added inline comments.Oct 30 2017, 7:26 AM
lib/Target/ARM/ARMISelLowering.cpp
11064

I was hoping for an approach that didn't need O(n^2) even if lower bound. If your increment is 16x, then the loop will be 16x less efficient than previously. Granted, it's doing more, but that's a big impact that I'd rather avoid.

But I have been away from isel for quite a while now, so I'll wait for some current isel developers (@qcolombet @rovka) to shed some light.

test/CodeGen/ARM/misched-fusion-aes.ll
79

Right, I expected as much. And you can't just ignore those lines because the QD, QE etc. will not match.

Hopefully, simplifying the IR could make the extra AES instructions unnecessary and we end up with a static pattern.


Sharing my matrix multiplication example, if anyone's interested.

rengolin added a comment.EditedOct 30 2017, 10:58 AM


Sharing my matrix multiplication example, if anyone's interested.

Thanks! Can you also share the platform you're running your tests? And run some benchmarks to show run time and compile time differences? Also the source where this comes from (C/Fortran), so we can fiddle / compare to others.

cheers,
--renato

javed.absar edited edge metadata.Oct 30 2017, 11:01 AM

Hi Eugene:
Thanks for the work. However, in its current form the implementation is bit hard to follow. Would it be possible to, describe -

  1. Your overall approach in the implementation

2.. Describe above the function what it is trying to do (e.g. checkedGetIncrement - which b.t.w looks non-intuitive name).

You do have comments everywhere but it is hard to follow the string of thought in its current form.

lib/Target/ARM/ARMISelLowering.cpp
10996

But you are checking only VLD1/VST1, so you may want to change function name appropriately. VLD1OrVST1
...................................^^ ^

11006

Seems unnecessary as single use

11065

nitpick. Better to use range loop

@rengolin

  1. Compilation times

With patch:

real	1m40.086s
user	0m57.040s
sys	0m27.444s

W/o patch:

real	1m40.619s
user	0m58.120s
sys	0m25.944s

Those measurements were done with this bash script:

#!/bin/bash
LLC=/data/llvm/build_ninja_Release/bin/llc
for ((i=1;i<=10000;i++)); do 
   $LLC -mtriple=arm-eabi -float-abi=soft -mattr=+neon mat_mul_4x4.ll 
done

An interesting fact is that execution of patched llc is stably slightly less than
that of non-patched version (both were run 3 times in a row). Not sure what the reason
is (may be less number of SD nodes after DAGCombine). My machine specs are:
Core-i5 2500K, 8GB RAM Ubuntu 16.04

  1. Execution times of matrix multiplication example (ARMv8, 32-bit) on ARM Cortex A57, 2GHz:

With patch:

MI scheduler: 2549066 usec
SD scheduler: 2647092 usec

W/o patch:

MI scheduler: 3039261 usec
SD scheduler: 2843175 usec

We're using MI scheduler model added in D28152. With SD scheduler improvement is smaller, but still
noticable.

With patch:

real	1m40.086s

W/o patch:

real	1m40.619s

@evgeny777, while this is encouraging, you're still only running the one case where you know this is good.

Those measurements were done with this bash script:

#!/bin/bash
LLC=/data/llvm/build_ninja_Release/bin/llc
for ((i=1;i<=10000;i++)); do 
   $LLC -mtriple=arm-eabi -float-abi=soft -mattr=+neon mat_mul_4x4.ll 
done

This is very far from ideal, as kernel caching, bash and whatever running on your current system could be affecting your timing more than the difference itself.

An interesting fact is that execution of patched llc is stably slightly less than
that of non-patched version (both were run 3 times in a row). Not sure what the reason
is (may be less number of SD nodes after DAGCombine). My machine specs are:
Core-i5 2500K, 8GB RAM Ubuntu 16.04

Honestly, I don't think your results carry any statistical significance, the variability of the runs could be due to too many factors. I wouldn't worry about those fluctuations right now.

With patch:

MI scheduler: 2549066 usec
SD scheduler: 2647092 usec

W/o patch:

MI scheduler: 3039261 usec
SD scheduler: 2843175 usec

We're using MI scheduler model added in D28152. With SD scheduler improvement is smaller, but still
noticable.

Again, this is the case that was tailored and reduced, so of course, the numbers will look very inflated.

What ARM machine have you ran these? This makes a huge difference. From your comments, I'm guessing this is ARMv8, which is by far *not* the common case for AArch32.

Until you can show results on real benchmarks on at least one core for both compile and run times, there isn't much else we can do.

Once that's available, I'd encourage other people to do the same on their cores, as well as have a deeper look into the code.

From @javed.absar's comment, I'm not the only one finding this patch non-intuitive, even with the good amount of comments.

cheers,
--renato

eastig added a subscriber: eastig.Oct 31 2017, 3:31 AM

Hi,

I'll do AArch32/AArch64 benchmarks runs on our Juno boards. It would be worth to run CTMark to check compilation time.

Thanks,
Evgeny Astigeevich
The Arm Compiler Optimisation team

fhahn added a subscriber: fhahn.Oct 31 2017, 3:48 AM
fhahn added inline comments.
test/CodeGen/ARM/misched-fusion-aes.ll
79

I think we could replace the last 3 AESE instructions (lines 57, 59, 61), which only combine the results of the previous computations and have no AESMC instructions to pair with. But as long as the number of AESE/ASEMC pairs does not change, the position of the 3 left-over AESE instructions is nothing to be too concerned about IMO.

I'll do AArch32/AArch64 benchmarks runs on our Juno boards. It would be worth to run CTMark to check compilation time.

Hi,

There is no point in testing AArch64 as this is a change to the AArch32 back-end. Can you run the same benchmarks on an ARMv7 board? This is the most likely target on using the ARM back-end.

cheers,
--renato

rengolin added inline comments.Oct 31 2017, 5:13 AM
test/CodeGen/ARM/misched-fusion-aes.ll
79

Sure, I'm not worried about their position, just trying to future-proof the test, so we don't need to keep shuffling them on code-gen changes.

@rengolin

What ARM machine have you ran these?

This is NDA-covered and I'm not going to share specs. My testing abilities are limited, but here are few more
devices, you should be familiar with:

Samsung Galaxy Nexus (TI OMAP 4460 Dual-core Cortex-A9 1,2 GHz):
With patch:

4x3 Matrix multiplication: 7436829 usec
4x4 Matrix multiplication: 8205384 usec

W/o patch:

4x3 Matrix multiplication: 7797547 usec
4x4 Matrix multiplication: 8580291 usec

Huawei Honor 8 (Cortex A72, 2.3GHz):
With patch:

4x3 Matrix multiplication: 1020747 usec
4x4 Matrix multiplication: 1130021 usec

W/o patch:

4x3 Matrix multiplication: 1193994 usec
4x4 Matrix multiplication: 1290402 usec

From @javed.absar's comment, I'm not the only one finding this patch non-intuitive, even with the good amount of comments.

I'm preparing algorithm description, so I suggest returning to discussing this later.

Samsung Galaxy Nexus (TI OMAP 4460 Dual-core Cortex-A9 1,2 GHz):
Huawei Honor 8 (Cortex A72, 2.3GHz):

Right, those are good platforms. But we need more detailed benchmarks (which @eastig is going to run, but I encourage @evandro and @rovka to do the same on others) before we can settle the regression question.

From @javed.absar's comment, I'm not the only one finding this patch non-intuitive, even with the good amount of comments.

I'm preparing algorithm description, so I suggest returning to discussing this later.

Excellent, thank you!

@javed.absar @rengolin
I've created a PowerPoint file describing algorithm. Hope this helps.

@javed.absar @rengolin
I've created a PowerPoint file describing algorithm. Hope this helps.

Now as a PDF. :)

https://reviews.llvm.org/F5459847

efriedma added inline comments.
lib/Target/ARM/ARMISelLowering.cpp
11064

I'll have to think about this a bit more... but I'll just briefly note that the existing CombineBaseUpdate has shown up in the past as a compile-time hog for large basic blocks. (I think it's worst-case O(N^3) or something like that.)

rengolin added inline comments.Oct 31 2017, 12:28 PM
lib/Target/ARM/ARMISelLowering.cpp
11064

Thanks Eli. I just want to make sure that we only do that if we really need it, ie. the benefit outweigh the costs/risks and there is no other way.

rovka edited edge metadata.Nov 1 2017, 6:24 AM

Hi,

I tried to give this a run [1] on top of r317072 on an NVIDIA TK1 (Cortex-A15) and I'm getting some failures in the test-suite (in 42 of the benchmarks) along the lines of [2]. I think we should fix those before worrying about performance numbers...

Let me know if you have trouble reproducing these results.

Thanks,
Diana

[1] With these flags: sandbox/bin/python sandbox/bin/lnt runtest test-suite --sandbox sandbox --test-suite test-suite --cc $LLVM_BLD/bin/clang --cxx $LLVM_BLD/bin/clang++ --use-lit $LLVM_BLD/bin/llvm-lit --cppflags '-O3 -mcpu=cortex-a15 -fomit-frame-pointer' --threads=1 --build-threads=4 --use-perf=all --run-under 'taskset -c 0' --benchmarking-only --exec-multisample=3

[2] clang-6.0: llvm/include/llvm/Support/Casting.h:106: static bool llvm::isa_impl_cl<llvm::ConstantSDNode, const llvm::SDNode *>::doit(const From *) [To = llvm::ConstantSDNode, From = const llvm::SDNode *]: Assertion `Val && "isa<> used on a null pointer"' failed.
Stack dump:
0. Program arguments: build/bin/clang-6.0 -cc1 -triple armv7-unknown-linux-gnueabihf -emit-obj -disable-free -main-file-name ffbench.c -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu cortex-a15 -target-feature -crc -target-feature +dsp -target-feature -ras -target-feature -dotprod -target-feature +hwdiv-arm -target-feature +hwdiv -target-abi aapcs-linux -mfloat-abi hard -fallow-half-arguments-and-returns -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file sandbox/test-2017-11-01_11-50-24/SingleSource/Benchmarks/Misc/CMakeFiles/ffbench.dir/ffbench.c.gcno -resource-dir build/lib/clang/6.0.0 -D NDEBUG -internal-isystem /usr/local/include -internal-isystem build/lib/clang/6.0.0/include -internal-externc-isystem /usr/include/arm-linux-gnueabihf -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Werror=date-time -w -fdebug-compilation-dir sandbox/test-2017-11-01_11-50-24/SingleSource/Benchmarks/Misc -ferror-limit 19 -fmessage-length 0 -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o CMakeFiles/ffbench.dir/ffbench.c.o -x c test-suite/SingleSource/Benchmarks/Misc/ffbench.c

  1. <eof> parser at end of file
  2. Code generation
  3. Running pass 'Function Pass Manager' on module 'test-suite/SingleSource/Benchmarks/Misc/ffbench.c'.
  4. Running pass 'ARM Instruction Selection' on function '@fourn'

clang-6.0: error: unable to execute command: Aborted
clang-6.0: error: clang frontend command failed due to signal (use -v to see invocation)

@rovka Looking at it ...

evgeny777 updated this revision to Diff 121306.Nov 2 2017, 8:07 AM
  • Fixed bug caused by incorrect negative increment handling. Added test case.
  • If we can't select best possible increment value we now return the first one instead of the last one
  • Addressed some of review comments
  • Fixed issues in ivchain-ARM.ll test case

@rovka : I've run the LLVM test suite (with --compile-only) and it worked fine for me (no assertions, e.t.c). I haven't run it yet, though.

evgeny777 added inline comments.Nov 2 2017, 8:09 AM
lib/Target/ARM/ARMISelLowering.cpp
11065

I need iterator here to call UI.getUse(), not just SDNode*

After looking at your PDF, it sounds as though you don't need the while loop at all.

Just iterate once through all users, collecting the increment sizes.
Then sort them and make sure each one is a step away from the other (step = inc[1]-inc[0]).
If the end is reached and they're all the same size, combine to post-inc.

An optimisation over that would be to find a sub-set of the list which has constant increment and only combine those, but that's a very small edge case inside an already small case and I wouldn't worry about it.

Also, unless you can run the check-all and the test-suite and make sure there are no breakages on your own, it's going to be hard to spend time looking at your patch again.

Adding David (combine code owner) for a better look at the algorithm.

cheers,
--renato

evgeny777 added a comment.EditedNov 2 2017, 9:03 AM

After looking at your PDF, it sounds as though you don't need the while loop at all.

Can you please elaborate? Given this type of chain

[ Store(Value, Addr) ]  ------------> [ VST1_UPD ]

The only user of Addr operand (VST1_UPD) is Store(Value, Addr). What should I do next?
Also sort (qsort) introduces additional O(N*logN) complexity in average case and O(N*N) in the worst case.

Also, unless you can run the check-all

I ran check-llvm and also I ran LLVM test-suite, as described here:
http://llvm.org/docs/lnt/quickstart.html

However I could only cross-compile it and check that there are no more asserions. I don't have any linux ARM devices and running test suite on Android looks difficult.
I'll try jailbroken iPad tomorrow which might probably be easier

The only user of Addr operand (VST1_UPD) is Store(Value, Addr). What should I do next?

All three stores chain up to the same load, with different offsets. Can't you look for that pattern?

@eli.friedman / @majnemer should really give their input here, as I'm reaching the limit of my combine knowledge.

Also sort (qsort) introduces additional O(N*logN) complexity in average case and O(N*N) in the worst case.

First, there is no requirement for standard libraries to use qsort. I believe stdlibc++ uses introsort (worst-case nlogn), not sure about libc++.

Still, you only sort a very short list of integers IFF the pattern is matched, paying the (small) cost only for the rare cases where it does.

Let's assume you have the absurd amount of 10 integers. That's 100 passes for N^2 in the worst possible case, ~33 for nlogn, only when the list is non-empty.

While iterating through the use chain and dereferencing pointers through N for loops will grant you cache misses, huge number of loads, potential branch prediction failures. For *all* cases.

I'll let you guess which one is worse.

also I ran LLVM test-suite

No you didn't. You only compiled the test suite.

However I could only cross-compile it and check that there are no more asserions. I don't have any linux ARM devices and running test suite on Android looks difficult.
I'll try jailbroken iPad tomorrow which might probably be easier

You can't push a patch that will affect every Arm device on the planet by testing on NDA hardware with NDA operating system and hope all is well. But it gets worse, because you have only "tested" your small matrix multiply snippet and claimed victory.

In reality, you have to demonstrate, from your own effort, that you have done your homework.

In general, your homework is at least:

  1. Build and check-all, making sure you built the Arm back-end (so Lit tests run on Arm code-gen). Check.
  2. Convince people your algorithm is the best we can possibly do in this case. We're doing that, but I'm not convinced yet.

Meanwhile, we have demonstrated that you have broken the test-suite on Arm (the architecture your patch targets). In this case you *must*:

  1. *Run* the test-suite in test mode, on Arm hardware (Linux, Apple, even Android would be fine). Everything must pass. Not check.

Since this is a performance improvement with serious risk of compile time issues, the additional items are required:

  1. Run standard compiler benchmarks (SPEC, EEMBC, Coremark) on Arm hardware (Linux, Apple, even Android would be fine). Not check.
    1. Report results for anything greater than 1% change (up and down) on both run time and compile time.
    2. Report the geomeans of all benchmarks, discuss the results.
    3. A nice touch would be running the test-suite in benchmark mode, which does that more or less automagically.
  2. Invite other people to run benchmarks on other machines which the patch targets, discuss results. I've done that, but they will only do it when the algorithm is right.

In this particular case, the Arm back-end targets a lot more ARMv7 cores than ARMv8, so at least some benchmarking has to be done on ARMv7 hardware.

It would be fine for you to benchmark on ARMv8 and wait for other people to do the same on ARMv7.

But before anyone else gets involved in their own benchmarks, you *have* to have finished, and presented, steps 1~4 above.

cheers,
--renato

Let's assume you have the absurd amount of 10 integers. That's 100 passes for N^2 in the worst possible case, ~33 for nlogn, only when the list is non-empty.

I believe algorithm complexity is O(N), where N is a number of nodes in Selection DAG (it's about finding a minimum value in a list of values). You're suggesting O(N*logN), why?

You can't push a patch that will affect every Arm device on the planet by testing on ...

Like I said before, my testing capacity is limited and running test suite on a single device (even popular) also doesn't provide that much of data, you probably expect. Meanwhile someone might also want to give it a try, so I updated the diff.

I believe algorithm complexity is O(N), where N is a number of nodes in Selection DAG (it's about finding a minimum value in a list of values). You're suggesting O(N*logN), why?

I'm lost. What algorithm is O(N)? You have nested loops, the outer one is unbound, the inner one is O(N).

Like I said before, my testing capacity is limited

How do you know your change is good for all cases the compiler will have to handle, and not just the one you care about?

and running test suite on a single device (even popular) also doesn't provide that much of data, you probably expect.

Testing on a single device is infinitely better than not testing it at all, I hope you can see that.

Meanwhile someone might also want to give it a try, so I updated the diff.

So you are expecting other people to test your code for you?

I'm lost. What algorithm is O(N)? You have nested loops, the outer one is unbound, the inner one is O(N).

Having nested loops doesn't necessary means that you have O(N^2) complexity. My algorithm:

  • Doesn't examine any node in the DAG twice
  • Number of nodes in the DAG in N
  • At maximum it can examine all nodes in DAG

So the complexity is naturally O(N)

Testing on a single device is infinitely better than not testing it at all, I hope you can see that.

I've shown benchmarks on few Android phones, if you remember. I'll also try running full test suite, but this can take some time

So you are expecting other people to test your code for you?

I'm not expecting anything. I just posted the fix.

When you're considering the overall complexity, there are a couple points to keep in mind:

  1. DAGCombine calls CombineBaseUpdate for each relevant node in the DAG.
  2. isPredecessorOf is O(N) in the size of the DAG.
  1. DAGCombine calls CombineBaseUpdate for each relevant node in the DAG.
  2. isPredecessorOf is O(N) in the size of the DAG.
  1. That's true, I see no difference regarding this between old and new approaches.
  2. I didn't know that, thanks! I'll check if it is possible to optimize

I've shown benchmarks on few Android phones, if you remember.

No you haven't. You have shown one little snippet, not benchmarks.

I'll also try running full test suite, but this can take some time

Sure, once you're finished, and all tests are green, we can start again.

eastig added a comment.Nov 3 2017, 5:09 AM

Hi,

I've got first results of benchmark runs: the LNT test suite + a private benchmark 01. I used the latest patch.
The configuration is a Juno board Cortex-A57/A53, v8-a, AArch32, Thumb2.
Options: -O3 -mcpu=cortex-a57 -mthumb -fomit-frame-pointer
The runs passed without errors.

Improvements:

MultiSource/Applications/JM/lencod/lencod2.57%
Private benchmark 019.5%

Regressions:

SingleSource/Benchmarks/Misc/salsa203.41%

Thanks,
Evgeny Astigeevich

I've got first results of benchmark runs: the LNT test suite + a private benchmark 01. I used the latest patch.
The configuration is a Juno board Cortex-A57/A53, v8-a, AArch32, Thumb2.
Options: -O3 -mcpu=cortex-a57 -mthumb -fomit-frame-pointer
The runs passed without errors.

Thanks Evgeny.

Improvements:

MultiSource/Applications/JM/lencod/lencod2.57%
Private benchmark 019.5%

Regressions:

SingleSource/Benchmarks/Misc/salsa203.41%

Not very convincing numbers. I guess efficient pipelining can make most of the difference wash away.

Maybe older ARM cores, like A8, will get better improvement?

What about compile time? And code size?

eastig added a comment.Nov 3 2017, 7:05 AM

Not very convincing numbers. I guess efficient pipelining can make most of the difference wash away.

Agree. Maybe there will be more interesting data from other benchmarks.

Maybe older ARM cores, like A8, will get better improvement?

We have Cortex-A9 Panda boards. I'll use them.

What about compile time? And code size?

The LNT server does not provide compile time reports. I am not sure whether they are not displayed or no compile time data available.
Also compilation happens on a target. This is not cross-compilation. I don't know how compile time is stable on Juno boards.
I think Evgeny could try to do cross-compilation runs of the LNT test suite on X86. It's worth to run CTMark. Apple says it's quite good at catching compile time regressions.

Attached the picture of code size improvements (sorry for not providing them as plain table data):

The LNT server does not provide compile time reports. I am not sure whether they are not displayed or no compile time data available.
Also compilation happens on a target. This is not cross-compilation. I don't know how compile time is stable on Juno boards.

LNT does provide compile time data, but since you're cross-compiling, it's probably being lost somewhere in the process.

Attached the picture of code size improvements (sorry for not providing them as plain table data):

Interesting. I'm guessing all TSVC reductions are due to the same function.

It's also interesting that TSVC was the code that changed the most, and had no visible performance benefits.

Maybe hinting that the variation in other benchmarks could have been a side effect, not the use of post-increment, including the matrix multiply. Neither lencod nor salsa show up in code size changes.

It's quite likely that your A9 won't show much better results (or even different random results), given that it's OOO and pipelined between memory, ALU and vector operations.

eastig added a comment.Nov 3 2017, 8:10 AM

Interesting. I'm guessing all TSVC reductions are due to the same function.

It's also interesting that TSVC was the code that changed the most, and had no visible performance benefits.

Actually it had but not many.
There are some other regressions/improvements. I have not listed them because the hottest code is not changed.
Attached full tables:


MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt has 1.86% execution time regression and 41.88% code size improvement.
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt has 1.48% execution time improvement and 39.61% code size improvement.
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt has 1.26% execution time improvements and 41.61% code size improvement.

Maybe hinting that the variation in other benchmarks could have been a side effect, not the use of post-increment, including the matrix multiply. Neither lencod nor salsa show up in code size changes.

It's quite likely that your A9 won't show much better results (or even different random results), given that it's OOO and pipelined between memory, ALU and vector operations.

What about running on Cortex-A53? Its pipeline is in-order.

Actually it had but not many.
There are some other regressions/improvements. I have not listed them because the hottest code is not changed.

Right, and the fact that TSVC appears on both improvements and regressions is a hint that there are other factors at play.

MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt has 1.86% execution time regression and 41.88% code size improvement.
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt has 1.48% execution time improvement and 39.61% code size improvement.
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt has 1.26% execution time improvements and 41.61% code size improvement.

That's pretty consistent, again, probably the same code. But at least you didn't have regressions in non-affected code, which means the early exits are working as expected.

What about running on Cortex-A53? Its pipeline is in-order.

I don't think out vs. in will make a big difference (maybe just different noise). Because this is an ALU vs. Load/Vector, the pipelining will have a much bigger impact than the dispatcher.

eastig added a comment.Nov 3 2017, 8:28 AM

Cortex-A57 results from another private benchmark (50 sub-benchmarks):

3 sub-benchmarks: 2% score regression
2 sub-benchmarks: 1% code size improvement (no changes in their scores)
1 sub-benchmark: 5.56% score improvement
1 sub-benchmark: 1.15% score improvement

CTMark benchmarks (Baseline corresponds to non-patched version)


CFLAGS:

-target armv7-unknown-linux-gnueabihf  -O3 -mcpu=cortex-a15 -fomit-frame-pointer

CTMark benchmarks (Baseline corresponds to non-patched version)


CFLAGS:

-target armv7-unknown-linux-gnueabihf  -O3 -mcpu=cortex-a15 -fomit-frame-pointer

Thanks! Unfortunately, ClamAV is dominated by I/O, so probably just noise.

The ones that probably responded are lencode and kc, but they responded on opposite directions, equally well/badly.

So far, this change has been good for code-size, but not much else. If/when this change is likely to be accepted, it will probably have to rely on size-opts profile being chosen.

eastig added a comment.Nov 8 2017, 3:30 AM

Hi,

I've got Spec2006 results for Cortex-A57:

401.bzip2: 5.54% code size increase
471.omnetpp: 1.22% execution time improvement

401.bzip2: 5.54% code size increase

That's really strange, any idea how this could happen?

eastig added a comment.Nov 9 2017, 3:25 AM

401.bzip2: 5.54% code size increase

That's really strange, any idea how this could happen?

Sorry, I have not looked at it yet. Busy with evaluation of the partial inliner.

Code size improvements on Jetson TK1 (cortex-a15). CFLAGS:

-mcpu=cortex-a15 -O3 -fomit-frame-pointer

There were no size regressions

I could finally get execution performance results for Jetson-TK1 (cortex-a15), which look meaningful
What I did so far:

  1. Created subset of LLVM test suite which contains only tests which were changed between compilations (patched vs non-patched). I used a script which compared executable file hash sums to do that. Overall there are 168 such tests in SingleSource and MultiSource dirs
  1. Used multisample run with 10 passes (--exec-multisample=10)
  1. Used perf (timeit.sh) instead of tmeit, because it shows much less variance in execution times for me
  1. Each time before running test suite the device was rebooted (2 times)

Here are the results:

There were no performance regressions.

@eastig I've tested patch on bzip2. I don't have SpecCPU 2006, so I used plain 1.0.3
Here are sizes of code section I got for different runs.

  1. Cortex-a57 and SDNode scheduler (-Wall -Winline -target arm-none-linux-gnueabihf -mfloat-abi=hard -O3 -mcpu=cortex-a57 -fomit-frame-pointer)
Before: 84188
After: 83172
  1. Cortex-a57 and MI scheduler (-Wall -Winline -target arm-none-linux-gnueabihf -mfloat-abi=hard -O3 -mcpu=cortex-a57 -Xclang -target-feature -Xclang +use-misched -fomit-frame-pointer)
Before: 80088
After: 79440
  1. Cortex-a15 (-Wall -Winline -target arm-none-linux-gnueabihf -mfloat-abi=hard -O3 -mcpu=cortex-a15 -fomit-frame-pointer)
Before: 81672
After: 81320

@evgeny777: I'll try to find some time to look at 401.bzip2.

I think, I know the reason of code section growth: you've likely compiled 401.bzip with -mthumb.
For some reason I get twice more register spills/reloads in BZ2_decompress in the patched version.
Strange, but all other functions are shorter or equal in size. Looking for the problem source now.

mcrosier resigned from this revision.Apr 10 2018, 1:10 PM