This is an archive of the discontinued LLVM Phabricator instance.

[ARMISelLowering] Better handling of NEON load/store for sequential memory regions
Needs ReviewPublic

Authored by evgeny777 on Oct 30 2017, 5:41 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
javed.absar
ab
qcolombet
rovka
evandro
kristof.beyls
majnemer
eli.friedman
mcrosier

Summary

Consider sample code which copies 4x4 matrix row by row (see cascade-vld-vst.ll). Current revision generation following code (AArch32):

mov r2, #48
mov r3, r0
vld1.32 {d16, d17}, [r3], r2
vld1.64 {d18, d19}, [r3]
add r3, r0, #32
add r0, r0, #16
vld1.64 {d22, d23}, [r0]
add r0, r1, #16
vld1.64 {d20, d21}, [r3]
vst1.64 {d22, d23}, [r0]
add r0, r1, #32
vst1.64 {d20, d21}, [r0]
vst1.32 {d16, d17}, [r1], r2
vst1.64 {d18, d19}, [r1]
mov pc, lr

After this patch is applied:

vld1.32 {d16, d17}, [r0]!
vld1.32 {d18, d19}, [r0]!
vld1.32 {d20, d21}, [r0]!
vld1.64 {d22, d23}, [r0]
vst1.32 {d16, d17}, [r1]!
vst1.32 {d18, d19}, [r1]!
vst1.32 {d20, d21}, [r1]!
vst1.64 {d22, d23}, [r1]
mov pc, lr

It also speeds up our matrix multiplication function by 15%. Some of the existing LLVM test cases now have approx. 25% less instructions than before.

The improvement is based on two major changes to CombineBaseUpdate

When we select address increment instruction to fold we prefer one which is equal to access size of load/store
If we can't find such address increment bound to current load/store instruction address operand we navigate up the SelectionDAG chain and try to borrow address increment instruction bound to address operand of parent VST{X}_UPD and VLD{X}_UPD which we processed earlier.

Diff Detail

Event Timeline

evgeny777 created this revision.Oct 30 2017, 5:41 AM

Herald added subscribers: kristof.beyls, eraman, rengolin, aemerson. · View Herald TranscriptOct 30 2017, 5:41 AM

So, this is an interesting change, on the TODO list for a while, but the code and the tests are changing too much in areas where not clearly related.

Furthermore, you have replaced a single-pass for a multi-pass on the Dag, which may have several unintended consequences, including compile time increase.

For your results, you have not shown which is your matrix benchmark, numbers on standard compiler benchmarks and the LLVM test-suite, nor you have said which architecture (aarch32-variant) you're benchmarking.

Giving that this will affect *all* AArch32 (armv7/8), I think we need to be cautious and extensive on our approach.

I'm adding more Arm folks to get a wider review as well as testing, but I am worried with the compilation time as well as the impact in many other code patterns than just matrix multiply.

cheers,
--renato

lib/Target/ARM/ARMISelLowering.cpp
11064	I'm worried about this... there should be a guaranteed exit condition here.
test/CodeGen/ARM/cascade-vld-vst.ll
29	Use regexp as above: {{{d[0-9]+}}, {{d[0-9]+}}}
test/CodeGen/ARM/misched-fusion-aes.ll
79	why would these change?
test/CodeGen/ARM/vector-load.ll
258	you can't guarantee r0 as this returns void.

evgeny777 added inline comments.Oct 30 2017, 7:10 AM

lib/Target/ARM/ARMISelLowering.cpp
11064	What about reducing number of iterations to, let's say, 16? This should be more than enough for most of cases. Also the effect of compilation time increase should become limited.
test/CodeGen/ARM/misched-fusion-aes.ll
79	IR is compiled into a mixture of aesXX and vldXX instructions which are being scheduled differently after this patch is applied. I checked manually that parameters passed to each aesXX instruction are identical (though I can't guarantee that I didn't make any mistake, of course :)

rengolin added inline comments.Oct 30 2017, 7:26 AM

lib/Target/ARM/ARMISelLowering.cpp
11064	I was hoping for an approach that didn't need O(n^2) even if lower bound. If your increment is 16x, then the loop will be 16x less efficient than previously. Granted, it's doing more, but that's a big impact that I'd rather avoid. But I have been away from isel for quite a while now, so I'll wait for some current isel developers (@qcolombet @rovka) to shed some light.
test/CodeGen/ARM/misched-fusion-aes.ll
79	Right, I expected as much. And you can't just ignore those lines because the QD, QE etc. will not match. Hopefully, simplifying the IR could make the extra AES instructions unnecessary and we end up with a static pattern.

mat_mul_4x4.ll5 KBDownload

Sharing my matrix multiplication example, if anyone's interested.

In D39415#910840, @evgeny777 wrote:

mat_mul_4x4.ll5 KBDownload

Sharing my matrix multiplication example, if anyone's interested.

Thanks! Can you also share the platform you're running your tests? And run some benchmarks to show run time and compile time differences? Also the source where this comes from (C/Fortran), so we can fiddle / compare to others.

cheers,
--renato

Hi Eugene:
Thanks for the work. However, in its current form the implementation is bit hard to follow. Would it be possible to, describe -

Your overall approach in the implementation

2.. Describe above the function what it is trying to do (e.g. checkedGetIncrement - which b.t.w looks non-intuitive name).

You do have comments everywhere but it is hard to follow the string of thought in its current form.

lib/Target/ARM/ARMISelLowering.cpp
10996	But you are checking only VLD1/VST1, so you may want to change function name appropriately. VLD1OrVST1 ...................................^^ ^
11006	Seems unnecessary as single use
11065	nitpick. Better to use range loop

@rengolin

Compilation times

With patch:

real	1m40.086s
user	0m57.040s
sys	0m27.444s

W/o patch:

real	1m40.619s
user	0m58.120s
sys	0m25.944s

Those measurements were done with this bash script:

#!/bin/bash
LLC=/data/llvm/build_ninja_Release/bin/llc
for ((i=1;i<=10000;i++)); do 
   $LLC -mtriple=arm-eabi -float-abi=soft -mattr=+neon mat_mul_4x4.ll 
done

An interesting fact is that execution of patched llc is stably slightly less than
that of non-patched version (both were run 3 times in a row). Not sure what the reason
is (may be less number of SD nodes after DAGCombine). My machine specs are:
Core-i5 2500K, 8GB RAM Ubuntu 16.04

Execution times of matrix multiplication example (ARMv8, 32-bit) on ARM Cortex A57, 2GHz:

With patch:

MI scheduler: 2549066 usec
SD scheduler: 2647092 usec

W/o patch:

MI scheduler: 3039261 usec
SD scheduler: 2843175 usec

We're using MI scheduler model added in D28152. With SD scheduler improvement is smaller, but still
noticable.

In D39415#911554, @evgeny777 wrote:
With patch:
real	1m40.086s
W/o patch:
real	1m40.619s

@evgeny777, while this is encouraging, you're still only running the one case where you know this is good.

Those measurements were done with this bash script:

#!/bin/bash
LLC=/data/llvm/build_ninja_Release/bin/llc
for ((i=1;i<=10000;i++)); do 
   $LLC -mtriple=arm-eabi -float-abi=soft -mattr=+neon mat_mul_4x4.ll 
done

This is very far from ideal, as kernel caching, bash and whatever running on your current system could be affecting your timing more than the difference itself.

An interesting fact is that execution of patched llc is stably slightly less than
that of non-patched version (both were run 3 times in a row). Not sure what the reason
is (may be less number of SD nodes after DAGCombine). My machine specs are:
Core-i5 2500K, 8GB RAM Ubuntu 16.04

Honestly, I don't think your results carry any statistical significance, the variability of the runs could be due to too many factors. I wouldn't worry about those fluctuations right now.

With patch:
MI scheduler: 2549066 usec
SD scheduler: 2647092 usec
W/o patch:
MI scheduler: 3039261 usec
SD scheduler: 2843175 usec
We're using MI scheduler model added in D28152. With SD scheduler improvement is smaller, but still
noticable.

Again, this is the case that was tailored and reduced, so of course, the numbers will look very inflated.

What ARM machine have you ran these? This makes a huge difference. From your comments, I'm guessing this is ARMv8, which is by far *not* the common case for AArch32.

Until you can show results on real benchmarks on at least one core for both compile and run times, there isn't much else we can do.

Once that's available, I'd encourage other people to do the same on their cores, as well as have a deeper look into the code.

From @javed.absar's comment, I'm not the only one finding this patch non-intuitive, even with the good amount of comments.

cheers,
--renato

eastig added a subscriber: eastig.Oct 31 2017, 3:31 AM

Hi,

I'll do AArch32/AArch64 benchmarks runs on our Juno boards. It would be worth to run CTMark to check compilation time.

Thanks,
Evgeny Astigeevich
The Arm Compiler Optimisation team

fhahn added a subscriber: fhahn.Oct 31 2017, 3:48 AM

fhahn added inline comments.

test/CodeGen/ARM/misched-fusion-aes.ll
79	I think we could replace the last 3 AESE instructions (lines 57, 59, 61), which only combine the results of the previous computations and have no AESMC instructions to pair with. But as long as the number of AESE/ASEMC pairs does not change, the position of the 3 left-over AESE instructions is nothing to be too concerned about IMO.

In D39415#911617, @eastig wrote:

I'll do AArch32/AArch64 benchmarks runs on our Juno boards. It would be worth to run CTMark to check compilation time.

Hi,

There is no point in testing AArch64 as this is a change to the AArch32 back-end. Can you run the same benchmarks on an ARMv7 board? This is the most likely target on using the ARM back-end.

cheers,
--renato

rengolin added inline comments.Oct 31 2017, 5:13 AM

test/CodeGen/ARM/misched-fusion-aes.ll
79	Sure, I'm not worried about their position, just trying to future-proof the test, so we don't need to keep shuffling them on code-gen changes.

@rengolin

What ARM machine have you ran these?

This is NDA-covered and I'm not going to share specs. My testing abilities are limited, but here are few more
devices, you should be familiar with:

Samsung Galaxy Nexus (TI OMAP 4460 Dual-core Cortex-A9 1,2 GHz):
With patch:

4x3 Matrix multiplication: 7436829 usec
4x4 Matrix multiplication: 8205384 usec

W/o patch:

4x3 Matrix multiplication: 7797547 usec
4x4 Matrix multiplication: 8580291 usec

Huawei Honor 8 (Cortex A72, 2.3GHz):
With patch:

4x3 Matrix multiplication: 1020747 usec
4x4 Matrix multiplication: 1130021 usec

W/o patch:

4x3 Matrix multiplication: 1193994 usec
4x4 Matrix multiplication: 1290402 usec

From @javed.absar's comment, I'm not the only one finding this patch non-intuitive, even with the good amount of comments.

I'm preparing algorithm description, so I suggest returning to discussing this later.

In D39415#911672, @evgeny777 wrote:

Samsung Galaxy Nexus (TI OMAP 4460 Dual-core Cortex-A9 1,2 GHz):
Huawei Honor 8 (Cortex A72, 2.3GHz):

Right, those are good platforms. But we need more detailed benchmarks (which @eastig is going to run, but I encourage @evandro and @rovka to do the same on others) before we can settle the regression question.

From @javed.absar's comment, I'm not the only one finding this patch non-intuitive, even with the good amount of comments.

I'm preparing algorithm description, so I suggest returning to discussing this later.

Excellent, thank you!

@javed.absar @rengolin
I've created a PowerPoint file describing algorithm. Hope this helps.

fold_algorithm_for_neon_load_store.pptx738 KBDownload

In D39415#911772, @evgeny777 wrote:

@javed.absar @rengolin
I've created a PowerPoint file describing algorithm. Hope this helps.
fold_algorithm_for_neon_load_store.pptx738 KBDownload

Now as a PDF. :)

https://reviews.llvm.org/F5459847

fold_algorithm_for_neon_load_store.pdf643 KBDownload

efriedma added a subscriber: efriedma.Oct 31 2017, 11:23 AM

efriedma added inline comments.

lib/Target/ARM/ARMISelLowering.cpp
11064	I'll have to think about this a bit more... but I'll just briefly note that the existing CombineBaseUpdate has shown up in the past as a compile-time hog for large basic blocks. (I think it's worst-case O(N^3) or something like that.)

rengolin added inline comments.Oct 31 2017, 12:28 PM

lib/Target/ARM/ARMISelLowering.cpp
11064	Thanks Eli. I just want to make sure that we only do that if we really need it, ie. the benefit outweigh the costs/risks and there is no other way.

Hi,

I tried to give this a run [1] on top of r317072 on an NVIDIA TK1 (Cortex-A15) and I'm getting some failures in the test-suite (in 42 of the benchmarks) along the lines of [2]. I think we should fix those before worrying about performance numbers...

Let me know if you have trouble reproducing these results.

Thanks,
Diana

[1] With these flags: sandbox/bin/python sandbox/bin/lnt runtest test-suite --sandbox sandbox --test-suite test-suite --cc $LLVM_BLD/bin/clang --cxx $LLVM_BLD/bin/clang++ --use-lit $LLVM_BLD/bin/llvm-lit --cppflags '-O3 -mcpu=cortex-a15 -fomit-frame-pointer' --threads=1 --build-threads=4 --use-perf=all --run-under 'taskset -c 0' --benchmarking-only --exec-multisample=3

[2] clang-6.0: llvm/include/llvm/Support/Casting.h:106: static bool llvm::isa_impl_cl<llvm::ConstantSDNode, const llvm::SDNode *>::doit(const From *) [To = llvm::ConstantSDNode, From = const llvm::SDNode *]: Assertion `Val && "isa<> used on a null pointer"' failed.
Stack dump:
0. Program arguments: build/bin/clang-6.0 -cc1 -triple armv7-unknown-linux-gnueabihf -emit-obj -disable-free -main-file-name ffbench.c -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu cortex-a15 -target-feature -crc -target-feature +dsp -target-feature -ras -target-feature -dotprod -target-feature +hwdiv-arm -target-feature +hwdiv -target-abi aapcs-linux -mfloat-abi hard -fallow-half-arguments-and-returns -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file sandbox/test-2017-11-01_11-50-24/SingleSource/Benchmarks/Misc/CMakeFiles/ffbench.dir/ffbench.c.gcno -resource-dir build/lib/clang/6.0.0 -D NDEBUG -internal-isystem /usr/local/include -internal-isystem build/lib/clang/6.0.0/include -internal-externc-isystem /usr/include/arm-linux-gnueabihf -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Werror=date-time -w -fdebug-compilation-dir sandbox/test-2017-11-01_11-50-24/SingleSource/Benchmarks/Misc -ferror-limit 19 -fmessage-length 0 -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o CMakeFiles/ffbench.dir/ffbench.c.o -x c test-suite/SingleSource/Benchmarks/Misc/ffbench.c

<eof> parser at end of file
Code generation
Running pass 'Function Pass Manager' on module 'test-suite/SingleSource/Benchmarks/Misc/ffbench.c'.
Running pass 'ARM Instruction Selection' on function '@fourn'

clang-6.0: error: unable to execute command: Aborted
clang-6.0: error: clang frontend command failed due to signal (use -v to see invocation)

@rovka Looking at it ...

Fixed bug caused by incorrect negative increment handling. Added test case.
If we can't select best possible increment value we now return the first one instead of the last one
Addressed some of review comments
Fixed issues in ivchain-ARM.ll test case

@rovka : I've run the LLVM test suite (with --compile-only) and it worked fine for me (no assertions, e.t.c). I haven't run it yet, though.

evgeny777 added inline comments.Nov 2 2017, 8:09 AM

lib/Target/ARM/ARMISelLowering.cpp
11065	I need iterator here to call UI.getUse(), not just SDNode*

After looking at your PDF, it sounds as though you don't need the while loop at all.

Just iterate once through all users, collecting the increment sizes.
Then sort them and make sure each one is a step away from the other (step = inc[1]-inc[0]).
If the end is reached and they're all the same size, combine to post-inc.

An optimisation over that would be to find a sub-set of the list which has constant increment and only combine those, but that's a very small edge case inside an already small case and I wouldn't worry about it.

Also, unless you can run the check-all and the test-suite and make sure there are no breakages on your own, it's going to be hard to spend time looking at your patch again.

Adding David (combine code owner) for a better look at the algorithm.

cheers,
--renato

rengolin added a reviewer: eli.friedman.Nov 2 2017, 8:47 AM

After looking at your PDF, it sounds as though you don't need the while loop at all.

Can you please elaborate? Given this type of chain

[ Store(Value, Addr) ]  ------------> [ VST1_UPD ]

The only user of Addr operand (VST1_UPD) is Store(Value, Addr). What should I do next?
Also sort (qsort) introduces additional O(N*logN) complexity in average case and O(N*N) in the worst case.

Also, unless you can run the check-all

I ran check-llvm and also I ran LLVM test-suite, as described here:
http://llvm.org/docs/lnt/quickstart.html

However I could only cross-compile it and check that there are no more asserions. I don't have any linux ARM devices and running test suite on Android looks difficult.
I'll try jailbroken iPad tomorrow which might probably be easier

In D39415#914104, @evgeny777 wrote:

The only user of Addr operand (VST1_UPD) is Store(Value, Addr). What should I do next?

All three stores chain up to the same load, with different offsets. Can't you look for that pattern?

@eli.friedman / @majnemer should really give their input here, as I'm reaching the limit of my combine knowledge.

Also sort (qsort) introduces additional O(N*logN) complexity in average case and O(N*N) in the worst case.

First, there is no requirement for standard libraries to use qsort. I believe stdlibc++ uses introsort (worst-case nlogn), not sure about libc++.

Still, you only sort a very short list of integers IFF the pattern is matched, paying the (small) cost only for the rare cases where it does.

Let's assume you have the absurd amount of 10 integers. That's 100 passes for N^2 in the worst possible case, ~33 for nlogn, only when the list is non-empty.

While iterating through the use chain and dereferencing pointers through N for loops will grant you cache misses, huge number of loads, potential branch prediction failures. For *all* cases.

I'll let you guess which one is worse.

also I ran LLVM test-suite

No you didn't. You only compiled the test suite.

However I could only cross-compile it and check that there are no more asserions. I don't have any linux ARM devices and running test suite on Android looks difficult.
I'll try jailbroken iPad tomorrow which might probably be easier

You can't push a patch that will affect every Arm device on the planet by testing on NDA hardware with NDA operating system and hope all is well. But it gets worse, because you have only "tested" your small matrix multiply snippet and claimed victory.

In reality, you have to demonstrate, from your own effort, that you have done your homework.

In general, your homework is at least:

Build and check-all, making sure you built the Arm back-end (so Lit tests run on Arm code-gen). Check.
Convince people your algorithm is the best we can possibly do in this case. We're doing that, but I'm not convinced yet.

Meanwhile, we have demonstrated that you have broken the test-suite on Arm (the architecture your patch targets). In this case you *must*:

*Run* the test-suite in test mode, on Arm hardware (Linux, Apple, even Android would be fine). Everything must pass. Not check.

Since this is a performance improvement with serious risk of compile time issues, the additional items are required:

Run standard compiler benchmarks (SPEC, EEMBC, Coremark) on Arm hardware (Linux, Apple, even Android would be fine). Not check.
1. Report results for anything greater than 1% change (up and down) on both run time and compile time.
2. Report the geomeans of all benchmarks, discuss the results.
3. A nice touch would be running the test-suite in benchmark mode, which does that more or less automagically.
Invite other people to run benchmarks on other machines which the patch targets, discuss results. I've done that, but they will only do it when the algorithm is right.

In this particular case, the Arm back-end targets a lot more ARMv7 cores than ARMv8, so at least some benchmarking has to be done on ARMv7 hardware.

It would be fine for you to benchmark on ARMv8 and wait for other people to do the same on ARMv7.

But before anyone else gets involved in their own benchmarks, you *have* to have finished, and presented, steps 1~4 above.

cheers,
--renato

Let's assume you have the absurd amount of 10 integers. That's 100 passes for N^2 in the worst possible case, ~33 for nlogn, only when the list is non-empty.

I believe algorithm complexity is O(N), where N is a number of nodes in Selection DAG (it's about finding a minimum value in a list of values). You're suggesting O(N*logN), why?

You can't push a patch that will affect every Arm device on the planet by testing on ...

Like I said before, my testing capacity is limited and running test suite on a single device (even popular) also doesn't provide that much of data, you probably expect. Meanwhile someone might also want to give it a try, so I updated the diff.

In D39415#914238, @evgeny777 wrote:

I believe algorithm complexity is O(N), where N is a number of nodes in Selection DAG (it's about finding a minimum value in a list of values). You're suggesting O(N*logN), why?

I'm lost. What algorithm is O(N)? You have nested loops, the outer one is unbound, the inner one is O(N).

Like I said before, my testing capacity is limited

How do you know your change is good for all cases the compiler will have to handle, and not just the one you care about?

and running test suite on a single device (even popular) also doesn't provide that much of data, you probably expect.

Testing on a single device is infinitely better than not testing it at all, I hope you can see that.

Meanwhile someone might also want to give it a try, so I updated the diff.

So you are expecting other people to test your code for you?

I'm lost. What algorithm is O(N)? You have nested loops, the outer one is unbound, the inner one is O(N).

Having nested loops doesn't necessary means that you have O(N^2) complexity. My algorithm:

Doesn't examine any node in the DAG twice
Number of nodes in the DAG in N
At maximum it can examine all nodes in DAG

So the complexity is naturally O(N)

Testing on a single device is infinitely better than not testing it at all, I hope you can see that.

I've shown benchmarks on few Android phones, if you remember. I'll also try running full test suite, but this can take some time

So you are expecting other people to test your code for you?

I'm not expecting anything. I just posted the fix.

When you're considering the overall complexity, there are a couple points to keep in mind:

DAGCombine calls CombineBaseUpdate for each relevant node in the DAG.
isPredecessorOf is O(N) in the size of the DAG.

DAGCombine calls CombineBaseUpdate for each relevant node in the DAG.

isPredecessorOf is O(N) in the size of the DAG.

That's true, I see no difference regarding this between old and new approaches.
I didn't know that, thanks! I'll check if it is possible to optimize

In D39415#914290, @evgeny777 wrote:

I've shown benchmarks on few Android phones, if you remember.

No you haven't. You have shown one little snippet, not benchmarks.

I'll also try running full test suite, but this can take some time

Sure, once you're finished, and all tests are green, we can start again.

Hi,

I've got first results of benchmark runs: the LNT test suite + a private benchmark 01. I used the latest patch.
The configuration is a Juno board Cortex-A57/A53, v8-a, AArch32, Thumb2.
Options: -O3 -mcpu=cortex-a57 -mthumb -fomit-frame-pointer
The runs passed without errors.

Improvements:

MultiSource/Applications/JM/lencod/lencod	2.57%
Private benchmark 01	9.5%

Regressions:

SingleSource/Benchmarks/Misc/salsa20

3.41%

Thanks,
Evgeny Astigeevich

@eastig Thanks a lot!

In D39415#914992, @eastig wrote:

I've got first results of benchmark runs: the LNT test suite + a private benchmark 01. I used the latest patch.
The configuration is a Juno board Cortex-A57/A53, v8-a, AArch32, Thumb2.
Options: -O3 -mcpu=cortex-a57 -mthumb -fomit-frame-pointer
The runs passed without errors.

Thanks Evgeny.

Improvements:

MultiSource/Applications/JM/lencod/lencod 2.57%

Private benchmark 01 9.5%

Regressions:

SingleSource/Benchmarks/Misc/salsa20 3.41%

Not very convincing numbers. I guess efficient pipelining can make most of the difference wash away.

Maybe older ARM cores, like A8, will get better improvement?

What about compile time? And code size?

In D39415#914995, @rengolin wrote:

Not very convincing numbers. I guess efficient pipelining can make most of the difference wash away.

Agree. Maybe there will be more interesting data from other benchmarks.

Maybe older ARM cores, like A8, will get better improvement?

We have Cortex-A9 Panda boards. I'll use them.

What about compile time? And code size?

The LNT server does not provide compile time reports. I am not sure whether they are not displayed or no compile time data available.
Also compilation happens on a target. This is not cross-compilation. I don't know how compile time is stable on Juno boards.
I think Evgeny could try to do cross-compilation runs of the LNT test suite on X86. It's worth to run CTMark. Apple says it's quite good at catching compile time regressions.

Attached the picture of code size improvements (sorry for not providing them as plain table data):

In D39415#915070, @eastig wrote:

The LNT server does not provide compile time reports. I am not sure whether they are not displayed or no compile time data available.
Also compilation happens on a target. This is not cross-compilation. I don't know how compile time is stable on Juno boards.

LNT does provide compile time data, but since you're cross-compiling, it's probably being lost somewhere in the process.

Attached the picture of code size improvements (sorry for not providing them as plain table data):

Interesting. I'm guessing all TSVC reductions are due to the same function.

It's also interesting that TSVC was the code that changed the most, and had no visible performance benefits.

Maybe hinting that the variation in other benchmarks could have been a side effect, not the use of post-increment, including the matrix multiply. Neither lencod nor salsa show up in code size changes.

It's quite likely that your A9 won't show much better results (or even different random results), given that it's OOO and pipelined between memory, ALU and vector operations.

In D39415#915116, @rengolin wrote:

Interesting. I'm guessing all TSVC reductions are due to the same function.

It's also interesting that TSVC was the code that changed the most, and had no visible performance benefits.

Actually it had but not many.
There are some other regressions/improvements. I have not listed them because the hottest code is not changed.
Attached full tables:

MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt has 1.86% execution time regression and 41.88% code size improvement.
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt has 1.48% execution time improvement and 39.61% code size improvement.
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt has 1.26% execution time improvements and 41.61% code size improvement.

Maybe hinting that the variation in other benchmarks could have been a side effect, not the use of post-increment, including the matrix multiply. Neither lencod nor salsa show up in code size changes.

It's quite likely that your A9 won't show much better results (or even different random results), given that it's OOO and pipelined between memory, ALU and vector operations.

What about running on Cortex-A53? Its pipeline is in-order.

In D39415#915151, @eastig wrote:

Actually it had but not many.
There are some other regressions/improvements. I have not listed them because the hottest code is not changed.

Right, and the fact that TSVC appears on both improvements and regressions is a hint that there are other factors at play.

MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt has 1.86% execution time regression and 41.88% code size improvement.
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt has 1.48% execution time improvement and 39.61% code size improvement.
MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt has 1.26% execution time improvements and 41.61% code size improvement.

That's pretty consistent, again, probably the same code. But at least you didn't have regressions in non-affected code, which means the early exits are working as expected.

What about running on Cortex-A53? Its pipeline is in-order.

I don't think out vs. in will make a big difference (maybe just different noise). Because this is an ALU vs. Load/Vector, the pipelining will have a much bigger impact than the dispatcher.

Cortex-A57 results from another private benchmark (50 sub-benchmarks):

3 sub-benchmarks: 2% score regression
2 sub-benchmarks: 1% code size improvement (no changes in their scores)
1 sub-benchmark: 5.56% score improvement
1 sub-benchmark: 1.15% score improvement

CTMark benchmarks (Baseline corresponds to non-patched version)

CFLAGS:

-target armv7-unknown-linux-gnueabihf  -O3 -mcpu=cortex-a15 -fomit-frame-pointer

In D39415#917654, @evgeny777 wrote:
CTMark benchmarks (Baseline corresponds to non-patched version)

CFLAGS:
-target armv7-unknown-linux-gnueabihf  -O3 -mcpu=cortex-a15 -fomit-frame-pointer

Thanks! Unfortunately, ClamAV is dominated by I/O, so probably just noise.

The ones that probably responded are lencode and kc, but they responded on opposite directions, equally well/badly.

So far, this change has been good for code-size, but not much else. If/when this change is likely to be accepted, it will probably have to rely on size-opts profile being chosen.

Hi,

I've got Spec2006 results for Cortex-A57:

401.bzip2: 5.54% code size increase
471.omnetpp: 1.22% execution time improvement

401.bzip2: 5.54% code size increase

That's really strange, any idea how this could happen?

In D39415#919103, @evgeny777 wrote:

401.bzip2: 5.54% code size increase

That's really strange, any idea how this could happen?

Sorry, I have not looked at it yet. Busy with evaluation of the partial inliner.

Code size improvements on Jetson TK1 (cortex-a15). CFLAGS:

-mcpu=cortex-a15 -O3 -fomit-frame-pointer

There were no size regressions

I could finally get execution performance results for Jetson-TK1 (cortex-a15), which look meaningful
What I did so far:

Created subset of LLVM test suite which contains only tests which were changed between compilations (patched vs non-patched). I used a script which compared executable file hash sums to do that. Overall there are 168 such tests in SingleSource and MultiSource dirs

Used multisample run with 10 passes (--exec-multisample=10)

Used perf (timeit.sh) instead of tmeit, because it shows much less variance in execution times for me

Each time before running test suite the device was rebooted (2 times)

Here are the results:

There were no performance regressions.

@eastig I've tested patch on bzip2. I don't have SpecCPU 2006, so I used plain 1.0.3
Here are sizes of code section I got for different runs.

Cortex-a57 and SDNode scheduler (-Wall -Winline -target arm-none-linux-gnueabihf -mfloat-abi=hard -O3 -mcpu=cortex-a57 -fomit-frame-pointer)

Before: 84188
After: 83172

Cortex-a57 and MI scheduler (-Wall -Winline -target arm-none-linux-gnueabihf -mfloat-abi=hard -O3 -mcpu=cortex-a57 -Xclang -target-feature -Xclang +use-misched -fomit-frame-pointer)

Before: 80088
After: 79440

Cortex-a15 (-Wall -Winline -target arm-none-linux-gnueabihf -mfloat-abi=hard -O3 -mcpu=cortex-a15 -fomit-frame-pointer)

Before: 81672
After: 81320

@evgeny777: I'll try to find some time to look at 401.bzip2.

I think, I know the reason of code section growth: you've likely compiled 401.bzip with -mthumb.
For some reason I get twice more register spills/reloads in BZ2_decompress in the patched version.
Strange, but all other functions are shorter or equal in size. Looking for the problem source now.

mcrosier resigned from this revision.Apr 10 2018, 1:10 PM

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMISelLowering.cpp

449 lines

test/

CodeGen/

ARM/

alloc-no-stack-realign.ll

91 lines

cascade-vld-vst.ll

37 lines

memcpy-inline.ll

17 lines

misched-fusion-aes.ll

12 lines

vector-load.ll

8 lines

vext.ll

27 lines

Transforms/

LoopStrengthReduce/

ARM/

ivchain-ARM.ll

10 lines

Diff 120805

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,987 Lines • ▼ Show 20 Lines	for (unsigned n = 0; n < NumElts; ++n) {
else if (MaskElt >= (int)NumElts && MaskElt < (int)(NumElts + HalfElts))		else if (MaskElt >= (int)NumElts && MaskElt < (int)(NumElts + HalfElts))
NewElt = HalfElts + MaskElt - NumElts;		NewElt = HalfElts + MaskElt - NumElts;
NewMask.push_back(NewElt);		NewMask.push_back(NewElt);
}		}
return DAG.getVectorShuffle(VT, SDLoc(N), NewConcat,		return DAG.getVectorShuffle(VT, SDLoc(N), NewConcat,
DAG.getUNDEF(VT), NewMask);		DAG.getUNDEF(VT), NewMask);
}		}

		static bool isUpdatingVLDorVST(SDNode *Inst) {
		javed.absarUnsubmitted Not Done Reply Inline Actions But you are checking only VLD1/VST1, so you may want to change function name appropriately. VLD1OrVST1 ...................................^^ ^ javed.absar: But you are checking only VLD1/VST1, so you may want to change function name appropriately.
		switch(Inst->getOpcode()) {
		case ARMISD::VLD1_UPD:
		case ARMISD::VST1_UPD:
		return true;
		default:
		return false;
		}
		}

		static ConstantSDNode tryGetConstOperand(SDNode Inst, unsigned NOp) {
		javed.absarUnsubmitted Not Done Reply Inline Actions Seems unnecessary as single use javed.absar: Seems unnecessary as single use
		return dyn_cast<ConstantSDNode>(Inst->getOperand(NOp).getNode());
		}

		static SDValue getIncrementWithOffset(SelectionDAG &DAG, SDValue C,
		unsigned Offset, SDLoc DL) {
		// If Offset is zero then C may or may not be constant.
		if (!Offset)
		return C;

		// We should always have constant value C, if offset is not zero.
		unsigned NewVal =
		cast<ConstantSDNode>(C.getNode())->getZExtValue() - Offset;

		return DAG.getConstant(NewVal, DL, C.getValueType());
		}

		static std::pair<SDValue, unsigned> checkedGetIncrement(SDValue Addr,
		SDNode *Inst,
		unsigned AccessSize,
		unsigned Offset) {
		// If the increment is a constant, it must match the memory ref size.
		SDValue Inc = Inst->getOperand(Inst->getOperand(0) == Addr ? 1 : 0);
		auto *CInc = dyn_cast<ConstantSDNode>(Inc.getNode());

		// Don't select non-constant increment if we have to subtract a
		// constant from it. This may result in additional register pressure
		if (!CInc && Offset)
		return {SDValue(), 0};

		unsigned CIncSize = CInc ? CInc->getZExtValue() : 0;
		if (AccessSize >= 3 * 16 && CIncSize != AccessSize) {
		// VLD3/4 and VST3/4 for 128-bit vectors are implemented with two
		// separate instructions that make it harder to use a non-constant update.
		return {SDValue(), 0};
		}

		// If increment is not greater than offset introduced by VLD/VST upper in the
		// call chain we'll be unable to fold such.
		if (CInc && CIncSize <= Offset)
		return {SDValue(), 0};

		return {Inc, CIncSize};
		}

		// Find address updating instruction, which we can fold with load/store,
		// creating VLD{X}_UPD or VST{X}_UPD.
		static std::pair<SDNode *, SDValue>
		findAddressUpdateToFold(SelectionDAG &DAG, SDNode *N, SDValue Addr,
		unsigned AccessSize) {
		unsigned Offset = 0;
		SDLoc DL(N);
		struct Match {
		SDNode *UInst; // Address update instruction
		SDValue Inc; // Address increment
		unsigned Off; // Offset introduced by cascade vld/vst
		} M = {};

		while (true) {
		rengolinUnsubmitted Not Done Reply Inline Actions I'm worried about this... there should be a guaranteed exit condition here. rengolin: I'm worried about this... there should be a guaranteed exit condition here.
		evgeny777AuthorUnsubmitted Not Done Reply Inline Actions What about reducing number of iterations to, let's say, 16? This should be more than enough for most of cases. Also the effect of compilation time increase should become limited. evgeny777: What about reducing number of iterations to, let's say, 16? This should be more than enough for…
		rengolinUnsubmitted Not Done Reply Inline Actions I was hoping for an approach that didn't need O(n^2) even if lower bound. If your increment is 16x, then the loop will be 16x less efficient than previously. Granted, it's doing more, but that's a big impact that I'd rather avoid. But I have been away from isel for quite a while now, so I'll wait for some current isel developers (@qcolombet @rovka) to shed some light. rengolin: I was hoping for an approach that didn't need O(n^2) even if lower bound. If your increment is…
		efriedmaUnsubmitted Not Done Reply Inline Actions I'll have to think about this a bit more... but I'll just briefly note that the existing CombineBaseUpdate has shown up in the past as a compile-time hog for large basic blocks. (I think it's worst-case O(N^3) or something like that.) efriedma: I'll have to think about this a bit more... but I'll just briefly note that the existing…
		rengolinUnsubmitted Not Done Reply Inline Actions Thanks Eli. I just want to make sure that we only do that if we really need it, ie. the benefit outweigh the costs/risks and there is no other way. rengolin: Thanks Eli. I just want to make sure that we only do that if we really need it, ie. the benefit…
		for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
		javed.absarUnsubmitted Not Done Reply Inline Actions nitpick. Better to use range loop javed.absar: nitpick. Better to use range loop
		evgeny777AuthorUnsubmitted Not Done Reply Inline Actions I need iterator here to call UI.getUse(), not just SDNode* evgeny777: I need iterator here to call UI.getUse(), not just SDNode*
		UE = Addr.getNode()->use_end(); UI != UE; ++UI) {
		SDNode User = UI;
		if (User->getOpcode() != ISD::ADD \|\|
		UI.getUse().getResNo() != Addr.getResNo())
		continue;

		// Check that the add is independent of the load/store. Otherwise,
		// folding it would create a cycle.
		if (User->isPredecessorOf(N) \|\| N->isPredecessorOf(User))
		continue;

		// We can fold following types of address increment:
		// 1. Non-constant and Offset == 0
		// 2. Constant and Inc.second >= Offset
		auto Inc = checkedGetIncrement(Addr, User, AccessSize, Offset);
		if (Inc.first.getNode())
		M = {User, Inc.first, Offset};

		if (Inc.second == AccessSize + Offset)
		// We've found best match possible.
		return {M.UInst, getIncrementWithOffset(DAG, Inc.first, Offset, DL)};
		}

		// If 'Addr' points to VLD{X}_UPD or VST{X}_UPD with fixed post-increment
		// then we examine parent address operand as well, keeping track of
		// post-increment value
		if (!isUpdatingVLDorVST(Addr.getNode()))
		break;

		// Get post-increment value from VST{X}_UPD or VLD{X}_UPD. If it is not
		// constant don't bother. Otherwise we'll introduce extra register
		// operation, because we'll need to subtract constant value from register
		// increment.
		auto *CInc = tryGetConstOperand(Addr.getNode(), 2);
		if (!CInc)
		break;

		// Update offset with a size of post-increment of command upper in the
		// chain.
		Offset += CInc->getZExtValue();
		Addr = Addr.getOperand(1);
		}
		return {M.UInst, getIncrementWithOffset(DAG, M.Inc, M.Off, DL)};
		}

/// CombineBaseUpdate - Target-specific DAG combine function for VLDDUP,		/// CombineBaseUpdate - Target-specific DAG combine function for VLDDUP,
/// NEON load/store intrinsics, and generic vector load/stores, to merge		/// NEON load/store intrinsics, and generic vector load/stores, to merge
/// base address updates.		/// base address updates.
/// For generic load/stores, the memory type is assumed to be a vector.		/// For generic load/stores, the memory type is assumed to be a vector.
/// The caller is assumed to have checked legality.		/// The caller is assumed to have checked legality.
static SDValue CombineBaseUpdate(SDNode *N,		static SDValue CombineBaseUpdate(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI) {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
const bool isIntrinsic = (N->getOpcode() == ISD::INTRINSIC_VOID \|\|		const bool isIntrinsic = (N->getOpcode() == ISD::INTRINSIC_VOID \|\|
N->getOpcode() == ISD::INTRINSIC_W_CHAIN);		N->getOpcode() == ISD::INTRINSIC_W_CHAIN);
const bool isStore = N->getOpcode() == ISD::STORE;		const bool isStore = N->getOpcode() == ISD::STORE;
const unsigned AddrOpIdx = ((isIntrinsic \|\| isStore) ? 2 : 1);		const unsigned AddrOpIdx = ((isIntrinsic \|\| isStore) ? 2 : 1);
SDValue Addr = N->getOperand(AddrOpIdx);		SDValue Addr = N->getOperand(AddrOpIdx);
MemSDNode *MemN = cast<MemSDNode>(N);		MemSDNode *MemN = cast<MemSDNode>(N);
SDLoc dl(N);		SDLoc dl(N);

// Search for a use of the address operand that is an increment.
for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
UE = Addr.getNode()->use_end(); UI != UE; ++UI) {
SDNode User = UI;
if (User->getOpcode() != ISD::ADD \|\|
UI.getUse().getResNo() != Addr.getResNo())
continue;

// Check that the add is independent of the load/store. Otherwise, folding
// it would create a cycle.
if (User->isPredecessorOf(N) \|\| N->isPredecessorOf(User))
continue;

// Find the new opcode for the updating load/store.		// Find the new opcode for the updating load/store.
bool isLoadOp = true;		bool isLoadOp = true;
bool isLaneOp = false;		bool isLaneOp = false;
unsigned NewOpc = 0;		unsigned NewOpc = 0;
unsigned NumVecs = 0;		unsigned NumVecs = 0;
if (isIntrinsic) {		if (isIntrinsic) {
unsigned IntNo = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
switch (IntNo) {		switch (IntNo) {
default: llvm_unreachable("unexpected intrinsic for Neon base update");		default: llvm_unreachable("unexpected intrinsic for Neon base update");
case Intrinsic::arm_neon_vld1: NewOpc = ARMISD::VLD1_UPD;		case Intrinsic::arm_neon_vld1: NewOpc = ARMISD::VLD1_UPD;
NumVecs = 1; break;		NumVecs = 1; break;
case Intrinsic::arm_neon_vld2: NewOpc = ARMISD::VLD2_UPD;		case Intrinsic::arm_neon_vld2: NewOpc = ARMISD::VLD2_UPD;
NumVecs = 2; break;		NumVecs = 2; break;
case Intrinsic::arm_neon_vld3: NewOpc = ARMISD::VLD3_UPD;		case Intrinsic::arm_neon_vld3: NewOpc = ARMISD::VLD3_UPD;
NumVecs = 3; break;		NumVecs = 3; break;
case Intrinsic::arm_neon_vld4: NewOpc = ARMISD::VLD4_UPD;		case Intrinsic::arm_neon_vld4: NewOpc = ARMISD::VLD4_UPD;
NumVecs = 4; break;		NumVecs = 4; break;
case Intrinsic::arm_neon_vld2lane: NewOpc = ARMISD::VLD2LN_UPD;		case Intrinsic::arm_neon_vld2lane: NewOpc = ARMISD::VLD2LN_UPD;
NumVecs = 2; isLaneOp = true; break;		NumVecs = 2; isLaneOp = true; break;
case Intrinsic::arm_neon_vld3lane: NewOpc = ARMISD::VLD3LN_UPD;		case Intrinsic::arm_neon_vld3lane: NewOpc = ARMISD::VLD3LN_UPD;
NumVecs = 3; isLaneOp = true; break;		NumVecs = 3; isLaneOp = true; break;
case Intrinsic::arm_neon_vld4lane: NewOpc = ARMISD::VLD4LN_UPD;		case Intrinsic::arm_neon_vld4lane: NewOpc = ARMISD::VLD4LN_UPD;
NumVecs = 4; isLaneOp = true; break;		NumVecs = 4; isLaneOp = true; break;
case Intrinsic::arm_neon_vst1: NewOpc = ARMISD::VST1_UPD;		case Intrinsic::arm_neon_vst1: NewOpc = ARMISD::VST1_UPD;
NumVecs = 1; isLoadOp = false; break;		NumVecs = 1; isLoadOp = false; break;
case Intrinsic::arm_neon_vst2: NewOpc = ARMISD::VST2_UPD;		case Intrinsic::arm_neon_vst2: NewOpc = ARMISD::VST2_UPD;
NumVecs = 2; isLoadOp = false; break;		NumVecs = 2; isLoadOp = false; break;
case Intrinsic::arm_neon_vst3: NewOpc = ARMISD::VST3_UPD;		case Intrinsic::arm_neon_vst3: NewOpc = ARMISD::VST3_UPD;
NumVecs = 3; isLoadOp = false; break;		NumVecs = 3; isLoadOp = false; break;
case Intrinsic::arm_neon_vst4: NewOpc = ARMISD::VST4_UPD;		case Intrinsic::arm_neon_vst4: NewOpc = ARMISD::VST4_UPD;
NumVecs = 4; isLoadOp = false; break;		NumVecs = 4; isLoadOp = false; break;
case Intrinsic::arm_neon_vst2lane: NewOpc = ARMISD::VST2LN_UPD;		case Intrinsic::arm_neon_vst2lane: NewOpc = ARMISD::VST2LN_UPD;
NumVecs = 2; isLoadOp = false; isLaneOp = true; break;		NumVecs = 2; isLoadOp = false; isLaneOp = true; break;
case Intrinsic::arm_neon_vst3lane: NewOpc = ARMISD::VST3LN_UPD;		case Intrinsic::arm_neon_vst3lane: NewOpc = ARMISD::VST3LN_UPD;
NumVecs = 3; isLoadOp = false; isLaneOp = true; break;		NumVecs = 3; isLoadOp = false; isLaneOp = true; break;
case Intrinsic::arm_neon_vst4lane: NewOpc = ARMISD::VST4LN_UPD;		case Intrinsic::arm_neon_vst4lane: NewOpc = ARMISD::VST4LN_UPD;
NumVecs = 4; isLoadOp = false; isLaneOp = true; break;		NumVecs = 4; isLoadOp = false; isLaneOp = true; break;
}		}
} else {		} else {
isLaneOp = true;		isLaneOp = true;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default: llvm_unreachable("unexpected opcode for Neon base update");		default: llvm_unreachable("unexpected opcode for Neon base update");
case ARMISD::VLD1DUP: NewOpc = ARMISD::VLD1DUP_UPD; NumVecs = 1; break;		case ARMISD::VLD1DUP: NewOpc = ARMISD::VLD1DUP_UPD; NumVecs = 1; break;
case ARMISD::VLD2DUP: NewOpc = ARMISD::VLD2DUP_UPD; NumVecs = 2; break;		case ARMISD::VLD2DUP: NewOpc = ARMISD::VLD2DUP_UPD; NumVecs = 2; break;
case ARMISD::VLD3DUP: NewOpc = ARMISD::VLD3DUP_UPD; NumVecs = 3; break;		case ARMISD::VLD3DUP: NewOpc = ARMISD::VLD3DUP_UPD; NumVecs = 3; break;
case ARMISD::VLD4DUP: NewOpc = ARMISD::VLD4DUP_UPD; NumVecs = 4; break;		case ARMISD::VLD4DUP: NewOpc = ARMISD::VLD4DUP_UPD; NumVecs = 4; break;
case ISD::LOAD: NewOpc = ARMISD::VLD1_UPD;		case ISD::LOAD: NewOpc = ARMISD::VLD1_UPD;
NumVecs = 1; isLaneOp = false; break;		NumVecs = 1; isLaneOp = false; break;
case ISD::STORE: NewOpc = ARMISD::VST1_UPD;		case ISD::STORE: NewOpc = ARMISD::VST1_UPD;
NumVecs = 1; isLaneOp = false; isLoadOp = false; break;		NumVecs = 1; isLaneOp = false; isLoadOp = false; break;
}		}
}		}

// Find the size of memory referenced by the load/store.		// Find the size of memory referenced by the load/store.
EVT VecTy;		EVT VecTy;
if (isLoadOp) {		if (isLoadOp) {
VecTy = N->getValueType(0);		VecTy = N->getValueType(0);
} else if (isIntrinsic) {		} else if (isIntrinsic) {
VecTy = N->getOperand(AddrOpIdx+1).getValueType();		VecTy = N->getOperand(AddrOpIdx+1).getValueType();
} else {		} else {
assert(isStore && "Node has to be a load, a store, or an intrinsic!");		assert(isStore && "Node has to be a load, a store, or an intrinsic!");
VecTy = N->getOperand(1).getValueType();		VecTy = N->getOperand(1).getValueType();
}		}

unsigned NumBytes = NumVecs * VecTy.getSizeInBits() / 8;		unsigned NumBytes = NumVecs * VecTy.getSizeInBits() / 8;
if (isLaneOp)		if (isLaneOp)
NumBytes /= VecTy.getVectorNumElements();		NumBytes /= VecTy.getVectorNumElements();

// If the increment is a constant, it must match the memory ref size.		auto AU = findAddressUpdateToFold(DAG, N, Addr, NumBytes);
SDValue Inc = User->getOperand(User->getOperand(0) == Addr ? 1 : 0);		if (!AU.first)
ConstantSDNode *CInc = dyn_cast<ConstantSDNode>(Inc.getNode());		return SDValue();
if (NumBytes >= 3 * 16 && (!CInc \|\| CInc->getZExtValue() != NumBytes)) {
// VLD3/4 and VST3/4 for 128-bit vectors are implemented with two
// separate instructions that make it harder to use a non-constant update.
continue;
}

// OK, we found an ADD we can fold into the base update.		// OK, we found an ADD we can fold into the base update.
// Now, create a _UPD node, taking care of not breaking alignment.		// Now, create a _UPD node, taking care of not breaking alignment.

EVT AlignedVecTy = VecTy;		EVT AlignedVecTy = VecTy;
unsigned Alignment = MemN->getAlignment();		unsigned Alignment = MemN->getAlignment();

// If this is a less-than-standard-aligned load/store, change the type to		// If this is a less-than-standard-aligned load/store, change the type to
// match the standard alignment.		// match the standard alignment.
// The alignment is overlooked when selecting _UPD variants; and it's		// The alignment is overlooked when selecting _UPD variants; and it's
// easier to introduce bitcasts here than fix that.		// easier to introduce bitcasts here than fix that.
// There are 3 ways to get to this base-update combine:		// There are 3 ways to get to this base-update combine:
// - intrinsics: they are assumed to be properly aligned (to the standard		// - intrinsics: they are assumed to be properly aligned (to the standard
// alignment of the memory type), so we don't need to do anything.		// alignment of the memory type), so we don't need to do anything.
// - ARMISD::VLDx nodes: they are only generated from the aforementioned		// - ARMISD::VLDx nodes: they are only generated from the aforementioned
// intrinsics, so, likewise, there's nothing to do.		// intrinsics, so, likewise, there's nothing to do.
// - generic load/store instructions: the alignment is specified as an		// - generic load/store instructions: the alignment is specified as an
// explicit operand, rather than implicitly as the standard alignment		// explicit operand, rather than implicitly as the standard alignment
// of the memory type (like the intrisics). We need to change the		// of the memory type (like the intrisics). We need to change the
// memory type to match the explicit alignment. That way, we don't		// memory type to match the explicit alignment. That way, we don't
// generate non-standard-aligned ARMISD::VLDx nodes.		// generate non-standard-aligned ARMISD::VLDx nodes.
if (isa<LSBaseSDNode>(N)) {		if (isa<LSBaseSDNode>(N)) {
if (Alignment == 0)		if (Alignment == 0)
Alignment = 1;		Alignment = 1;
if (Alignment < VecTy.getScalarSizeInBits() / 8) {		if (Alignment < VecTy.getScalarSizeInBits() / 8) {
MVT EltTy = MVT::getIntegerVT(Alignment * 8);		MVT EltTy = MVT::getIntegerVT(Alignment * 8);
assert(NumVecs == 1 && "Unexpected multi-element generic load/store.");		assert(NumVecs == 1 && "Unexpected multi-element generic load/store.");
assert(!isLaneOp && "Unexpected generic load/store lane.");		assert(!isLaneOp && "Unexpected generic load/store lane.");
unsigned NumElts = NumBytes / (EltTy.getSizeInBits() / 8);		unsigned NumElts = NumBytes / (EltTy.getSizeInBits() / 8);
AlignedVecTy = MVT::getVectorVT(EltTy, NumElts);		AlignedVecTy = MVT::getVectorVT(EltTy, NumElts);
}		}
// Don't set an explicit alignment on regular load/stores that we want		// Don't set an explicit alignment on regular load/stores that we want
// to transform to VLD/VST 1_UPD nodes.		// to transform to VLD/VST 1_UPD nodes.
// This matches the behavior of regular load/stores, which only get an		// This matches the behavior of regular load/stores, which only get an
// explicit alignment if the MMO alignment is larger than the standard		// explicit alignment if the MMO alignment is larger than the standard
// alignment of the memory type.		// alignment of the memory type.
// Intrinsics, however, always get an explicit alignment, set to the		// Intrinsics, however, always get an explicit alignment, set to the
// alignment of the MMO.		// alignment of the MMO.
Alignment = 1;		Alignment = 1;
}		}

// Create the new updating load/store node.		// Create the new updating load/store node.
// First, create an SDVTList for the new updating node's results.		// First, create an SDVTList for the new updating node's results.
EVT Tys[6];		EVT Tys[6];
unsigned NumResultVecs = (isLoadOp ? NumVecs : 0);		unsigned NumResultVecs = (isLoadOp ? NumVecs : 0);
unsigned n;		unsigned n;
for (n = 0; n < NumResultVecs; ++n)		for (n = 0; n < NumResultVecs; ++n)
Tys[n] = AlignedVecTy;		Tys[n] = AlignedVecTy;
Tys[n++] = MVT::i32;		Tys[n++] = MVT::i32;
Tys[n] = MVT::Other;		Tys[n] = MVT::Other;
SDVTList SDTys = DAG.getVTList(makeArrayRef(Tys, NumResultVecs+2));		SDVTList SDTys = DAG.getVTList(makeArrayRef(Tys, NumResultVecs+2));

// Then, gather the new node's operands.		// Then, gather the new node's operands.
SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
Ops.push_back(N->getOperand(0)); // incoming chain		Ops.push_back(N->getOperand(0)); // incoming chain
Ops.push_back(N->getOperand(AddrOpIdx));		Ops.push_back(N->getOperand(AddrOpIdx));
Ops.push_back(Inc);		Ops.push_back(AU.second);

if (StoreSDNode *StN = dyn_cast<StoreSDNode>(N)) {		if (StoreSDNode *StN = dyn_cast<StoreSDNode>(N)) {
// Try to match the intrinsic's signature		// Try to match the intrinsic's signature
Ops.push_back(StN->getValue());		Ops.push_back(StN->getValue());
} else {		} else {
// Loads (and of course intrinsics) match the intrinsics' signature,		// Loads (and of course intrinsics) match the intrinsics' signature,
// so just add all but the alignment operand.		// so just add all but the alignment operand.
for (unsigned i = AddrOpIdx + 1; i < N->getNumOperands() - 1; ++i)		for (unsigned i = AddrOpIdx + 1; i < N->getNumOperands() - 1; ++i)
Ops.push_back(N->getOperand(i));		Ops.push_back(N->getOperand(i));
}		}

// For all node types, the alignment operand is always the last one.		// For all node types, the alignment operand is always the last one.
Ops.push_back(DAG.getConstant(Alignment, dl, MVT::i32));		Ops.push_back(DAG.getConstant(Alignment, dl, MVT::i32));

// If this is a non-standard-aligned STORE, the penultimate operand is the		// If this is a non-standard-aligned STORE, the penultimate operand is the
// stored value. Bitcast it to the aligned type.		// stored value. Bitcast it to the aligned type.
if (AlignedVecTy != VecTy && N->getOpcode() == ISD::STORE) {		if (AlignedVecTy != VecTy && N->getOpcode() == ISD::STORE) {
SDValue &StVal = Ops[Ops.size()-2];		SDValue &StVal = Ops[Ops.size()-2];
StVal = DAG.getNode(ISD::BITCAST, dl, AlignedVecTy, StVal);		StVal = DAG.getNode(ISD::BITCAST, dl, AlignedVecTy, StVal);
}		}

EVT LoadVT = isLaneOp ? VecTy.getVectorElementType() : AlignedVecTy;		EVT LoadVT = isLaneOp ? VecTy.getVectorElementType() : AlignedVecTy;
SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, LoadVT,		SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, LoadVT,
MemN->getMemOperand());		MemN->getMemOperand());

// Update the uses.		// Update the uses.
SmallVector<SDValue, 5> NewResults;		SmallVector<SDValue, 5> NewResults;
for (unsigned i = 0; i < NumResultVecs; ++i)		for (unsigned i = 0; i < NumResultVecs; ++i)
NewResults.push_back(SDValue(UpdN.getNode(), i));		NewResults.push_back(SDValue(UpdN.getNode(), i));

// If this is an non-standard-aligned LOAD, the first result is the loaded		// If this is an non-standard-aligned LOAD, the first result is the loaded
// value. Bitcast it to the expected result type.		// value. Bitcast it to the expected result type.
if (AlignedVecTy != VecTy && N->getOpcode() == ISD::LOAD) {		if (AlignedVecTy != VecTy && N->getOpcode() == ISD::LOAD) {
SDValue &LdVal = NewResults[0];		SDValue &LdVal = NewResults[0];
LdVal = DAG.getNode(ISD::BITCAST, dl, VecTy, LdVal);		LdVal = DAG.getNode(ISD::BITCAST, dl, VecTy, LdVal);
}		}

NewResults.push_back(SDValue(UpdN.getNode(), NumResultVecs+1)); // chain		NewResults.push_back(SDValue(UpdN.getNode(), NumResultVecs+1)); // chain
DCI.CombineTo(N, NewResults);		DCI.CombineTo(N, NewResults);
DCI.CombineTo(User, SDValue(UpdN.getNode(), NumResultVecs));		DCI.CombineTo(AU.first, SDValue(UpdN.getNode(), NumResultVecs));

break;
}
return SDValue();		return SDValue();
}		}

static SDValue PerformVLDCombine(SDNode *N,		static SDValue PerformVLDCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI) {
if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())		if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 3,077 Lines • Show Last 20 Lines

test/CodeGen/ARM/alloc-no-stack-realign.ll

	; RUN: llc < %s -mtriple=armv7-apple-ios -O0 \| FileCheck %s			; RUN: llc < %s -mtriple=armv7-apple-ios -O0 \| FileCheck %s

	; rdar://12713765			; rdar://12713765
	; When realign-stack is set to false, make sure we are not creating stack			; When realign-stack is set to false, make sure we are not creating stack
	; objects that are assumed to be 64-byte aligned.			; objects that are assumed to be 64-byte aligned.
	@T3_retval = common global <16 x float> zeroinitializer, align 16			@T3_retval = common global <16 x float> zeroinitializer, align 16

	define void @test1(<16 x float>* noalias sret %agg.result) nounwind ssp "no-realign-stack" {			define void @test1(<16 x float>* noalias sret %agg.result) nounwind ssp "no-realign-stack" {
	entry:			entry:
	; CHECK-LABEL: test1:			; CHECK-LABEL: test1:
	; CHECK: ldr r[[R1:[0-9]+]], [pc, r[[R1]]]			; CHECK: ldr r[[R1:[0-9]+]], [pc, r[[R1]]]
	; CHECK: mov r[[R2:[0-9]+]], r[[R1]]			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]!			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: add r[[R2:[0-9]+]], r[[R1]], #48
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]
	; CHECK: add r[[R1:[0-9]+]], r[[R1]], #32
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]			; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
	; CHECK: mov r[[R1:[0-9]+]], #32			; CHECK: mov r[[R1]], sp
	; CHECK: mov r[[R2:[0-9]+]], sp
	; CHECK: mov r[[R3:[0-9]+]], r[[R2]]
	; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128], r[[R1]]
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]
	; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]
	; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]
	; CHECK: add r[[R1:[0-9]+]], r0, #48
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
	; CHECK: add r[[R1:[0-9]+]], r0, #32			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
				; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
				; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!
				; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!
	; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!			; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]
				; CHECK: add sp, sp, #64
				; CHECK: bx lr
	%retval = alloca <16 x float>, align 16			%retval = alloca <16 x float>, align 16
	%0 = load <16 x float>, <16 x float>* @T3_retval, align 16			%0 = load <16 x float>, <16 x float>* @T3_retval, align 16
	store <16 x float> %0, <16 x float>* %retval			store <16 x float> %0, <16 x float>* %retval
	%1 = load <16 x float>, <16 x float>* %retval			%1 = load <16 x float>, <16 x float>* %retval
	store <16 x float> %1, <16 x float>* %agg.result, align 16			store <16 x float> %1, <16 x float>* %agg.result, align 16
	ret void			ret void
	}			}

	define void @test2(<16 x float>* noalias sret %agg.result) nounwind ssp {			define void @test2(<16 x float>* noalias sret %agg.result) nounwind ssp {
	entry:			entry:
	; CHECK-LABEL: test2:			; CHECK-LABEL: test2:
	; CHECK: ldr r[[R1:[0-9]+]], [pc, r[[R1]]]			; CHECK: ldr r[[R1:[0-9]+]], [pc, r[[R1]]]
	; CHECK: mov r[[R2:[0-9]+]], r[[R1]]			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]!			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: add r[[R2:[0-9]+]], r[[R1]], #48
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]
	; CHECK: add r[[R1:[0-9]+]], r[[R1]], #32
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]			; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
	; CHECK: mov r[[R1:[0-9]+]], #32			; CHECK: mov r[[R1]], sp
	; CHECK: mov r[[R2:[0-9]+]], sp			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
	; CHECK: mov r[[R3:[0-9]+]], r[[R2]]			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128], r[[R1]]
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]
	; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]
	; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R3]]:128]
	; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R2]]:128]
	; CHECK: add r[[R1:[0-9]+]], r0, #48
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
	; CHECK: add r[[R1:[0-9]+]], r0, #32			; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]!
				; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vld1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r[[R1]]:128]
				; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!
				; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!
	; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!			; CHECK: vst1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]!
	; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]			; CHECK: vst1.64 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0:128]
				; CHECK: mov sp, r7
				; CHECK: pop {r7, pc}

	%retval = alloca <16 x float>, align 16			%retval = alloca <16 x float>, align 16
	%0 = load <16 x float>, <16 x float>* @T3_retval, align 16			%0 = load <16 x float>, <16 x float>* @T3_retval, align 16
	store <16 x float> %0, <16 x float>* %retval			store <16 x float> %0, <16 x float>* %retval
	%1 = load <16 x float>, <16 x float>* %retval			%1 = load <16 x float>, <16 x float>* %retval
	store <16 x float> %1, <16 x float>* %agg.result, align 16			store <16 x float> %1, <16 x float>* %agg.result, align 16
	ret void			ret void
	}			}

test/CodeGen/ARM/cascade-vld-vst.ll

				; RUN: llc -mtriple=arm-eabi -float-abi=soft -mattr=+neon %s -o - \| FileCheck %s

				%M = type { [4 x <4 x float>] }

				; Function Attrs: noimplicitfloat noinline norecurse nounwind uwtable
				define void @_test_vld1_vst1(%M* %A, %M *%B) {
				entry:
				%v0p = getelementptr inbounds %M, %M* %A, i32 0, i32 0, i32 0
				%v0 = load <4 x float>, <4 x float>* %v0p
				%v1p = getelementptr inbounds %M, %M* %A, i32 0, i32 0, i32 1
				%v1 = load <4 x float>, <4 x float>* %v1p
				%v2p = getelementptr inbounds %M, %M* %A, i32 0, i32 0, i32 2
				%v2 = load <4 x float>, <4 x float>* %v2p
				%v3p = getelementptr inbounds %M, %M* %A, i32 0, i32 0, i32 3
				%v3 = load <4 x float>, <4 x float>* %v3p

				%s0p = getelementptr inbounds %M, %M* %B, i32 0, i32 0, i32 0
				store <4 x float> %v0, <4 x float>* %s0p
				%s1p = getelementptr inbounds %M, %M* %B, i32 0, i32 0, i32 1
				store <4 x float> %v1, <4 x float>* %s1p
				%s2p = getelementptr inbounds %M, %M* %B, i32 0, i32 0, i32 2
				store <4 x float> %v2, <4 x float>* %s2p
				%s3p = getelementptr inbounds %M, %M* %B, i32 0, i32 0, i32 3
				store <4 x float> %v3, <4 x float>* %s3p
				ret void
				}

				; CHECK: vld1.32 {d16, d17}, [r0]!
				; CHECK-NEXT: vld1.32 {d18, d19}, [r0]!
				rengolinUnsubmitted Not Done Reply Inline Actions Use regexp as above: {{{d[0-9]+}}, {{d[0-9]+}}} rengolin: Use regexp as above: {{{d[0-9]+}}, {{d[0-9]+}}}
				; CHECK-NEXT: vld1.32 {d20, d21}, [r0]!
				; CHECK-NEXT: vld1.64 {d22, d23}, [r0]
				; CHECK-NEXT: vst1.32 {d16, d17}, [r1]!
				; CHECK-NEXT: vst1.32 {d18, d19}, [r1]!
				; CHECK-NEXT: vst1.32 {d20, d21}, [r1]!
				; CHECK-NEXT: vst1.64 {d22, d23}, [r1]
				; CHECK-NEXT: mov pc, lr

test/CodeGen/ARM/memcpy-inline.ll

	Show All 38 Lines
	; CHECK-T1: bl _memcpy			; CHECK-T1: bl _memcpy
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([31 x i8], [31 x i8]* @.str1, i64 0, i64 0), i64 31, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([31 x i8], [31 x i8]* @.str1, i64 0, i64 0), i64 31, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t2(i8* nocapture %C) nounwind {			define void @t2(i8* nocapture %C) nounwind {
	entry:			entry:
	; CHECK-LABEL: t2:			; CHECK-LABEL: t2:
	; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]!			; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]!
	; CHECK: movs [[INC:r[0-9]+]], #32			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0]!
	; CHECK: add.w r3, r0, #16
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0], [[INC]]
	; CHECK: movw [[REG2:r[0-9]+]], #16716
	; CHECK: movt [[REG2:r[0-9]+]], #72
	; CHECK: str [[REG2]], [r0]
	; CHECK: vld1.64 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]			; CHECK: vld1.64 {d{{[0-9]+}}, d{{[0-9]+}}}, [r1]
	; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r3]			; CHECK: vst1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0]!
				; CHECK: movw r1, #16716
				; CHECK: movt r1, #72
				; CHECK: str r1, [r0]
				; CHECK: bx lr
	; CHECK-T1-LABEL: t2:			; CHECK-T1-LABEL: t2:
	; CHECK-T1: bl _memcpy			; CHECK-T1: bl _memcpy
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([36 x i8], [36 x i8]* @.str2, i64 0, i64 0), i64 36, i32 1, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([36 x i8], [36 x i8]* @.str2, i64 0, i64 0), i64 36, i32 1, i1 false)
	ret void			ret void
	}			}

	define void @t3(i8* nocapture %C) nounwind {			define void @t3(i8* nocapture %C) nounwind {
	entry:			entry:
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

test/CodeGen/ARM/misched-fusion-aes.ll

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	define void @aesea(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
store <16 x i8> %h3, <16 x i8>* %c3		store <16 x i8> %h3, <16 x i8>* %c3
ret void		ret void

; CHECK-LABEL: aesea:		; CHECK-LABEL: aesea:
; CHECK: aese.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QA]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QA]]
; CHECK: aese.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QB]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QB]]
; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aese.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QC]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QC]]
		; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
		rengolinUnsubmitted Not Done Reply Inline Actions why would these change? rengolin: why would these change?
		evgeny777AuthorUnsubmitted Not Done Reply Inline Actions IR is compiled into a mixture of aesXX and vldXX instructions which are being scheduled differently after this patch is applied. I checked manually that parameters passed to each aesXX instruction are identical (though I can't guarantee that I didn't make any mistake, of course :) evgeny777: IR is compiled into a mixture of aesXX and vldXX instructions which are being scheduled…
		rengolinUnsubmitted Not Done Reply Inline Actions Right, I expected as much. And you can't just ignore those lines because the QD, QE etc. will not match. Hopefully, simplifying the IR could make the extra AES instructions unnecessary and we end up with a static pattern. rengolin: Right, I expected as much. And you can't just ignore those lines because the QD, QE etc. will…
		fhahnUnsubmitted Not Done Reply Inline Actions I think we could replace the last 3 AESE instructions (lines 57, 59, 61), which only combine the results of the previous computations and have no AESMC instructions to pair with. But as long as the number of AESE/ASEMC pairs does not change, the position of the 3 left-over AESE instructions is nothing to be too concerned about IMO. fhahn: I think we could replace the last 3 AESE instructions (lines 57, 59, 61), which only combine…
		rengolinUnsubmitted Not Done Reply Inline Actions Sure, I'm not worried about their position, just trying to future-proof the test, so we don't need to keep shuffling them on code-gen changes. rengolin: Sure, I'm not worried about their position, just trying to future-proof the test, so we don't…
; CHECK: aese.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QD]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QD]]
		; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aese.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QE]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QE]]
; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aese.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QF]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QF]]
		; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aese.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QG]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QG]]
; CHECK: aese.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aese.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aese.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QH]]		; CHECK-NEXT: aesmc.8 {{q[0-9][0-9]?}}, [[QH]]
}		}

define void @aesda(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {		define void @aesda(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
%d0 = load <16 x i8>, <16 x i8>* %a0		%d0 = load <16 x i8>, <16 x i8>* %a0
%a1 = getelementptr inbounds <16 x i8>, <16 x i8>* %a0, i64 1		%a1 = getelementptr inbounds <16 x i8>, <16 x i8>* %a0, i64 1
%d1 = load <16 x i8>, <16 x i8>* %a1		%d1 = load <16 x i8>, <16 x i8>* %a1
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	define void @aesda(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
store <16 x i8> %h3, <16 x i8>* %c3		store <16 x i8> %h3, <16 x i8>* %c3
ret void		ret void

; CHECK-LABEL: aesda:		; CHECK-LABEL: aesda:
; CHECK: aesd.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QA:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QA]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QA]]
; CHECK: aesd.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QB:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QB]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QB]]
; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aesd.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QC:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QC]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QC]]
		; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aesd.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QD:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QD]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QD]]
		; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aesd.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QE:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QE]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QE]]
; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aesd.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QF:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QF]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QF]]
		; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aesd.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QG:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QG]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QG]]
; CHECK: aesd.8 {{q[0-9][0-9]?}}, {{q[0-9][0-9]?}}
; CHECK: aesd.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}		; CHECK: aesd.8 [[QH:q[0-9][0-9]?]], {{q[0-9][0-9]?}}
; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QH]]		; CHECK-NEXT: aesimc.8 {{q[0-9][0-9]?}}, [[QH]]
}		}

define void @aes_load_store(<16 x i8> %p1, <16 x i8> %p2 , <16 x i8> *%p3) {		define void @aes_load_store(<16 x i8> %p1, <16 x i8> %p2 , <16 x i8> *%p3) {
entry:		entry:
%x1 = alloca <16 x i8>, align 16		%x1 = alloca <16 x i8>, align 16
%x2 = alloca <16 x i8>, align 16		%x2 = alloca <16 x i8>, align 16
Show All 22 Lines

test/CodeGen/ARM/vector-load.ll

Show First 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	;CHECK: str r[[INCREG]], [r0]
%lA = load <4 x i8>, <4 x i8>* %A, align 4		%lA = load <4 x i8>, <4 x i8>* %A, align 4
%inc = getelementptr <4 x i8>, <4 x i8>* %A, i38 4		%inc = getelementptr <4 x i8>, <4 x i8>* %A, i38 4
store <4 x i8>* %inc, <4 x i8>** %ptr		store <4 x i8>* %inc, <4 x i8>** %ptr
%zlA = zext <4 x i8> %lA to <4 x i32>		%zlA = zext <4 x i8> %lA to <4 x i32>
ret <4 x i32> %zlA		ret <4 x i32> %zlA
}		}

; CHECK-LABEL: test_silly_load:		; CHECK-LABEL: test_silly_load:
; CHECK: vldr d{{[0-9]+}}, [r0, #16]		; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0:128]!
; CHECK: movs r1, #24		; CHECK-NEXT: vld1.8 {d{{[0-9]+}}}, [r0:64]!
; CHECK: vld1.8 {d{{[0-9]+}}, d{{[0-9]+}}}, [r0:128], r1		; CHECK-NEXT: ldr r0, [r0]
		rengolinUnsubmitted Not Done Reply Inline Actions you can't guarantee r0 as this returns void. rengolin: you can't guarantee r0 as this returns void.
; CHECK: ldr {{r[0-9]+}}, [r0]		; CHECK-NEXT: bx lr

define void @test_silly_load(<28 x i8>* %addr) {		define void @test_silly_load(<28 x i8>* %addr) {
load volatile <28 x i8>, <28 x i8>* %addr		load volatile <28 x i8>, <28 x i8>* %addr
ret void		ret void
}		}

define <4 x i32>* @test_vld1_immoffset(<4 x i32>* %ptr.in, <4 x i32>* %ptr.out) {		define <4 x i32>* @test_vld1_immoffset(<4 x i32>* %ptr.in, <4 x i32>* %ptr.out) {
; CHECK-LABEL: test_vld1_immoffset:		; CHECK-LABEL: test_vld1_immoffset:
; CHECK: movs [[INC:r[0-9]+]], #32		; CHECK: movs [[INC:r[0-9]+]], #32
; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0], [[INC]]		; CHECK: vld1.32 {{{d[0-9]+}}, {{d[0-9]+}}}, [r0], [[INC]]
%val = load <4 x i32>, <4 x i32>* %ptr.in		%val = load <4 x i32>, <4 x i32>* %ptr.in
store <4 x i32> %val, <4 x i32>* %ptr.out		store <4 x i32> %val, <4 x i32>* %ptr.out
%next = getelementptr <4 x i32>, <4 x i32>* %ptr.in, i32 2		%next = getelementptr <4 x i32>, <4 x i32>* %ptr.in, i32 2
ret <4 x i32>* %next		ret <4 x i32>* %next
}		}

test/CodeGen/ARM/vext.ll

	Show First 20 Lines • Show All 210 Lines • ▼ Show 20 Lines
	}			}

	; We should ignore a build_vector with more than two sources.			; We should ignore a build_vector with more than two sources.
	; Use illegal <32 x i16> type to produce such a shuffle after legalizing types.			; Use illegal <32 x i16> type to produce such a shuffle after legalizing types.
	; Try to look for fallback to by-element inserts.			; Try to look for fallback to by-element inserts.
	define <4 x i16> @test_multisource(<32 x i16>* %B) nounwind {			define <4 x i16> @test_multisource(<32 x i16>* %B) nounwind {
	; CHECK-LABEL: test_multisource:			; CHECK-LABEL: test_multisource:
	; CHECK: @ BB#0:			; CHECK: @ BB#0:
	; CHECK-NEXT: mov r1, r0			; CHECK-NEXT: vld1.16 {d16, d17}, [r0:128]!
	; CHECK-NEXT: add r2, r0, #48			; CHECK-NEXT: vld1.16 {d18, d19}, [r0:128]!
	; CHECK-NEXT: add r0, r0, #32			; CHECK-NEXT: vld1.16 {d20, d21}, [r0:128]!
	; CHECK-NEXT: vld1.16 {d16, d17}, [r1:128]!
	; CHECK-NEXT: vld1.64 {d20, d21}, [r0:128]
	; CHECK-NEXT: vorr d24, d20, d20			; CHECK-NEXT: vorr d24, d20, d20
	; CHECK-NEXT: vld1.64 {d18, d19}, [r2:128]			; CHECK-NEXT: vld1.64 {d22, d23}, [r0:128]
	; CHECK-NEXT: vld1.64 {d22, d23}, [r1:128]			; CHECK-NEXT: vzip.16 d24, d22
	; CHECK-NEXT: vzip.16 d24, d18			; CHECK-NEXT: vtrn.16 q8, q9
	; CHECK-NEXT: vtrn.16 q8, q11
	; CHECK-NEXT: vext.16 d18, d20, d24, #2			; CHECK-NEXT: vext.16 d18, d20, d24, #2
	; CHECK-NEXT: vext.16 d16, d18, d16, #2			; CHECK-NEXT: vext.16 d16, d18, d16, #2
	; CHECK-NEXT: vext.16 d16, d16, d16, #2			; CHECK-NEXT: vext.16 d16, d16, d16, #2
	; CHECK-NEXT: vmov r0, r1, d16			; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: mov pc, lr			; CHECK-NEXT: mov pc, lr
	%tmp1 = load <32 x i16>, <32 x i16>* %B			%tmp1 = load <32 x i16>, <32 x i16>* %B
	%tmp2 = shufflevector <32 x i16> %tmp1, <32 x i16> undef, <4 x i32> <i32 0, i32 8, i32 16, i32 24>			%tmp2 = shufflevector <32 x i16> %tmp1, <32 x i16> undef, <4 x i32> <i32 0, i32 8, i32 16, i32 24>
	ret <4 x i16> %tmp2			ret <4 x i16> %tmp2
	}			}

	; We don't handle shuffles using more than half of a 128-bit vector.			; We don't handle shuffles using more than half of a 128-bit vector.
	; Again, test for fallback to by-element inserts.			; Again, test for fallback to by-element inserts.
	define <4 x i16> @test_largespan(<8 x i16>* %B) nounwind {			define <4 x i16> @test_largespan(<8 x i16>* %B) nounwind {
	▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

test/Transforms/LoopStrengthReduce/ARM/ivchain-ARM.ll

Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
br i1 %exitcond.3, label %for.end, label %for.body		br i1 %exitcond.3, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret void		ret void
}		}

; @testNeon is an important example of the nead for ivchains.		; @testNeon is an important example of the nead for ivchains.
;		;
; Currently we have two extra add.w's that keep the store address
; live past the next increment because ISEL is unfortunately undoing
; the store chain. ISEL also fails to convert all but one of the stores to
; post-increment addressing. However, the loads should use
; post-increment addressing, no add's or add.w's beyond the three
; mentioned. Most importantly, there should be no spills or reloads!
;
; A9: testNeon:		; A9: testNeon:
; A9: %.lr.ph		; A9: %.lr.ph
; A9: add.w r		; A9: vst1.8 {{.*}}, [r{{[0-9]+}}]!
; A9-NOT: lsl.w		; A9-NOT: lsl.w
; A9-NOT: {{ldr\|str\|adds\|add r}}		; A9-NOT: {{ldr\|str\|adds\|add r}}
; A9: vst1.8 {{.*}} [r{{[0-9]+}}], r{{[0-9]+}}		; A9: vst1.8 {{.*}} [r{{[0-9]+}}], r{{[0-9]+}}
; A9: add.w r
; A9-NOT: {{ldr\|str\|adds\|add r}}		; A9-NOT: {{ldr\|str\|adds\|add r}}
; A9-NOT: add.w r		; A9-NOT: add.w r
; A9: bne		; A9: bne
define hidden void @testNeon(i8* %ref_data, i32 %ref_stride, i32 %limit, <16 x i8>* nocapture %data) nounwind optsize {		define hidden void @testNeon(i8* %ref_data, i32 %ref_stride, i32 %limit, <16 x i8>* nocapture %data) nounwind optsize {
%1 = icmp sgt i32 %limit, 0		%1 = icmp sgt i32 %limit, 0
br i1 %1, label %.lr.ph, label %45		br i1 %1, label %.lr.ph, label %45

.lr.ph: ; preds = %0		.lr.ph: ; preds = %0
▲ Show 20 Lines • Show All 145 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARMISelLowering] Better handling of NEON load/store for sequential memory regionsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 120805

lib/Target/ARM/ARMISelLowering.cpp

test/CodeGen/ARM/alloc-no-stack-realign.ll

test/CodeGen/ARM/cascade-vld-vst.ll

test/CodeGen/ARM/memcpy-inline.ll

test/CodeGen/ARM/misched-fusion-aes.ll

test/CodeGen/ARM/vector-load.ll

test/CodeGen/ARM/vext.ll

test/Transforms/LoopStrengthReduce/ARM/ivchain-ARM.ll

[ARMISelLowering] Better handling of NEON load/store for sequential memory regions
Needs ReviewPublic