This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Applying load pair optimization for volatile load/store
AbandonedPublic

Authored by flyingforyou on Nov 8 2015, 6:22 PM.

Details

Summary

If the second load is just right next to first load/store, it can be merged into load/store pair instruction. There is no semantic difference.

Diff Detail

Event Timeline

flyingforyou retitled this revision from to [AArch64] Applying load pair optimization for volatile load.
flyingforyou updated this object.
flyingforyou added a subscriber: llvm-commits.
mcrosier edited edge metadata.Nov 9 2015, 6:50 AM

Please add a few test cases.

This seems reasonable, but I would really like to hear from Tim or another reviewer.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
818

Please create a static helper function with comments suggesting why it's safe to merge adjacent loads/stores that are marked volatile. Also, I don't think this check is sufficient. The order of the offsets determine the ordering of the pairs. For example,

ldr x1, [x0, #4] <--- volatile
ldr x2, [x0]

would create a ldp with the volatile load being reordering, if I'm not mistaken.

1150

Please add a few comments about what's going on here.

t.p.northover edited edge metadata.Nov 9 2015, 11:18 AM

I agree the underlying transformation looks sound. Stopping the scan at 1 seems a bit hacky though, and is there any particular reason you limit the transformation to loads?

Hi, Chad, Tim.

I concern too much about merging store instruction. I was just worried about that some hardware communicated through memory interface. Now, I change the code that can merge adjacent volatile store also.

If you think something is weired or insufficent, please tell me.

Thanks for reviewing.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
818

Hi Chad.
I am really appreciate your review.

As you said, it has a possiblity of reordering.
But I think it's not harmful. After merging, it changed to ldp instruction that has no memory reference information. This means ldp instruction is treated like volatile instruction.

I will create a static helper function. If you feel something is missing, please tell me.

Thanks.

1150

I will do that.
Thanks.

flyingforyou retitled this revision from [AArch64] Applying load pair optimization for volatile load to [AArch64] Applying load pair optimization for volatile load/store.Nov 10 2015, 5:22 PM
flyingforyou updated this object.
flyingforyou edited edge metadata.

Thanks Chad, Tim.

This change has below things.
Modify existing tests, Add comments and tests, support volatile store.

Stopping the scan at 1 seems a bit hacky though, and is there any particular reason you limit the transformation to loads?

I agree with Tim. Setting the scan limit to 1 isn't the right solution. You should be able to scan until you encounter the first memory operation.

Hi Chad.

Could you give me an idea that how to change the code without limiting scan value to 1?

Should I make another function for merging adjacent volatile ldr/str?

Or move "hasOrderedMemoryRef" function to inside of "findMatchingInsn" and
change the Limit value 1 if the first instruction is volatile.

First one makes many duplication of codes. You may not be satisfied with second one.
If you have any other opinion, tell me please.

Thanks.

junbuml edited edge metadata.Nov 12 2015, 7:41 AM
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
189

You can use clang-format to check your coding style.

195–198

You can replace this with return Limit == 1;

842

Do you need to find a paired until you encounter the first memory operation? Or did you intend to check just right next instruction?

Looks like you try to see just right next instruction. Right? If that's the case you could use Count as well.

1160–1161

I think you can move this check in findMatchingInsn().

1176–1178

Can you add little bit more detail about why you want to check just the right next instruction ?

Thanks JunBum.

Your comment is very helpful for me.
I think that next commit is also insufficient, too.
So please review the next commit also, if you have time.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
189

I will fix it.
I didn't know clang-format. It's very convinient.

Thanks Junbum.

195–198

Yes. You're right.

842

I just want to check right next instruction.
I think adjacent or just next to.. these expressions are not enough for explanation of my patch.

I will change comments and function name also.

1160–1161

Ok. I will.

1176–1178

After moving "hasOrderedMemoryRef", I will add more comment inside of findMatchingInsn function.

flyingforyou updated this object.
flyingforyou edited edge metadata.

Thanks Junbum for the review.
I addressed your comments.
If you have more opinions, please tell me.

junbuml added inline comments.Nov 13 2015, 7:29 AM
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
189

Do you need to check this again? If this function is called only in findMatchingInsn, assert will be better.

195

Please remove this.

849–853

Why merging with the next instruction is safe. Is this also safe for the narrow load merge case where we concatenate MOMs ?

854

I doubt if isRightNextLdStMerge is the right function name? Something specifically volatile will be better ?

test/CodeGen/AArch64/arm64-ldp.ll
364

It will be good to add more tests to cover all other types of loads/stores impacted by this patch, not just i64.

Could you give me an idea that how to change the code without limiting scan value to 1?

Here's an idea: http://reviews.llvm.org/D14659 ;)

I also wanted to point out that this patch is awesome in that it also allows instructions whos MMOs have been dropped to be merged (hasOrderedMemoryRef makes a conservative assumption when MMOs are missing). Therefore, this solution will transform more that just volatile loads/stores. Of course the correct fix is to probably not drop MMO (as is don't in tail merging)...

After a bit of digging I'm fairly concerned this isn't a valid optimization. The hasOrderedMemoryRef not only models volatiles, but it also models atomics.

It still unclear to me that ldp/stp preserve ordering between the loads/stores. If the ordering isn't preserved, we might be able to get away with reordering volatiles in practice, but I can't imagine this would be the case for atomics.

Does anyone agree/disagree? I'm also wondering if the risk is worth the reward.

It still unclear to me that ldp/stp preserve ordering between the loads/stores.

I think the principle is sound, but we need to be very careful about what gets combined (for example I don't think we could extend the "strb wzr;strb wzr -> strh wzr" optimisation to volatiles). I think the key point in the ARM ARM is how the access is written for LDP/STP:

Mem[address + 0, dbytes, acctype] = data1;
Mem[address + dbytes, dbytes, acctype] = data2;

That seems to me exactly what you'd write for an operation that did guarantee it would preserve the memory characteristics of the split operations.

If the ordering isn't preserved, we might be able to get away with reordering volatiles in practice, but I can't imagine this would be the case for atomics.

I don't think you can get away with reordering volatiles at all. Their defining property is that the order is preserved, and they are one of the directly observable facets of C's abstract machine.

Atomics, less so (at least memory_order_relaxed ones); you might even be able to reorder seq_cst ones with a sophisticated enough application of the "as-if" rule, though the backend would never do that.

Even though we carefully decide what to combine, I doubt if the order of accessing memory in ldp / stp is not micro-architecturally dependent ?

I'm fine with deferring to your opinion, Tim, and thanks for the clarification. If you're comfortable with the transformation then I'm on board..

However, if I understand you correctly the below is *not* a safe transformation (assuming one of the loads is volatile):

ldr x8, [x0, #8]
ldr x9, [x0]

into

ldp x9, x8, [x0]

correct?

For "Normal" memory the uArch has quite a bit of freedom to do this kind of optimisation, but that's probably fine because you can't detect it. "Device" memory is designed to turn off that kind of thing though, and that's the primary valid use for volatile.

The main point that now worries me is this note I've just discovered in B2.8.2 (Device memory):

An instruction that generates a sequence of accesses as described in Atomicity in the ARM architecture on page B2-81 might be abandoned as a result of an exception being taken during the sequence of accesses. On return from the exception the instruction is restarted, and therefore one or more of the memory locations might be accessed multiple times. This can result in repeated accesses to a location where the program only defines a single access. For this reason, ARM strongly recommends that no accesses to Device memory are performed from a single instruction that spans the boundary of a translation granule or which in some other way could lead to some of the accesses being aborted.

I think ldp/stp operations could be aborted like that (B2-81 specifically says they generate 2 single-copy atomic accesses), so Chad's probably right that we can't do anything for volatile. Well done Chad!

FWIW, I think both loads would have to be volatile to prohibit that if the underlying principle was sound. We're allowed to reorder volatile accesses with non-volatile, but not 2 volatile accesses.

I'm not sure if I understand correctly. It will be helpful if you clarify below four cases ?
1.I guess this must be safe if ldp read [x0, #0] and then [x0, #8] ?

ldr x9, [x0] 
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

2.Is this safe ?

ldr x8, [x0, #8] (volatile) 
ldr x9, [x0] 
  ; into
ldp x9, x8, [x0]

3.Is this not safe ?

ldr x9, [x0] (volatile)
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

4.what about this case?

ldr x9, [x0] 
ldr x10, [x1]
ldr x8, [x0, #8] (volatile) 
  ; into
ldr x10, [x1]
ldp x9, x9, [x0]

Reordering the examples slightly:

3.Is this not safe ?

ldr x9, [x0] (volatile)
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

I think this is unsafe because the ldp can be interrupted after loading x9. It will then re-execute the entire instruction, causing 2 loads from [x0].

1.I guess this must be safe if ldp read [x0, #0] and then [x0, #8] ?

ldr x9, [x0] 
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

This is OK (assuming no atomics are around too). It's in-order and if the ldp is interrupted only x9 would be loaded twice, which isn't volatile so there's no problem repeating the access.

I'm really not sure it's worth bothering with though, generally either all or no accesses in some region are volatile (for accessing memory-mapped devices), so I rather doubt it would ever produce noticeable benefit.

2.Is this safe ?

ldr x8, [x0, #8] (volatile) 
ldr x9, [x0] 
  ; into
ldp x9, x8, [x0]

This is also safe. You're allowed to reorder the loads since only one is volatile, and the volatile access ends up last in the ldp again, so re-executing would be harmless.

4.what about this case?

ldr x9, [x0] 
ldr x10, [x1]
ldr x8, [x0, #8] (volatile) 
  ; into
ldr x10, [x1]
ldp x9, x8, [x0]

Also OK, again because the volatile access is last in the ldp.

Thanks Tim for the clarification. As long as we are not accessing volatile first in ldp, reordering volatile with non-volatile is fine. right ? So, I may want to confirm what I understand using below two more cases.

5.This must be fine ?

ldr x9, [x0] 
ldr x10, [x1] (volatile) 
ldr x8, [x0, #8] 
  ; into
ldr x10, [x1] (volatile) 
ldp x9, x8, [x0]

6.This must be not safe?

ldr x9, [x0]   (volatile) 
ldr x8, [x0, #8] 
  ; into
ldp x9, x8, [x0]

That looks like a reasonable description of the situation to me and both of your new examples are right.

Tim.

Thanks Tim for the great detail !

This is good discusson.
Thanks Tim, Chad, Junbum.

To my knowledge, The ARM architecture is a weakly ordered memory architecture that supports out of order completion. (B2.7.3)
This means

ex1)
 ldr x1, [x0]       (volatile)
 ldr x2, [x0, #8]   (volatile)
ex2)
 ldr x1, [x0, #8]   (volatile)
 ldr x2, [x0]       (volatile)

Both of examples are not guaranteed which instruction is executed for the first time on the ARM Architecture.
So I think they can be merged into single ldp instruction.
From the language side, volatile variable means it should be loaded and should be stored.

https://en.wikipedia.org/wiki/Volatile_(computer_programming)
In computer programming, particularly in the C, C++, C#, and Java programming languages, the volatile keyword indicates that a value may change between different accesses, even if it does not appear to be modified. This keyword prevents an optimizing compiler from optimizing away subsequent reads or writes and thus incorrectly reusing a stale value or omitting writes. Volatile values primarily arise in hardware access (memory-mapped I/O), where reading from or writing to memory is used to communicate with peripheral devices, and in threading, where a different thread may have modified a value.

For the ordering, (B2.7.2)

Ordering can be achieved by using a DMB or DSB barrier. For more information on DMB and DSB
instructions, see Memory barriers on page B2-85.

For Device Memory(B2.8.2), ARM sugessts using DMB, Load-Acquire, Store-Release(B2-87)

Each Load-Acquire Exclusive and Store-Release Exclusive instruction is essentially a variant of the
equivalent Load-Exclusive or Store-Exclusive instruction. All usage restrictions and single-copy atomicity
properties:
— That apply to the Load-Exclusive instructions also apply to the Load-Acquire Exclusive instructions.
— That apply to the Store-Exclusive instructions also apply to the Store-Release Exclusive instructions.
• The Load-Acquire/Store-Release instructions can remove the requirement to use the explicit DMB memory
barrier instruction.

Tim's mention

I think ldp/stp operations could be aborted like that (B2-81 specifically says they generate 2 single-copy atomic accesses)

I think most of ARM architecture, ldp's exectuion timing is same with single ldr instruction. so we don't have to worry about this so much.

Chad's mention

I also wanted to point out that this patch is awesome in that it also allows instructions whos MMOs have been dropped to be merged (hasOrderedMemoryRef makes a conservative assumption when MMOs are missing). Therefore, this solution will transform more that just volatile loads/stores. Of course the correct fix is to probably not drop MMO (as is don't in tail merging)...

I agree with this. But before discuss this, I want to finish "volatile load/store merge".

I think that if nobody can agree with applying both of examples what I mentioned, this commit is not worth for non-volatile & volatile type merging.

Junmo.

To my knowledge, The ARM architecture is a weakly ordered memory architecture that supports out of order completion. (B2.7.3)

I think you're not realising just how special Device memory is. It's not just memory we happen to know is mapped to some peripheral, it's a real set of attributes you can put into the page tables that prevent the core from messing around with your memory accesses in various ways.

And since the primary valid use for volatile is accessing this kind of memory, we need to make sure we don't break it.

For the ordering, (B2.7.2) [...] For Device Memory(B2.8.2), ARM sugessts using DMB, Load-Acquire, Store-Release(B2-87)

This is an incomplete picture. See B2-98, where it's explicitly called out that you need barriers for Device-nGRE memory. There would be no need to mention that if barriers were always needed.

I believe what actually happens is that Device-nGnRnE is completely ordered under all circumstances, Device-nGnrE is ordered within itself but can be reordered with accesses to Normal memory, and the others are even weaker.

I think we need volatile to be useful for the strongest of those memory regimes: Device-nGnRnE. That means we can't reorder anything and need to be careful about exceptions.

I think most of ARM architecture, ldp's exectuion timing is same with single ldr instruction. so we don't have to worry about this so much.

I don't think latencies come into it. The question is what a conforming ARMv8 CPU is allowed to do, and I think aborting mid-ldp/stp is allowed.

Cheers.

Tim.

I believe what actually happens is that Device-nGnRnE is completely ordered under all circumstances, Device-nGnEE is ordered within itself but can be reordered with accesses to Normal memory, and the others are even weaker.

Oops, this seems wrong on a more thorough reading. I don't think nGnRnE is actually that strong, though I still think the description of nGnRE is probably roughly right and prevents us from reordering volatiles. E.g. on the Reordering attribute:

For all memory types with the non-Reordering attribute, the order of memory accesses arriving at a single peripheral of IMPLEMENTATION DEFINED size, as defined by the peripheral, must be the same order that occurs in a simple sequential execution of the program. That is, the accesses appear in program order.

Thanks Tim.

Oops, this seems wrong on a more thorough reading. I don't think nGnRnE is actually that strong, though I still think the description of nGnRE is probably roughly right and prevents us from reordering volatiles. E.g. on the Reordering attribute:

I agree with your opinion. I learned a lot from this commit.
Especially, I really appreciate Tim, Chad, Junbum for your time reviewing.

If we can't merge two volatile load, I think this commit is not worth for non-volatile and volatile load merge. How do you think about this?
If you agree with this, I'll abandon this commit.

Junmo.

flyingforyou abandoned this revision.Nov 15 2015, 6:57 PM

Thanks for review.

Junmo.