This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Applying load pair optimization for volatile load/store
AbandonedPublic

Authored by flyingforyou on Nov 8 2015, 6:22 PM.

Download Raw Diff

Details

Reviewers

qcolombet
t.p.northover
ab
jmolloy
junbuml
mcrosier

Summary

If the second load is just right next to first load/store, it can be merged into load/store pair instruction. There is no semantic difference.

Diff Detail

Event Timeline

flyingforyou updated this revision to Diff 39658.Nov 8 2015, 6:22 PM

flyingforyou retitled this revision from to [AArch64] Applying load pair optimization for volatile load.

flyingforyou updated this object.

flyingforyou added reviewers: t.p.northover, mcrosier, junbuml, qcolombet, jmolloy, ab.

flyingforyou added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptNov 8 2015, 6:22 PM

Please add a few test cases.

This seems reasonable, but I would really like to hear from Tim or another reviewer.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
818	Please create a static helper function with comments suggesting why it's safe to merge adjacent loads/stores that are marked volatile. Also, I don't think this check is sufficient. The order of the offsets determine the ordering of the pairs. For example, ldr x1, [x0, #4] <--- volatile ldr x2, [x0] would create a ldp with the volatile load being reordering, if I'm not mistaken.
1150	Please add a few comments about what's going on here.

jevinskie added a subscriber: jevinskie.Nov 9 2015, 9:38 AM

I agree the underlying transformation looks sound. Stopping the scan at 1 seems a bit hacky though, and is there any particular reason you limit the transformation to loads?

Hi, Chad, Tim.

I concern too much about merging store instruction. I was just worried about that some hardware communicated through memory interface. Now, I change the code that can merge adjacent volatile store also.

If you think something is weired or insufficent, please tell me.

Thanks for reviewing.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
818	Hi Chad. I am really appreciate your review. As you said, it has a possiblity of reordering. But I think it's not harmful. After merging, it changed to ldp instruction that has no memory reference information. This means ldp instruction is treated like volatile instruction. I will create a static helper function. If you feel something is missing, please tell me. Thanks.
1150	I will do that. Thanks.

flyingforyou retitled this revision from [AArch64] Applying load pair optimization for volatile load to [AArch64] Applying load pair optimization for volatile load/store.Nov 10 2015, 5:22 PM

flyingforyou updated this object.

flyingforyou edited edge metadata.

Thanks Chad, Tim.

This change has below things.
Modify existing tests, Add comments and tests, support volatile store.

In D14489#285326, @t.p.northover wrote:

Stopping the scan at 1 seems a bit hacky though, and is there any particular reason you limit the transformation to loads?

I agree with Tim. Setting the scan limit to 1 isn't the right solution. You should be able to scan until you encounter the first memory operation.

Hi Chad.

Could you give me an idea that how to change the code without limiting scan value to 1?

Should I make another function for merging adjacent volatile ldr/str?

Or move "hasOrderedMemoryRef" function to inside of "findMatchingInsn" and
change the Limit value 1 if the first instruction is volatile.

First one makes many duplication of codes. You may not be satisfied with second one.
If you have any other opinion, tell me please.

Thanks.

Can you please upload the diff with full context.
http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
189	You can use clang-format to check your coding style.
195–198	You can replace this with return Limit == 1;
842	Do you need to find a paired until you encounter the first memory operation? Or did you intend to check just right next instruction? Looks like you try to see just right next instruction. Right? If that's the case you could use Count as well.
1160–1161	I think you can move this check in findMatchingInsn().
1176–1178	Can you add little bit more detail about why you want to check just the right next instruction ?

Thanks JunBum.

Your comment is very helpful for me.
I think that next commit is also insufficient, too.
So please review the next commit also, if you have time.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
189	I will fix it. I didn't know clang-format. It's very convinient. Thanks Junbum.
195–198	Yes. You're right.
842	I just want to check right next instruction. I think adjacent or just next to.. these expressions are not enough for explanation of my patch. I will change comments and function name also.
1160–1161	Ok. I will.
1176–1178	After moving "hasOrderedMemoryRef", I will add more comment inside of findMatchingInsn function.

Thanks Junbum for the review.
I addressed your comments.
If you have more opinions, please tell me.

junbuml added inline comments.Nov 13 2015, 7:29 AM

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
189	Do you need to check this again? If this function is called only in findMatchingInsn, assert will be better.
195	Please remove this.
849–853	Why merging with the next instruction is safe. Is this also safe for the narrow load merge case where we concatenate MOMs ?
854	I doubt if isRightNextLdStMerge is the right function name? Something specifically volatile will be better ?
test/CodeGen/AArch64/arm64-ldp.ll
364	It will be good to add more tests to cover all other types of loads/stores impacted by this patch, not just i64.

In D14489#287676, @flyingforyou wrote:

Could you give me an idea that how to change the code without limiting scan value to 1?

Here's an idea: http://reviews.llvm.org/D14659 ;)

I also wanted to point out that this patch is awesome in that it also allows instructions whos MMOs have been dropped to be merged (hasOrderedMemoryRef makes a conservative assumption when MMOs are missing). Therefore, this solution will transform more that just volatile loads/stores. Of course the correct fix is to probably not drop MMO (as is don't in tail merging)...

After a bit of digging I'm fairly concerned this isn't a valid optimization. The hasOrderedMemoryRef not only models volatiles, but it also models atomics.

It still unclear to me that ldp/stp preserve ordering between the loads/stores. If the ordering isn't preserved, we might be able to get away with reordering volatiles in practice, but I can't imagine this would be the case for atomics.

Does anyone agree/disagree? I'm also wondering if the risk is worth the reward.

It still unclear to me that ldp/stp preserve ordering between the loads/stores.

I think the principle is sound, but we need to be very careful about what gets combined (for example I don't think we could extend the "strb wzr;strb wzr -> strh wzr" optimisation to volatiles). I think the key point in the ARM ARM is how the access is written for LDP/STP:

Mem[address + 0, dbytes, acctype] = data1;
Mem[address + dbytes, dbytes, acctype] = data2;

That seems to me exactly what you'd write for an operation that did guarantee it would preserve the memory characteristics of the split operations.

If the ordering isn't preserved, we might be able to get away with reordering volatiles in practice, but I can't imagine this would be the case for atomics.

I don't think you can get away with reordering volatiles at all. Their defining property is that the order is preserved, and they are one of the directly observable facets of C's abstract machine.

Atomics, less so (at least memory_order_relaxed ones); you might even be able to reorder seq_cst ones with a sophisticated enough application of the "as-if" rule, though the backend would never do that.

Even though we carefully decide what to combine, I doubt if the order of accessing memory in ldp / stp is not micro-architecturally dependent ?

I'm fine with deferring to your opinion, Tim, and thanks for the clarification. If you're comfortable with the transformation then I'm on board..

However, if I understand you correctly the below is *not* a safe transformation (assuming one of the loads is volatile):

ldr x8, [x0, #8]
ldr x9, [x0]

into

ldp x9, x8, [x0]

correct?

For "Normal" memory the uArch has quite a bit of freedom to do this kind of optimisation, but that's probably fine because you can't detect it. "Device" memory is designed to turn off that kind of thing though, and that's the primary valid use for volatile.

The main point that now worries me is this note I've just discovered in B2.8.2 (Device memory):

An instruction that generates a sequence of accesses as described in Atomicity in the ARM architecture on page B2-81 might be abandoned as a result of an exception being taken during the sequence of accesses. On return from the exception the instruction is restarted, and therefore one or more of the memory locations might be accessed multiple times. This can result in repeated accesses to a location where the program only defines a single access. For this reason, ARM strongly recommends that no accesses to Device memory are performed from a single instruction that spans the boundary of a translation granule or which in some other way could lead to some of the accesses being aborted.

I think ldp/stp operations could be aborted like that (B2-81 specifically says they generate 2 single-copy atomic accesses), so Chad's probably right that we can't do anything for volatile. Well done Chad!

FWIW, I think both loads would have to be volatile to prohibit that if the underlying principle was sound. We're allowed to reorder volatile accesses with non-volatile, but not 2 volatile accesses.

I'm not sure if I understand correctly. It will be helpful if you clarify below four cases ?
1.I guess this must be safe if ldp read [x0, #0] and then [x0, #8] ?

ldr x9, [x0] 
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

2.Is this safe ?

ldr x8, [x0, #8] (volatile) 
ldr x9, [x0] 
  ; into
ldp x9, x8, [x0]

3.Is this not safe ?

ldr x9, [x0] (volatile)
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

4.what about this case?

ldr x9, [x0] 
ldr x10, [x1]
ldr x8, [x0, #8] (volatile) 
  ; into
ldr x10, [x1]
ldp x9, x9, [x0]

Reordering the examples slightly:

3.Is this not safe ?

ldr x9, [x0] (volatile)
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

I think this is unsafe because the ldp can be interrupted after loading x9. It will then re-execute the entire instruction, causing 2 loads from [x0].

1.I guess this must be safe if ldp read [x0, #0] and then [x0, #8] ?

ldr x9, [x0] 
ldr x8, [x0, #8] (volatile) 
  ; into
ldp x9, x8, [x0]

This is OK (assuming no atomics are around too). It's in-order and if the ldp is interrupted only x9 would be loaded twice, which isn't volatile so there's no problem repeating the access.

I'm really not sure it's worth bothering with though, generally either all or no accesses in some region are volatile (for accessing memory-mapped devices), so I rather doubt it would ever produce noticeable benefit.

2.Is this safe ?

ldr x8, [x0, #8] (volatile) 
ldr x9, [x0] 
  ; into
ldp x9, x8, [x0]

This is also safe. You're allowed to reorder the loads since only one is volatile, and the volatile access ends up last in the ldp again, so re-executing would be harmless.

4.what about this case?

ldr x9, [x0] 
ldr x10, [x1]
ldr x8, [x0, #8] (volatile) 
  ; into
ldr x10, [x1]
ldp x9, x8, [x0]

Also OK, again because the volatile access is last in the ldp.

Thanks Tim for the clarification. As long as we are not accessing volatile first in ldp, reordering volatile with non-volatile is fine. right ? So, I may want to confirm what I understand using below two more cases.

5.This must be fine ?

ldr x9, [x0] 
ldr x10, [x1] (volatile) 
ldr x8, [x0, #8] 
  ; into
ldr x10, [x1] (volatile) 
ldp x9, x8, [x0]

6.This must be not safe?

ldr x9, [x0]   (volatile) 
ldr x8, [x0, #8] 
  ; into
ldp x9, x8, [x0]

That looks like a reasonable description of the situation to me and both of your new examples are right.

Tim.

Thanks Tim for the great detail !

This is good discusson.
Thanks Tim, Chad, Junbum.

To my knowledge, The ARM architecture is a weakly ordered memory architecture that supports out of order completion. (B2.7.3)
This means

ex1)
 ldr x1, [x0]       (volatile)
 ldr x2, [x0, #8]   (volatile)
ex2)
 ldr x1, [x0, #8]   (volatile)
 ldr x2, [x0]       (volatile)

Both of examples are not guaranteed which instruction is executed for the first time on the ARM Architecture.
So I think they can be merged into single ldp instruction.
From the language side, volatile variable means it should be loaded and should be stored.

https://en.wikipedia.org/wiki/Volatile_(computer_programming)
In computer programming, particularly in the C, C++, C#, and Java programming languages, the volatile keyword indicates that a value may change between different accesses, even if it does not appear to be modified. This keyword prevents an optimizing compiler from optimizing away subsequent reads or writes and thus incorrectly reusing a stale value or omitting writes. Volatile values primarily arise in hardware access (memory-mapped I/O), where reading from or writing to memory is used to communicate with peripheral devices, and in threading, where a different thread may have modified a value.

For the ordering, (B2.7.2)

Ordering can be achieved by using a DMB or DSB barrier. For more information on DMB and DSB
instructions, see Memory barriers on page B2-85.

For Device Memory(B2.8.2), ARM sugessts using DMB, Load-Acquire, Store-Release(B2-87)

Each Load-Acquire Exclusive and Store-Release Exclusive instruction is essentially a variant of the
equivalent Load-Exclusive or Store-Exclusive instruction. All usage restrictions and single-copy atomicity
properties:
— That apply to the Load-Exclusive instructions also apply to the Load-Acquire Exclusive instructions.
— That apply to the Store-Exclusive instructions also apply to the Store-Release Exclusive instructions.
• The Load-Acquire/Store-Release instructions can remove the requirement to use the explicit DMB memory
barrier instruction.

Tim's mention

I think ldp/stp operations could be aborted like that (B2-81 specifically says they generate 2 single-copy atomic accesses)

I think most of ARM architecture, ldp's exectuion timing is same with single ldr instruction. so we don't have to worry about this so much.

Chad's mention

I also wanted to point out that this patch is awesome in that it also allows instructions whos MMOs have been dropped to be merged (hasOrderedMemoryRef makes a conservative assumption when MMOs are missing). Therefore, this solution will transform more that just volatile loads/stores. Of course the correct fix is to probably not drop MMO (as is don't in tail merging)...

I agree with this. But before discuss this, I want to finish "volatile load/store merge".

I think that if nobody can agree with applying both of examples what I mentioned, this commit is not worth for non-volatile & volatile type merging.

Junmo.

To my knowledge, The ARM architecture is a weakly ordered memory architecture that supports out of order completion. (B2.7.3)

I think you're not realising just how special Device memory is. It's not just memory we happen to know is mapped to some peripheral, it's a real set of attributes you can put into the page tables that prevent the core from messing around with your memory accesses in various ways.

And since the primary valid use for volatile is accessing this kind of memory, we need to make sure we don't break it.

For the ordering, (B2.7.2) [...] For Device Memory(B2.8.2), ARM sugessts using DMB, Load-Acquire, Store-Release(B2-87)

This is an incomplete picture. See B2-98, where it's explicitly called out that you need barriers for Device-nGRE memory. There would be no need to mention that if barriers were always needed.

I believe what actually happens is that Device-nGnRnE is completely ordered under all circumstances, Device-nGnrE is ordered within itself but can be reordered with accesses to Normal memory, and the others are even weaker.

I think we need volatile to be useful for the strongest of those memory regimes: Device-nGnRnE. That means we can't reorder anything and need to be careful about exceptions.

I think most of ARM architecture, ldp's exectuion timing is same with single ldr instruction. so we don't have to worry about this so much.

I don't think latencies come into it. The question is what a conforming ARMv8 CPU is allowed to do, and I think aborting mid-ldp/stp is allowed.

Cheers.

Tim.

I believe what actually happens is that Device-nGnRnE is completely ordered under all circumstances, Device-nGnEE is ordered within itself but can be reordered with accesses to Normal memory, and the others are even weaker.

Oops, this seems wrong on a more thorough reading. I don't think nGnRnE is actually that strong, though I still think the description of nGnRE is probably roughly right and prevents us from reordering volatiles. E.g. on the Reordering attribute:

For all memory types with the non-Reordering attribute, the order of memory accesses arriving at a single peripheral of IMPLEMENTATION DEFINED size, as defined by the peripheral, must be the same order that occurs in a simple sequential execution of the program. That is, the accesses appear in program order.

Thanks Tim.

Oops, this seems wrong on a more thorough reading. I don't think nGnRnE is actually that strong, though I still think the description of nGnRE is probably roughly right and prevents us from reordering volatiles. E.g. on the Reordering attribute:

I agree with your opinion. I learned a lot from this commit.
Especially, I really appreciate Tim, Chad, Junbum for your time reviewing.

If we can't merge two volatile load, I think this commit is not worth for non-volatile and volatile load merge. How do you think about this?
If you agree with this, I'll abandon this commit.

Junmo.

Thanks for review.

Junmo.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64LoadStoreOptimizer.cpp

37 lines

test/

CodeGen/

AArch64/

arm64-ldp.ll

19 lines

arm64-platform-reg.ll

12 lines

arm64-stp.ll

17 lines

arm64-volatile.ll

27 lines

Diff 40116

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

	Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines
	return true;			return true;
	// FIXME: Add other instructions (e.g, LDRBBui, LDURSHWi, LDRSHWui, etc.).			// FIXME: Add other instructions (e.g, LDRBBui, LDURSHWi, LDRSHWui, etc.).
	}			}
	}			}
	static bool isSmallTypeLdMerge(MachineInstr *MI) {			static bool isSmallTypeLdMerge(MachineInstr *MI) {
	return isSmallTypeLdMerge(MI->getOpcode());			return isSmallTypeLdMerge(MI->getOpcode());
	}			}

				// This function is called by findMatchingInsn.
				// Only check the target instruction is load/store and ScanLimit is 1 or not.
				static bool isRightNextLdStMerge(MachineInstr *MI, unsigned Limit) {
				if (!MI->mayStore() && !MI->mayLoad() && !MI->isCall() &&
				junbumlUnsubmitted Not Done Reply Inline Actions You can use clang-format to check your coding style. junbuml: You can use clang-format to check your coding style.
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions I will fix it. I didn't know clang-format. It's very convinient. Thanks Junbum. flyingforyou: I will fix it. I didn't know clang-format. It's very convinient. Thanks Junbum.
				junbumlUnsubmitted Not Done Reply Inline Actions Do you need to check this again? If this function is called only in findMatchingInsn, assert will be better. junbuml: Do you need to check this again? If this function is called only in findMatchingInsn, assert…
				!MI->hasUnmodeledSideEffects())
				return false;

				return Limit == 1;

				return false;
				junbumlUnsubmitted Not Done Reply Inline Actions Please remove this. junbuml: Please remove this.
				}

	// Scaling factor for unscaled load or store.			// Scaling factor for unscaled load or store.
				junbumlUnsubmitted Not Done Reply Inline Actions You can replace this with return Limit == 1; junbuml: You can replace this with return Limit == 1;
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Yes. You're right. flyingforyou: Yes. You're right.
	static int getMemScale(MachineInstr *MI) {			static int getMemScale(MachineInstr *MI) {
	switch (MI->getOpcode()) {			switch (MI->getOpcode()) {
	default:			default:
	llvm_unreachable("Opcode has unknown scale!");			llvm_unreachable("Opcode has unknown scale!");
	case AArch64::LDRBBui:			case AArch64::LDRBBui:
	case AArch64::STRBBui:			case AArch64::STRBBui:
	return 1;			return 1;
	case AArch64::LDRHHui:			case AArch64::LDRHHui:
	▲ Show 20 Lines • Show All 263 Lines • ▼ Show 20 Lines

	unsigned Opc = FirstMI->getOpcode();			unsigned Opc = FirstMI->getOpcode();
	bool MayLoad = FirstMI->mayLoad();			bool MayLoad = FirstMI->mayLoad();
	bool IsUnscaled = isUnscaledLdSt(FirstMI);			bool IsUnscaled = isUnscaledLdSt(FirstMI);
	unsigned Reg = getLdStRegOp(FirstMI).getReg();			unsigned Reg = getLdStRegOp(FirstMI).getReg();
	unsigned BaseReg = getLdStBaseOp(FirstMI).getReg();			unsigned BaseReg = getLdStBaseOp(FirstMI).getReg();
	int Offset = getLdStOffsetOp(FirstMI).getImm();			int Offset = getLdStOffsetOp(FirstMI).getImm();

				// If first instruction is volatile load/store, we only check just
				// right next instruction. For this, we just set Limit value to 1.
				// That means we just check right next instruction for merging.
				// e.g.,
				// ldr x0, [x2] -> volatile
				// ldr x1, [x2, #8] -> this is just right next instruction of first.
				// ; becomes
				// ldp x0, x1, [x2] -> if memory reference information is empty,
				// it is also treated volatile access.
				// Opposite situation or both of ldr/str instructions are volatile is
				// also ok.
				if (FirstMI->hasOrderedMemoryRef())
				Limit = 1;

	// Early exit if the first instruction modifies the base register.			// Early exit if the first instruction modifies the base register.
	// e.g., ldr x0, [x0]			// e.g., ldr x0, [x0]
	if (FirstMI->modifiesRegister(BaseReg, TRI))			if (FirstMI->modifiesRegister(BaseReg, TRI))
	return E;			return E;

	// Early exit if the offset if not possible to match. (6 bits of positive			// Early exit if the offset if not possible to match. (6 bits of positive
	// range, plus allow an extra one in case we find a later insn that matches			// range, plus allow an extra one in case we find a later insn that matches
	// with Offset-1)			// with Offset-1)
	Show All 17 Lines
	// optimization by changing how far we scan.			// optimization by changing how far we scan.
	if (MI->isDebugValue())			if (MI->isDebugValue())
	continue;			continue;

	// Now that we know this is a real instruction, count it.			// Now that we know this is a real instruction, count it.
	++Count;			++Count;

	bool CanMergeOpc = Opc == MI->getOpcode();			bool CanMergeOpc = Opc == MI->getOpcode();
	Flags.setSExtIdx(-1);			Flags.setSExtIdx(-1);
				mcrosierUnsubmitted Not Done Reply Inline Actions Please create a static helper function with comments suggesting why it's safe to merge adjacent loads/stores that are marked volatile. Also, I don't think this check is sufficient. The order of the offsets determine the ordering of the pairs. For example, ldr x1, [x0, #4] <--- volatile ldr x2, [x0] would create a ldp with the volatile load being reordering, if I'm not mistaken. mcrosier: Please create a static helper function with comments suggesting why it's safe to merge adjacent…
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Hi Chad. I am really appreciate your review. As you said, it has a possiblity of reordering. But I think it's not harmful. After merging, it changed to ldp instruction that has no memory reference information. This means ldp instruction is treated like volatile instruction. I will create a static helper function. If you feel something is missing, please tell me. Thanks. flyingforyou: Hi Chad. I am really appreciate your review. As you said, it has a possiblity of reordering.
	if (!CanMergeOpc) {			if (!CanMergeOpc) {
	bool IsValidLdStrOpc;			bool IsValidLdStrOpc;
	unsigned NonSExtOpc = getMatchingNonSExtOpcode(Opc, &IsValidLdStrOpc);			unsigned NonSExtOpc = getMatchingNonSExtOpcode(Opc, &IsValidLdStrOpc);
	assert(IsValidLdStrOpc &&			assert(IsValidLdStrOpc &&
	"Given Opc should be a Load or Store with an immediate");			"Given Opc should be a Load or Store with an immediate");
	// Opc will be the first instruction in the pair.			// Opc will be the first instruction in the pair.
	Flags.setSExtIdx(NonSExtOpc == (unsigned)Opc ? 1 : 0);			Flags.setSExtIdx(NonSExtOpc == (unsigned)Opc ? 1 : 0);
	CanMergeOpc = NonSExtOpc == getMatchingNonSExtOpcode(MI->getOpcode());			CanMergeOpc = NonSExtOpc == getMatchingNonSExtOpcode(MI->getOpcode());
	}			}

	if (CanMergeOpc && getLdStOffsetOp(MI).isImm()) {			if (CanMergeOpc && getLdStOffsetOp(MI).isImm()) {
	assert(MI->mayLoadOrStore() && "Expected memory operation.");			assert(MI->mayLoadOrStore() && "Expected memory operation.");
	// If we've found another instruction with the same opcode, check to see			// If we've found another instruction with the same opcode, check to see
	// if the base and offset are compatible with our starting instruction.			// if the base and offset are compatible with our starting instruction.
	// These instructions all have scaled immediate operands, so we just			// These instructions all have scaled immediate operands, so we just
	// check for +1/-1. Make sure to check the new instruction offset is			// check for +1/-1. Make sure to check the new instruction offset is
	// actually an immediate and not a symbolic reference destined for			// actually an immediate and not a symbolic reference destined for
	// a relocation.			// a relocation.
	//			//
	// Pairwise instructions have a 7-bit signed offset field. Single insns			// Pairwise instructions have a 7-bit signed offset field. Single insns
	// have a 12-bit unsigned offset field. To be a valid combine, the			// have a 12-bit unsigned offset field. To be a valid combine, the
	// final offset must be in range.			// final offset must be in range.
	unsigned MIBaseReg = getLdStBaseOp(MI).getReg();			unsigned MIBaseReg = getLdStBaseOp(MI).getReg();
	int MIOffset = getLdStOffsetOp(MI).getImm();			int MIOffset = getLdStOffsetOp(MI).getImm();
				junbumlUnsubmitted Not Done Reply Inline Actions Do you need to find a paired until you encounter the first memory operation? Or did you intend to check just right next instruction? Looks like you try to see just right next instruction. Right? If that's the case you could use Count as well. junbuml: Do you need to find a paired until you encounter the first memory operation? Or did you intend…
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions I just want to check right next instruction. I think adjacent or just next to.. these expressions are not enough for explanation of my patch. I will change comments and function name also. flyingforyou: I just want to check right next instruction. I think adjacent or just next to.. these…
	if (BaseReg == MIBaseReg && ((Offset == MIOffset + OffsetStride) \|\|			if (BaseReg == MIBaseReg && ((Offset == MIOffset + OffsetStride) \|\|
	(Offset + OffsetStride == MIOffset))) {			(Offset + OffsetStride == MIOffset))) {
	int MinOffset = Offset < MIOffset ? Offset : MIOffset;			int MinOffset = Offset < MIOffset ? Offset : MIOffset;
	// If this is a volatile load/store that otherwise matched, stop looking			// If this is a volatile load/store that otherwise matched, stop looking
	// as something is going on that we don't have enough information to			// as something is going on that we don't have enough information to
	// safely transform. Similarly, stop if we see a hint to avoid pairs.			// safely transform. Similarly, stop if we see a hint to avoid pairs.
	if (MI->hasOrderedMemoryRef() \|\| TII->isLdStPairSuppressed(MI))			// But if the Limit is 1 which means we try to merge just right next
				// load/store instruction, we can merge volatile load/store.
				// There is no semantic difference even if the order is changed.
				// Pairwise instruction doesn't have memory reference information
				// which treated volatile access.
				junbumlUnsubmitted Not Done Reply Inline Actions Why merging with the next instruction is safe. Is this also safe for the narrow load merge case where we concatenate MOMs ? junbuml: Why merging with the next instruction is safe. Is this also safe for the narrow load merge…
				if ((!isRightNextLdStMerge(MI, Limit) && MI->hasOrderedMemoryRef()) \|\|
				junbumlUnsubmitted Not Done Reply Inline Actions I doubt if isRightNextLdStMerge is the right function name? Something specifically volatile will be better ? junbuml: I doubt if isRightNextLdStMerge is the right function name? Something specifically volatile…
				TII->isLdStPairSuppressed(MI))
	return E;			return E;
	// If the resultant immediate offset of merging these instructions			// If the resultant immediate offset of merging these instructions
	// is out of range for a pairwise instruction, bail and keep looking.			// is out of range for a pairwise instruction, bail and keep looking.
	bool MIIsUnscaled = isUnscaledLdSt(MI);			bool MIIsUnscaled = isUnscaledLdSt(MI);
	bool IsSmallTypeLd = isSmallTypeLdMerge(MI->getOpcode());			bool IsSmallTypeLd = isSmallTypeLdMerge(MI->getOpcode());
	if (!IsSmallTypeLd &&			if (!IsSmallTypeLd &&
	!inBoundsForPair(MIIsUnscaled, MinOffset, OffsetStride)) {			!inBoundsForPair(MIIsUnscaled, MinOffset, OffsetStride)) {
	trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);			trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);
	▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines
	MachineInstr *MI = MBBI;			MachineInstr *MI = MBBI;
	// Skip DBG_VALUE instructions. Otherwise debug info can affect the			// Skip DBG_VALUE instructions. Otherwise debug info can affect the
	// optimization by changing how far we scan.			// optimization by changing how far we scan.
	if (MI->isDebugValue())			if (MI->isDebugValue())
	continue;			continue;

	// Now that we know this is a real instruction, count it.			// Now that we know this is a real instruction, count it.
	++Count;			++Count;

				mcrosierUnsubmitted Not Done Reply Inline Actions Please add a few comments about what's going on here. mcrosier: Please add a few comments about what's going on here.
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions I will do that. Thanks. flyingforyou: I will do that. Thanks.
	// If we found a match, return it.			// If we found a match, return it.
	if (isMatchingUpdateInsn(I, MI, BaseReg, Offset))			if (isMatchingUpdateInsn(I, MI, BaseReg, Offset))
	return MBBI;			return MBBI;

	// Update the status of what the instruction clobbered and used.			// Update the status of what the instruction clobbered and used.
	trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);			trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);

	// Otherwise, if the base register is used or modified, we have no match, so			// Otherwise, if the base register is used or modified, we have no match, so
	// return early.			// return early.
	if (ModifiedRegs[BaseReg] \|\| UsedRegs[BaseReg])			if (ModifiedRegs[BaseReg] \|\| UsedRegs[BaseReg])
	return E;			return E;
				junbumlUnsubmitted Not Done Reply Inline Actions I think you can move this check in findMatchingInsn(). junbuml: I think you can move this check in findMatchingInsn().
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Ok. I will. flyingforyou: Ok. I will.
	}			}
	return E;			return E;
	}			}

	bool AArch64LoadStoreOpt::tryToMergeLdStInst(			bool AArch64LoadStoreOpt::tryToMergeLdStInst(
	MachineBasicBlock::iterator &MBBI) {			MachineBasicBlock::iterator &MBBI) {
	MachineInstr *MI = MBBI;			MachineInstr *MI = MBBI;
	MachineBasicBlock::iterator E = MI->getParent()->end();			MachineBasicBlock::iterator E = MI->getParent()->end();
	// If this is a volatile load/store, don't mess with it.
	if (MI->hasOrderedMemoryRef())
	return false;

	// Make sure this is a reg+imm (as opposed to an address reloc).			// Make sure this is a reg+imm (as opposed to an address reloc).
	if (!getLdStOffsetOp(MI).isImm())			if (!getLdStOffsetOp(MI).isImm())
	return false;			return false;

	// Check if this load/store has a hint to avoid pair formation.			// Check if this load/store has a hint to avoid pair formation.
	// MachineMemOperands hints are set by the AArch64StorePairSuppress pass.			// MachineMemOperands hints are set by the AArch64StorePairSuppress pass.
	if (TII->isLdStPairSuppressed(MI))			if (TII->isLdStPairSuppressed(MI))
	return false;			return false;
				junbumlUnsubmitted Not Done Reply Inline Actions Can you add little bit more detail about why you want to check just the right next instruction ? junbuml: Can you add little bit more detail about why you want to check just the right next instruction ?
				flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions After moving "hasOrderedMemoryRef", I will add more comment inside of findMatchingInsn function. flyingforyou: After moving "hasOrderedMemoryRef", I will add more comment inside of findMatchingInsn function.

	// Look ahead up to ScanLimit instructions for a pairable instruction.			// Look ahead up to ScanLimit instructions for a pairable instruction.
	LdStPairFlags Flags;			LdStPairFlags Flags;
	MachineBasicBlock::iterator Paired = findMatchingInsn(MBBI, Flags, ScanLimit);			MachineBasicBlock::iterator Paired = findMatchingInsn(MBBI, Flags, ScanLimit);
	if (Paired != E) {			if (Paired != E) {
	if (isSmallTypeLdMerge(MI)) {			if (isSmallTypeLdMerge(MI)) {
	++NumSmallTypeMerged;			++NumSmallTypeMerged;
	} else {			} else {
	▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-ldp.ll

	Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines
	%sexttmp = sext i32 %tmp to i64	%sexttmp = sext i32 %tmp to i64
	%sexttmp1 = sext i32 %tmp1 to i64	%sexttmp1 = sext i32 %tmp1 to i64
	%ptr = getelementptr inbounds i32, i32* %add.ptr, i64 1	%ptr = getelementptr inbounds i32, i32* %add.ptr, i64 1
	call void @use-ptr(i32* %ptr)	call void @use-ptr(i32* %ptr)
	%add = add nsw i64 %sexttmp1, %sexttmp	%add = add nsw i64 %sexttmp1, %sexttmp
	ret i64 %add	ret i64 %add
	}	}

		; Test the load/store optimizer---combine adjacent volatile ldr
		; CHECK-LABEL: volatile_ldp_long
		; CHECK: ldr
		; CHECK: ldp
		; CHECK: ldr
		define i64 @volatile_ldp_long(i64* %p) nounwind {
		junbumlUnsubmitted Not Done Reply Inline Actions It will be good to add more tests to cover all other types of loads/stores impacted by this patch, not just i64. junbuml: It will be good to add more tests to cover all other types of loads/stores impacted by this…
		%add.ptr = getelementptr inbounds i64, i64* %p, i64 0
		%tmp = load volatile i64, i64* %add.ptr, align 8
		%add.ptr1 = getelementptr inbounds i64, i64* %p, i64 2
		%tmp1 = load volatile i64, i64* %add.ptr1, align 8
		%add.ptr2 = getelementptr inbounds i64, i64* %p, i64 3
		%tmp2 = load volatile i64, i64* %add.ptr2, align 8
		%add.ptr3 = getelementptr inbounds i64, i64* %p, i64 1
		%tmp3 = load volatile i64, i64* %add.ptr3, align 8
		%add = add nsw i64 %tmp1, %tmp
		%add2 = add nsw i64 %tmp2, %add
		%add3 = add nsw i64 %tmp3, %add2
		ret i64 %add3
		}
Context not available.

test/CodeGen/AArch64/arm64-platform-reg.ll

	; RUN: llc -mtriple=arm64-apple-ios -mattr=+reserve-x18 -o - %s \| FileCheck %s --check-prefix=CHECK-RESERVE-X18			; RUN: llc -mtriple=arm64-apple-ios -mattr=+reserve-x18 -o - %s \| FileCheck %s --check-prefix=CHECK-RESERVE-X18
	; RUN: llc -mtriple=arm64-freebsd-gnu -mattr=+reserve-x18 -o - %s \| FileCheck %s --check-prefix=CHECK-RESERVE-X18			; RUN: llc -mtriple=arm64-freebsd-gnu -mattr=+reserve-x18 -o - %s \| FileCheck %s --check-prefix=CHECK-RESERVE-X18
	; RUN: llc -mtriple=arm64-linux-gnu -o - %s \| FileCheck %s			; RUN: llc -mtriple=arm64-linux-gnu -o - %s \| FileCheck %s

	; x18 is reserved as a platform register on Darwin but not on other			; x18 is reserved as a platform register on Darwin but not on other
	; systems. Create loads of register pressure and make sure this is respected.			; systems. Create loads of register pressure and make sure this is respected.

	; Also, fp must always refer to a valid frame record, even if it's not the one			; Also, fp must always refer to a valid frame record, even if it's not the one
	; of the current function, so it shouldn't be used either.			; of the current function, so it shouldn't be used either.

	@var = global [30 x i64] zeroinitializer			@var = global [30 x i64] zeroinitializer

	define void @keep_live() {			define void @keep_live() {
	%val = load volatile [30 x i64], [30 x i64]* @var			%val = load volatile [30 x i64], [30 x i64]* @var
	store volatile [30 x i64] %val, [30 x i64]* @var			store volatile [30 x i64] %val, [30 x i64]* @var

	; CHECK: ldr x18			; CHECK: ldp x17, x18
	; CHECK: str x18			; CHECK: stp x17, x18

	; CHECK-RESERVE-X18-NOT: ldr fp			; CHECK-RESERVE-X18-NOT: ldp x17, fp
	; CHECK-RESERVE-X18-NOT: ldr x18			; CHECK-RESERVE-X18-NOT: ldp x17, x18
	; CHECK-RESERVE-X18: Spill			; CHECK-RESERVE-X18: Spill
	; CHECK-RESERVE-X18-NOT: ldr fp			; CHECK-RESERVE-X18-NOT: ldp x17, fp
	; CHECK-RESERVE-X18-NOT: ldr x18			; CHECK-RESERVE-X18-NOT: ldp x17, x18
	; CHECK-RESERVE-X18: ret			; CHECK-RESERVE-X18: ret
	ret void			ret void
	}			}

test/CodeGen/AArch64/arm64-stp.ll

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	define i32 @stp_int_rar_hazard_after(i32 %w0, i32 %a, i32 %b, i32* nocapture %p) nounwind {
store i32 %a, i32* %p, align 4		store i32 %a, i32* %p, align 4
%ld.ptr = getelementptr inbounds i32, i32* %p, i64 1		%ld.ptr = getelementptr inbounds i32, i32* %p, i64 1
%tmp = load i32, i32* %ld.ptr, align 4		%tmp = load i32, i32* %ld.ptr, align 4
%tmp2 = add i32 %tmp, %b		%tmp2 = add i32 %tmp, %b
%add.ptr = getelementptr inbounds i32, i32* %p, i64 1		%add.ptr = getelementptr inbounds i32, i32* %p, i64 1
store i32 %b, i32* %add.ptr, align 4		store i32 %b, i32* %add.ptr, align 4
ret i32 %tmp2		ret i32 %tmp2
}		}

		; Test the load/store optimizer---combine adjacent volatile str
		; CHECK-LABEL: volatile_stp_long
		; CHECK: str x0, [x4]
		; CHECK: stp x1, x2, [x4, #16]
		; CHECK: str x3, [x4, #8]
		define void @volatile_stp_long(i64 %a, i64 %b, i64 %c, i64 %d,i64* nocapture %p) nounwind {
		%add.ptr = getelementptr inbounds i64, i64* %p, i64 0
		store volatile i64 %a, i64* %add.ptr, align 8
		%add.ptr1 = getelementptr inbounds i64, i64* %p, i64 2
		store volatile i64 %b, i64* %add.ptr1, align 8
		%add.ptr2 = getelementptr inbounds i64, i64* %p, i64 3
		store volatile i64 %c, i64* %add.ptr2, align 8
		%add.ptr3 = getelementptr inbounds i64, i64* %p, i64 1
		store volatile i64 %d, i64* %add.ptr3, align 8
		ret void
		}

test/CodeGen/AArch64/arm64-volatile.ll

	; RUN: llc < %s -march=arm64 \| FileCheck %s			; RUN: llc < %s -march=arm64 \| FileCheck %s
	define i64 @normal_load(i64* nocapture %bar) nounwind readonly {			define i64 @normal_load(i64* nocapture %bar) nounwind readonly {
	; CHECK: normal_load			; CHECK: normal_load
	; CHECK: ldp			; CHECK: ldp
				; CHECK: ldp
				; CHECK-NEXT: add
				; CHECK-NEXT: add
	; CHECK-NEXT: add			; CHECK-NEXT: add
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%add.ptr = getelementptr inbounds i64, i64* %bar, i64 1			%add.ptr = getelementptr inbounds i64, i64* %bar, i64 1
	%tmp = load i64, i64* %add.ptr, align 8			%tmp = load i64, i64* %add.ptr, align 8
	%add.ptr1 = getelementptr inbounds i64, i64* %bar, i64 2			%add.ptr1 = getelementptr inbounds i64, i64* %bar, i64 5
	%tmp1 = load i64, i64* %add.ptr1, align 8			%tmp1 = load i64, i64* %add.ptr1, align 8
				%add.ptr2 = getelementptr inbounds i64, i64* %bar, i64 2
				%tmp2 = load i64, i64* %add.ptr2, align 8
				%add.ptr3 = getelementptr inbounds i64, i64* %bar, i64 4
				%tmp3 = load i64, i64* %add.ptr3, align 8
	%add = add nsw i64 %tmp1, %tmp			%add = add nsw i64 %tmp1, %tmp
	ret i64 %add			%add2 = add nsw i64 %tmp3, %tmp2
				%add3 = add nsw i64 %add, %add2
				ret i64 %add3
	}			}

	define i64 @volatile_load(i64* nocapture %bar) nounwind {			define i64 @volatile_load(i64* nocapture %bar) nounwind {
	; CHECK: volatile_load			; CHECK: volatile_load
	; CHECK: ldr			; CHECK: ldr
	; CHECK-NEXT: ldr			; CHECK-NEXT: ldr
				; CHECK-NEXT: ldr
				; CHECK-NEXT: ldr
				; CHECK-NEXT: add
				; CHECK-NEXT: add
	; CHECK-NEXT: add			; CHECK-NEXT: add
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%add.ptr = getelementptr inbounds i64, i64* %bar, i64 1			%add.ptr = getelementptr inbounds i64, i64* %bar, i64 1
	%tmp = load volatile i64, i64* %add.ptr, align 8			%tmp = load volatile i64, i64* %add.ptr, align 8
	%add.ptr1 = getelementptr inbounds i64, i64* %bar, i64 2			%add.ptr1 = getelementptr inbounds i64, i64* %bar, i64 5
	%tmp1 = load volatile i64, i64* %add.ptr1, align 8			%tmp1 = load volatile i64, i64* %add.ptr1, align 8
				%add.ptr2 = getelementptr inbounds i64, i64* %bar, i64 2
				%tmp2 = load volatile i64, i64* %add.ptr2, align 8
				%add.ptr3 = getelementptr inbounds i64, i64* %bar, i64 4
				%tmp3 = load volatile i64, i64* %add.ptr3, align 8
	%add = add nsw i64 %tmp1, %tmp			%add = add nsw i64 %tmp1, %tmp
	ret i64 %add			%add2 = add nsw i64 %tmp3, %tmp2
				%add3 = add nsw i64 %add, %add2
				ret i64 %add3
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Applying load pair optimization for volatile load/storeAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 40116

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

test/CodeGen/AArch64/arm64-ldp.ll

test/CodeGen/AArch64/arm64-platform-reg.ll

test/CodeGen/AArch64/arm64-stp.ll

test/CodeGen/AArch64/arm64-volatile.ll

[AArch64] Applying load pair optimization for volatile load/store
AbandonedPublic