This is an archive of the discontinued LLVM Phabricator instance.

I haven't done any testing beyond basic lit tests. Hopefully, I'll have correctness/perf testing in flight by tomorrow. Regardless, I think this is in good enough shape to review.

kristof.beyls added inline comments.Aug 19 2015, 3:26 AM

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
2	My preference is to use -march=aarch64 instead of -march=arm64 as much as possible, as aarch64 is the official name for the architecture. Is there a specific reason -mcpu=cortex-a57 needs to be added to the test line? I don't have a strong opinion on whether it should be there or not, just wondering.
82	Looking at this test case, I see that before this patch, the following code is produced: ldur x8, [x0, #-8] ldr x9, [x0] . If I'm not mistaken, ldur x8, [x0, #-8] has the same functionality as ldr x8, [x0, #-1]? If so, wouldn't it be better to make sure we produce ldr instead of ldur in the first place? If we would do that, and there still is a good reason to have special code to convert LDR + LDUR into LDP, I guess none of the above test cases really show that (although I haven't investigated every single test case in detail)?

kristof.beyls added inline comments.Aug 19 2015, 4:08 AM

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
82	D'oh! The ARMARM clearly states that the scaled immediate offsets in ldr x8, [x0, #imm] can only be positive/unsigned. Please ignore my comment above!

junbuml added a subscriber: junbuml.Aug 20 2015, 8:16 AM

Addressed Kristof's comments. Included full diff.

Fix test7 which I modified/broke, while doing additional testing.

mcrosier updated this revision to Diff 32821.Aug 21 2015, 6:19 AM

Hi Chad,

I checked on a small testcase, and with this patch we do merge STUR and STR. There is one potential issue though: in some cases we intentionally split STP into two STUR/STR, as there is a big performance penalty if STP crosses a cache line (see e.g. performSTORECombine in AArch64ISelLowering.cpp and getMemoryOpCost in AArch64TargetTransformInfo.cpp). So, I think we might want to perform this combining only for non-temporal or known-to-be-well-aligned memory accesses. What do you think, does it make sense?

Thanks,
Michael

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
395	This looks strange. Is it expected that scaled and unscaled offset differ by `MemSize^2`? (in one case you multiply by `MemSize`, in the other - divide)
656	Same here.

Hi Michael,
So, you're saying that performSTORECombine splits 16B stores that cross line boundaries for performance reasons. AFAICT, this combine is only enabled with Cyclone. Later the AArch64LoadStore optimizer runs and combines these split stores undoing the optimization performed during ISelLowering.

If I understand things correctly, I think this makes sense. However, I'd probably implement this in a separate patch and test it on other subtargets (e.g., A57). Sound reasonable?

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
395	My assumption is that a scaled offset is always the unscaled offset divide by the MemSize and conversely the unscaled offset is always the scaled offset divided by the MemSize; I believe this is valid. I.e., Unscaled Offset = Scaled Offset * MemSize; Scaled Offset = Unscaled Offset / MemSize; What did I miss? I'm trying to simply the logic so that both offsets are either scaled or unscaled, but not a mix of the two. I'm guessing you're concerned my conversion is incorrect.

Hi Chad,

As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization.

I think that being able to combine STUR/STR to STP is good by itself, but from the patch alone we'll probably get regressions (AFAIR, the biggest one was matmul from LNT testsuite). That's my only concern, but I don't mind landing the patch and addressing this issue later if we actually observe it.

Also, I can't provide a high-quality review of this code, as I'm not very familiar with it, so I'd appreciate other people's feedback on the patch.

Thanks!

Michael

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
395	Yep, I was concerned that it was incorrect, but I just read the code incorrectly. Sorry for that! (I thought that we do `PairedOffset = UnscaledOffset*MemSize` in one case and `PairedOffset = UnscaleOffset/MemSize` in the other).

This LGTM (modulo nits), but let's see what the testsuite says first.

In D12116#231366, @mzolotukhin wrote:

As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization.

But this issue isn't specific to this patch, right? If the unaligned store was split to STR+STR we would have generated an STP even before this change. I agree we'll need to do something about this though, but separately, and for both mixed and non-mixed STR/STUR pairs.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
390	MI -> Paired, FirstMI -> I ?
557–560	What about removing CanMergeOpc? It's assigned to several times with different values, and all could be folded into ifs/returns.
569–570	Same: fold the condition into the if, remove CanMergeOpc?
577–579	I see this is the only use of the getFrom helpers. What about: getMatchingPairOpcode(Opc) == getMatchingPairOpcode(PairOpc) That'll let us get rid of the added functions.
580–583	Same: fold the condition into the return, remove CanMergeOpc?
677	To make sure I understand: the previous code was "incorrect" in checking MI rather than FirstMI, but it didn't make a difference because we didn't allow mismatched opcodes, right?
test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
9	Kill the "entry:"s ?

Thanks for the feedback, Ahmed.

In D12116#236954, @ab wrote:

This LGTM (modulo nits), but let's see what the testsuite says first.

Unfortunately, I don't have a setup that can test with the testsuite at the moment. That is in the works...

This transformation is very narrow and did not hit anything in Spec2000. Therefore, I'm going to move onto other work with a higher ROI. However, feel free to push this one along.

In D12116#231366, @mzolotukhin wrote:

As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization.

But this issue isn't specific to this patch, right? If the unaligned store was split to STR+STR we would have generated an STP even before this change. I agree we'll need to do something about this though, but separately, and for both mixed and non-mixed STR/STUR pairs.

That is absolutely correct! This patch only applies to the very narrow case of STR/STUR. The common case is when we're pairing STUR/STUR, which has already been committed.

mcrosier added inline comments.Sep 1 2015, 8:36 AM

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
677	I'm not sure the checking was "incorrect", but your latter comment is correct; because we didn't allow mismatches opcodes it didn't make a difference. However, because we're converting MI to be either Scaled or Unscaled to match FirstMI, we need to use IsUscaled in the context of this patch.

Hi Chad, Ahmed,

I ran some testing for this patch and found a bug there. The issue is that when we scale offset, we need to make sure that the original value is divisible by the scale factor. Here is a test to illustrate the issue:

define <2 x i64> @test4(i64* %p) nounwind {
  %a1 = bitcast i64* %p to <2 x i64>*
  %tmp1 = load <2 x i64>, < 2 x i64>* %a1, align 8           # Load <p[0], p[1]>

  %add.ptr2 = getelementptr inbounds i64, i64* %p, i64 3
  %a2 = bitcast i64* %add.ptr2 to <2 x i64>*
  %tmp2 = load <2 x i64>, <2 x i64>* %a2, align 8            # Load <p[3], p[4]>
  %add = add nsw <2 x i64> %tmp1, %tmp2
  ret <2 x i64> %add
}

The current patch will combine these two loads, which is incorrect.

Michael

Thanks Michael. Chad, I'll fix that and resubmit, thanks for the patch!

And this should catch the unscaled offset not being a multiple of the scale (with abundant asserts).

Thanks, Ahmed.

Hi,

I just finished LNT+externals run, and saw no failures. Thanks for the fix!

Michael

Thanks for seeing this one through, Ahmed. LGTM.

This revision is now accepted and ready to land.Sep 3 2015, 5:30 AM

Committed r246769.

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
1	BTW, I added -aarch64-neon-syntax=apple to the final commit, so that this will pass on all platforms.

I speculatively reverted r246769 in r246782 as the compiler looks to be ICEing on Multisource/Benchmarks/tramp3d-v4.

chrisdiamand_arm added a subscriber: chrisdiamand_arm.Feb 18 2016, 5:30 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64LoadStoreOptimizer.cpp

98 lines

test/

CodeGen/

AArch64/

ldp-stp-scaled-unscaled-pairs.ll

105 lines

Diff 33810

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

Show First 20 Lines • Show All 376 Lines • ▼ Show 20 Lines	AArch64LoadStoreOpt::mergePairedInsns(MachineBasicBlock::iterator I,
// Insert our new paired instruction after whichever of the paired		// Insert our new paired instruction after whichever of the paired
// instructions MergeForward indicates.		// instructions MergeForward indicates.
MachineBasicBlock::iterator InsertionPoint = MergeForward ? Paired : I;		MachineBasicBlock::iterator InsertionPoint = MergeForward ? Paired : I;
// Also based on MergeForward is from where we copy the base register operand		// Also based on MergeForward is from where we copy the base register operand
// so we get the flags compatible with the input code.		// so we get the flags compatible with the input code.
const MachineOperand &BaseRegOp =		const MachineOperand &BaseRegOp =
MergeForward ? getLdStBaseOp(Paired) : getLdStBaseOp(I);		MergeForward ? getLdStBaseOp(Paired) : getLdStBaseOp(I);

		int Offset = getLdStOffsetOp(I).getImm();
		int PairedOffset = getLdStOffsetOp(Paired).getImm();
		bool PairedIsUnscaled = isUnscaledLdSt(Paired->getOpcode());

		// We're trying to pair instructions that differ in how they are scaled.
		// If I is scaled then scale the offset of Paired accordingly.
		abAuthorUnsubmitted Done Reply Inline Actions MI -> Paired, FirstMI -> I ? ab: MI -> Paired, FirstMI -> I ?
		// Otherwise, do the opposite (i.e., make Paired's offset unscaled).
		if (IsUnscaled != PairedIsUnscaled) {
		int MemSize = getMemSize(Paired);
		assert(!(PairedOffset % getMemSize(Paired)) &&
		"Offset should be a multiple of the stride!");
		mzolotukhinUnsubmitted Done Reply Inline Actions This looks strange. Is it expected that scaled and unscaled offset differ by `MemSize^2`? (in one case you multiply by `MemSize`, in the other - divide) mzolotukhin: This looks strange. Is it expected that scaled and unscaled offset differ by `MemSize^2`? (in…
		mcrosierUnsubmitted Done Reply Inline Actions My assumption is that a scaled offset is always the unscaled offset divide by the MemSize and conversely the unscaled offset is always the scaled offset divided by the MemSize; I believe this is valid. I.e., Unscaled Offset = Scaled Offset * MemSize; Scaled Offset = Unscaled Offset / MemSize; What did I miss? I'm trying to simply the logic so that both offsets are either scaled or unscaled, but not a mix of the two. I'm guessing you're concerned my conversion is incorrect. mcrosier: My assumption is that a scaled offset is always the unscaled offset divide by the MemSize and…
		mzolotukhinUnsubmitted Done Reply Inline Actions Yep, I was concerned that it was incorrect, but I just read the code incorrectly. Sorry for that! (I thought that we do `PairedOffset = UnscaledOffsetMemSize` in one case and `PairedOffset = UnscaleOffset/MemSize` in the other). mzolotukhin:* Yep, I was concerned that it was incorrect, but I just read the code incorrectly. Sorry for…
		PairedOffset =
		PairedIsUnscaled ? PairedOffset / MemSize : PairedOffset * MemSize;
		}

// Which register is Rt and which is Rt2 depends on the offset order.		// Which register is Rt and which is Rt2 depends on the offset order.
MachineInstr RtMI, Rt2MI;		MachineInstr RtMI, Rt2MI;
if (getLdStOffsetOp(I).getImm() ==		if (Offset == PairedOffset + OffsetStride) {
getLdStOffsetOp(Paired).getImm() + OffsetStride) {
RtMI = Paired;		RtMI = Paired;
Rt2MI = I;		Rt2MI = I;
// Here we swapped the assumption made for SExtIdx.		// Here we swapped the assumption made for SExtIdx.
// I.e., we turn ldp I, Paired into ldp Paired, I.		// I.e., we turn ldp I, Paired into ldp Paired, I.
// Update the index accordingly.		// Update the index accordingly.
if (SExtIdx != -1)		if (SExtIdx != -1)
SExtIdx = (SExtIdx + 1) % 2;		SExtIdx = (SExtIdx + 1) % 2;
} else {		} else {
RtMI = I;		RtMI = I;
Rt2MI = Paired;		Rt2MI = Paired;
}		}
// Handle Unscaled		// Scale the immediate offset, if necessary.
int OffsetImm = getLdStOffsetOp(RtMI).getImm();		int OffsetImm = getLdStOffsetOp(RtMI).getImm();
if (IsUnscaled && EnableAArch64UnscaledMemOp)		if (isUnscaledLdSt(RtMI->getOpcode()) && EnableAArch64UnscaledMemOp) {
OffsetImm /= OffsetStride;		assert(!(OffsetImm % getMemSize(RtMI)) &&
		"Offset should be a multiple of the stride!");
		OffsetImm /= getMemSize(RtMI);
		}

// Construct the new instruction.		// Construct the new instruction.
MachineInstrBuilder MIB = BuildMI(*I->getParent(), InsertionPoint,		MachineInstrBuilder MIB = BuildMI(*I->getParent(), InsertionPoint,
I->getDebugLoc(), TII->get(NewOpc))		I->getDebugLoc(), TII->get(NewOpc))
.addOperand(getLdStRegOp(RtMI))		.addOperand(getLdStRegOp(RtMI))
.addOperand(getLdStRegOp(Rt2MI))		.addOperand(getLdStRegOp(Rt2MI))
.addOperand(BaseRegOp)		.addOperand(BaseRegOp)
.addImm(OffsetImm);		.addImm(OffsetImm);
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	if (MO.isDef()) {
UsedRegs.set(*AI);		UsedRegs.set(*AI);
}		}
}		}
}		}

static bool inBoundsForPair(bool IsUnscaled, int Offset, int OffsetStride) {		static bool inBoundsForPair(bool IsUnscaled, int Offset, int OffsetStride) {
// Convert the byte-offset used by unscaled into an "element" offset used		// Convert the byte-offset used by unscaled into an "element" offset used
// by the scaled pair load/store instructions.		// by the scaled pair load/store instructions.
if (IsUnscaled)		if (IsUnscaled) {
		// If the byte-offset isn't a multiple of the stride, there's no point
		// trying to match it.
		if (Offset % OffsetStride)
		return false;
Offset /= OffsetStride;		Offset /= OffsetStride;
		}

return Offset <= 63 && Offset >= -64;		return Offset <= 63 && Offset >= -64;
}		}

// Do alignment, specialized to power of 2 and for signed ints,		// Do alignment, specialized to power of 2 and for signed ints,
// avoiding having to do a C-style cast from uint_64t to int when		// avoiding having to do a C-style cast from uint_64t to int when
// using RoundUpToAlignment from include/llvm/Support/MathExtras.h.		// using RoundUpToAlignment from include/llvm/Support/MathExtras.h.
// FIXME: Move this function to include/MathExtras.h?		// FIXME: Move this function to include/MathExtras.h?
Show All 19 Lines	static bool mayAlias(MachineInstr *MIa,
const AArch64InstrInfo *TII) {		const AArch64InstrInfo *TII) {
for (auto &MIb : MemInsns)		for (auto &MIb : MemInsns)
if (mayAlias(MIa, MIb, TII))		if (mayAlias(MIa, MIb, TII))
return true;		return true;

return false;		return false;
}		}

		static bool canMergeOpc(unsigned Opc, unsigned PairOpc, LdStPairFlags &Flags) {
		// Opcodes match: nothing more to check.
		if (Opc == PairOpc)
		return true;

		abAuthorUnsubmitted Not Done Reply Inline Actions What about removing CanMergeOpc? It's assigned to several times with different values, and all could be folded into ifs/returns. ab: What about removing CanMergeOpc? It's assigned to several times with different values, and all…
		// Try to match a sign-extended load/store with a zero-extended load/store.
		Flags.setSExtIdx(-1);
		bool IsValidLdStrOpc, PairIsValidLdStrOpc;
		unsigned NonSExtOpc = getMatchingNonSExtOpcode(Opc, &IsValidLdStrOpc);
		assert(IsValidLdStrOpc &&
		"Given Opc should be a Load or Store with an immediate");
		// Opc will be the first instruction in the pair.
		if (NonSExtOpc == getMatchingNonSExtOpcode(PairOpc, &PairIsValidLdStrOpc)) {
		Flags.setSExtIdx(NonSExtOpc == (unsigned)Opc ? 1 : 0);
		return true;
		abAuthorUnsubmitted Done Reply Inline Actions Same: fold the condition into the if, remove CanMergeOpc? ab: Same: fold the condition into the if, remove CanMergeOpc?
		}

		// FIXME: Can we also match a mixed sext/zext unscaled/scaled pair?

		// If the second instruction isn't even a load/store, bail out.
		if (!PairIsValidLdStrOpc)
		return false;

		// Try to match an unscaled load/store with a scaled load/store.
		abAuthorUnsubmitted Done Reply Inline Actions I see this is the only use of the getFrom helpers. What about: getMatchingPairOpcode(Opc) == getMatchingPairOpcode(PairOpc) That'll let us get rid of the added functions. ab: I see this is the only use of the getFrom helpers. What about: getMatchingPairOpcode…
		return isUnscaledLdSt(Opc) != isUnscaledLdSt(PairOpc) &&
		getMatchingPairOpcode(Opc) == getMatchingPairOpcode(PairOpc);
		}

		abAuthorUnsubmitted Done Reply Inline Actions Same: fold the condition into the return, remove CanMergeOpc? ab: Same: fold the condition into the return, remove CanMergeOpc?
/// findMatchingInsn - Scan the instructions looking for a load/store that can		/// findMatchingInsn - Scan the instructions looking for a load/store that can
/// be combined with the current instruction into a load/store pair.		/// be combined with the current instruction into a load/store pair.
MachineBasicBlock::iterator		MachineBasicBlock::iterator
AArch64LoadStoreOpt::findMatchingInsn(MachineBasicBlock::iterator I,		AArch64LoadStoreOpt::findMatchingInsn(MachineBasicBlock::iterator I,
LdStPairFlags &Flags,		LdStPairFlags &Flags,
unsigned Limit) {		unsigned Limit) {
MachineBasicBlock::iterator E = I->getParent()->end();		MachineBasicBlock::iterator E = I->getParent()->end();
MachineBasicBlock::iterator MBBI = I;		MachineBasicBlock::iterator MBBI = I;
Show All 34 Lines	for (unsigned Count = 0; MBBI != E && Count < Limit; ++MBBI) {
// Skip DBG_VALUE instructions. Otherwise debug info can affect the		// Skip DBG_VALUE instructions. Otherwise debug info can affect the
// optimization by changing how far we scan.		// optimization by changing how far we scan.
if (MI->isDebugValue())		if (MI->isDebugValue())
continue;		continue;

// Now that we know this is a real instruction, count it.		// Now that we know this is a real instruction, count it.
++Count;		++Count;

bool CanMergeOpc = Opc == MI->getOpcode();		if (canMergeOpc(Opc, MI->getOpcode(), Flags) &&
Flags.setSExtIdx(-1);		getLdStOffsetOp(MI).isImm()) {
if (!CanMergeOpc) {
bool IsValidLdStrOpc;
unsigned NonSExtOpc = getMatchingNonSExtOpcode(Opc, &IsValidLdStrOpc);
assert(IsValidLdStrOpc &&
"Given Opc should be a Load or Store with an immediate");
// Opc will be the first instruction in the pair.
Flags.setSExtIdx(NonSExtOpc == (unsigned)Opc ? 1 : 0);
CanMergeOpc = NonSExtOpc == getMatchingNonSExtOpcode(MI->getOpcode());
}

if (CanMergeOpc && getLdStOffsetOp(MI).isImm()) {
assert(MI->mayLoadOrStore() && "Expected memory operation.");		assert(MI->mayLoadOrStore() && "Expected memory operation.");
// If we've found another instruction with the same opcode, check to see		// If we've found another instruction with the same opcode, check to see
// if the base and offset are compatible with our starting instruction.		// if the base and offset are compatible with our starting instruction.
// These instructions all have scaled immediate operands, so we just		// These instructions all have scaled immediate operands, so we just
// check for +1/-1. Make sure to check the new instruction offset is		// check for +1/-1. Make sure to check the new instruction offset is
// actually an immediate and not a symbolic reference destined for		// actually an immediate and not a symbolic reference destined for
// a relocation.		// a relocation.
//		//
// Pairwise instructions have a 7-bit signed offset field. Single insns		// Pairwise instructions have a 7-bit signed offset field. Single insns
// have a 12-bit unsigned offset field. To be a valid combine, the		// have a 12-bit unsigned offset field. To be a valid combine, the
// final offset must be in range.		// final offset must be in range.
unsigned MIBaseReg = getLdStBaseOp(MI).getReg();		unsigned MIBaseReg = getLdStBaseOp(MI).getReg();
int MIOffset = getLdStOffsetOp(MI).getImm();		int MIOffset = getLdStOffsetOp(MI).getImm();

		// We're trying to pair instructions that differ in how they are scaled.
		// If FirstMI is scaled then scale the offset of MI accordingly.
		// Otherwise, do the opposite (i.e., make MI's offset unscaled).
		bool MIIsUnscaled = isUnscaledLdSt(MI);
		if (IsUnscaled != MIIsUnscaled) {
		int MemSize = getMemSize(MI);
		if (MIIsUnscaled) {
		mzolotukhinUnsubmitted Done Reply Inline Actions Same here. mzolotukhin: Same here.
		// If the unscaled offset isn't a multiple of the MemSize, we can't
		// pair the operations together: bail and keep looking.
		if (MIOffset % MemSize)
		continue;
		MIOffset /= MemSize;
		} else {
		MIOffset *= MemSize;
		}
		}

if (BaseReg == MIBaseReg && ((Offset == MIOffset + OffsetStride) \|\|		if (BaseReg == MIBaseReg && ((Offset == MIOffset + OffsetStride) \|\|
(Offset + OffsetStride == MIOffset))) {		(Offset + OffsetStride == MIOffset))) {
int MinOffset = Offset < MIOffset ? Offset : MIOffset;		int MinOffset = Offset < MIOffset ? Offset : MIOffset;
// If this is a volatile load/store that otherwise matched, stop looking		// If this is a volatile load/store that otherwise matched, stop looking
// as something is going on that we don't have enough information to		// as something is going on that we don't have enough information to
// safely transform. Similarly, stop if we see a hint to avoid pairs.		// safely transform. Similarly, stop if we see a hint to avoid pairs.
if (MI->hasOrderedMemoryRef() \|\| TII->isLdStPairSuppressed(MI))		if (MI->hasOrderedMemoryRef() \|\| TII->isLdStPairSuppressed(MI))
return E;		return E;
// If the resultant immediate offset of merging these instructions		// If the resultant immediate offset of merging these instructions
// is out of range for a pairwise instruction, bail and keep looking.		// is out of range for a pairwise instruction, bail and keep looking.
bool MIIsUnscaled = isUnscaledLdSt(MI);		if (!inBoundsForPair(IsUnscaled, MinOffset, OffsetStride)) {
		abAuthorUnsubmitted Done Reply Inline Actions To make sure I understand: the previous code was "incorrect" in checking MI rather than FirstMI, but it didn't make a difference because we didn't allow mismatched opcodes, right? ab: To make sure I understand: the previous code was "incorrect" in checking MI rather than FirstMI…
		mcrosierUnsubmitted Done Reply Inline Actions I'm not sure the checking was "incorrect", but your latter comment is correct; because we didn't allow mismatches opcodes it didn't make a difference. However, because we're converting MI to be either Scaled or Unscaled to match FirstMI, we need to use IsUscaled in the context of this patch. mcrosier: I'm not sure the checking was "incorrect", but your latter comment is correct; because we…
if (!inBoundsForPair(MIIsUnscaled, MinOffset, OffsetStride)) {
trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);		trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);
MemInsns.push_back(MI);		MemInsns.push_back(MI);
continue;		continue;
}		}
// If the alignment requirements of the paired (scaled) instruction		// If the alignment requirements of the paired (scaled) instruction
// can't express the offset of the unscaled input, bail and keep		// can't express the offset of the unscaled input, bail and keep
// looking.		// looking.
if (IsUnscaled && EnableAArch64UnscaledMemOp &&		if (IsUnscaled && EnableAArch64UnscaledMemOp &&
▲ Show 20 Lines • Show All 494 Lines • Show Last 20 Lines

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll

				; RUN: llc < %s -march=aarch64 -aarch64-stp-suppress=false -verify-machineinstrs -asm-verbose=false \| FileCheck %s
				mcrosierUnsubmitted Not Done Reply Inline Actions BTW, I added -aarch64-neon-syntax=apple to the final commit, so that this will pass on all platforms. mcrosier: BTW, I added -aarch64-neon-syntax=apple to the final commit, so that this will pass on all…

				kristof.beylsUnsubmitted Done Reply Inline Actions My preference is to use -march=aarch64 instead of -march=arm64 as much as possible, as aarch64 is the official name for the architecture. Is there a specific reason -mcpu=cortex-a57 needs to be added to the test line? I don't have a strong opinion on whether it should be there or not, just wondering. kristof.beyls: My preference is to use -march=aarch64 instead of -march=arm64 as much as possible, as aarch64…
				; CHECK-LABEL: test_strd_sturd:
				; CHECK-NEXT: stp d0, d1, [x0, #-8]
				; CHECK-NEXT: ret
				define void @test_strd_sturd(float* %ptr, <2 x float> %v1, <2 x float> %v2) #0 {
				%tmp1 = bitcast float* %ptr to <2 x float>*
				store <2 x float> %v2, <2 x float>* %tmp1, align 16
				%add.ptr = getelementptr inbounds float, float* %ptr, i64 -2
				abAuthorUnsubmitted Done Reply Inline Actions Kill the "entry:"s ? ab: Kill the "entry:"s ?
				%tmp = bitcast float* %add.ptr to <2 x float>*
				store <2 x float> %v1, <2 x float>* %tmp, align 16
				ret void
				}

				; CHECK-LABEL: test_sturd_strd:
				; CHECK-NEXT: stp d0, d1, [x0, #-8]
				; CHECK-NEXT: ret
				define void @test_sturd_strd(float* %ptr, <2 x float> %v1, <2 x float> %v2) #0 {
				%add.ptr = getelementptr inbounds float, float* %ptr, i64 -2
				%tmp = bitcast float* %add.ptr to <2 x float>*
				store <2 x float> %v1, <2 x float>* %tmp, align 16
				%tmp1 = bitcast float* %ptr to <2 x float>*
				store <2 x float> %v2, <2 x float>* %tmp1, align 16
				ret void
				}

				; CHECK-LABEL: test_strq_sturq:
				; CHECK-NEXT: stp q0, q1, [x0, #-16]
				; CHECK-NEXT: ret
				define void @test_strq_sturq(double* %ptr, <2 x double> %v1, <2 x double> %v2) #0 {
				%tmp1 = bitcast double* %ptr to <2 x double>*
				store <2 x double> %v2, <2 x double>* %tmp1, align 16
				%add.ptr = getelementptr inbounds double, double* %ptr, i64 -2
				%tmp = bitcast double* %add.ptr to <2 x double>*
				store <2 x double> %v1, <2 x double>* %tmp, align 16
				ret void
				}

				; CHECK-LABEL: test_sturq_strq:
				; CHECK-NEXT: stp q0, q1, [x0, #-16]
				; CHECK-NEXT: ret
				define void @test_sturq_strq(double* %ptr, <2 x double> %v1, <2 x double> %v2) #0 {
				%add.ptr = getelementptr inbounds double, double* %ptr, i64 -2
				%tmp = bitcast double* %add.ptr to <2 x double>*
				store <2 x double> %v1, <2 x double>* %tmp, align 16
				%tmp1 = bitcast double* %ptr to <2 x double>*
				store <2 x double> %v2, <2 x double>* %tmp1, align 16
				ret void
				}

				; CHECK-LABEL: test_ldrx_ldurx:
				; CHECK-NEXT: ldp [[V0:x[0-9]+]], [[V1:x[0-9]+]], [x0, #-8]
				; CHECK-NEXT: add x0, [[V0]], [[V1]]
				; CHECK-NEXT: ret
				define i64 @test_ldrx_ldurx(i64* %p) #0 {
				%tmp = load i64, i64* %p, align 4
				%add.ptr = getelementptr inbounds i64, i64* %p, i64 -1
				%tmp1 = load i64, i64* %add.ptr, align 4
				%add = add nsw i64 %tmp1, %tmp
				ret i64 %add
				}

				; CHECK-LABEL: test_ldurx_ldrx:
				; CHECK-NEXT: ldp [[V0:x[0-9]+]], [[V1:x[0-9]+]], [x0, #-8]
				; CHECK-NEXT: add x0, [[V0]], [[V1]]
				; CHECK-NEXT: ret
				define i64 @test_ldurx_ldrx(i64* %p) #0 {
				%add.ptr = getelementptr inbounds i64, i64* %p, i64 -1
				%tmp1 = load i64, i64* %add.ptr, align 4
				%tmp = load i64, i64* %p, align 4
				%add = add nsw i64 %tmp1, %tmp
				ret i64 %add
				}

				; CHECK-LABEL: test_ldrsw_ldursw:
				; CHECK-NEXT: ldpsw [[V0:x[0-9]+]], [[V1:x[0-9]+]], [x0, #-4]
				; CHECK-NEXT: add x0, [[V0]], [[V1]]
				; CHECK-NEXT: ret
				define i64 @test_ldrsw_ldursw(i32* %p) #0 {
				%tmp = load i32, i32* %p, align 4
				%add.ptr = getelementptr inbounds i32, i32* %p, i64 -1
				%tmp1 = load i32, i32* %add.ptr, align 4
				kristof.beylsUnsubmitted Done Reply Inline Actions Looking at this test case, I see that before this patch, the following code is produced: ldur x8, [x0, #-8] ldr x9, [x0] . If I'm not mistaken, ldur x8, [x0, #-8] has the same functionality as ldr x8, [x0, #-1]? If so, wouldn't it be better to make sure we produce ldr instead of ldur in the first place? If we would do that, and there still is a good reason to have special code to convert LDR + LDUR into LDP, I guess none of the above test cases really show that (although I haven't investigated every single test case in detail)? kristof.beyls: Looking at this test case, I see that before this patch, the following code is produced…
				kristof.beylsUnsubmitted Done Reply Inline Actions D'oh! The ARMARM clearly states that the scaled immediate offsets in ldr x8, [x0, #imm] can only be positive/unsigned. Please ignore my comment above! kristof.beyls: D'oh! The ARMARM clearly states that the scaled immediate offsets in ldr x8, [x0, #imm] can…
				%sexttmp = sext i32 %tmp to i64
				%sexttmp1 = sext i32 %tmp1 to i64
				%add = add nsw i64 %sexttmp1, %sexttmp
				ret i64 %add
				}

				; Also make sure we only match valid offsets.
				; CHECK-LABEL: test_ldrq_ldruq_invalidoffset:
				; CHECK-NEXT: ldr q[[V0:[0-9]+]], [x0]
				; CHECK-NEXT: ldur q[[V1:[0-9]+]], [x0, #24]
				; CHECK-NEXT: add.2d v0, v[[V0]], v[[V1]]
				; CHECK-NEXT: ret
				define <2 x i64> @test_ldrq_ldruq_invalidoffset(i64* %p) nounwind {
				%a1 = bitcast i64* %p to <2 x i64>*
				%tmp1 = load <2 x i64>, < 2 x i64>* %a1, align 8
				%add.ptr2 = getelementptr inbounds i64, i64* %p, i64 3
				%a2 = bitcast i64* %add.ptr2 to <2 x i64>*
				%tmp2 = load <2 x i64>, <2 x i64>* %a2, align 8
				%add = add nsw <2 x i64> %tmp1, %tmp2
				ret <2 x i64> %add
				}

				attributes #0 = { nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve load/store optimizer to handle LDUR + LDR.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 33810

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll

[AArch64] Improve load/store optimizer to handle LDUR + LDR.
ClosedPublic