This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64.td
-
AArch64ExpandPseudoInsts.cpp
-
AArch64ISelLowering.h
-
AArch64ISelLowering.cpp
-
AArch64InstrAtomics.td
-
AArch64Subtarget.h
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-atomic-128.ll
-
cmpxchg-O0.ll
-
v8.4-atomic-128.ll
-
volatile-128.ll

Differential D67485

AArch64: use ldp/stp for atomic & volatile 128-bit where appropriate.
Needs ReviewPublic

Authored by t.p.northover on Sep 12 2019, 2:38 AM.

Download Raw Diff

This revision needs review, but all reviewers have resigned.

Details

Reviewers

labrinea

Summary

From v8.4a onwards, aligned 128-bit ldp and stp instructions are guaranteed to be single-copy atomic, so they are going to be a lot more efficient than the CAS loop we used to implement "load atomic" and "store atomic" before even if we do need a DMB sometimes. Additionally, some earlier CPUs had this property anyway so it makes sense to use it.

Finally, even before this guarantee there are machine-specific circumstances where a 128-bit ldp/stp makes sense but doesn't really fit an atomic profile. So this extends the selection to volatile accesses, even if they're not aligned (presumably the coder knows what they're doing). The one exception for volatile is when -mstrict-align is in force, that should take precedence.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

t.p.northover created this revision.Sep 12 2019, 2:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2019, 2:38 AM

Herald added subscribers: jfb, hiraditya, kristof.beyls, mcrosier. · View Herald Transcript

pbarrio added a subscriber: pbarrio.Sep 18 2019, 9:41 AM

Hi Tim, thanks for looking into this optimization opportunity. I have a few remarks regarding this change:

First, it appears that the current codegen (CAS loop) for 128-bit atomic accesses is broken based on this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3. There are two problematic cases as far as I understand: (1) const and (2) volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS. The CAS loop violates both.

Maybe the solution is to follow GCC here, at least for the general case (architectures prior to v8.4), meaning to expand atomic operations into a call to the appropriate __atomic library function . I believe this is what the AtomicExpandPass does in LLVM. In that case we need to provide an implementation of those functions.

My concern about using ldp/stp is that the specification promises single-copy atomicity provided that accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory. Is there such a guarantee to the compiler? Maybe that's the default for Operating Systems but not for bare-metal? Would it make sense to enable this optimization for certain target triples, i.e. when sys != none?

In certain code sequences where you have two consecutive atomic stores, or an atomic load followed by an atomic store you'll end up with redundant memory barriers. Is there a way to get rid of them?

Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

First, it appears that the current codegen (CAS loop) for 128-bit atomic accesses is broken based on this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3. There are two problematic cases as far as I understand: (1) const and (2) volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS. The CAS loop violates both.

Maybe the solution is to follow GCC here, at least for the general case (architectures prior to v8.4), meaning to expand atomic operations into a call to the appropriate __atomic library function . I believe this is what the AtomicExpandPass does in LLVM. In that case we need to provide an implementation of those functions.

I think Clang is involved there too, in horribly non-obvious ways (for example I think that's the only way to get the actual libcalls you want rather than legacy ones). Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage (which, unintentionally, is what this patch does).

My concern about using ldp/stp is that the specification promises single-copy atomicity provided that accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory. Is there such a guarantee to the compiler? Maybe that's the default for Operating Systems but not for bare-metal? Would it make sense to enable this optimization for certain target triples, i.e. when sys != none?

I don't think anyone has written down a guarantee, but we've pretty much always assumed we're accessing reasonably normal memory. dmb instructions are always ish for example. I've never had any comments from our more embedded developers on that front (or seen anyone try to do general atomics in another realm). I suspect they go to assembly for the few spots it might matter.

In certain code sequences where you have two consecutive atomic stores, or an atomic load followed by an atomic store you'll end up with redundant memory barriers. Is there a way to get rid of them?

ARM has a pass designed to merge adjacent barriers, though I've seen it miss some cases. We might think about porting it to AArch64, or maybe doing some work in AtomicExpansion in generic way.

Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

I don't think Clang really does anything with -mtune yet. The most it could do based on the way CPUs are implemented in LLVM now would be something like applying the scheduling model. Almost all of the features in the list are going to break older CPUs.

In D67485#1676748, @t.p.northover wrote:

Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

I don't think Clang really does anything with -mtune yet. The most it could do based on the way CPUs are implemented in LLVM now would be something like applying the scheduling model. Almost all of the features in the list are going to break older CPUs.

I'm afraid I mentioned to Alexandros that I wondered how this would interact with a potential future enabling for an -mtune feature, leading to the above question.
But you're right, if we do end up implementing support for -mtune, we'd need to categorize subtarget features into either architectural or tuning features, and only enable the tuning features for a subtarget when -mtune is used. Clearly, FeatureAtomicLDPSTP is an architectural feature. So this part of the patch is just fine I think.
Sorry for the noise.

In D67485#1676748, @t.p.northover wrote:

I think Clang is involved there too, in horribly non-obvious ways (for example I think that's the only way to get the actual libcalls you want rather than legacy ones). Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage (which, unintentionally, is what this patch does).

I am not sure I am following here. According to https://llvm.org/docs/Atomics.html the AtomicExpandPass will translate atomic operations on data sizes above MaxAtomicSizeInBitsSupported into calls to atomic libcalls. The docs say that even though the libcalls share the same names with clang builtins they are not directly related to them. Indeed, I hacked the AArhc64 backend to disallow codegen for 128-bit atomics and as a result LLVM emitted calls to __atomic_store_16 and __atomic_load_16. Are those legacy names? I also tried emitting IR for the clang builtins and I saw atomic load/store IR instructions (like those in your tests), no libcalls. Anyhow, my concern here is that if sometime in the future we replace the broken CAS loop with a libcall, the current patch will break ABI compatibity between v8.4 objects with atomic ldp/stp and v8.X objects without the extension. Moreover, this ABI incompatibility already exists between objects built with LLVM and GCC. Any thoughts?

I don't think anyone has written down a guarantee, but we've pretty much always assumed we're accessing reasonably normal memory. dmb instructions are always ish for example. I've never had any comments from our more embedded developers on that front (or seen anyone try to do general atomics in another realm). I suspect they go to assembly for the few spots it might matter.

Fair enough.

ARM has a pass designed to merge adjacent barriers, though I've seen it miss some cases. We might think about porting it to AArch64, or maybe doing some work in AtomicExpansion in generic way.

Sounds good. On that note can you add some testcases (like the sequences I mentioned) with a FIXME label as motivating examples of such an optimization for future work?

labrinea added a reviewer: labrinea.Sep 23 2019, 9:29 AM

LukeGeeson added a subscriber: LukeGeeson.Oct 31 2019, 9:52 AM

This has landed as https://reviews.llvm.org/D109827.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64.td

11 lines

AArch64ExpandPseudoInsts.cpp

6 lines

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

153 lines

AArch64InstrAtomics.td

10 lines

AArch64Subtarget.h

2 lines

test/

CodeGen/

AArch64/

29 lines

6 lines

221 lines

68 lines

Diff 219866

llvm/lib/Target/AArch64/AArch64.td

Show First 20 Lines • Show All 354 Lines • ▼ Show 20 Lines
def FeatureTME : SubtargetFeature<"tme", "HasTME",		def FeatureTME : SubtargetFeature<"tme", "HasTME",
"true", "Enable Transactional Memory Extension" >;		"true", "Enable Transactional Memory Extension" >;

def FeatureTaggedGlobals : SubtargetFeature<"tagged-globals",		def FeatureTaggedGlobals : SubtargetFeature<"tagged-globals",
"AllowTaggedGlobals",		"AllowTaggedGlobals",
"true", "Use an instruction sequence for taking the address of a global "		"true", "Use an instruction sequence for taking the address of a global "
"that allows a memory tag in the upper address bits">;		"that allows a memory tag in the upper address bits">;

		// In v8.4a onwards, aligned ldp and stp operations are atomic. This also applies
		// to some specific earlier CPUs.
		def FeatureAtomicLDPSTP : SubtargetFeature<"atomic-ldp-stp", "HasAtomicLDPSTP", "true",
		"Use LDP & STP for aligned 128-bit atomics">;


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Architectures.		// Architectures.
//		//

def HasV8_1aOps : SubtargetFeature<"v8.1a", "HasV8_1aOps", "true",		def HasV8_1aOps : SubtargetFeature<"v8.1a", "HasV8_1aOps", "true",
"Support ARM v8.1a instructions", [FeatureCRC, FeatureLSE, FeatureRDM,		"Support ARM v8.1a instructions", [FeatureCRC, FeatureLSE, FeatureRDM,
FeaturePAN, FeatureLOR, FeatureVH]>;		FeaturePAN, FeatureLOR, FeatureVH]>;

def HasV8_2aOps : SubtargetFeature<"v8.2a", "HasV8_2aOps", "true",		def HasV8_2aOps : SubtargetFeature<"v8.2a", "HasV8_2aOps", "true",
"Support ARM v8.2a instructions", [HasV8_1aOps, FeaturePsUAO,		"Support ARM v8.2a instructions", [HasV8_1aOps, FeaturePsUAO,
FeaturePAN_RWV, FeatureRAS, FeatureCCPP]>;		FeaturePAN_RWV, FeatureRAS, FeatureCCPP]>;

def HasV8_3aOps : SubtargetFeature<"v8.3a", "HasV8_3aOps", "true",		def HasV8_3aOps : SubtargetFeature<"v8.3a", "HasV8_3aOps", "true",
"Support ARM v8.3a instructions", [HasV8_2aOps, FeatureRCPC, FeaturePA,		"Support ARM v8.3a instructions", [HasV8_2aOps, FeatureRCPC, FeaturePA,
FeatureJS, FeatureCCIDX, FeatureComplxNum]>;		FeatureJS, FeatureCCIDX, FeatureComplxNum]>;

def HasV8_4aOps : SubtargetFeature<"v8.4a", "HasV8_4aOps", "true",		def HasV8_4aOps : SubtargetFeature<"v8.4a", "HasV8_4aOps", "true",
"Support ARM v8.4a instructions", [HasV8_3aOps, FeatureDotProd,		"Support ARM v8.4a instructions", [HasV8_3aOps, FeatureDotProd,
FeatureNV, FeatureRASv8_4, FeatureMPAM, FeatureDIT,		FeatureNV, FeatureRASv8_4, FeatureMPAM, FeatureDIT,
FeatureTRACEV8_4, FeatureAM, FeatureSEL2, FeatureTLB_RMI,		FeatureTRACEV8_4, FeatureAM, FeatureSEL2, FeatureTLB_RMI,
FeatureFMI, FeatureRCPC_IMMO]>;		FeatureFMI, FeatureRCPC_IMMO, FeatureAtomicLDPSTP]>;

def HasV8_5aOps : SubtargetFeature<		def HasV8_5aOps : SubtargetFeature<
"v8.5a", "HasV8_5aOps", "true", "Support ARM v8.5a instructions",		"v8.5a", "HasV8_5aOps", "true", "Support ARM v8.5a instructions",
[HasV8_4aOps, FeatureAltFPCmp, FeatureFRInt3264, FeatureSpecRestrict,		[HasV8_4aOps, FeatureAltFPCmp, FeatureFRInt3264, FeatureSpecRestrict,
FeatureSSBS, FeatureSB, FeaturePredRes, FeatureCacheDeepPersist,		FeatureSSBS, FeatureSB, FeaturePredRes, FeatureCacheDeepPersist,
FeatureBranchTargetId]		FeatureBranchTargetId]
>;		>;

▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	def ProcCyclone : SubtargetFeature<"cyclone", "ARMProcFamily", "Cyclone",
FeatureDisableLatencySchedHeuristic,		FeatureDisableLatencySchedHeuristic,
FeatureFPARMv8,		FeatureFPARMv8,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseCryptoEOR,		FeatureFuseCryptoEOR,
FeatureNEON,		FeatureNEON,
FeaturePerfMon,		FeaturePerfMon,
FeatureZCRegMove,		FeatureZCRegMove,
FeatureZCZeroing,		FeatureZCZeroing,
FeatureZCZeroingFPWorkaround		FeatureZCZeroingFPWorkaround,
		FeatureAtomicLDPSTP
]>;		]>;

def ProcExynosM1 : SubtargetFeature<"exynosm1", "ARMProcFamily", "ExynosM1",		def ProcExynosM1 : SubtargetFeature<"exynosm1", "ARMProcFamily", "ExynosM1",
"Samsung Exynos-M1 processors",		"Samsung Exynos-M1 processors",
[FeatureSlowPaired128,		[FeatureSlowPaired128,
FeatureCRC,		FeatureCRC,
FeatureCrypto,		FeatureCrypto,
FeatureExynosCheapAsMoveHandling,		FeatureExynosCheapAsMoveHandling,
▲ Show 20 Lines • Show All 302 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp

Show First 20 Lines • Show All 607 Lines • ▼ Show 20 Lines	case AArch64::RET_ReallyLR: {
// liveness checks.		// liveness checks.
MachineInstrBuilder MIB =		MachineInstrBuilder MIB =
BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(AArch64::RET))		BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(AArch64::RET))
.addReg(AArch64::LR, RegState::Undef);		.addReg(AArch64::LR, RegState::Undef);
transferImpOps(MI, MIB, MIB);		transferImpOps(MI, MIB, MIB);
MI.eraseFromParent();		MI.eraseFromParent();
return true;		return true;
}		}
		case AArch64::LOAD_ATOMIC_128:
		MI.setDesc(TII->get(AArch64::LDPXi));
		break;
		case AArch64::STORE_ATOMIC_128:
		MI.setDesc(TII->get(AArch64::STPXi));
		break;
case AArch64::CMP_SWAP_8:		case AArch64::CMP_SWAP_8:
return expandCMP_SWAP(MBB, MBBI, AArch64::LDAXRB, AArch64::STLXRB,		return expandCMP_SWAP(MBB, MBBI, AArch64::LDAXRB, AArch64::STLXRB,
AArch64::SUBSWrx,		AArch64::SUBSWrx,
AArch64_AM::getArithExtendImm(AArch64_AM::UXTB, 0),		AArch64_AM::getArithExtendImm(AArch64_AM::UXTB, 0),
AArch64::WZR, NextMBBI);		AArch64::WZR, NextMBBI);
case AArch64::CMP_SWAP_16:		case AArch64::CMP_SWAP_16:
return expandCMP_SWAP(MBB, MBBI, AArch64::LDAXRH, AArch64::STLXRH,		return expandCMP_SWAP(MBB, MBBI, AArch64::LDAXRH, AArch64::STLXRH,
AArch64::SUBSWrx,		AArch64::SUBSWrx,
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 402 Lines • ▼ Show 20 Lines	public:

Value emitLoadLinked(IRBuilder<> &Builder, Value Addr,		Value emitLoadLinked(IRBuilder<> &Builder, Value Addr,
AtomicOrdering Ord) const override;		AtomicOrdering Ord) const override;
Value emitStoreConditional(IRBuilder<> &Builder, Value Val,		Value emitStoreConditional(IRBuilder<> &Builder, Value Val,
Value *Addr, AtomicOrdering Ord) const override;		Value *Addr, AtomicOrdering Ord) const override;

void emitAtomicCmpXchgNoStoreLLBalance(IRBuilder<> &Builder) const override;		void emitAtomicCmpXchgNoStoreLLBalance(IRBuilder<> &Builder) const override;

		bool isOpSuitableForLDPSTP(const Instruction *I) const;
		bool shouldInsertFencesForAtomic(const Instruction *I) const override;
TargetLoweringBase::AtomicExpansionKind		TargetLoweringBase::AtomicExpansionKind
shouldExpandAtomicLoadInIR(LoadInst *LI) const override;		shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
TargetLoweringBase::AtomicExpansionKind		TargetLoweringBase::AtomicExpansionKind
shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;		shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;

TargetLoweringBase::AtomicExpansionKind		TargetLoweringBase::AtomicExpansionKind
shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;		shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;
▲ Show 20 Lines • Show All 339 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 596 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,

setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);

setTargetDAGCombine(ISD::ANY_EXTEND);		setTargetDAGCombine(ISD::ANY_EXTEND);
setTargetDAGCombine(ISD::ZERO_EXTEND);		setTargetDAGCombine(ISD::ZERO_EXTEND);
setTargetDAGCombine(ISD::SIGN_EXTEND);		setTargetDAGCombine(ISD::SIGN_EXTEND);
setTargetDAGCombine(ISD::BITCAST);		setTargetDAGCombine(ISD::BITCAST);
setTargetDAGCombine(ISD::CONCAT_VECTORS);		setTargetDAGCombine(ISD::CONCAT_VECTORS);

		// Volatile 128-bit loads and stores can have useful extra properties if
		// implemented as a single LDP or STP. Unfortunately i128 is not a legal type
		// so the only opportunity we have to do anything with them is the very first
		// DAG combine.
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);
if (Subtarget->supportsAddressTopByteIgnored())
setTargetDAGCombine(ISD::LOAD);		setTargetDAGCombine(ISD::LOAD);

		// Similarly atomic 128-bit operations are sometimes implementable with LDP or
		// STP.
		if (Subtarget->hasAtomicLDPSTP()) {
		setTargetDAGCombine(ISD::ATOMIC_LOAD);
		setTargetDAGCombine(ISD::ATOMIC_STORE);
		}

setTargetDAGCombine(ISD::MUL);		setTargetDAGCombine(ISD::MUL);

setTargetDAGCombine(ISD::SELECT);		setTargetDAGCombine(ISD::SELECT);
setTargetDAGCombine(ISD::VSELECT);		setTargetDAGCombine(ISD::VSELECT);

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
▲ Show 20 Lines • Show All 10,869 Lines • ▼ Show 20 Lines	if (!T->isSized() \|\|
return SDValue();		return SDValue();

SDLoc DL(GN);		SDLoc DL(GN);
SDValue Result = DAG.getGlobalAddress(GV, DL, MVT::i64, Offset);		SDValue Result = DAG.getGlobalAddress(GV, DL, MVT::i64, Offset);
return DAG.getNode(ISD::SUB, DL, MVT::i64, Result,		return DAG.getNode(ISD::SUB, DL, MVT::i64, Result,
DAG.getConstant(MinOffset, DL, MVT::i64));		DAG.getConstant(MinOffset, DL, MVT::i64));
}		}

		static void matchLDPSTPAddrMode(SDValue Addr, SDValue &Base, SDValue &Offset,
		SelectionDAG &DAG) {
		SDLoc DL(Addr);
		Base = Addr;
		Offset = DAG.getTargetConstant(0, DL, MVT::i32);

		if (Addr.getOpcode() != ISD::ADD \|\| !isa<ConstantSDNode>(Addr->getOperand(1)))
		return;

		int64_t Val = cast<ConstantSDNode>(Base->getOperand(1))->getSExtValue();
		if (Val % 8 != 0 \|\| !isInt<7>(Val / 8))
		return;

		Base = Base->getOperand(0);
		Offset = DAG.getTargetConstant(Val / 8, DL, MVT::i32);
		}

		static SDValue performAtomic128Load(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
		const AArch64Subtarget *Subtarget) {
		MemSDNode *MN = cast<MemSDNode>(N);
		assert((N->getOpcode() != ISD::ATOMIC_LOAD \|\| MN->getAlignment() >= 16) &&
		"ldp only atomic for aligned addresses");
		assert(MN->getMemoryVT() == MVT::i128 && "Wrong size for load");
		assert((N->getOpcode() != ISD::LOAD \|\|
		cast<LoadSDNode>(N)->getExtensionType() == ISD::NON_EXTLOAD) &&
		"unexpected extend");

		if (Subtarget->requiresStrictAlign() && MN->getAlignment() < 16)
		return SDValue();

		SDLoc DL(N);
		SDValue Base, Offset;
		matchLDPSTPAddrMode(MN->getBasePtr(), Base, Offset, DAG);

		SDNode *NewLoad = DAG.getMachineNode(AArch64::LOAD_ATOMIC_128, DL, MVT::i64,
		MVT::i64, MVT::Other, Base, Offset,
		N->getOperand(0));
		DAG.setNodeMemRefs(cast<MachineSDNode>(NewLoad), MN->getMemOperand());

		SDValue NewVal = DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i128,
		SDValue(NewLoad, 0), SDValue(NewLoad, 1));

		DCI.CombineTo(N, {NewVal, SDValue(NewLoad, 2)});
		return SDValue(N, 0);
		}

		static SDValue performAtomic128Store(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
		const AArch64Subtarget *Subtarget) {
		MemSDNode *MN = cast<MemSDNode>(N);
		assert((N->getOpcode() != ISD::ATOMIC_STORE \|\| MN->getAlignment() >= 16) &&
		"stp only atomic for aligned addresses");
		assert(MN->getMemoryVT() == MVT::i128 && "Wrong size for store");
		assert((N->getOpcode() != ISD::STORE \|\|
		!cast<StoreSDNode>(N)->isTruncatingStore()) &&
		"unexpected trunace");

		if (Subtarget->requiresStrictAlign() && MN->getAlignment() < 16)
		return SDValue();

		SDValue Val = N->getOperand(N->getOpcode() == ISD::STORE ? 1 : 2);
		SDLoc DL(N);
		SDValue Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i64, Val,
		DAG.getIntPtrConstant(0, DL));
		SDValue Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i64, Val,
		DAG.getIntPtrConstant(1, DL));

		SDValue Base, Offset;
		matchLDPSTPAddrMode(MN->getBasePtr(), Base, Offset, DAG);

		SDValue Ops[] = {Lo, Hi, Base, Offset, N->getOperand(0)};
		SDNode *NewStore =
		DAG.getMachineNode(AArch64::STORE_ATOMIC_128, DL, MVT::Other, Ops);
		DAG.setNodeMemRefs(cast<MachineSDNode>(NewStore), MN->getMemOperand());

		return SDValue(NewStore, 0);
		}

SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,		SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default:		default:
LLVM_DEBUG(dbgs() << "Custom combining: skipping\n");		LLVM_DEBUG(dbgs() << "Custom combining: skipping\n");
break;		break;
case ISD::ADD:		case ISD::ADD:
Show All 26 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::BITCAST:		case ISD::BITCAST:
return performBitcastCombine(N, DCI, DAG);		return performBitcastCombine(N, DCI, DAG);
case ISD::CONCAT_VECTORS:		case ISD::CONCAT_VECTORS:
return performConcatVectorsCombine(N, DCI, DAG);		return performConcatVectorsCombine(N, DCI, DAG);
case ISD::SELECT:		case ISD::SELECT:
return performSelectCombine(N, DCI);		return performSelectCombine(N, DCI);
case ISD::VSELECT:		case ISD::VSELECT:
return performVSelectCombine(N, DCI.DAG);		return performVSelectCombine(N, DCI.DAG);
case ISD::LOAD:		case ISD::LOAD: {
if (performTBISimplification(N->getOperand(1), DCI, DAG))		MemSDNode *MN = cast<MemSDNode>(N);
		if (MN->isVolatile() && MN->getMemoryVT() == MVT::i128)
		return performAtomic128Load(N, DAG, DCI, Subtarget);
		if (Subtarget->supportsAddressTopByteIgnored() &&
		performTBISimplification(N->getOperand(1), DCI, DAG))
return SDValue(N, 0);		return SDValue(N, 0);
break;		break;
case ISD::STORE:		}
		case ISD::ATOMIC_LOAD:
		if (Subtarget->hasAtomicLDPSTP() && N->getValueType(0) == MVT::i128)
		return performAtomic128Load(N, DAG, DCI, Subtarget);
		break;
		case ISD::ATOMIC_STORE:
		if (Subtarget->hasAtomicLDPSTP() &&
		N->getOperand(2)->getValueType(0) == MVT::i128)
		return performAtomic128Store(N, DAG, DCI, Subtarget);
		break;
		case ISD::STORE: {
		MemSDNode *MN = cast<MemSDNode>(N);
		if (MN->isVolatile() && MN->getMemoryVT() == MVT::i128)
		return performAtomic128Store(N, DAG, DCI, Subtarget);
return performSTORECombine(N, DCI, DAG, Subtarget);		return performSTORECombine(N, DCI, DAG, Subtarget);
		}
case AArch64ISD::BRCOND:		case AArch64ISD::BRCOND:
return performBRCONDCombine(N, DCI, DAG);		return performBRCONDCombine(N, DCI, DAG);
case AArch64ISD::TBNZ:		case AArch64ISD::TBNZ:
case AArch64ISD::TBZ:		case AArch64ISD::TBZ:
return performTBZCombine(N, DCI, DAG);		return performTBZCombine(N, DCI, DAG);
case AArch64ISD::CSEL:		case AArch64ISD::CSEL:
return performCONDCombine(N, DCI, DAG, 2, 3);		return performCONDCombine(N, DCI, DAG, 2, 3);
case AArch64ISD::DUP:		case AArch64ISD::DUP:
▲ Show 20 Lines • Show All 343 Lines • ▼ Show 20 Lines	AArch64TargetLowering::getPreferredVectorAction(MVT VT) const {
// v4i16, v2i32 instead of to promote.		// v4i16, v2i32 instead of to promote.
if (VT == MVT::v1i8 \|\| VT == MVT::v1i16 \|\| VT == MVT::v1i32 \|\|		if (VT == MVT::v1i8 \|\| VT == MVT::v1i16 \|\| VT == MVT::v1i32 \|\|
VT == MVT::v1f32)		VT == MVT::v1f32)
return TypeWidenVector;		return TypeWidenVector;

return TargetLoweringBase::getPreferredVectorAction(VT);		return TargetLoweringBase::getPreferredVectorAction(VT);
}		}

		// In v8.4a, ldp and stp instructions are guaranteed to be single-copy atomic
		// provided the address is 16-byte aligned.
		bool AArch64TargetLowering::isOpSuitableForLDPSTP(const Instruction *I) const {
		if (!Subtarget->hasAtomicLDPSTP())
		return false;

		if (auto LI = dyn_cast<LoadInst>(I))
		return LI->getType()->getPrimitiveSizeInBits() == 128 &&
		LI->getAlignment() >= 16;
		else if (auto SI = dyn_cast<StoreInst>(I))
		return SI->getValueOperand()->getType()->getPrimitiveSizeInBits() == 128 &&
		SI->getAlignment() >= 16;

		return false;
		}

		bool AArch64TargetLowering::shouldInsertFencesForAtomic(
		const Instruction *I) const {
		// There is no LDP or STP with acquire/release semantics so we'll need a
		// barrier.
		return isOpSuitableForLDPSTP(I);
		}


// Loads and stores less than 128-bits are already atomic; ones above that		// Loads and stores less than 128-bits are already atomic; ones above that
// are doomed anyway, so defer to the default libcall and blame the OS when		// are doomed anyway, so defer to the default libcall and blame the OS when
// things go wrong.		// things go wrong.
bool AArch64TargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {		bool AArch64TargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {
unsigned Size = SI->getValueOperand()->getType()->getPrimitiveSizeInBits();		unsigned Size = SI->getValueOperand()->getType()->getPrimitiveSizeInBits();
return Size == 128;		if (Size != 128)
		return false;

		return !isOpSuitableForLDPSTP(SI);
}		}

// Loads and stores less than 128-bits are already atomic; ones above that		// Loads and stores less than 128-bits are already atomic; ones above that
// are doomed anyway, so defer to the default libcall and blame the OS when		// are doomed anyway, so defer to the default libcall and blame the OS when
// things go wrong.		// things go wrong.
TargetLowering::AtomicExpansionKind		TargetLowering::AtomicExpansionKind
AArch64TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {		AArch64TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
unsigned Size = LI->getType()->getPrimitiveSizeInBits();		unsigned Size = LI->getType()->getPrimitiveSizeInBits();
return Size == 128 ? AtomicExpansionKind::LLSC : AtomicExpansionKind::None;
		if (Size != 128 \|\| isOpSuitableForLDPSTP(LI))
		return AtomicExpansionKind::None;

		return AtomicExpansionKind::LLSC;
}		}

// For the real atomic operations, we have ldxr/stxr up to 128 bits,		// For the real atomic operations, we have ldxr/stxr up to 128 bits,
TargetLowering::AtomicExpansionKind		TargetLowering::AtomicExpansionKind
AArch64TargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {		AArch64TargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
if (AI->isFloatingPointOperation())		if (AI->isFloatingPointOperation())
return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::CmpXChg;

▲ Show 20 Lines • Show All 309 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrAtomics.td

Show First 20 Lines • Show All 448 Lines • ▼ Show 20 Lines	let Predicates = [HasLSE] in {
defm : CASregister_patterns<"CAS", "atomic_cmp_swap">;		defm : CASregister_patterns<"CAS", "atomic_cmp_swap">;

// These two patterns are only needed for global isel, selection dag isel		// These two patterns are only needed for global isel, selection dag isel
// converts atomic load-sub into a sub and atomic load-add, and likewise for		// converts atomic load-sub into a sub and atomic load-add, and likewise for
// and -> clr.		// and -> clr.
defm : LDOPregister_patterns_mod<"LDADD", "atomic_load_sub", "SUB">;		defm : LDOPregister_patterns_mod<"LDADD", "atomic_load_sub", "SUB">;
defm : LDOPregister_patterns_mod<"LDCLR", "atomic_load_and", "ORN">;		defm : LDOPregister_patterns_mod<"LDCLR", "atomic_load_and", "ORN">;
}		}

		let mayLoad = 1 in
		def LOAD_ATOMIC_128 : Pseudo<(outs GPR64:$Rt1, GPR64:$Rt2),
		(ins GPR64:$Rn, simm7s8:$offs), []>,
		Sched<[]>;

		let mayStore = 1 in
		def STORE_ATOMIC_128 : Pseudo<(outs), (ins GPR64:$Rt1, GPR64:$Rt2, GPR64:$Rn,
		simm7s8:$offs), []>,
		Sched<[]>;

llvm/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	protected:
// HasZeroCycleRegMove - Has zero-cycle register mov instructions.		// HasZeroCycleRegMove - Has zero-cycle register mov instructions.
bool HasZeroCycleRegMove = false;		bool HasZeroCycleRegMove = false;

// HasZeroCycleZeroing - Has zero-cycle zeroing instructions.		// HasZeroCycleZeroing - Has zero-cycle zeroing instructions.
bool HasZeroCycleZeroing = false;		bool HasZeroCycleZeroing = false;
bool HasZeroCycleZeroingGP = false;		bool HasZeroCycleZeroingGP = false;
bool HasZeroCycleZeroingFP = false;		bool HasZeroCycleZeroingFP = false;
bool HasZeroCycleZeroingFPWorkaround = false;		bool HasZeroCycleZeroingFPWorkaround = false;
		bool HasAtomicLDPSTP = false;

// StrictAlign - Disallow unaligned memory accesses.		// StrictAlign - Disallow unaligned memory accesses.
bool StrictAlign = false;		bool StrictAlign = false;

// NegativeImmediates - transform instructions with negative immediates		// NegativeImmediates - transform instructions with negative immediates
bool NegativeImmediates = true;		bool NegativeImmediates = true;

// Enable 64-bit vectorization in SLP.		// Enable 64-bit vectorization in SLP.
▲ Show 20 Lines • Show All 263 Lines • ▼ Show 20 Lines	public:
bool hasMPAM() const { return HasMPAM; }		bool hasMPAM() const { return HasMPAM; }
bool hasDIT() const { return HasDIT; }		bool hasDIT() const { return HasDIT; }
bool hasTRACEV8_4() const { return HasTRACEV8_4; }		bool hasTRACEV8_4() const { return HasTRACEV8_4; }
bool hasAM() const { return HasAM; }		bool hasAM() const { return HasAM; }
bool hasSEL2() const { return HasSEL2; }		bool hasSEL2() const { return HasSEL2; }
bool hasTLB_RMI() const { return HasTLB_RMI; }		bool hasTLB_RMI() const { return HasTLB_RMI; }
bool hasFMI() const { return HasFMI; }		bool hasFMI() const { return HasFMI; }
bool hasRCPC_IMMO() const { return HasRCPC_IMMO; }		bool hasRCPC_IMMO() const { return HasRCPC_IMMO; }
		bool hasAtomicLDPSTP() const { return HasAtomicLDPSTP; }

bool useSmallAddressing() const {		bool useSmallAddressing() const {
switch (TLInfo.getTargetMachine().getCodeModel()) {		switch (TLInfo.getTargetMachine().getCodeModel()) {
case CodeModel::Kernel:		case CodeModel::Kernel:
// Kernel is currently allowed only for Fuchsia targets,		// Kernel is currently allowed only for Fuchsia targets,
// where it is the same as Small for almost all purposes.		// where it is the same as Small for almost all purposes.
case CodeModel::Small:		case CodeModel::Small:
return true;		return true;
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-atomic-128.ll

; RUN: llc < %s -mtriple=arm64-linux-gnu -verify-machineinstrs -mcpu=cyclone \| FileCheck %s		; RUN: llc < %s -mtriple=arm64-linux-gnu -verify-machineinstrs \| FileCheck %s
		; RUN: llc < %s -mtriple=arm64-linux-gnu -verify-machineinstrs -mcpu=cyclone \| FileCheck %s --check-prefix=CHECK-ATOMIC-LDP

@var = global i128 0		@var = global i128 0

define i128 @val_compare_and_swap(i128* %p, i128 %oldval, i128 %newval) {		define i128 @val_compare_and_swap(i128* %p, i128 %oldval, i128 %newval) {
; CHECK-LABEL: val_compare_and_swap:		; CHECK-LABEL: val_compare_and_swap:
; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:		; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:
; CHECK: ldaxp [[RESULTLO:x[0-9]+]], [[RESULTHI:x[0-9]+]], [x[[ADDR:[0-9]+]]]		; CHECK: ldaxp [[RESULTLO:x[0-9]+]], [[RESULTHI:x[0-9]+]], [x[[ADDR:[0-9]+]]]
; CHECK-DAG: eor [[MISMATCH_LO:x[0-9]+]], [[RESULTLO]], x2		; CHECK-DAG: eor [[MISMATCH_LO:x[0-9]+]], [[RESULTLO]], x2
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	; CHECK-DAG: stp [[DEST_REGLO]], [[DEST_REGHI]]
ret void		ret void
}		}

define i128 @atomic_load_seq_cst(i128* %p) {		define i128 @atomic_load_seq_cst(i128* %p) {
; CHECK-LABEL: atomic_load_seq_cst:		; CHECK-LABEL: atomic_load_seq_cst:
; CHECK-NOT: dmb		; CHECK-NOT: dmb
; CHECK-LABEL: ldaxp		; CHECK-LABEL: ldaxp
; CHECK-NOT: dmb		; CHECK-NOT: dmb

		; CHECK-ATOMIC-LDP-LABEL: atomic_load_seq_cst:
		; CHECK-ATOMIC-LDP: ldp x0, x1, [x0]
		; CHECK-ATOMIC-LDP: dmb ish
%r = load atomic i128, i128* %p seq_cst, align 16		%r = load atomic i128, i128* %p seq_cst, align 16
ret i128 %r		ret i128 %r
}		}

define i128 @atomic_load_relaxed(i64, i64, i128* %p) {		define i128 @atomic_load_relaxed(i64, i64, i128* %p) {
; CHECK-LABEL: atomic_load_relaxed:		; CHECK-LABEL: atomic_load_relaxed:
; CHECK-NOT: dmb		; CHECK-NOT: dmb
; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:		; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:
; CHECK: ldxp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x2]		; CHECK: ldxp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x2]
; CHECK-NEXT: stxp [[SUCCESS:w[0-9]+]], [[LO]], [[HI]], [x2]		; CHECK-NEXT: stxp [[SUCCESS:w[0-9]+]], [[LO]], [[HI]], [x2]
; CHECK: cbnz [[SUCCESS]], [[LABEL]]		; CHECK: cbnz [[SUCCESS]], [[LABEL]]
; CHECK-NOT: dmb		; CHECK-NOT: dmb

		; CHECK-ATOMIC-LDP-LABEL: atomic_load_relaxed:
		; CHECK-ATOMIC-LDP: ldp x0, x1, [x2]
		; CHECK-ATOMIC-LDP-NOT: dmb
%r = load atomic i128, i128* %p monotonic, align 16		%r = load atomic i128, i128* %p monotonic, align 16
ret i128 %r		ret i128 %r
}		}


define void @atomic_store_seq_cst(i128 %in, i128* %p) {		define void @atomic_store_seq_cst(i128 %in, i128* %p) {
; CHECK-LABEL: atomic_store_seq_cst:		; CHECK-LABEL: atomic_store_seq_cst:
; CHECK-NOT: dmb		; CHECK-NOT: dmb
; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:		; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:
; CHECK: ldaxp xzr, [[IGNORED:x[0-9]+]], [x2]		; CHECK: ldaxp xzr, [[IGNORED:x[0-9]+]], [x2]
; CHECK: stlxp [[SUCCESS:w[0-9]+]], x0, x1, [x2]		; CHECK: stlxp [[SUCCESS:w[0-9]+]], x0, x1, [x2]
; CHECK: cbnz [[SUCCESS]], [[LABEL]]		; CHECK: cbnz [[SUCCESS]], [[LABEL]]
; CHECK-NOT: dmb		; CHECK-NOT: dmb

		; CHECK-ATOMIC-LDP-LABEL: atomic_store_seq_cst:
		; CHECK-ATOMIC-LDP: dmb ish
		; CHECK-ATOMIC-LDP: stp x0, x1, [x2]
		; CHECK-ATOMIC-LDP: dmb ish

store atomic i128 %in, i128* %p seq_cst, align 16		store atomic i128 %in, i128* %p seq_cst, align 16
ret void		ret void
}		}

define void @atomic_store_release(i128 %in, i128* %p) {		define void @atomic_store_release(i128 %in, i128* %p) {
; CHECK-LABEL: atomic_store_release:		; CHECK-LABEL: atomic_store_release:
; CHECK-NOT: dmb		; CHECK-NOT: dmb
; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:		; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:
; CHECK: ldxp xzr, [[IGNORED:x[0-9]+]], [x2]		; CHECK: ldxp xzr, [[IGNORED:x[0-9]+]], [x2]
; CHECK: stlxp [[SUCCESS:w[0-9]+]], x0, x1, [x2]		; CHECK: stlxp [[SUCCESS:w[0-9]+]], x0, x1, [x2]
; CHECK: cbnz [[SUCCESS]], [[LABEL]]		; CHECK: cbnz [[SUCCESS]], [[LABEL]]
; CHECK-NOT: dmb		; CHECK-NOT: dmb

		; CHECK-ATOMIC-LDP-LABEL: atomic_store_release:
		; CHECK-ATOMIC-LDP: dmb ish
		; CHECK-ATOMIC-LDP: stp x0, x1, [x2]
		; CHECK-ATOMIC-LDP-NOT: dmb

store atomic i128 %in, i128* %p release, align 16		store atomic i128 %in, i128* %p release, align 16
ret void		ret void
}		}

define void @atomic_store_relaxed(i128 %in, i128* %p) {		define void @atomic_store_relaxed(i128 %in, i128* %p) {
; CHECK-LABEL: atomic_store_relaxed:		; CHECK-LABEL: atomic_store_relaxed:
; CHECK-NOT: dmb		; CHECK-NOT: dmb
; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:		; CHECK: [[LABEL:.?LBB[0-9]+_[0-9]+]]:
; CHECK: ldxp xzr, [[IGNORED:x[0-9]+]], [x2]		; CHECK: ldxp xzr, [[IGNORED:x[0-9]+]], [x2]
; CHECK: stxp [[SUCCESS:w[0-9]+]], x0, x1, [x2]		; CHECK: stxp [[SUCCESS:w[0-9]+]], x0, x1, [x2]
; CHECK: cbnz [[SUCCESS]], [[LABEL]]		; CHECK: cbnz [[SUCCESS]], [[LABEL]]
; CHECK-NOT: dmb		; CHECK-NOT: dmb

		; CHECK-ATOMIC-LDP-LABEL: atomic_store_relaxed:
		; CHECK-ATOMIC-LDP-NOT: dmb
		; CHECK-ATOMIC-LDP: stp x0, x1, [x2]
		; CHECK-ATOMIC-LDP-NOT: dmb

store atomic i128 %in, i128* %p unordered, align 16		store atomic i128 %in, i128* %p unordered, align 16
ret void		ret void
}		}

llvm/test/CodeGen/AArch64/cmpxchg-O0.ll

	Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines

	; Original implementation assumed the desired & new arguments had already been			; Original implementation assumed the desired & new arguments had already been
	; type-legalized into some kind of BUILD_PAIR operation and crashed when this			; type-legalized into some kind of BUILD_PAIR operation and crashed when this
	; was false.			; was false.
	@var128 = global i128 0			@var128 = global i128 0
	define {i128, i1} @test_cmpxchg_128_unsplit(i128* %addr) {			define {i128, i1} @test_cmpxchg_128_unsplit(i128* %addr) {
	; CHECK-LABEL: test_cmpxchg_128_unsplit:			; CHECK-LABEL: test_cmpxchg_128_unsplit:
	; CHECK: add x[[VAR128:[0-9]+]], {{x[0-9]+}}, :lo12:var128			; CHECK: add x[[VAR128:[0-9]+]], {{x[0-9]+}}, :lo12:var128
	; CHECK: ldr [[DESIRED_HI:x[0-9]+]], [x[[VAR128]], #8]			; CHECK: ldp [[DESIRED_LO:x[0-9]+]], [[DESIRED_HI:x[0-9]+]], [x[[VAR128]]]
	; CHECK: ldr [[DESIRED_LO:x[0-9]+]], [x[[VAR128]]]			; CHECK: ldp [[NEW_LO:x[0-9]+]], [[NEW_HI:x[0-9]+]], [x[[VAR128]]]
	; CHECK: ldr [[NEW_HI:x[0-9]+]], [x[[VAR128]], #8]
	; CHECK: ldr [[NEW_LO:x[0-9]+]], [x[[VAR128]]]
	; CHECK: [[RETRY:.LBB[0-9]+_[0-9]+]]:			; CHECK: [[RETRY:.LBB[0-9]+_[0-9]+]]:
	; CHECK: ldaxp [[OLD_LO:x[0-9]+]], [[OLD_HI:x[0-9]+]], [x0]			; CHECK: ldaxp [[OLD_LO:x[0-9]+]], [[OLD_HI:x[0-9]+]], [x0]
	; CHECK: cmp [[OLD_LO]], [[DESIRED_LO]]			; CHECK: cmp [[OLD_LO]], [[DESIRED_LO]]
	; CHECK: cset [[CMP_TMP:w[0-9]+]], ne			; CHECK: cset [[CMP_TMP:w[0-9]+]], ne
	; CHECK: cmp [[OLD_HI]], [[DESIRED_HI]]			; CHECK: cmp [[OLD_HI]], [[DESIRED_HI]]
	; CHECK: cinc [[CMP:w[0-9]+]], [[CMP_TMP]], ne			; CHECK: cinc [[CMP:w[0-9]+]], [[CMP_TMP]], ne
	; CHECK: cbnz [[CMP]], [[DONE:.LBB[0-9]+_[0-9]+]]			; CHECK: cbnz [[CMP]], [[DONE:.LBB[0-9]+_[0-9]+]]
	; CHECK: stlxp [[STATUS:w[0-9]+]], [[NEW_LO]], [[NEW_HI]], [x0]			; CHECK: stlxp [[STATUS:w[0-9]+]], [[NEW_LO]], [[NEW_HI]], [x0]
	; CHECK: cbnz [[STATUS]], [[RETRY]]			; CHECK: cbnz [[STATUS]], [[RETRY]]
	; CHECK: [[DONE]]:			; CHECK: [[DONE]]:

	%desired = load volatile i128, i128* @var128			%desired = load volatile i128, i128* @var128
	%new = load volatile i128, i128* @var128			%new = load volatile i128, i128* @var128
	%val = cmpxchg i128* %addr, i128 %desired, i128 %new seq_cst seq_cst			%val = cmpxchg i128* %addr, i128 %desired, i128 %new seq_cst seq_cst
	ret { i128, i1 } %val			ret { i128, i1 } %val
	}			}

llvm/test/CodeGen/AArch64/v8.4-atomic-128.ll

This file was added.

				; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+v8.4a %s -o - \| FileCheck %s
				; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=cyclone %s -o - \| FileCheck %s

				define void @test_atomic_load(i128* %addr) {
				; CHECK-LABEL: test_atomic_load:

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%res.0 = load atomic i128, i128* %addr monotonic, align 16
				store i128 %res.0, i128* %addr

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%res.1 = load atomic i128, i128* %addr unordered, align 16
				store i128 %res.1, i128* %addr

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0]
				; CHECK: dmb ish
				; CHECK: stp [[LO]], [[HI]], [x0]
				%res.2 = load atomic i128, i128* %addr acquire, align 16
				store i128 %res.2, i128* %addr

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0]
				; CHECK: dmb ish
				; CHECK: stp [[LO]], [[HI]], [x0]
				%res.3 = load atomic i128, i128* %addr seq_cst, align 16
				store i128 %res.3, i128* %addr


				%addr8 = bitcast i128* %addr to i8*

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0, #8]
				; CHECK-DAG: stp [[LO]], [[HI]], [x0]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 8
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				%res.5 = load atomic i128, i128* %addr128.1 monotonic, align 16
				store i128 %res.5, i128* %addr

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0, #504]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%addr8.2 = getelementptr i8, i8* %addr8, i32 504
				%addr128.2 = bitcast i8* %addr8.2 to i128*
				%res.6 = load atomic i128, i128* %addr128.2 monotonic, align 16
				store i128 %res.6, i128* %addr

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0, #-512]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%addr8.3 = getelementptr i8, i8* %addr8, i32 -512
				%addr128.3 = bitcast i8* %addr8.3 to i128*
				%res.7 = load atomic i128, i128* %addr128.3 monotonic, align 16
				store i128 %res.7, i128* %addr

				ret void
				}

				define void @test_libcall_load(i128* %addr) {
				; CHECK-LABEL: test_libcall_load:
				; CHECK: bl __atomic_load
				%res.8 = load atomic i128, i128* %addr unordered, align 8
				store i128 %res.8, i128* %addr

				ret void
				}

				define void @test_nonfolded_load1(i128* %addr) {
				; CHECK-LABEL: test_nonfolded_load1:
				%addr8 = bitcast i128* %addr to i8*

				; CHECK: add x[[ADDR:[0-9]+]], x0, #4
				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x[[ADDR]]]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 4
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				%res.1 = load atomic i128, i128* %addr128.1 monotonic, align 16
				store i128 %res.1, i128* %addr

				ret void
				}

				define void @test_nonfolded_load2(i128* %addr) {
				; CHECK-LABEL: test_nonfolded_load2:
				%addr8 = bitcast i128* %addr to i8*

				; CHECK: add x[[ADDR:[0-9]+]], x0, #512
				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x[[ADDR]]]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 512
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				%res.1 = load atomic i128, i128* %addr128.1 monotonic, align 16
				store i128 %res.1, i128* %addr

				ret void
				}

				define void @test_nonfolded_load3(i128* %addr) {
				; CHECK-LABEL: test_nonfolded_load3:
				%addr8 = bitcast i128* %addr to i8*

				; CHECK: sub x[[ADDR:[0-9]+]], x0, #520
				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x[[ADDR]]]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 -520
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				%res.1 = load atomic i128, i128* %addr128.1 monotonic, align 16
				store i128 %res.1, i128* %addr

				ret void
				}

				define void @test_atomic_store(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_atomic_store:

				; CHECK: stp x2, x3, [x0]
				store atomic i128 %val, i128* %addr monotonic, align 16

				; CHECK: stp x2, x3, [x0]
				store atomic i128 %val, i128* %addr unordered, align 16

				; CHECK: dmb ish
				; CHECK: stp x2, x3, [x0]
				store atomic i128 %val, i128* %addr release, align 16

				; CHECK: dmb ish
				; CHECK: stp x2, x3, [x0]
				; CHECK: dmb ish
				store atomic i128 %val, i128* %addr seq_cst, align 16


				%addr8 = bitcast i128* %addr to i8*

				; CHECK: stp x2, x3, [x0, #8]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 8
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				store atomic i128 %val, i128* %addr128.1 monotonic, align 16

				; CHECK: stp x2, x3, [x0, #504]
				%addr8.2 = getelementptr i8, i8* %addr8, i32 504
				%addr128.2 = bitcast i8* %addr8.2 to i128*
				store atomic i128 %val, i128* %addr128.2 monotonic, align 16

				; CHECK: stp x2, x3, [x0, #-512]
				%addr8.3 = getelementptr i8, i8* %addr8, i32 -512
				%addr128.3 = bitcast i8* %addr8.3 to i128*
				store atomic i128 %val, i128* %addr128.3 monotonic, align 16

				ret void
				}

				define void @test_libcall_store(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_libcall_store:
				; CHECK: bl __atomic_store
				store atomic i128 %val, i128* %addr unordered, align 8

				ret void
				}

				define void @test_nonfolded_store1(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_nonfolded_store1:
				%addr8 = bitcast i128* %addr to i8*

				; CHECK: add x[[ADDR:[0-9]+]], x0, #4
				; CHECK: stp x2, x3, [x[[ADDR]]]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 4
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				store atomic i128 %val, i128* %addr128.1 monotonic, align 16

				ret void
				}

				define void @test_nonfolded_store2(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_nonfolded_store2:
				%addr8 = bitcast i128* %addr to i8*

				; CHECK: add x[[ADDR:[0-9]+]], x0, #512
				; CHECK: stp x2, x3, [x[[ADDR]]]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 512
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				store atomic i128 %val, i128* %addr128.1 monotonic, align 16

				ret void
				}

				define void @test_nonfolded_store3(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_nonfolded_store3:
				%addr8 = bitcast i128* %addr to i8*

				; CHECK: sub x[[ADDR:[0-9]+]], x0, #520
				; CHECK: stp x2, x3, [x[[ADDR]]]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 -520
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				store atomic i128 %val, i128* %addr128.1 monotonic, align 16

				ret void
				}

				define i128 @test_volatile_atomic_load(i128* %addr) {
				; CHECK-LABEL: test_volatile_atomic_load:

				; CHECK: ldp [[LHS_LO:x[0-9]+]], [[LHS_HI:x[0-9]+]], [x0]
				%lhs = load atomic volatile i128, i128* %addr monotonic, align 16

				; CHECK: bl __atomic_load
				; CHECK: ldp [[RHS_LO:x[0-9]+]], [[RHS_HI:x[0-9]+]], [sp]
				%rhs = load atomic volatile i128, i128* %addr monotonic, align 1

				; CHECK: adds x0, [[LHS_LO]], [[RHS_LO]]
				; CHECK: adcs x1, [[LHS_HI]], [[RHS_HI]]
				%res = add i128 %lhs, %rhs
				ret i128 %res
				}

				define void @test_volatile_atomic_store(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_volatile_atomic_store:

				; CHECK: stp x2, x3, [x0]
				store atomic volatile i128 %val, i128* %addr monotonic, align 16

				; CHECK: bl __atomic_store
				store atomic volatile i128 %val, i128* %addr monotonic, align 1
				ret void
				}

llvm/test/CodeGen/AArch64/volatile-128.ll

This file was added.

				; RUN: llc -mtriple=aarch64-linux-gnu %s -o - \| FileCheck %s
				; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+strict-align %s -o - \| FileCheck %s --check-prefix=CHECK-STRICT-ALIGN

				define void @test_volatile_load(i128* %addr) {
				; CHECK-LABEL: test_volatile_load:

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0]
				; CHECK: stp [[LO]], [[HI]], [x0]
				%res.0 = load volatile i128, i128* %addr, align 16
				store volatile i128 %res.0, i128* %addr

				%addr8 = bitcast i128* %addr to i8*

				; CHECK: ldp [[LO:x[0-9]+]], [[HI:x[0-9]+]], [x0, #8]
				; CHECK-DAG: stp [[LO]], [[HI]], [x0]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 8
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				%res.5 = load volatile i128, i128* %addr128.1, align 16
				store volatile i128 %res.5, i128* %addr

				; CHECK: ldp
				; CHECK-STRICT-ALIGN-LABEL: test_volatile_load:
				; CHECK-STRICT-ALIGN: ldrb
				load volatile i128, i128* %addr, align 1

				ret void
				}

				define void @test_volatile_store(i128* %addr, i128 %val) {
				; CHECK-LABEL: test_volatile_store:

				; CHECK: stp x2, x3, [x0]
				store volatile i128 %val, i128* %addr, align 16

				%addr8 = bitcast i128* %addr to i8*

				; CHECK: stp x2, x3, [x0, #8]
				%addr8.1 = getelementptr i8, i8* %addr8, i32 8
				%addr128.1 = bitcast i8* %addr8.1 to i128*
				store volatile i128 %val, i128* %addr128.1, align 16

				; CHECK: stp
				; CHECK-STRICT-ALIGN-LABEL: test_volatile_store:
				; CHECK-STRICT-ALIGN: strb
				store volatile i128 %val, i128* %addr, align 1

				ret void
				}

				define i256 @test_ext(i128* %addr) {
				; CHECK-LABEL: test_ext:
				; CHECK: ldp x0, x1, [x0]
				; CHECK: mov x2, xzr
				; CHECK: mov x3, xzr

				%res.128 = load volatile i128, i128* %addr
				%res = zext i128 %res.128 to i256
				ret i256 %res
				}

				define void @test_trunc(i128* %addr, i256 %val) {
				; CHECK-LABEL: test_trunc:
				; CHECK: stp x2, x3, [x0]

				%val.128 = trunc i256 %val to i128
				store i128 %val.128, i128* %addr
				ret void
				}