This is an archive of the discontinued LLVM Phabricator instance.

My understanding is that the !nontemporal metadata is a hint to the backend that the load data will not be reused so caching it is unlikely to be useful for performance. It isn't a mandate. Similarly the ldnp pairwise loads are a hint to the microarchitecture that the loaded data is likely not going to be reused so doesn't need caching. The micro-architecture is free to ignore the hint again.
With that, is it beneficial to force uses of 128bit non-temperoral loads if it leads to more instructions overall? For i256 it was almost certainly a good thing to do, but for i128 it sounds like it might end up slowing things down more than it helps. I'm not sure where the balance point lies.

In D132559#3746096, @dmgreen wrote:

My understanding is that the !nontemporal metadata is a hint to the backend that the load data will not be reused so caching it is unlikely to be useful for performance. It isn't a mandate. Similarly the ldnp pairwise loads are a hint to the microarchitecture that the loaded data is likely not going to be reused so doesn't need caching. The micro-architecture is free to ignore the hint again.
With that, is it beneficial to force uses of 128bit non-temperoral loads if it leads to more instructions overall? For i256 it was almost certainly a good thing to do, but for i128 it sounds like it might end up slowing things down more than it helps. I'm not sure where the balance point lies.

I agree that this seems more uarch specific. If the selected uarch just ignores the hints, we shouldn't try too hard to generate LDNP. But on some uarch avoiding cache pollution can outweigh the drawbacks of having to issue more load instructions (and the overhead of a few extra movs should be negligible on most beefier uarchs). I think codegen could also be improved for the cases where we have input types that need further legalization.

In most cases the hint comes directly from the user who hopefully know how to use it.

llvm/test/CodeGen/AArch64/nontemporal-load.ll
327–328	I guess we would also use the `LDNQ` variant here. I assume the reason we don't is because `<17 x float>` will get broken down to `<4 x float>` pieces during legalization. @dmgreen do you by any chance have any ideas on where to best improve this?

I don't think they ever get ignored by the hardware, but it is only a hints. I would be surprised in the extra instructions are better than a normal load, but it will depend on how much pressure this is on the cache at the time. I don't think there is a lot in it either way though, and if non-temporal loads are being used the cpu is more likely to have high memory usage with lower computation, meaning the extra instructions are less of an issue.

llvm/test/CodeGen/AArch64/nontemporal-load.ll
327–328	I'm not sure I'm afraid. I believe that loads get split to legal parts (not in half like other operations). If we wouldn't expect 17x non temporal loads very often, it may not be too important to fix. The loop vectorizer will always pick powers of 2 after all.

Running this locally causes many test failures for me. @zainja could you double check all AArch64 codegen tests pass with the patch applied to current main? Also, could you rebase the patch to current main to make the precommit tests run latest main?

This revision now requires changes to proceed.Aug 30 2022, 3:27 AM

In D132559#3757832, @fhahn wrote:

Running this locally causes many test failures for me. @zainja could you double check all AArch64 codegen tests pass with the patch applied to current main? Also, could you rebase the patch to current main to make the precommit tests run latest main?

The issue is related to

setOperationAction(ISD::LOAD, MVT::vxixx, Custom);

Looking at the following test arm64-neon-vector-shuffle-extract.ll the stack trace failure indicates that the last function called was LowerLoad. Before the patch, the function wasn't called for that specific case which means that the setOperationAction calls are the issue.

Address some of the failing tests. The problem we have here is that all 128-bit vector loads will go to LowerLOAD function.
Most of the failing test cases where triggered by the following assert

assert((VT == MVT::v4i16 || VT == MVT::v4i32) && "Expected v4i16 or v4i32");

To address this I added a block before it to return an empty SDValue node. Some tests
still fail after this fix.

Harbormaster completed remote builds in B184534: Diff 457201.Sep 1 2022, 2:29 AM

Fix patch for failing tests. The problem originated from 128-bit loads being handled incorrectly if they weren't nontemporal loads. I added two checks to address the issue
First, if the load is of floating point type we preserve the behaviour provided from

setOperationPromotedToType(ISD::LOAD, VT, PromoteTo);

Otherwise we return an empty SDNode() object.

Harbormaster completed remote builds in B184794: Diff 457553.Sep 2 2022, 4:09 AM

Remove support for floating point non temporal loads

fhahn added a reviewer: t.p.northover.Sep 2 2022, 6:25 AM

Harbormaster completed remote builds in B184808: Diff 457576.Sep 2 2022, 6:54 AM

Matt added a subscriber: Matt.Sep 2 2022, 8:47 PM

fhahn mentioned this in D133421: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load.Sep 9 2022, 2:07 AM

check for little endian target

Harbormaster completed remote builds in B186614: Diff 460077.Sep 14 2022, 7:46 AM

This should be fine now, could you rebase this patch on top of D133421? Codegen of test_ldnp_v33f64 should be improved by this, right?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
802	Would be good to use the more descriptive comment from above, just for 128 bits.
5445	nit: would be good to have a newline before this line..
20893–20896	unrelated change?

rebase on top of D133421 and address stylistic comments

Harbormaster completed remote builds in B188709: Diff 462909.Sep 26 2022, 9:14 AM

Thanks for getting rid of the excessive codegen regressions from the earlier version! IIUC now we should get at most an extra mov instruction to combine the 2 loaded values into a single vector. IMO this is a reasonable trade-off between user request & codegen.

@dmgreen WDYT? If there still are concerns about the extra mov instruction, we could make this opt-in with a target feature.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5466–5467	nit: this needs formatting.

Yeah sure. This sounds fine - no objections from me.

In D132559#3823531, @dmgreen wrote:

Yeah sure. This sounds fine - no objections from me.

Cheers, LGTM then!

This revision is now accepted and ready to land.Sep 29 2022, 5:59 AM

This revision was landed with ongoing or failed builds.Sep 30 2022, 3:05 AM

Closed by commit rG661403b85c21: [AArch64] Add support for 128-bit non temporal loads. (authored by zjaffal, committed by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG661403b85c21: [AArch64] Add support for 128-bit non temporal loads..

Hello. I've had reports that this is changing the lowering of unaligned vector loads when strict-align is present. https://godbolt.org/z/bT4cnvn4a.

Custom lowering is often more pain than it is worth. It doesn't seem like something that is intended, and can lead to access violations if unaligned accesses are not enabled.

amilendra added a subscriber: amilendra.Oct 10 2022, 8:17 AM

In D132559#3846764, @dmgreen wrote:

Hello. I've had reports that this is changing the lowering of unaligned vector loads when strict-align is present. https://godbolt.org/z/bT4cnvn4a.

Custom lowering is often more pain than it is worth. It doesn't seem like something that is intended, and can lead to access violations if unaligned accesses are not enabled.

That is very interesting, I will look into what is triggering this issue.

That is very interesting, I will look into what is triggering this issue.

OK thanks. Reverting the patch in the meantime is probably the best idea, I should probably have done that yesterday. Things are pretty broken at the moment and the fix might take a little time.

In D132559#3852103, @dmgreen wrote:

That is very interesting, I will look into what is triggering this issue.

OK thanks. Reverting the patch in the meantime is probably the best idea, I should probably have done that yesterday. Things are pretty broken at the moment and the fix might take a little time.

I will revert it now.

The cause of the issue is that we specify custom lowering operation for v2i64

setOperationAction(ISD::LOAD, MVT::v2i64, Custom);

Which prevents LegalizeLoadOps from creating the necessary code for unaligned loads

dmgreen added a reverting change: rG1e723b7ab303: Revert "[AArch64] Add support for 128-bit non temporal loads.".Oct 12 2022, 3:11 AM

Yeah thanks - I was just running the tests to check the revert. It took a while, but looks like they passed in the end.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

1 line

AArch64ISelLowering.cpp

31 lines

AArch64InstrInfo.td

5 lines

test/

CodeGen/

AArch64/

nontemporal-load.ll

15 lines

Diff 464205

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 451 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {

STG,		STG,
STZG,		STZG,
ST2G,		ST2G,
STZ2G,		STZ2G,

LDP,		LDP,
LDNP,		LDNP,
		LDNP128,
STP,		STP,
STNP,		STNP,

// Memory Operations		// Memory Operations
MOPS_MEMSET,		MOPS_MEMSET,
MOPS_MEMSET_TAGGING,		MOPS_MEMSET_TAGGING,
MOPS_MEMCOPY,		MOPS_MEMCOPY,
MOPS_MEMMOVE,		MOPS_MEMMOVE,
▲ Show 20 Lines • Show All 723 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 793 Lines • ▼ Show 20 Lines	#undef LCALLNAME5
// will break up 256 bit inputs.		// will break up 256 bit inputs.
setOperationAction(ISD::LOAD, MVT::v32i8, Custom);		setOperationAction(ISD::LOAD, MVT::v32i8, Custom);
setOperationAction(ISD::LOAD, MVT::v16i16, Custom);		setOperationAction(ISD::LOAD, MVT::v16i16, Custom);
setOperationAction(ISD::LOAD, MVT::v16f16, Custom);		setOperationAction(ISD::LOAD, MVT::v16f16, Custom);
setOperationAction(ISD::LOAD, MVT::v8i32, Custom);		setOperationAction(ISD::LOAD, MVT::v8i32, Custom);
setOperationAction(ISD::LOAD, MVT::v8f32, Custom);		setOperationAction(ISD::LOAD, MVT::v8f32, Custom);
setOperationAction(ISD::LOAD, MVT::v4f64, Custom);		setOperationAction(ISD::LOAD, MVT::v4f64, Custom);
setOperationAction(ISD::LOAD, MVT::v4i64, Custom);		setOperationAction(ISD::LOAD, MVT::v4i64, Custom);
		// 128-bit non-temporal loads can be lowered to LDNP using custom lowering.
		fhahnUnsubmitted Not Done Reply Inline Actions Would be good to use the more descriptive comment from above, just for 128 bits. fhahn: Would be good to use the more descriptive comment from above, just for 128 bits.
		setOperationAction(ISD::LOAD, MVT::v4i32, Custom);
		setOperationAction(ISD::LOAD, MVT::v2i64, Custom);
		setOperationAction(ISD::LOAD, MVT::v8i16, Custom);
		setOperationAction(ISD::LOAD, MVT::v16i8, Custom);

// Lower READCYCLECOUNTER using an mrs from PMCCNTR_EL0.		// Lower READCYCLECOUNTER using an mrs from PMCCNTR_EL0.
// This requires the Performance Monitors extension.		// This requires the Performance Monitors extension.
if (Subtarget->hasPerfMon())		if (Subtarget->hasPerfMon())
setOperationAction(ISD::READCYCLECOUNTER, MVT::i64, Legal);		setOperationAction(ISD::READCYCLECOUNTER, MVT::i64, Legal);

if (getLibcallName(RTLIB::SINCOS_STRET_F32) != nullptr &&		if (getLibcallName(RTLIB::SINCOS_STRET_F32) != nullptr &&
getLibcallName(RTLIB::SINCOS_STRET_F64) != nullptr) {		getLibcallName(RTLIB::SINCOS_STRET_F64) != nullptr) {
▲ Show 20 Lines • Show All 1,515 Lines • ▼ Show 20 Lines	case AArch64ISD::FIRST_NUMBER:
MAKE_CASE(AArch64ISD::SST1_UXTW_PRED)		MAKE_CASE(AArch64ISD::SST1_UXTW_PRED)
MAKE_CASE(AArch64ISD::SST1_SXTW_SCALED_PRED)		MAKE_CASE(AArch64ISD::SST1_SXTW_SCALED_PRED)
MAKE_CASE(AArch64ISD::SST1_UXTW_SCALED_PRED)		MAKE_CASE(AArch64ISD::SST1_UXTW_SCALED_PRED)
MAKE_CASE(AArch64ISD::SST1_IMM_PRED)		MAKE_CASE(AArch64ISD::SST1_IMM_PRED)
MAKE_CASE(AArch64ISD::SSTNT1_PRED)		MAKE_CASE(AArch64ISD::SSTNT1_PRED)
MAKE_CASE(AArch64ISD::SSTNT1_INDEX_PRED)		MAKE_CASE(AArch64ISD::SSTNT1_INDEX_PRED)
MAKE_CASE(AArch64ISD::LDP)		MAKE_CASE(AArch64ISD::LDP)
MAKE_CASE(AArch64ISD::LDNP)		MAKE_CASE(AArch64ISD::LDNP)
		MAKE_CASE(AArch64ISD::LDNP128)
MAKE_CASE(AArch64ISD::STP)		MAKE_CASE(AArch64ISD::STP)
MAKE_CASE(AArch64ISD::STNP)		MAKE_CASE(AArch64ISD::STNP)
MAKE_CASE(AArch64ISD::BITREVERSE_MERGE_PASSTHRU)		MAKE_CASE(AArch64ISD::BITREVERSE_MERGE_PASSTHRU)
MAKE_CASE(AArch64ISD::BSWAP_MERGE_PASSTHRU)		MAKE_CASE(AArch64ISD::BSWAP_MERGE_PASSTHRU)
MAKE_CASE(AArch64ISD::REVH_MERGE_PASSTHRU)		MAKE_CASE(AArch64ISD::REVH_MERGE_PASSTHRU)
MAKE_CASE(AArch64ISD::REVW_MERGE_PASSTHRU)		MAKE_CASE(AArch64ISD::REVW_MERGE_PASSTHRU)
MAKE_CASE(AArch64ISD::REVD_MERGE_PASSTHRU)		MAKE_CASE(AArch64ISD::REVD_MERGE_PASSTHRU)
MAKE_CASE(AArch64ISD::CTLZ_MERGE_PASSTHRU)		MAKE_CASE(AArch64ISD::CTLZ_MERGE_PASSTHRU)
▲ Show 20 Lines • Show All 3,068 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerStore128(SDValue Op,
return Result;		return Result;
}		}

SDValue AArch64TargetLowering::LowerLOAD(SDValue Op,		SDValue AArch64TargetLowering::LowerLOAD(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
LoadSDNode *LoadNode = cast<LoadSDNode>(Op);		LoadSDNode *LoadNode = cast<LoadSDNode>(Op);
assert(LoadNode && "Expected custom lowering of a load node");		assert(LoadNode && "Expected custom lowering of a load node");
		// Handle lowering 128-bit non temporal loads for little-endian targets.
		EVT MemVT = LoadNode->getMemoryVT();
		if (LoadNode->isNonTemporal() && Subtarget->isLittleEndian() &&
		MemVT.getSizeInBits() == 128 &&
		(MemVT.getScalarSizeInBits() == 8u \|\|
		MemVT.getScalarSizeInBits() == 16u \|\|
		MemVT.getScalarSizeInBits() == 32u \|\|
		MemVT.getScalarSizeInBits() == 64u)) {

		SDValue Result = DAG.getMemIntrinsicNode(
		AArch64ISD::LDNP128, DL,
		DAG.getVTList({MemVT.getHalfNumVectorElementsVT(*DAG.getContext()),
		MemVT.getHalfNumVectorElementsVT(*DAG.getContext()),
		MVT::Other}),
		{LoadNode->getChain(), LoadNode->getBasePtr()}, LoadNode->getMemoryVT(),
		LoadNode->getMemOperand());

		SDValue P = DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(Op), MemVT,
		Result.getValue(0), Result.getValue(1));
		return DAG.getMergeValues({P, Result.getValue(2) /* Chain */}, DL);
		}

if (LoadNode->getMemoryVT() == MVT::i64x8) {		if (LoadNode->getMemoryVT() == MVT::i64x8) {
		fhahnUnsubmitted Not Done Reply Inline Actions nit: would be good to have a newline before this line.. fhahn: nit: would be good to have a newline before this line..
SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
SDValue Base = LoadNode->getBasePtr();		SDValue Base = LoadNode->getBasePtr();
SDValue Chain = LoadNode->getChain();		SDValue Chain = LoadNode->getChain();
EVT PtrVT = Base.getValueType();		EVT PtrVT = Base.getValueType();
for (unsigned i = 0; i < 8; i++) {		for (unsigned i = 0; i < 8; i++) {
SDValue Ptr = DAG.getNode(ISD::ADD, DL, PtrVT, Base,		SDValue Ptr = DAG.getNode(ISD::ADD, DL, PtrVT, Base,
DAG.getConstant(i * 8, DL, PtrVT));		DAG.getConstant(i * 8, DL, PtrVT));
SDValue Part = DAG.getLoad(MVT::i64, DL, Chain, Ptr,		SDValue Part = DAG.getLoad(MVT::i64, DL, Chain, Ptr,
LoadNode->getPointerInfo(),		LoadNode->getPointerInfo(),
LoadNode->getOriginalAlign());		LoadNode->getOriginalAlign());
Ops.push_back(Part);		Ops.push_back(Part);
Chain = SDValue(Part.getNode(), 1);		Chain = SDValue(Part.getNode(), 1);
}		}
SDValue Loaded = DAG.getNode(AArch64ISD::LS64_BUILD, DL, MVT::i64x8, Ops);		SDValue Loaded = DAG.getNode(AArch64ISD::LS64_BUILD, DL, MVT::i64x8, Ops);
return DAG.getMergeValues({Loaded, Chain}, DL);		return DAG.getMergeValues({Loaded, Chain}, DL);
}		}

// Custom lowering for extending v4i8 vector loads.		// Custom lowering for extending v4i8 vector loads.
EVT VT = Op->getValueType(0);		EVT VT = Op->getValueType(0);
assert((VT == MVT::v4i16 \|\| VT == MVT::v4i32) && "Expected v4i16 or v4i32");

if (LoadNode->getMemoryVT() != MVT::v4i8)		if ((VT != MVT::v4i16 && VT != MVT::v4i32) \|\|
		LoadNode->getMemoryVT() != MVT::v4i8)
		fhahnUnsubmitted Not Done Reply Inline Actions nit: this needs formatting. fhahn: nit: this needs formatting.
return SDValue();		return SDValue();

unsigned ExtType;		unsigned ExtType;
if (LoadNode->getExtensionType() == ISD::SEXTLOAD)		if (LoadNode->getExtensionType() == ISD::SEXTLOAD)
ExtType = ISD::SIGN_EXTEND;		ExtType = ISD::SIGN_EXTEND;
else if (LoadNode->getExtensionType() == ISD::ZEXTLOAD \|\|		else if (LoadNode->getExtensionType() == ISD::ZEXTLOAD \|\|
LoadNode->getExtensionType() == ISD::EXTLOAD)		LoadNode->getExtensionType() == ISD::EXTLOAD)
ExtType = ISD::ZERO_EXTEND;		ExtType = ISD::ZERO_EXTEND;
▲ Show 20 Lines • Show All 15,409 Lines • ▼ Show 20 Lines	case ISD::STRICT_FP_TO_UINT:
return;		return;
case ISD::ATOMIC_CMP_SWAP:		case ISD::ATOMIC_CMP_SWAP:
ReplaceCMP_SWAP_128Results(N, Results, DAG, Subtarget);		ReplaceCMP_SWAP_128Results(N, Results, DAG, Subtarget);
return;		return;
case ISD::ATOMIC_LOAD:		case ISD::ATOMIC_LOAD:
case ISD::LOAD: {		case ISD::LOAD: {
MemSDNode *LoadNode = cast<MemSDNode>(N);		MemSDNode *LoadNode = cast<MemSDNode>(N);
EVT MemVT = LoadNode->getMemoryVT();		EVT MemVT = LoadNode->getMemoryVT();
// Handle lowering 256 bit non temporal loads into LDNP for little-endian		// Handle lowering 256 bit non temporal loads into LDNP for little-endian
// targets.		// targets.
if (LoadNode->isNonTemporal() && Subtarget->isLittleEndian() &&		if (LoadNode->isNonTemporal() && Subtarget->isLittleEndian() &&
MemVT.getSizeInBits() == 256u &&		MemVT.getSizeInBits() == 256u &&
		fhahnUnsubmitted Not Done Reply Inline Actions unrelated change? fhahn: unrelated change?
(MemVT.getScalarSizeInBits() == 8u \|\|		(MemVT.getScalarSizeInBits() == 8u \|\|
MemVT.getScalarSizeInBits() == 16u \|\|		MemVT.getScalarSizeInBits() == 16u \|\|
MemVT.getScalarSizeInBits() == 32u \|\|		MemVT.getScalarSizeInBits() == 32u \|\|
MemVT.getScalarSizeInBits() == 64u)) {		MemVT.getScalarSizeInBits() == 64u)) {

SDValue Result = DAG.getMemIntrinsicNode(		SDValue Result = DAG.getMemIntrinsicNode(
AArch64ISD::LDNP, SDLoc(N),		AArch64ISD::LDNP, SDLoc(N),
DAG.getVTList({MemVT.getHalfNumVectorElementsVT(*DAG.getContext()),		DAG.getVTList({MemVT.getHalfNumVectorElementsVT(*DAG.getContext()),
▲ Show 20 Lines • Show All 1,714 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 313 Lines • ▼ Show 20 Lines

	def SDT_AArch64TLSDescCall : SDTypeProfile<0, -2, [SDTCisPtrTy<0>,			def SDT_AArch64TLSDescCall : SDTypeProfile<0, -2, [SDTCisPtrTy<0>,
	SDTCisPtrTy<1>]>;			SDTCisPtrTy<1>]>;

	def SDT_AArch64uaddlp : SDTypeProfile<1, 1, [SDTCisVec<0>, SDTCisVec<1>]>;			def SDT_AArch64uaddlp : SDTypeProfile<1, 1, [SDTCisVec<0>, SDTCisVec<1>]>;

	def SDT_AArch64ldp : SDTypeProfile<2, 1, [SDTCisVT<0, i64>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;			def SDT_AArch64ldp : SDTypeProfile<2, 1, [SDTCisVT<0, i64>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;
	def SDT_AArch64ldnp : SDTypeProfile<2, 1, [SDTCisVT<0, v4i32>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;			def SDT_AArch64ldnp : SDTypeProfile<2, 1, [SDTCisVT<0, v4i32>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;
				def SDT_AArch64ldnp128 : SDTypeProfile<2, 1, [SDTCisVT<0, v2i32>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;
	def SDT_AArch64stp : SDTypeProfile<0, 3, [SDTCisVT<0, i64>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;			def SDT_AArch64stp : SDTypeProfile<0, 3, [SDTCisVT<0, i64>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;
	def SDT_AArch64stnp : SDTypeProfile<0, 3, [SDTCisVT<0, v4i32>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;			def SDT_AArch64stnp : SDTypeProfile<0, 3, [SDTCisVT<0, v4i32>, SDTCisSameAs<0, 1>, SDTCisPtrTy<2>]>;

	// Generates the general dynamic sequences, i.e.			// Generates the general dynamic sequences, i.e.
	// adrp x0, :tlsdesc:var			// adrp x0, :tlsdesc:var
	// ldr x1, [x0, #:tlsdesc_lo12:var]			// ldr x1, [x0, #:tlsdesc_lo12:var]
	// add x0, x0, #:tlsdesc_lo12:var			// add x0, x0, #:tlsdesc_lo12:var
	// .tlsdesccall var			// .tlsdesccall var
	▲ Show 20 Lines • Show All 397 Lines • ▼ Show 20 Lines
	]>;			]>;
	def AArch64sunpkhi : SDNode<"AArch64ISD::SUNPKHI", SDT_AArch64unpk>;			def AArch64sunpkhi : SDNode<"AArch64ISD::SUNPKHI", SDT_AArch64unpk>;
	def AArch64sunpklo : SDNode<"AArch64ISD::SUNPKLO", SDT_AArch64unpk>;			def AArch64sunpklo : SDNode<"AArch64ISD::SUNPKLO", SDT_AArch64unpk>;
	def AArch64uunpkhi : SDNode<"AArch64ISD::UUNPKHI", SDT_AArch64unpk>;			def AArch64uunpkhi : SDNode<"AArch64ISD::UUNPKHI", SDT_AArch64unpk>;
	def AArch64uunpklo : SDNode<"AArch64ISD::UUNPKLO", SDT_AArch64unpk>;			def AArch64uunpklo : SDNode<"AArch64ISD::UUNPKLO", SDT_AArch64unpk>;

	def AArch64ldp : SDNode<"AArch64ISD::LDP", SDT_AArch64ldp, [SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;			def AArch64ldp : SDNode<"AArch64ISD::LDP", SDT_AArch64ldp, [SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;
	def AArch64ldnp : SDNode<"AArch64ISD::LDNP", SDT_AArch64ldnp, [SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;			def AArch64ldnp : SDNode<"AArch64ISD::LDNP", SDT_AArch64ldnp, [SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;
				def AArch64ldnp128 : SDNode<"AArch64ISD::LDNP128", SDT_AArch64ldnp128, [SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;
	def AArch64stp : SDNode<"AArch64ISD::STP", SDT_AArch64stp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;			def AArch64stp : SDNode<"AArch64ISD::STP", SDT_AArch64stp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
	def AArch64stnp : SDNode<"AArch64ISD::STNP", SDT_AArch64stnp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;			def AArch64stnp : SDNode<"AArch64ISD::STNP", SDT_AArch64stnp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;

	def AArch64tbl : SDNode<"AArch64ISD::TBL", SDT_AArch64TBL>;			def AArch64tbl : SDNode<"AArch64ISD::TBL", SDT_AArch64TBL>;
	def AArch64mrs : SDNode<"AArch64ISD::MRS",			def AArch64mrs : SDNode<"AArch64ISD::MRS",
	SDTypeProfile<1, 1, [SDTCisVT<0, i64>, SDTCisVT<1, i32>]>,			SDTypeProfile<1, 1, [SDTCisVT<0, i64>, SDTCisVT<1, i32>]>,
	[SDNPHasChain, SDNPOutGlue]>;			[SDNPHasChain, SDNPOutGlue]>;

	▲ Show 20 Lines • Show All 1,844 Lines • ▼ Show 20 Lines
	defm LDNPD : LoadPairNoAlloc<0b01, 1, FPR64Op, simm7s8, "ldnp">;			defm LDNPD : LoadPairNoAlloc<0b01, 1, FPR64Op, simm7s8, "ldnp">;
	defm LDNPQ : LoadPairNoAlloc<0b10, 1, FPR128Op, simm7s16, "ldnp">;			defm LDNPQ : LoadPairNoAlloc<0b10, 1, FPR128Op, simm7s16, "ldnp">;

	def : Pat<(AArch64ldp (am_indexed7s64 GPR64sp:$Rn, simm7s8:$offset)),			def : Pat<(AArch64ldp (am_indexed7s64 GPR64sp:$Rn, simm7s8:$offset)),
	(LDPXi GPR64sp:$Rn, simm7s8:$offset)>;			(LDPXi GPR64sp:$Rn, simm7s8:$offset)>;

	def : Pat<(AArch64ldnp (am_indexed7s128 GPR64sp:$Rn, simm7s16:$offset)),			def : Pat<(AArch64ldnp (am_indexed7s128 GPR64sp:$Rn, simm7s16:$offset)),
	(LDNPQi GPR64sp:$Rn, simm7s16:$offset)>;			(LDNPQi GPR64sp:$Rn, simm7s16:$offset)>;

				def : Pat<(AArch64ldnp128 (am_indexed7s64 GPR64sp:$Rn, simm7s8:$offset)),
				(LDNPDi GPR64sp:$Rn, simm7s8:$offset)>;
	//---			//---
	// (register offset)			// (register offset)
	//---			//---

	// Integer			// Integer
	defm LDRBB : Load8RO<0b00, 0, 0b01, GPR32, "ldrb", i32, zextloadi8>;			defm LDRBB : Load8RO<0b00, 0, 0b01, GPR32, "ldrb", i32, zextloadi8>;
	defm LDRHH : Load16RO<0b01, 0, 0b01, GPR32, "ldrh", i32, zextloadi16>;			defm LDRHH : Load16RO<0b01, 0, 0b01, GPR32, "ldrh", i32, zextloadi16>;
	defm LDRW : Load32RO<0b10, 0, 0b01, GPR32, "ldr", i32, load>;			defm LDRW : Load32RO<0b10, 0, 0b01, GPR32, "ldr", i32, load>;
	▲ Show 20 Lines • Show All 5,835 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/nontemporal-load.ll

	Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <32 x i8>, <32 x i8>* %A, align 8, !nontemporal !0			%lv = load <32 x i8>, <32 x i8>* %A, align 8, !nontemporal !0
	ret <32 x i8> %lv			ret <32 x i8> %lv
	}			}

	define <4 x i32> @test_ldnp_v4i32(<4 x i32>* %A) {			define <4 x i32> @test_ldnp_v4i32(<4 x i32>* %A) {
	; CHECK-LABEL: test_ldnp_v4i32:			; CHECK-LABEL: test_ldnp_v4i32:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldr q0, [x0]			; CHECK-NEXT: ldnp d0, d1, [x0]
				; CHECK-NEXT: mov.d v0[1], v1[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v4i32:			; CHECK-BE-LABEL: test_ldnp_v4i32:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: ldr q0, [x0]			; CHECK-BE-NEXT: ldr q0, [x0]
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load<4 x i32>, <4 x i32>* %A, align 8, !nontemporal !0			%lv = load<4 x i32>, <4 x i32>* %A, align 8, !nontemporal !0
	ret <4 x i32> %lv			ret <4 x i32> %lv
	}			}

	define <4 x float> @test_ldnp_v4f32(<4 x float>* %A) {			define <4 x float> @test_ldnp_v4f32(<4 x float>* %A) {
	; CHECK-LABEL: test_ldnp_v4f32:			; CHECK-LABEL: test_ldnp_v4f32:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldr q0, [x0]			; CHECK-NEXT: ldnp d0, d1, [x0]
				; CHECK-NEXT: mov.d v0[1], v1[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v4f32:			; CHECK-BE-LABEL: test_ldnp_v4f32:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: ldr q0, [x0]			; CHECK-BE-NEXT: ldr q0, [x0]
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load<4 x float>, <4 x float>* %A, align 8, !nontemporal !0			%lv = load<4 x float>, <4 x float>* %A, align 8, !nontemporal !0
	ret <4 x float> %lv			ret <4 x float> %lv
	}			}

	define <8 x i16> @test_ldnp_v8i16(<8 x i16>* %A) {			define <8 x i16> @test_ldnp_v8i16(<8 x i16>* %A) {
	; CHECK-LABEL: test_ldnp_v8i16:			; CHECK-LABEL: test_ldnp_v8i16:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldr q0, [x0]			; CHECK-NEXT: ldnp d0, d1, [x0]
				; CHECK-NEXT: mov.d v0[1], v1[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v8i16:			; CHECK-BE-LABEL: test_ldnp_v8i16:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: ldr q0, [x0]			; CHECK-BE-NEXT: ldr q0, [x0]
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <8 x i16>, <8 x i16>* %A, align 8, !nontemporal !0			%lv = load <8 x i16>, <8 x i16>* %A, align 8, !nontemporal !0
	ret <8 x i16> %lv			ret <8 x i16> %lv
	}			}

	define <16 x i8> @test_ldnp_v16i8(<16 x i8>* %A) {			define <16 x i8> @test_ldnp_v16i8(<16 x i8>* %A) {
	; CHECK-LABEL: test_ldnp_v16i8:			; CHECK-LABEL: test_ldnp_v16i8:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldr q0, [x0]			; CHECK-NEXT: ldnp d0, d1, [x0]
				; CHECK-NEXT: mov.d v0[1], v1[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v16i8:			; CHECK-BE-LABEL: test_ldnp_v16i8:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: ldr q0, [x0]			; CHECK-BE-NEXT: ldr q0, [x0]
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <16 x i8>, <16 x i8>* %A, align 8, !nontemporal !0			%lv = load <16 x i8>, <16 x i8>* %A, align 8, !nontemporal !0
	ret <16 x i8> %lv			ret <16 x i8> %lv
	}			}
	define <2 x double> @test_ldnp_v2f64(<2 x double>* %A) {			define <2 x double> @test_ldnp_v2f64(<2 x double>* %A) {
	; CHECK-LABEL: test_ldnp_v2f64:			; CHECK-LABEL: test_ldnp_v2f64:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldr q0, [x0]			; CHECK-NEXT: ldnp d0, d1, [x0]
				; CHECK-NEXT: mov.d v0[1], v1[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v2f64:			; CHECK-BE-LABEL: test_ldnp_v2f64:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: ldr q0, [x0]			; CHECK-BE-NEXT: ldr q0, [x0]
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <2 x double>, <2 x double>* %A, align 8, !nontemporal !0			%lv = load <2 x double>, <2 x double>* %A, align 8, !nontemporal !0
	ret <2 x double> %lv			ret <2 x double> %lv
	▲ Show 20 Lines • Show All 144 Lines • ▼ Show 20 Lines
	; CHECK-BE-NEXT: ldp q2, q3, [x0, #32]			; CHECK-BE-NEXT: ldp q2, q3, [x0, #32]
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <16 x float>, <16 x float>* %A, align 8, !nontemporal !0			%lv = load <16 x float>, <16 x float>* %A, align 8, !nontemporal !0
	ret <16 x float> %lv			ret <16 x float> %lv
	}			}

	define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {			define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {
	; CHECK-LABEL: test_ldnp_v17f32:			; CHECK-LABEL: test_ldnp_v17f32:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldnp q0, q1, [x0, #32]			; CHECK-NEXT: ldnp q0, q1, [x0, #32]
				fhahnUnsubmitted Not Done Reply Inline Actions I guess we would also use the `LDNQ` variant here. I assume the reason we don't is because `<17 x float>` will get broken down to `<4 x float>` pieces during legalization. @dmgreen do you by any chance have any ideas on where to best improve this? fhahn: I guess we would also use the `LDNQ` variant here. I assume the reason we don't is because `<17…
				dmgreenUnsubmitted Not Done Reply Inline Actions I'm not sure I'm afraid. I believe that loads get split to legal parts (not in half like other operations). If we wouldn't expect 17x non temporal loads very often, it may not be too important to fix. The loop vectorizer will always pick powers of 2 after all. dmgreen: I'm not sure I'm afraid. I believe that loads get split to legal parts (not in half like other…
	; CHECK-NEXT: ldnp q2, q3, [x0]			; CHECK-NEXT: ldnp q2, q3, [x0]
	; CHECK-NEXT: ldr s4, [x0, #64]			; CHECK-NEXT: ldr s4, [x0, #64]
	; CHECK-NEXT: stp q0, q1, [x8, #32]			; CHECK-NEXT: stp q0, q1, [x8, #32]
	; CHECK-NEXT: stp q2, q3, [x8]			; CHECK-NEXT: stp q2, q3, [x8]
	; CHECK-NEXT: str s4, [x8, #64]			; CHECK-NEXT: str s4, [x8, #64]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v17f32:			; CHECK-BE-LABEL: test_ldnp_v17f32:
	▲ Show 20 Lines • Show All 321 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add support for 128-bit non temporal loads.ClosedPublic

Details

Diff Detail