This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
32/40
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
nontemporal-load.ll

Differential D133421

[AArch64] break non-temporal loads over 256 into 256-loads and a smaller load
ClosedPublic

Authored by zjaffal on Sep 7 2022, 6:12 AM.

Download Raw Diff

Details

Reviewers

fhahn
hiraditya
kristof.beyls
dmgreen
SjoerdMeijer
t.p.northover

Commits

rG2d3c260362a2: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load

Summary

Currently over 256 non-temporal loads are broken inefficently. For example, v17i32 gets broken into 2 128-bit loads. It is better if we can use
256-bit loads instead.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

zjaffal created this revision.Sep 7 2022, 6:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 6:12 AM

zjaffal requested review of this revision.Sep 7 2022, 6:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 6:12 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B185399: Diff 458436.Sep 7 2022, 6:37 AM

Thanks for sharing the patch. As is, it looks like there are multiple test failures that are caused by this patch and need addressing (some comments inline about correctness issues)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
54	Is this needed?
73	Is this needed?
18089	move all variables that are used after the early exit down after the early exit.
18094	We should also only do this for non-temporal loads.
18098	Could you add a comment here illustrating what kind of DAG nodes we create to replace the original load? It would also be good to document the motivation for special handling for non-temporal loads here.
18105	I don't think that's correct, you are passing the offset in bits, but I think it should be in bytes. Looking at some of the test changes, the offsets of the loads are wrong, e.g. `ldnp q0, q2, [x0, #256]`.
18136	Do you need to create `NullVector` explicitly here? Could you just use `DAG.degUNDEF(NewVT)` here?
20132	This should be kept I think.

This revision now requires changes to proceed.Sep 7 2022, 7:08 AM

zjaffal added inline comments.Sep 8 2022, 7:17 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18098	Would the motivation for the special handling for non-temporal loads be something like: We have 256-bit non-temporal load so it might be good to utilise them when having large load instructions?
18136	Yes I will do that

Fix the issues related to the scalable loads.

zjaffal marked 7 inline comments as done.Sep 8 2022, 8:15 AM

zjaffal added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18094	Checks are added now for non-temporal loads
18105	This should be fixed now

zjaffal marked 3 inline comments as done.Sep 8 2022, 8:17 AM

fhahn added inline comments.Sep 8 2022, 8:17 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18098	We have 256-bit non-temporal load so it might be good to utilise them when having large load instructions? Yes something like that, maybe: Try to beak up non-temporal stores into blocks of 256 bits early, so the LDNPQ can be selected. Might be good to add this as comment for the whole function.

zjaffal added inline comments.Sep 8 2022, 8:22 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18098	You can check the comments I added. I will add a comment for the whole function as well

Add comments for function

Harbormaster completed remote builds in B185638: Diff 458760.Sep 8 2022, 9:54 AM

Thanks for the update, this looks like a nice improvement and should avoid regressions for those cases with D132559. A few more mostly stylistic comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18082	It would be good to be more concrete what large means here (>256 bits) will be split into blocks of 256bits. nit: comments should be full sentences ending with a period (`.`)
18088	Thinking about it a bit more, I think we should also not handle `volatile` and `atomic` loads here to be safe. Could you add tests and the extra conditions for early exit here? I think it would be fine to just add the tests directly in the patch.
18094	`LD->isNonTemporal()` is already checked above
18099	nit: `BasePtr` would be more accurate.
18110	nit: used as unsigned, so maybe make this unsigned too?
18120	nit: comments should be full sentences ending with a period (`.`) `bits in the load operation` -> `bits of the load operation`? It might also help with readability if you add a newline before the comment.
18123	nit: comments should be full sentences ending with a period (.)
18136	MaskVector not used?
18137	`UndefVector` would be more accurate?

Add early exit for volatile loads and a test cases to check for it.
Address some style issues

Harbormaster completed remote builds in B185838: Diff 459056.Sep 9 2022, 8:04 AM

Removing checking if both operands are negative at the same time since it is handled by default.

It looks like you uploaded the wrong patch here?

Fix non-temporal clause

readd the volatile test

t.p.northover added inline comments.Sep 9 2022, 8:55 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18132	Shouldn't the second operand of this call be `PtrOffset` again?
18139–18141	Is this (and the implementation generally) big-endian correct? I don't know the answer here, I can never remember what's supposed to happen. But someone should definitely try it on an `aarch64_be` target and at least eyeball the assembly to check the offsets and so on.
18153	The `Chain` here is the input chain, I think you need to `TokenFactor` all the loads' output chains together to make sure nothing gets reordered with them.
20211–20212	This looks like it takes precedence over `performLOADCombine` and disables `ldnp` formation if the TBI feature is enabled.

Harbormaster completed remote builds in B185859: Diff 459083.Sep 9 2022, 9:37 AM

zjaffal added inline comments.Sep 12 2022, 1:40 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20211–20212	Removing the checks and calling `performLOADCombine` caused many tests to fail. Maybe we can check if the load is non-temporal here ?

zjaffal added inline comments.Sep 12 2022, 2:05 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18132	Yes it should be, it is worth noting that the offsets on the generated assembly didn't change when I changed from using `MemVT.getSizeInBits())` to `PtrOffset`

zjaffal added inline comments.Sep 12 2022, 2:19 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

18139–18141

This is the generated assembly for big-endian
using the following test case

define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {
  %lv = load<17 x float>, <17 x float>* %A, align 8, !nontemporal !0
  ret <17 x float> %lv
}

!0 = !{i32 1}

test_ldnp_v17f32:                       // @test_ldnp_v17f32
	.cfi_startproc
// %bb.0:
	ldnp	q0, q1, [x0, #32]
	add	x9, x8, #48
	add	x10, x8, #32
	ldnp	q2, q3, [x0]
	add	x11, x8, #16
	ldr	s4, [x0, #64]
	st1	{ v2.4s }, [x8]
	st1	{ v1.4s }, [x9]
	st1	{ v0.4s }, [x10]
	st1	{ v3.4s }, [x11]
	str	s4, [x8, #64]
	ret

Looking at https://godbolt.org I think there are more load instructions before breaking them

t.p.northover added inline comments.Sep 12 2022, 3:08 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18139–18141	It's coming back to me now, and I think that output is incorrect. Basically, the `ld1` and `st1` instructions would load each element big-endian but preserve the order (i.e. `v0[0]` is still loaded from `&ptr[0]` not `&ptr[3]`. On the other hand `ldr` and `ldnp` load the whole vector-register in big-endian order which loads `v0[0]` from `&ptr[3]`. This output mixes the two so it'll rearrange elements. But I think the real bug is in https://reviews.llvm.org/rG7155ed4289 that you committed earlier, or at least that needs fixing before any bug here is obvious. A proper fix would be to put an `AArch64ISD::REV<n>` (where `<n>` is the element size) after each `ldnp` to restore what an `ld1` would have done, though I'd be OK with disabling the optimization for big-endian instead. I care if we break it, I'm less bothered if it isn't as fast as little-endian.
20211–20212	The check is needed because otherwise we assume TBI even when we shouldn't, but it probably shouldn't cause an early return. Both optimizations should get the opportunity to run if they might be applicable.

Change Align to use PtrOffset
Make sure that performTBISimplification optimisation can run
Add TokenFactor node

Harbormaster completed remote builds in B186120: Diff 459418.Sep 12 2022, 4:50 AM

zjaffal marked 15 inline comments as done.Sep 12 2022, 8:47 AM

rebase on top of main

Harbormaster completed remote builds in B188194: Diff 462202.Sep 22 2022, 9:25 AM

LGTM, thanks for the latest changes. Some small stylistic comments inline.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18082	nit: I think this is easier to read if you keep `nontemporal loads` together. Also, period at end of sentence.
18108	nit: utilise -> utilize (American spelling)
18125	nit: add newline after here, to separate blocks.
18127	nit: UNDEF vector instead of `null value vector`?
18149	nit: use `ConcatVT` to match other names here.
18163	add newline before the new function/

This revision is now accepted and ready to land.Sep 26 2022, 5:45 AM

fhahn mentioned this in D132559: [AArch64] Add support for 128-bit non temporal loads..Sep 26 2022, 5:51 AM

Address stylistic comments

Harbormaster completed remote builds in B188708: Diff 462908.Sep 26 2022, 9:01 AM

Matt added a subscriber: Matt.Sep 26 2022, 9:22 AM

Thanks for the update. 2 more small comments, but I'll fix them before committing the change on your behalf.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18082	Looks like there's another typo: `nontermporal -> nontemporal`
18125	The newline was meant to go after the `}` to separate it from the next block, not before it.

This revision was landed with ongoing or failed builds.Sep 28 2022, 7:21 AM

Closed by commit rG2d3c260362a2: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG2d3c260362a2: [AArch64] break non-temporal loads over 256 into 256-loads and a smaller load.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

88 lines

test/

CodeGen/

AArch64/

nontemporal-load.ll

52 lines

Diff 463547

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/RuntimeLibcalls.h"		#include "llvm/CodeGen/RuntimeLibcalls.h"
#include "llvm/CodeGen/SelectionDAG.h"		#include "llvm/CodeGen/SelectionDAG.h"
#include "llvm/CodeGen/SelectionDAGNodes.h"		#include "llvm/CodeGen/SelectionDAGNodes.h"
#include "llvm/CodeGen/TargetCallingConv.h"		#include "llvm/CodeGen/TargetCallingConv.h"
#include "llvm/CodeGen/TargetInstrInfo.h"		#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/ValueTypes.h"		#include "llvm/CodeGen/ValueTypes.h"
#include "llvm/IR/Attributes.h"		#include "llvm/IR/Attributes.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
		fhahnUnsubmitted Done Reply Inline Actions Is this needed? fhahn: Is this needed?
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DebugLoc.h"		#include "llvm/IR/DebugLoc.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/GlobalValue.h"		#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/IntrinsicsAArch64.h"		#include "llvm/IR/IntrinsicsAArch64.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/OperandTraits.h"		#include "llvm/IR/OperandTraits.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/IR/Use.h"		#include "llvm/IR/Use.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"
		fhahnUnsubmitted Done Reply Inline Actions Is this needed? fhahn: Is this needed?
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/CodeGen.h"		#include "llvm/Support/CodeGen.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/InstructionCost.h"		#include "llvm/Support/InstructionCost.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"
▲ Show 20 Lines • Show All 812 Lines • ▼ Show 20 Lines	#undef LCALLNAME5
setTargetDAGCombine(ISD::SETCC);		setTargetDAGCombine(ISD::SETCC);

setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);

setTargetDAGCombine({ISD::ANY_EXTEND, ISD::ZERO_EXTEND, ISD::SIGN_EXTEND,		setTargetDAGCombine({ISD::ANY_EXTEND, ISD::ZERO_EXTEND, ISD::SIGN_EXTEND,
ISD::VECTOR_SPLICE, ISD::SIGN_EXTEND_INREG,		ISD::VECTOR_SPLICE, ISD::SIGN_EXTEND_INREG,
ISD::CONCAT_VECTORS, ISD::EXTRACT_SUBVECTOR,		ISD::CONCAT_VECTORS, ISD::EXTRACT_SUBVECTOR,
ISD::INSERT_SUBVECTOR, ISD::STORE, ISD::BUILD_VECTOR});		ISD::INSERT_SUBVECTOR, ISD::STORE, ISD::BUILD_VECTOR});
if (Subtarget->supportsAddressTopByteIgnored())
setTargetDAGCombine(ISD::LOAD);		setTargetDAGCombine(ISD::LOAD);

setTargetDAGCombine(ISD::MSTORE);		setTargetDAGCombine(ISD::MSTORE);

setTargetDAGCombine(ISD::MUL);		setTargetDAGCombine(ISD::MUL);

setTargetDAGCombine({ISD::SELECT, ISD::VSELECT});		setTargetDAGCombine({ISD::SELECT, ISD::VSELECT});

setTargetDAGCombine({ISD::INTRINSIC_VOID, ISD::INTRINSIC_W_CHAIN,		setTargetDAGCombine({ISD::INTRINSIC_VOID, ISD::INTRINSIC_W_CHAIN,
▲ Show 20 Lines • Show All 17,163 Lines • ▼ Show 20 Lines	if (Store->getMemoryVT() != Orig.getValueType())
return SDValue();		return SDValue();
return DAG.getStore(Store->getChain(), SDLoc(Store), Orig,		return DAG.getStore(Store->getChain(), SDLoc(Store), Orig,
Store->getBasePtr(), Store->getMemOperand());		Store->getBasePtr(), Store->getMemOperand());
}		}

return SDValue();		return SDValue();
}		}

		// Perform TBI simplification if supported by the target and try to break up nontemporal loads larger than 256-bits loads for odd types so LDNPQ 256-bit load instructions can be selected.
		fhahnUnsubmitted Done Reply Inline Actions It would be good to be more concrete what large means here (>256 bits) will be split into blocks of 256bits. nit: comments should be full sentences ending with a period (`.`) fhahn: It would be good to be more concrete what large means here (>256 bits) will be split into…
		fhahnUnsubmitted Not Done Reply Inline Actions nit: I think this is easier to read if you keep `nontemporal loads` together. Also, period at end of sentence. fhahn: nit: I think this is easier to read if you keep `nontemporal loads` together. Also, period at…
		fhahnUnsubmitted Not Done Reply Inline Actions Looks like there's another typo: `nontermporal -> nontemporal` fhahn: Looks like there's another typo: `nontermporal -> nontemporal`
		static SDValue performLOADCombine(SDNode *N,
		TargetLowering::DAGCombinerInfo &DCI,
		SelectionDAG &DAG,
		const AArch64Subtarget *Subtarget) {
		if (Subtarget->supportsAddressTopByteIgnored())
		performTBISimplification(N->getOperand(1), DCI, DAG);
		fhahnUnsubmitted Done Reply Inline Actions Thinking about it a bit more, I think we should also not handle `volatile` and `atomic` loads here to be safe. Could you add tests and the extra conditions for early exit here? I think it would be fine to just add the tests directly in the patch. fhahn: Thinking about it a bit more, I think we should also not handle `volatile` and `atomic` loads…

		fhahnUnsubmitted Done Reply Inline Actions move all variables that are used after the early exit down after the early exit. fhahn: move all variables that are used after the early exit down after the early exit.
		LoadSDNode *LD = cast<LoadSDNode>(N);
		EVT MemVT = LD->getMemoryVT();
		if (LD->isVolatile() \|\| !LD->isNonTemporal() \|\| !Subtarget->isLittleEndian())
		return SDValue(N, 0);

		fhahnUnsubmitted Done Reply Inline Actions We should also only do this for non-temporal loads. fhahn: We should also only do this for non-temporal loads.
		zjaffalAuthorUnsubmitted Done Reply Inline Actions Checks are added now for non-temporal loads zjaffal: Checks are added now for non-temporal loads
		fhahnUnsubmitted Done Reply Inline Actions `LD->isNonTemporal()` is already checked above fhahn: `LD->isNonTemporal()` is already checked above
		if (MemVT.isScalableVector() \|\| MemVT.getSizeInBits() <= 256 \|\|
		MemVT.getSizeInBits() % 256 == 0 \|\|
		256 % MemVT.getScalarSizeInBits() != 0)
		return SDValue(N, 0);
		fhahnUnsubmitted Done Reply Inline Actions Could you add a comment here illustrating what kind of DAG nodes we create to replace the original load? It would also be good to document the motivation for special handling for non-temporal loads here. fhahn: Could you add a comment here illustrating what kind of DAG nodes we create to replace the…
		zjaffalAuthorUnsubmitted Done Reply Inline Actions Would the motivation for the special handling for non-temporal loads be something like: We have 256-bit non-temporal load so it might be good to utilise them when having large load instructions? zjaffal: Would the motivation for the special handling for non-temporal loads be something like: We…
		fhahnUnsubmitted Done Reply Inline Actions We have 256-bit non-temporal load so it might be good to utilise them when having large load instructions? Yes something like that, maybe: Try to beak up non-temporal stores into blocks of 256 bits early, so the LDNPQ can be selected. Might be good to add this as comment for the whole function. fhahn: > We have 256-bit non-temporal load so it might be good to utilise them when having large load…
		zjaffalAuthorUnsubmitted Done Reply Inline Actions You can check the comments I added. I will add a comment for the whole function as well zjaffal: You can check the comments I added. I will add a comment for the whole function as well

		fhahnUnsubmitted Done Reply Inline Actions nit: `BasePtr` would be more accurate. fhahn: nit: `BasePtr` would be more accurate.
		SDLoc DL(LD);
		SDValue Chain = LD->getChain();
		SDValue BasePtr = LD->getBasePtr();
		SDNodeFlags Flags = LD->getFlags();
		SmallVector<SDValue, 4> LoadOps;
		SmallVector<SDValue, 4> LoadOpsChain;
		fhahnUnsubmitted Done Reply Inline Actions I don't think that's correct, you are passing the offset in bits, but I think it should be in bytes. Looking at some of the test changes, the offsets of the loads are wrong, e.g. `ldnp q0, q2, [x0, #256]`. fhahn: I don't think that's correct, you are passing the offset in bits, but I think it should be in…
		zjaffalAuthorUnsubmitted Done Reply Inline Actions This should be fixed now zjaffal: This should be fixed now
		// Replace any non temporal load over 256-bit with a series of 256 bit loads
		// and a scalar/vector load less than 256. This way we can utilize 256-bit
		// loads and reduce the amount of load instructions generated.
		fhahnUnsubmitted Not Done Reply Inline Actions nit: utilise -> utilize (American spelling) fhahn: nit: utilise -> utilize (American spelling)
		MVT NewVT =
		MVT::getVectorVT(MemVT.getVectorElementType().getSimpleVT(),
		fhahnUnsubmitted Done Reply Inline Actions nit: used as unsigned, so maybe make this unsigned too? fhahn: nit: used as unsigned, so maybe make this unsigned too?
		256 / MemVT.getVectorElementType().getSizeInBits());
		unsigned Num256Loads = MemVT.getSizeInBits() / 256;
		// Create all 256-bit loads starting from offset 0 and up to Num256Loads-1*32.
		for (unsigned I = 0; I < Num256Loads; I++) {
		unsigned PtrOffset = I * 32;
		SDValue NewPtr = DAG.getMemBasePlusOffset(
		BasePtr, TypeSize::Fixed(PtrOffset), DL, Flags);
		Align NewAlign = commonAlignment(LD->getAlign(), PtrOffset);
		SDValue NewLoad = DAG.getLoad(
		NewVT, DL, Chain, NewPtr, LD->getPointerInfo().getWithOffset(PtrOffset),
		fhahnUnsubmitted Done Reply Inline Actions nit: comments should be full sentences ending with a period (`.`) `bits in the load operation` -> `bits of the load operation`? It might also help with readability if you add a newline before the comment. fhahn: nit: comments should be full sentences ending with a period (`.`) `bits in the load operation`…
		NewAlign, LD->getMemOperand()->getFlags(), LD->getAAInfo());
		LoadOps.push_back(NewLoad);
		LoadOpsChain.push_back(SDValue(cast<SDNode>(NewLoad), 1));
		fhahnUnsubmitted Done Reply Inline Actions nit: comments should be full sentences ending with a period (.) fhahn: nit: comments should be full sentences ending with a period (.)
		}

		fhahnUnsubmitted Not Done Reply Inline Actions nit: add newline after here, to separate blocks. fhahn: nit: add newline after here, to separate blocks.
		fhahnUnsubmitted Not Done Reply Inline Actions The newline was meant to go after the `}` to separate it from the next block, not before it. fhahn: The newline was meant to go after the `}` to separate it from the next block, not before it.
		// Process remaining bits of the load operation.
		// This is done by creating an UNDEF vector to match the size of the
		fhahnUnsubmitted Not Done Reply Inline Actions nit: UNDEF vector instead of `null value vector`? fhahn: nit: UNDEF vector instead of `null value vector`?
		// 256-bit loads and inserting the remaining load to it. We extract the
		// original load type at the end using EXTRACT_SUBVECTOR instruction.
		unsigned BitsRemaining = MemVT.getSizeInBits() % 256;
		unsigned PtrOffset = (MemVT.getSizeInBits() - BitsRemaining) / 8;
		MVT RemainingVT = MVT::getVectorVT(
		t.p.northoverUnsubmitted Done Reply Inline Actions Shouldn't the second operand of this call be `PtrOffset` again? t.p.northover: Shouldn't the second operand of this call be `PtrOffset` again?
		zjaffalAuthorUnsubmitted Done Reply Inline Actions Yes it should be, it is worth noting that the offsets on the generated assembly didn't change when I changed from using `MemVT.getSizeInBits())` to `PtrOffset` zjaffal: Yes it should be, it is worth noting that the offsets on the generated assembly didn't change…
		MemVT.getVectorElementType().getSimpleVT(),
		BitsRemaining / MemVT.getVectorElementType().getSizeInBits());
		SDValue NewPtr =
		DAG.getMemBasePlusOffset(BasePtr, TypeSize::Fixed(PtrOffset), DL, Flags);
		fhahnUnsubmitted Done Reply Inline Actions Do you need to create `NullVector` explicitly here? Could you just use `DAG.degUNDEF(NewVT)` here? fhahn: Do you need to create `NullVector` explicitly here? Could you just use `DAG.degUNDEF(NewVT)`…
		zjaffalAuthorUnsubmitted Done Reply Inline Actions Yes I will do that zjaffal: Yes I will do that
		fhahnUnsubmitted Done Reply Inline Actions MaskVector not used? fhahn: MaskVector not used?
		Align NewAlign = commonAlignment(LD->getAlign(), PtrOffset);
		fhahnUnsubmitted Done Reply Inline Actions `UndefVector` would be more accurate? fhahn: `UndefVector` would be more accurate?
		SDValue RemainingLoad =
		DAG.getLoad(RemainingVT, DL, Chain, NewPtr,
		LD->getPointerInfo().getWithOffset(PtrOffset), NewAlign,
		LD->getMemOperand()->getFlags(), LD->getAAInfo());
		t.p.northoverUnsubmitted Done Reply Inline Actions Is this (and the implementation generally) big-endian correct? I don't know the answer here, I can never remember what's supposed to happen. But someone should definitely try it on an `aarch64_be` target and at least eyeball the assembly to check the offsets and so on. t.p.northover: Is this (and the implementation generally) big-endian correct? I don't know the answer here, I…
		zjaffalAuthorUnsubmitted Done Reply Inline Actions This is the generated assembly for big-endian using the following test case define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) { %lv = load<17 x float>, <17 x float>* %A, align 8, !nontemporal !0 ret <17 x float> %lv } !0 = !{i32 1} test_ldnp_v17f32: // @test_ldnp_v17f32 .cfi_startproc // %bb.0: ldnp q0, q1, [x0, #32] add x9, x8, #48 add x10, x8, #32 ldnp q2, q3, [x0] add x11, x8, #16 ldr s4, [x0, #64] st1 { v2.4s }, [x8] st1 { v1.4s }, [x9] st1 { v0.4s }, [x10] st1 { v3.4s }, [x11] str s4, [x8, #64] ret Looking at https://godbolt.org I think there are more load instructions before breaking them zjaffal: This is the generated assembly for big-endian using the following test case ``` define <17 x…
		t.p.northoverUnsubmitted Done Reply Inline Actions It's coming back to me now, and I think that output is incorrect. Basically, the `ld1` and `st1` instructions would load each element big-endian but preserve the order (i.e. `v0[0]` is still loaded from `&ptr[0]` not `&ptr[3]`. On the other hand `ldr` and `ldnp` load the whole vector-register in big-endian order which loads `v0[0]` from `&ptr[3]`. This output mixes the two so it'll rearrange elements. But I think the real bug is in https://reviews.llvm.org/rG7155ed4289 that you committed earlier, or at least that needs fixing before any bug here is obvious. A proper fix would be to put an `AArch64ISD::REV<n>` (where `<n>` is the element size) after each `ldnp` to restore what an `ld1` would have done, though I'd be OK with disabling the optimization for big-endian instead. I care if we break it, I'm less bothered if it isn't as fast as little-endian. t.p.northover: It's coming back to me now, and I think that output is incorrect. Basically, the `ld1` and…
		SDValue UndefVector = DAG.getUNDEF(NewVT);
		SDValue InsertIdx = DAG.getVectorIdxConstant(0, DL);
		SDValue ExtendedReminingLoad =
		DAG.getNode(ISD::INSERT_SUBVECTOR, DL, NewVT,
		{UndefVector, RemainingLoad, InsertIdx});
		LoadOps.push_back(ExtendedReminingLoad);
		LoadOpsChain.push_back(SDValue(cast<SDNode>(RemainingLoad), 1));
		EVT ConcatVT =
		fhahnUnsubmitted Not Done Reply Inline Actions nit: use `ConcatVT` to match other names here. fhahn: nit: use `ConcatVT` to match other names here.
		EVT::getVectorVT(*DAG.getContext(), MemVT.getScalarType(),
		LoadOps.size() * NewVT.getVectorNumElements());
		SDValue ConcatVectors =
		DAG.getNode(ISD::CONCAT_VECTORS, DL, ConcatVT, LoadOps);
		t.p.northoverUnsubmitted Done Reply Inline Actions The `Chain` here is the input chain, I think you need to `TokenFactor` all the loads' output chains together to make sure nothing gets reordered with them. t.p.northover: The `Chain` here is the input chain, I think you need to `TokenFactor` all the loads' output…
		// Extract the original vector type size.
		SDValue ExtractSubVector =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MemVT,
		{ConcatVectors, DAG.getVectorIdxConstant(0, DL)});
		SDValue TokenFactor =
		DAG.getNode(ISD::TokenFactor, DL, MVT::Other, LoadOpsChain);
		return DAG.getMergeValues({ExtractSubVector, TokenFactor}, DL);
		}

static SDValue performSTORECombine(SDNode *N,		static SDValue performSTORECombine(SDNode *N,
		fhahnUnsubmitted Not Done Reply Inline Actions add newline before the new function/ fhahn: add newline before the new function/
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
StoreSDNode *ST = cast<StoreSDNode>(N);		StoreSDNode *ST = cast<StoreSDNode>(N);
SDValue Chain = ST->getChain();		SDValue Chain = ST->getChain();
SDValue Value = ST->getValue();		SDValue Value = ST->getValue();
SDValue Ptr = ST->getBasePtr();		SDValue Ptr = ST->getBasePtr();

▲ Show 20 Lines • Show All 2,031 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::INSERT_SUBVECTOR:		case ISD::INSERT_SUBVECTOR:
return performInsertSubvectorCombine(N, DCI, DAG);		return performInsertSubvectorCombine(N, DCI, DAG);
case ISD::SELECT:		case ISD::SELECT:
return performSelectCombine(N, DCI);		return performSelectCombine(N, DCI);
case ISD::VSELECT:		case ISD::VSELECT:
return performVSelectCombine(N, DCI.DAG);		return performVSelectCombine(N, DCI.DAG);
case ISD::SETCC:		case ISD::SETCC:
return performSETCCCombine(N, DCI, DAG);		return performSETCCCombine(N, DCI, DAG);
case ISD::LOAD:		case ISD::LOAD:
if (performTBISimplification(N->getOperand(1), DCI, DAG))		return performLOADCombine(N, DCI, DAG, Subtarget);
fhahnUnsubmitted Done Reply Inline Actions This should be kept I think. fhahn: This should be kept I think.
		t.p.northoverUnsubmitted Done Reply Inline Actions This looks like it takes precedence over `performLOADCombine` and disables `ldnp` formation if the TBI feature is enabled. t.p.northover: This looks like it takes precedence over `performLOADCombine` and disables `ldnp` formation if…
		zjaffalAuthorUnsubmitted Done Reply Inline Actions Removing the checks and calling `performLOADCombine` caused many tests to fail. Maybe we can check if the load is non-temporal here ? zjaffal: Removing the checks and calling `performLOADCombine` caused many tests to fail. Maybe we can…
		t.p.northoverUnsubmitted Done Reply Inline Actions The check is needed because otherwise we assume TBI even when we shouldn't, but it probably shouldn't cause an early return. Both optimizations should get the opportunity to run if they might be applicable. t.p.northover: The check is needed because otherwise we assume TBI even when we shouldn't, but it probably…
return SDValue(N, 0);
break;
case ISD::STORE:		case ISD::STORE:
return performSTORECombine(N, DCI, DAG, Subtarget);		return performSTORECombine(N, DCI, DAG, Subtarget);
case ISD::MSTORE:		case ISD::MSTORE:
return performMSTORECombine(N, DCI, DAG, Subtarget);		return performMSTORECombine(N, DCI, DAG, Subtarget);
case ISD::MGATHER:		case ISD::MGATHER:
case ISD::MSCATTER:		case ISD::MSCATTER:
return performMaskedGatherScatterCombine(N, DCI, DAG);		return performMaskedGatherScatterCombine(N, DCI, DAG);
case ISD::VECTOR_SPLICE:		case ISD::VECTOR_SPLICE:
▲ Show 20 Lines • Show All 2,371 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/nontemporal-load.ll

	Show First 20 Lines • Show All 314 Lines • ▼ Show 20 Lines
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <16 x float>, <16 x float>* %A, align 8, !nontemporal !0			%lv = load <16 x float>, <16 x float>* %A, align 8, !nontemporal !0
	ret <16 x float> %lv			ret <16 x float> %lv
	}			}

	define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {			define <17 x float> @test_ldnp_v17f32(<17 x float>* %A) {
	; CHECK-LABEL: test_ldnp_v17f32:			; CHECK-LABEL: test_ldnp_v17f32:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldp q1, q2, [x0, #32]			; CHECK-NEXT: ldnp q0, q1, [x0, #32]
	; CHECK-NEXT: ldp q3, q4, [x0]			; CHECK-NEXT: ldnp q2, q3, [x0]
	; CHECK-NEXT: ldr s0, [x0, #64]			; CHECK-NEXT: ldr s4, [x0, #64]
	; CHECK-NEXT: stp q3, q4, [x8]			; CHECK-NEXT: stp q0, q1, [x8, #32]
	; CHECK-NEXT: stp q1, q2, [x8, #32]			; CHECK-NEXT: stp q2, q3, [x8]
	; CHECK-NEXT: str s0, [x8, #64]			; CHECK-NEXT: str s4, [x8, #64]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v17f32:			; CHECK-BE-LABEL: test_ldnp_v17f32:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: add x9, x0, #32			; CHECK-BE-NEXT: add x9, x0, #32
	; CHECK-BE-NEXT: ld1 { v1.4s }, [x0]			; CHECK-BE-NEXT: ld1 { v1.4s }, [x0]
	; CHECK-BE-NEXT: add x10, x0, #16			; CHECK-BE-NEXT: add x10, x0, #16
	; CHECK-BE-NEXT: ldr s2, [x0, #64]			; CHECK-BE-NEXT: ldr s2, [x0, #64]
	Show All 12 Lines
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <17 x float>, <17 x float>* %A, align 8, !nontemporal !0			%lv = load <17 x float>, <17 x float>* %A, align 8, !nontemporal !0
	ret <17 x float> %lv			ret <17 x float> %lv
	}			}

	define <33 x double> @test_ldnp_v33f64(<33 x double>* %A) {			define <33 x double> @test_ldnp_v33f64(<33 x double>* %A) {
	; CHECK-LABEL: test_ldnp_v33f64:			; CHECK-LABEL: test_ldnp_v33f64:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldp q0, q1, [x0]			; CHECK-NEXT: ldnp q0, q1, [x0]
	; CHECK-NEXT: ldp q2, q3, [x0, #32]			; CHECK-NEXT: ldnp q2, q3, [x0, #32]
	; CHECK-NEXT: ldp q4, q5, [x0, #64]			; CHECK-NEXT: ldnp q4, q5, [x0, #64]
	; CHECK-NEXT: ldp q6, q7, [x0, #96]			; CHECK-NEXT: ldnp q6, q7, [x0, #96]
	; CHECK-NEXT: ldp q16, q17, [x0, #128]			; CHECK-NEXT: ldnp q16, q17, [x0, #128]
	; CHECK-NEXT: ldp q18, q19, [x0, #160]			; CHECK-NEXT: ldnp q18, q19, [x0, #224]
	; CHECK-NEXT: ldp q21, q22, [x0, #224]			; CHECK-NEXT: ldnp q20, q21, [x0, #192]
	; CHECK-NEXT: ldp q23, q24, [x0, #192]			; CHECK-NEXT: ldnp q22, q23, [x0, #160]
	; CHECK-NEXT: ldr d20, [x0, #256]			; CHECK-NEXT: ldr d24, [x0, #256]
	; CHECK-NEXT: stp q0, q1, [x8]			; CHECK-NEXT: stp q0, q1, [x8]
	; CHECK-NEXT: stp q2, q3, [x8, #32]			; CHECK-NEXT: stp q2, q3, [x8, #32]
	; CHECK-NEXT: stp q4, q5, [x8, #64]			; CHECK-NEXT: stp q4, q5, [x8, #64]
	; CHECK-NEXT: str d20, [x8, #256]
	; CHECK-NEXT: stp q6, q7, [x8, #96]			; CHECK-NEXT: stp q6, q7, [x8, #96]
	; CHECK-NEXT: stp q16, q17, [x8, #128]			; CHECK-NEXT: stp q16, q17, [x8, #128]
	; CHECK-NEXT: stp q18, q19, [x8, #160]			; CHECK-NEXT: stp q22, q23, [x8, #160]
	; CHECK-NEXT: stp q23, q24, [x8, #192]			; CHECK-NEXT: stp q20, q21, [x8, #192]
	; CHECK-NEXT: stp q21, q22, [x8, #224]			; CHECK-NEXT: stp q18, q19, [x8, #224]
				; CHECK-NEXT: str d24, [x8, #256]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v33f64:			; CHECK-BE-LABEL: test_ldnp_v33f64:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: add x9, x0, #16			; CHECK-BE-NEXT: add x9, x0, #16
	; CHECK-BE-NEXT: add x10, x0, #32			; CHECK-BE-NEXT: add x10, x0, #32
	; CHECK-BE-NEXT: ld1 { v21.2d }, [x0]			; CHECK-BE-NEXT: ld1 { v21.2d }, [x0]
	; CHECK-BE-NEXT: add x11, x8, #208			; CHECK-BE-NEXT: add x11, x8, #208
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <33 x double>, <33 x double>* %A, align 8, !nontemporal !0			%lv = load <33 x double>, <33 x double>* %A, align 8, !nontemporal !0
	ret <33 x double> %lv			ret <33 x double> %lv
	}			}

	define <33 x i8> @test_ldnp_v33i8(<33 x i8>* %A) {			define <33 x i8> @test_ldnp_v33i8(<33 x i8>* %A) {
	; CHECK-LABEL: test_ldnp_v33i8:			; CHECK-LABEL: test_ldnp_v33i8:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldp q1, q0, [x0]			; CHECK-NEXT: ldnp q0, q1, [x0]
	; CHECK-NEXT: ldrb w9, [x0, #32]			; CHECK-NEXT: add x9, x8, #32
	; CHECK-NEXT: stp q1, q0, [x8]			; CHECK-NEXT: ldr b2, [x0, #32]
	; CHECK-NEXT: strb w9, [x8, #32]			; CHECK-NEXT: stp q0, q1, [x8]
				; CHECK-NEXT: st1.b { v2 }[0], [x9]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v33i8:			; CHECK-BE-LABEL: test_ldnp_v33i8:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: add x9, x0, #16			; CHECK-BE-NEXT: add x9, x0, #16
	; CHECK-BE-NEXT: ld1 { v0.16b }, [x0]			; CHECK-BE-NEXT: ld1 { v0.16b }, [x0]
	; CHECK-BE-NEXT: add x10, x8, #16			; CHECK-BE-NEXT: add x10, x8, #16
	; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]			; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
	▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines
	; CHECK-BE-NEXT: ret			; CHECK-BE-NEXT: ret
	%lv = load <4 x i63>, <4 x i63>* %A, align 8, !nontemporal !0			%lv = load <4 x i63>, <4 x i63>* %A, align 8, !nontemporal !0
	ret <4 x i63> %lv			ret <4 x i63> %lv
	}			}

	define <5 x double> @test_ldnp_v5f64(<5 x double>* %A) {			define <5 x double> @test_ldnp_v5f64(<5 x double>* %A) {
	; CHECK-LABEL: test_ldnp_v5f64:			; CHECK-LABEL: test_ldnp_v5f64:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: ldp q0, q2, [x0]			; CHECK-NEXT: ldnp q0, q2, [x0]
				; CHECK-NEXT: ldr d4, [x0, #32]
	; CHECK-NEXT: ext.16b v1, v0, v0, #8			; CHECK-NEXT: ext.16b v1, v0, v0, #8
	; CHECK-NEXT: ; kill: def $d0 killed $d0 killed $q0			; CHECK-NEXT: ; kill: def $d0 killed $d0 killed $q0
	; CHECK-NEXT: ; kill: def $d1 killed $d1 killed $q1			; CHECK-NEXT: ; kill: def $d1 killed $d1 killed $q1
	; CHECK-NEXT: ext.16b v3, v2, v2, #8			; CHECK-NEXT: ext.16b v3, v2, v2, #8
	; CHECK-NEXT: ldr d4, [x0, #32]
	; CHECK-NEXT: ; kill: def $d2 killed $d2 killed $q2			; CHECK-NEXT: ; kill: def $d2 killed $d2 killed $q2
	; CHECK-NEXT: ; kill: def $d3 killed $d3 killed $q3			; CHECK-NEXT: ; kill: def $d3 killed $d3 killed $q3
	; CHECK-NEXT: ; kill: def $d4 killed $d4 killed $q4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; CHECK-BE-LABEL: test_ldnp_v5f64:			; CHECK-BE-LABEL: test_ldnp_v5f64:
	; CHECK-BE: // %bb.0:			; CHECK-BE: // %bb.0:
	; CHECK-BE-NEXT: add x8, x0, #16			; CHECK-BE-NEXT: add x8, x0, #16
	; CHECK-BE-NEXT: ld1 { v0.2d }, [x0]			; CHECK-BE-NEXT: ld1 { v0.2d }, [x0]
	; CHECK-BE-NEXT: ldr d4, [x0, #32]			; CHECK-BE-NEXT: ldr d4, [x0, #32]
	; CHECK-BE-NEXT: ld1 { v2.2d }, [x8]			; CHECK-BE-NEXT: ld1 { v2.2d }, [x8]
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] break non-temporal loads over 256 into 256-loads and a smaller loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 463547

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/nontemporal-load.ll

[AArch64] break non-temporal loads over 256 into 256-loads and a smaller load
ClosedPublic