This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
16/24
AArch64ISelLowering.cpp
1/1
AArch64Subtarget.h
1/1
AArch64Subtarget.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/1
sve-gather-scatter-to-contiguous.ll

Differential D120912

[AArch64][SVE] Convert gather/scatter with a stride of 2 to contiguous loads/stores
Changes PlannedPublic

Authored by kmclaughlin on Mar 3 2022, 8:11 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
CarolineConcatto
peterwaller-arm
efriedma

Summary

This patch extends performMaskedGatherScatterCombine to find gathers
& scatters with a stride of two in their indices, which can be converted
to a pair of contiguous loads or stores with zips & uzps and the
appropriate predicates.

There were no performance improvements found using this combine for scatter
stores of 64 bit data, so we just return SDValue() in this case.

Diff Detail

Unit TestsFailed

	Time	Test
	60,070 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg_mask.c
	60,070 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlseg.c
	60,090 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlsegff.c
	60,050 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg_mask.c
	60,090 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-overloaded::vloxseg_mask.c
		View Full Test Results (7 Failed)

Event Timeline

kmclaughlin created this revision.Mar 3 2022, 8:11 AM

Herald added a reviewer: efriedma. · View Herald TranscriptMar 3 2022, 8:11 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

kmclaughlin requested review of this revision.Mar 3 2022, 8:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 8:11 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B152386: Diff 412730.Mar 3 2022, 8:44 AM

Hi @kmclaughlin, this looks like a nice improvement! I've not reviewed all of it so far, but I'll leave a few comments in the bits I have reviewed.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16934	Perhaps worth adding a comment here explaining that the following code will attempt to transform stride=2 gathers/scatters into contiguous loads/stores?
16935	It might look nicer if you can avoid unnecessary indentation here by rewriting this to return early, i.e. if (Index->getOpcode() != ISD::STEP_VECTOR \|\| Mask->getOpcode() == ISD::EXTRACT_SUBVECTOR \|\| IndexType != ISD::SIGNED_SCALED) return SDValue(); ... rest of code with less indentation ...
16940	I think you can just write `if (Step != 2)` here.
16944	Are you assuming this is a legal/simple type here? For example, the type could be <vscale x 12 x i32>, which will be an extended type and `getSimpleVT` will assert. Could you add a test case for this? I think you can avoid this either by only doing this after legalisation or by rewriting as a sequence of if-else statements, i.e. if (MemVT == MVT::nxv2i64 \|\| MemVT == MVT::nxv2f64 ...

Did you try using ld2/st2? I guess the problem is that it accesses too many bytes?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16936	Why the restriction on the mask opcode?
16954	I guess you're not handling i16/i8 because you'd have to worry about extension?

If possible can you move the performMaskedGatherScatterCombine changes into their own function, something like tryCombineToMaskedLoadStore?

Moved the changes to performMaskedGatherScatterCombine into a new function, tryCombineToMaskedLoadStore.
Removed the switch statement and use of getSimpleVT(), using if-else statements which check MemVT for supported types instead.
Removed the restriction that the mask opcode of the gather/scatter is not an extract_subvector.

In D120912#3357559, @efriedma wrote:

Did you try using ld2/st2? I guess the problem is that it accesses too many bytes?

Hi @efriedma, I did investigate whether I could use ld2/st2 instead and found that the performance was very similar compared with contiguous loads & stores. The problem with using ld2/st2 was that we could end up accessing too many bytes as you suggested. It's for this reason that we're instead using contiguous loads/stores and always interleave the predicate with pfalse, to ensure that we don't attempt to access beyond the last element which the gather/scatter would have accessed.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16936	This restriction was added downstream as a workaround to an issue which I have not been able to reproduce. Removing as I believe it is no longer required.
16944	Hi @david-arm, I have changed this to be a sequence of if-else statements checking MemVT as suggested. I tried adding a test using types such as nxv12i32, but I found that this crashes when the stepvector has an illegal type with "Do not know how to widen the result of this operator!"
16954	This patch was only for handling legal types, though I think i8/i16 types could be supported in future if this would be beneficial.

rscottmanley added a subscriber: rscottmanley.Mar 7 2022, 7:37 AM

Hi @kmclaughlin, this patch looks a lot better now thanks! Most of my remaining comments are minor, but I do think we should add some explicit checks for truncating scatters and extending gathers.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16785	nit: Is it worth adding a comment here that `Index` should always be a STEP_VECTOR, which is the reason you are asumming the first operand is a ConstantSDNode?
16796	I think you may need some extra checks here for extending gather loads or truncating scatter stores and bail out for both cases, otherwise we may attempt to do the DAG combine for extloads of nxv4i32 -> nx4i64.
16805	Given that you're creating new nodes above and then potentially dropping them it might be better to only create nodes once you're guaranteed to do the transformation, i.e. something like auto MGT = dyn_cast<MaskedGatherSDNode>(MGS); if (MGT && !MGT->getPassThru().isUndef()) return SDValue(); // Using contiguous stores is not beneficial over 64 bit scatters auto MSC = cast<MaskedScatterSDNode>(MGS); if (MSC && MSC->getValue().getValueType().getScalarSizeInBits() != 64) // Split the mask into two parts and interleave with false ... if (MGT) { .. gather case .. }
16809	nit: I think you also calculate this for the scatter case. Maybe it's worth moving this to above the if statement and then re-using the value for both gather and scatter cases?
16824	I'm not sure if you can simply reuse the same memory operand here, i.e. if you look at `DAGTypeLegalizer::SplitVecRes_MLOAD` you'll see an example of what we do for scalable vectors. I think it's because we're loading from a different offset now.
16866	Again, I think you need to create a new mem operand here.
llvm/test/CodeGen/AArch64/sve-gather-scatter-to-contiguous.ll
7	I think it's worth having a FP gather test and FP scatter test too, since we support them

Harbormaster completed remote builds in B152933: Diff 413465.Mar 7 2022, 8:43 AM

cameron.mcinally added a subscriber: cameron.mcinally.Mar 7 2022, 9:30 AM

I find the performance claims interesting. I've asked this question to SVE hw engineers before on which is faster -- gather vs load and shuffle vs load2 and the answer was essentially "depends on your loop". If you're using up ports to shuffle that could otherwise be used for computation, it seems like this would be a loser. If that analysis is correct then IMO this decision should be made in LV and the backend should honor the gather. However, if it stays in the backend, I still have some comments:

64b scatters being as fast/faster than either contiguous stores or st2 is amazing if true, but is that always going to be true for all SVE targets? I'm guessing some sort of "preferScatter()" or "preferGather()" on a per target basis will (eventually?) be needed for this, but I don't have access to different SVE capable chips.

I understand why the contiguous sequence is needed over ld2/st2, but what about when you know it is safe to use ld2/st2? If the contiguous sequence you have here is indeed faster, would existing combines that match to ld2/st2 be combined to this instead?

Should this combine apply of the "odd" elements which are also stride2 { 1, 3, 5, 7...}?

Does this change still allow a match to ld2/st2 when all the elements are accessed by a pair of gathers/scatters?

Added the preferGatherScatter function to AArch64Subtarget, which checks if a gather/scatter with a given type & stride is preferable to contiguous loads/stores on the current target.
Create new machine memory operands for the contiguous loads/stores instead of reusing the mem operand from the gather/scatter.
Added tests for floating-point gathers & scatters, and extending & truncating gathers/scatters.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16796	I think in the case of extending gather loads and truncating scatter stores, the index opcode would be a sign/zero-extend or trunc instead of a stepvector? I've added some tests to check that we don't perform the combine in these cases.

Hi @rscottmanley, thank you for taking a look at this patch.

In D120912#3364546, @rscottmanley wrote:

64b scatters being as fast/faster than either contiguous stores or st2 is amazing if true, but is that always going to be true for all SVE targets? I'm guessing some sort of "preferScatter()" or "preferGather()" on a per target basis will (eventually?) be needed for this, but I don't have access to different SVE capable chips.

We expect it to be faster on Neoverse-v1 for 32b/64b gathers & 64b scatters with a stride of two, so those are the cases we're currently optimising for. You're right that it makes sense to tune this per subtarget and I've added this now.

In D120912#3364546, @rscottmanley wrote:

I understand why the contiguous sequence is needed over ld2/st2, but what about when you know it is safe to use ld2/st2? If the contiguous sequence you have here is indeed faster, would existing combines that match to ld2/st2 be combined to this instead?

Should this combine apply of the "odd" elements which are also stride2 { 1, 3, 5, 7...}?

Does this change still allow a match to ld2/st2 when all the elements are accessed by a pair of gathers/scatters?

This approach is largely a stop-gap until we have proper (de)interleaving in the loop vectoriser, or when the InterleavedAccess pass supports scalable vectors. We would expect those transformations to use an appropriate cost-model to decide whether to use gathers/scatters, ld2/st2 or explicit (de)interleaving intrinsics. All of these transformations would happen before legalisation, so if it ends up here as a gather that was either because it wasn't safe to use ld2/st2 or the cost-model argued against it. That makes this a bit of a last-resort DAGCombine fold to improve code quality. Until we have proper (de)interleaving support in the passes mentioned above, this mechanism will at least already improve performance for those subtargets where this is enabled.

Harbormaster completed remote builds in B153550: Diff 414354.Mar 10 2022, 6:31 AM

Thanks for making all the changes @kmclaughlin! I think it looks pretty good now. I just had a few more minor comments ...

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16818	nit: Looks like this is indented too much here.
16830	nit: I think you can also move this above the if statement and then reuse for the store case too, and use MGS->getPointerInfo() instead.
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
364	nit: Can you fix the formatting issue? Thanks!
llvm/lib/Target/AArch64/AArch64Subtarget.h
672	I think it's worth being more explicit here about what `VecTy` actually is because it could be different to the loaded or stored data type. Do you think it's worth renaming this to `MemTy` to be clear? Or you can keep the same variable name, but add a comment saying that is the type for the data in memory. The choice is yours!

@kmclaughlin Thanks for the reply and adding the subtarget check. I figured this might have something to do with the scalable vectorization and this is a reasonable stopgap.

Renamed VecVT to MemVT in the preferGatherScatter function.
Fixed formatting issues.

Harbormaster completed remote builds in B153588: Diff 414408.Mar 10 2022, 10:01 AM

LGTM (with formatting issues fixed)! This looks great now thanks @kmclaughlin. :)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16773	nit: Can you fix all the formatting issues in your new code before merging please?

david-arm accepted this revision.Mar 14 2022, 2:20 AM

This revision is now accepted and ready to land.Mar 14 2022, 2:20 AM

Perhaps I've misunderstood something but I'd like to take a step back to make sure we're not overcomplicating something here and at the same time making it harder to handle more cases in the future.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16801–16806	Is this a temporary restriction? I ask because when I asked about doing this in IR instead of CodeGen, one of the answers was to support illegal types. To be clear I'm not against solving this in CodeGen by my assumption is that one of the biggest winners to doing this transform is when trying to gather/scatter bytes or halfwords, which are not supported and thus require splitting. However they can be done more efficiently with masked loads and stores.
16816–16818	Can these use `ISD::EXTRACT_SUBVECTOR` instead? I ask because those will emit better code (removes the need for `pfalse`) and also allows DAGCombine to perhaps optimise further (e.g. when Mask is a constant). With that said, ...
16827–16845	...sorry if this is a silly question as I've not scrutinised the code but is this manual splitting and recombine necessary. It looks a little like manual type legalisation but this combine is something that is generally best done before type legalisation and thus you should be able to create a single big load and let common code split it if necessary. This might also resolve my "bytes" and "halfwords" comment above.
16857–16858	As above, using `ISD::EXTRACT_SUBVECTOR` here seems preferable and also means the scatter case doesn't require any target specific nodes.
16914–16915	I think these tests really belong inside `tryToCombineToMaskedLoadStore` because it is that function that has the restrictions and thus the same function that might later remove them.

kmclaughlin planned changes to this revision.Mar 16 2022, 6:34 AM

Matt added a subscriber: Matt.Mar 17 2022, 5:49 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

156 lines

AArch64Subtarget.h

5 lines

AArch64Subtarget.cpp

21 lines

test/

CodeGen/

AArch64/

sve-gather-scatter-to-contiguous.ll

168 lines

Diff 414408

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,764 Lines • ▼ Show 20 Lines	static bool findMoreOptimalIndexType(const MaskedGatherScatterSDNode *N,
// Stride does not scale explicitly by 'Scale', because it happens in		// Stride does not scale explicitly by 'Scale', because it happens in
// the gather/scatter addressing mode.		// the gather/scatter addressing mode.
Index = DAG.getNode(ISD::STEP_VECTOR, SDLoc(N), NewIndexVT,		Index = DAG.getNode(ISD::STEP_VECTOR, SDLoc(N), NewIndexVT,
DAG.getTargetConstant(Stride, SDLoc(N), MVT::i32));		DAG.getTargetConstant(Stride, SDLoc(N), MVT::i32));
BasePtr = NewBasePtr;		BasePtr = NewBasePtr;
return true;		return true;
}		}

		static SDValue tryToCombineToMaskedLoadStore(MaskedGatherScatterSDNode *MGS,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -static SDValue tryToCombineToMaskedLoadStore(MaskedGatherScatterSDNode MGS, - SelectionDAG &DAG, - const AArch64Subtarget Subtarget) { +static SDValue +tryToCombineToMaskedLoadStore(MaskedGatherScatterSDNode MGS, SelectionDAG &DAG, + const AArch64Subtarget Subtarget) { Lint: Pre-merge checks: clang-format: please reformat the code ``` -static SDValue tryToCombineToMaskedLoadStore…
		david-armUnsubmitted Not Done Reply Inline Actions nit: Can you fix all the formatting issues in your new code before merging please? david-arm: nit: Can you fix all the formatting issues in your new code before merging please?
		SelectionDAG &DAG,
		const AArch64Subtarget *Subtarget) {
		auto *MGT = dyn_cast<MaskedGatherSDNode>(MGS);
		auto *MSC = dyn_cast<MaskedScatterSDNode>(MGS);
		if (MGT && !MGT->getPassThru().isUndef())
		return SDValue();

		SDLoc DL(MGS);
		SDValue Chain = MGS->getChain();
		SDValue Index = MGS->getIndex();
		SDValue Mask = MGS->getMask();
		SDValue BasePtr = MGS->getBasePtr();
		david-armUnsubmitted Done Reply Inline Actions nit: Is it worth adding a comment here that `Index` should always be a STEP_VECTOR, which is the reason you are asumming the first operand is a ConstantSDNode? david-arm: nit: Is it worth adding a comment here that `Index` should always be a STEP_VECTOR, which is…

		EVT MemVT = MGS->getMemoryVT();
		EVT MaskVT = Mask.getValueType();
		EVT BaseVT = BasePtr.getValueType();

		EVT VT = MGT ? MGT->getValueType(0) : MSC->getValue().getValueType();

		// Step is a ConstantSDNode since Index is always a STEP_VECTOR
		auto Step = cast<ConstantSDNode>(Index->getOperand(0))->getSExtValue();

		if (Step != 2 \|\|
		david-armUnsubmitted Done Reply Inline Actions I think you may need some extra checks here for extending gather loads or truncating scatter stores and bail out for both cases, otherwise we may attempt to do the DAG combine for extloads of nxv4i32 -> nx4i64. david-arm: I think you may need some extra checks here for extending gather loads or truncating scatter…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions I think in the case of extending gather loads and truncating scatter stores, the index opcode would be a sign/zero-extend or trunc instead of a stepvector? I've added some tests to check that we don't perform the combine in these cases. kmclaughlin: I think in the case of extending gather loads and truncating scatter stores, the index opcode…
		Subtarget->preferGatherScatter(MGS->getOpcode(), MemVT, Step))
		return SDValue();

		unsigned ShiftAmt;
		if (MemVT == MVT::nxv2i64 \|\| MemVT == MVT::nxv2f64)
		ShiftAmt = 3;
		else if (MemVT == MVT::nxv4i32 \|\| MemVT == MVT::nxv4f32)
		ShiftAmt = 2;
		else
		david-armUnsubmitted Done Reply Inline Actions Given that you're creating new nodes above and then potentially dropping them it might be better to only create nodes once you're guaranteed to do the transformation, i.e. something like auto MGT = dyn_cast<MaskedGatherSDNode>(MGS); if (MGT && !MGT->getPassThru().isUndef()) return SDValue(); // Using contiguous stores is not beneficial over 64 bit scatters auto MSC = cast<MaskedScatterSDNode>(MGS); if (MSC && MSC->getValue().getValueType().getScalarSizeInBits() != 64) // Split the mask into two parts and interleave with false ... if (MGT) { .. gather case .. } david-arm: Given that you're creating new nodes above and then potentially dropping them it might be…
		return SDValue();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Is this a temporary restriction? I ask because when I asked about doing this in IR instead of CodeGen, one of the answers was to support illegal types. To be clear I'm not against solving this in CodeGen by my assumption is that one of the biggest winners to doing this transform is when trying to gather/scatter bytes or halfwords, which are not supported and thus require splitting. However they can be done more efficiently with masked loads and stores. paulwalker-arm: Is this a temporary restriction? I ask because when I asked about doing this in IR instead of…

		MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
		MGS->getPointerInfo(), MGS->getMemOperand()->getFlags(),
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - MGS->getPointerInfo(), MGS->getMemOperand()->getFlags(), - MemoryLocation::UnknownSize, MGS->getOriginalAlign(), - MGS->getAAInfo(), MGS->getRanges()); + MGS->getPointerInfo(), MGS->getMemOperand()->getFlags(), + MemoryLocation::UnknownSize, MGS->getOriginalAlign(), MGS->getAAInfo(), + MGS->getRanges()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - MGS->getPointerInfo(), MGS…
		david-armUnsubmitted Done Reply Inline Actions nit: I think you also calculate this for the scatter case. Maybe it's worth moving this to above the if statement and then re-using the value for both gather and scatter cases? david-arm: nit: I think you also calculate this for the scatter case. Maybe it's worth moving this to…
		MemoryLocation::UnknownSize, MGS->getOriginalAlign(),
		MGS->getAAInfo(), MGS->getRanges());

		auto MPI = MachinePointerInfo(MGS->getPointerInfo().getAddrSpace());

		// Split the mask into two parts and interleave with false
		auto PFalse = SDValue(DAG.getMachineNode(AArch64::PFALSE, DL, MaskVT), 0);
		auto PredL = DAG.getNode(AArch64ISD::ZIP1, DL, MaskVT, Mask, PFalse);
		auto PredH = DAG.getNode(AArch64ISD::ZIP2, DL, MaskVT, Mask, PFalse);
		david-armUnsubmitted Done Reply Inline Actions nit: Looks like this is indented too much here. david-arm: nit: Looks like this is indented too much here.
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Can these use `ISD::EXTRACT_SUBVECTOR` instead? I ask because those will emit better code (removes the need for `pfalse`) and also allows DAGCombine to perhaps optimise further (e.g. when Mask is a constant). With that said, ... paulwalker-arm: Can these use `ISD::EXTRACT_SUBVECTOR` instead? I ask because those will emit better code…

		auto EltCount =
		DAG.getVScale(DL, MVT::i64,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - DAG.getVScale(DL, MVT::i64, - APInt(VT.getScalarSizeInBits(), - VT.getVectorMinNumElements() << ShiftAmt)); + DAG.getVScale(DL, MVT::i64, + APInt(VT.getScalarSizeInBits(), VT.getVectorMinNumElements() + << ShiftAmt)); Lint: Pre-merge checks: clang-format: please reformat the code ``` - DAG.getVScale(DL, MVT::i64…
		APInt(VT.getScalarSizeInBits(),
		VT.getVectorMinNumElements() << ShiftAmt));

		david-armUnsubmitted Done Reply Inline Actions I'm not sure if you can simply reuse the same memory operand here, i.e. if you look at `DAGTypeLegalizer::SplitVecRes_MLOAD` you'll see an example of what we do for scalable vectors. I think it's because we're loading from a different offset now. david-arm: I'm not sure if you can simply reuse the same memory operand here, i.e. if you look at…
		if (MGT) {
		// Perform the actual loads, making sure we use the chain value
		// from the first in the second.
		auto LoadL = DAG.getMaskedLoad(VT, DL, Chain, BasePtr, DAG.getUNDEF(BaseVT),
		PredL, MGT->getPassThru(), MemVT, MMO,
		ISD::UNINDEXED, MGT->getExtensionType());
		david-armUnsubmitted Done Reply Inline Actions nit: I think you can also move this above the if statement and then reuse for the store case too, and use MGS->getPointerInfo() instead. david-arm: nit: I think you can also move this above the if statement and then reuse for the store case…

		MMO = DAG.getMachineFunction().getMachineMemOperand(
		MPI, MGT->getMemOperand()->getFlags(), MemoryLocation::UnknownSize,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - MPI, MGT->getMemOperand()->getFlags(), MemoryLocation::UnknownSize, - MGT->getOriginalAlign(), MGT->getAAInfo(), MGT->getRanges()); + MPI, MGT->getMemOperand()->getFlags(), MemoryLocation::UnknownSize, + MGT->getOriginalAlign(), MGT->getAAInfo(), MGT->getRanges()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - MPI, MGT->getMemOperand()->getFlags()…
		MGT->getOriginalAlign(), MGT->getAAInfo(), MGT->getRanges());

		BasePtr = DAG.getNode(ISD::ADD, DL, BaseVT, BasePtr, EltCount);
		auto LoadH = DAG.getMaskedLoad(VT, DL, LoadL.getValue(1), BasePtr,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - auto LoadH = DAG.getMaskedLoad(VT, DL, LoadL.getValue(1), BasePtr, - DAG.getUNDEF(BaseVT), PredH, - MGT->getPassThru(), MemVT, MMO, - ISD::UNINDEXED, MGT->getExtensionType()); + auto LoadH = + DAG.getMaskedLoad(VT, DL, LoadL.getValue(1), BasePtr, + DAG.getUNDEF(BaseVT), PredH, MGT->getPassThru(), + MemVT, MMO, ISD::UNINDEXED, MGT->getExtensionType()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - auto LoadH = DAG.getMaskedLoad(VT, DL, LoadL.
		DAG.getUNDEF(BaseVT), PredH,
		MGT->getPassThru(), MemVT, MMO,
		ISD::UNINDEXED, MGT->getExtensionType());

		// Combine the loaded lanes into the single vector we'd get from
		// the original gather.
		auto Res = DAG.getNode(AArch64ISD::UZP1, DL, VT, LoadL.getValue(0),
		LoadH.getValue(0));
		paulwalker-armUnsubmitted Not Done Reply Inline Actions ...sorry if this is a silly question as I've not scrutinised the code but is this manual splitting and recombine necessary. It looks a little like manual type legalisation but this combine is something that is generally best done before type legalisation and thus you should be able to create a single big load and let common code split it if necessary. This might also resolve my "bytes" and "halfwords" comment above. paulwalker-arm: ...sorry if this is a silly question as I've not scrutinised the code but is this manual…

		// Make sure we return both the loaded values and the chain from the
		// last load.
		return DAG.getMergeValues({Res, LoadH.getValue(1)}, DL);
		}

		SDValue Data = MSC->getValue();

		// As with the mask, split the data into two parts and interleave;
		// Here we can just use the data itself for the other lanes, since
		// the inactive lanes won't be stored.
		auto DataL = DAG.getNode(AArch64ISD::ZIP1, DL, VT, Data, Data);
		auto DataH = DAG.getNode(AArch64ISD::ZIP2, DL, VT, Data, Data);
		paulwalker-armUnsubmitted Not Done Reply Inline Actions As above, using `ISD::EXTRACT_SUBVECTOR` here seems preferable and also means the scatter case doesn't require any target specific nodes. paulwalker-arm: As above, using `ISD::EXTRACT_SUBVECTOR` here seems preferable and also means the scatter case…

		// Perform the actual stores, making sure we use the chain value
		// from the first in the second.
		auto StoreL = DAG.getMaskedStore(Chain, DL, DataL, BasePtr,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - auto StoreL = DAG.getMaskedStore(Chain, DL, DataL, BasePtr, - DAG.getUNDEF(BaseVT), PredL, MemVT, MMO, - ISD::UNINDEXED, MSC->isTruncatingStore()); + auto StoreL = + DAG.getMaskedStore(Chain, DL, DataL, BasePtr, DAG.getUNDEF(BaseVT), PredL, + MemVT, MMO, ISD::UNINDEXED, MSC->isTruncatingStore()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - auto StoreL = DAG.getMaskedStore(Chain, DL, DataL…
		DAG.getUNDEF(BaseVT), PredL, MemVT, MMO,
		ISD::UNINDEXED, MSC->isTruncatingStore());

		MMO = DAG.getMachineFunction().getMachineMemOperand(
		david-armUnsubmitted Done Reply Inline Actions Again, I think you need to create a new mem operand here. david-arm: Again, I think you need to create a new mem operand here.
		MPI, MSC->getMemOperand()->getFlags(), MemoryLocation::UnknownSize,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - MPI, MSC->getMemOperand()->getFlags(), MemoryLocation::UnknownSize, - MSC->getOriginalAlign(), MSC->getAAInfo(), MSC->getRanges()); + MPI, MSC->getMemOperand()->getFlags(), MemoryLocation::UnknownSize, + MSC->getOriginalAlign(), MSC->getAAInfo(), MSC->getRanges()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - MPI, MSC->getMemOperand()->getFlags()…
		MSC->getOriginalAlign(), MSC->getAAInfo(), MSC->getRanges());

		BasePtr = DAG.getNode(ISD::ADD, DL, BaseVT, BasePtr, EltCount);
		auto StoreH = DAG.getMaskedStore(StoreL, DL, DataH, BasePtr,
		DAG.getUNDEF(BaseVT), PredH, MemVT, MMO,
		ISD::UNINDEXED, MSC->isTruncatingStore());

		return StoreH;
		}

static SDValue performMaskedGatherScatterCombine(		static SDValue performMaskedGatherScatterCombine(
SDNode *N, TargetLowering::DAGCombinerInfo &DCI, SelectionDAG &DAG) {		SDNode *N, TargetLowering::DAGCombinerInfo &DCI, SelectionDAG &DAG,
		const AArch64Subtarget *Subtarget) {
MaskedGatherScatterSDNode *MGS = cast<MaskedGatherScatterSDNode>(N);		MaskedGatherScatterSDNode *MGS = cast<MaskedGatherScatterSDNode>(N);
assert(MGS && "Can only combine gather load or scatter store nodes");		assert(MGS && "Can only combine gather load or scatter store nodes");

if (!DCI.isBeforeLegalize())
return SDValue();

SDLoc DL(MGS);		SDLoc DL(MGS);
SDValue Chain = MGS->getChain();		SDValue Chain = MGS->getChain();
SDValue Scale = MGS->getScale();		SDValue Scale = MGS->getScale();
SDValue Index = MGS->getIndex();		SDValue Index = MGS->getIndex();
SDValue Mask = MGS->getMask();		SDValue Mask = MGS->getMask();
SDValue BasePtr = MGS->getBasePtr();		SDValue BasePtr = MGS->getBasePtr();
ISD::MemIndexType IndexType = MGS->getIndexType();		ISD::MemIndexType IndexType = MGS->getIndexType();
		EVT MemVT = MGS->getMemoryVT();

if (!findMoreOptimalIndexType(MGS, BasePtr, Index, DAG))		if (DCI.isBeforeLegalize() &&
return SDValue();		findMoreOptimalIndexType(MGS, BasePtr, Index, DAG)) {

// Here we catch such cases early and change MGATHER's IndexType to allow		// Here we catch such cases early and change MGATHER's IndexType to allow
// the use of an Index that's more legalisation friendly.		// the use of an Index that's more legalisation friendly.
if (auto *MGT = dyn_cast<MaskedGatherSDNode>(MGS)) {		if (auto *MGT = dyn_cast<MaskedGatherSDNode>(MGS)) {
SDValue PassThru = MGT->getPassThru();		SDValue PassThru = MGT->getPassThru();
SDValue Ops[] = {Chain, PassThru, Mask, BasePtr, Index, Scale};		SDValue Ops[] = {Chain, PassThru, Mask, BasePtr, Index, Scale};
return DAG.getMaskedGather(		return DAG.getMaskedGather(DAG.getVTList(N->getValueType(0), MVT::Other),
DAG.getVTList(N->getValueType(0), MVT::Other), MGT->getMemoryVT(), DL,		MemVT, DL, Ops, MGT->getMemOperand(),
Ops, MGT->getMemOperand(), IndexType, MGT->getExtensionType());		IndexType, MGT->getExtensionType());
}		}
auto *MSC = cast<MaskedScatterSDNode>(MGS);		auto *MSC = cast<MaskedScatterSDNode>(MGS);
SDValue Data = MSC->getValue();		SDValue Data = MSC->getValue();
SDValue Ops[] = {Chain, Data, Mask, BasePtr, Index, Scale};		SDValue Ops[] = {Chain, Data, Mask, BasePtr, Index, Scale};
return DAG.getMaskedScatter(DAG.getVTList(MVT::Other), MSC->getMemoryVT(), DL,		return DAG.getMaskedScatter(DAG.getVTList(MVT::Other), MemVT, DL, Ops,
Ops, MSC->getMemOperand(), IndexType,		MSC->getMemOperand(), IndexType,
MSC->isTruncatingStore());		MSC->isTruncatingStore());
}		}

		// Attempt to transform a gather or scatter with a stride of 2 into
		// a pair of contiguous loads or stores
		if (Index->getOpcode() == ISD::STEP_VECTOR &&
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - if (Index->getOpcode() == ISD::STEP_VECTOR && - IndexType == ISD::SIGNED_SCALED) + if (Index->getOpcode() == ISD::STEP_VECTOR && IndexType == ISD::SIGNED_SCALED) Lint: Pre-merge checks: clang-format: please reformat the code ``` - if (Index->getOpcode() == ISD::STEP_VECTOR &&…
		IndexType == ISD::SIGNED_SCALED)
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I think these tests really belong inside `tryToCombineToMaskedLoadStore` because it is that function that has the restrictions and thus the same function that might later remove them. paulwalker-arm: I think these tests really belong inside `tryToCombineToMaskedLoadStore` because it is that…
		return tryToCombineToMaskedLoadStore(MGS, DAG, Subtarget);

		return SDValue();
		}

/// Target-specific DAG combine function for NEON load/store intrinsics		/// Target-specific DAG combine function for NEON load/store intrinsics
/// to merge base address updates.		/// to merge base address updates.
static SDValue performNEONPostLDSTCombine(SDNode *N,		static SDValue performNEONPostLDSTCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())		if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
return SDValue();		return SDValue();

unsigned AddrOpIdx = N->getNumOperands() - 1;		unsigned AddrOpIdx = N->getNumOperands() - 1;
SDValue Addr = N->getOperand(AddrOpIdx);		SDValue Addr = N->getOperand(AddrOpIdx);

// Search for a use of the address operand that is an increment.		// Search for a use of the address operand that is an increment.
for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),		for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
UE = Addr.getNode()->use_end(); UI != UE; ++UI) {		UE = Addr.getNode()->use_end(); UI != UE; ++UI) {
		david-armUnsubmitted Done Reply Inline Actions Perhaps worth adding a comment here explaining that the following code will attempt to transform stride=2 gathers/scatters into contiguous loads/stores? david-arm: Perhaps worth adding a comment here explaining that the following code will attempt to…
SDNode User = UI;		SDNode User = UI;
		david-armUnsubmitted Done Reply Inline Actions It might look nicer if you can avoid unnecessary indentation here by rewriting this to return early, i.e. if (Index->getOpcode() != ISD::STEP_VECTOR \|\| Mask->getOpcode() == ISD::EXTRACT_SUBVECTOR \|\| IndexType != ISD::SIGNED_SCALED) return SDValue(); ... rest of code with less indentation ... david-arm: It might look nicer if you can avoid unnecessary indentation here by rewriting this to return…
if (User->getOpcode() != ISD::ADD \|\|		if (User->getOpcode() != ISD::ADD \|\|
		efriedmaUnsubmitted Done Reply Inline Actions Why the restriction on the mask opcode? efriedma: Why the restriction on the mask opcode?
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions This restriction was added downstream as a workaround to an issue which I have not been able to reproduce. Removing as I believe it is no longer required. kmclaughlin: This restriction was added downstream as a workaround to an issue which I have not been able to…
UI.getUse().getResNo() != Addr.getResNo())		UI.getUse().getResNo() != Addr.getResNo())
continue;		continue;

// Check that the add is independent of the load/store. Otherwise, folding		// Check that the add is independent of the load/store. Otherwise, folding
		david-armUnsubmitted Done Reply Inline Actions I think you can just write `if (Step != 2)` here. david-arm: I think you can just write `if (Step != 2)` here.
// it would create a cycle.		// it would create a cycle.
SmallPtrSet<const SDNode *, 32> Visited;		SmallPtrSet<const SDNode *, 32> Visited;
SmallVector<const SDNode *, 16> Worklist;		SmallVector<const SDNode *, 16> Worklist;
Visited.insert(Addr.getNode());		Visited.insert(Addr.getNode());
		david-armUnsubmitted Done Reply Inline Actions Are you assuming this is a legal/simple type here? For example, the type could be <vscale x 12 x i32>, which will be an extended type and `getSimpleVT` will assert. Could you add a test case for this? I think you can avoid this either by only doing this after legalisation or by rewriting as a sequence of if-else statements, i.e. if (MemVT == MVT::nxv2i64 \|\| MemVT == MVT::nxv2f64 ... david-arm: Are you assuming this is a legal/simple type here? For example, the type could be <vscale x 12…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @david-arm, I have changed this to be a sequence of if-else statements checking MemVT as suggested. I tried adding a test using types such as nxv12i32, but I found that this crashes when the stepvector has an illegal type with "Do not know how to widen the result of this operator!" kmclaughlin: Hi @david-arm, I have changed this to be a sequence of if-else statements checking MemVT as…
Worklist.push_back(N);		Worklist.push_back(N);
Worklist.push_back(User);		Worklist.push_back(User);
if (SDNode::hasPredecessorHelper(N, Visited, Worklist) \|\|		if (SDNode::hasPredecessorHelper(N, Visited, Worklist) \|\|
SDNode::hasPredecessorHelper(User, Visited, Worklist))		SDNode::hasPredecessorHelper(User, Visited, Worklist))
continue;		continue;

// Find the new opcode for the updating load/store.		// Find the new opcode for the updating load/store.
bool IsStore = false;		bool IsStore = false;
bool IsLaneOp = false;		bool IsLaneOp = false;
bool IsDupOp = false;		bool IsDupOp = false;
		efriedmaUnsubmitted Not Done Reply Inline Actions I guess you're not handling i16/i8 because you'd have to worry about extension? efriedma: I guess you're not handling i16/i8 because you'd have to worry about extension?
		kmclaughlinAuthorUnsubmitted Not Done Reply Inline Actions This patch was only for handling legal types, though I think i8/i16 types could be supported in future if this would be beneficial. kmclaughlin: This patch was only for handling legal types, though I think i8/i16 types could be supported in…
unsigned NewOpc = 0;		unsigned NewOpc = 0;
unsigned NumVecs = 0;		unsigned NumVecs = 0;
unsigned IntNo = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
switch (IntNo) {		switch (IntNo) {
default: llvm_unreachable("unexpected intrinsic for Neon base update");		default: llvm_unreachable("unexpected intrinsic for Neon base update");
case Intrinsic::aarch64_neon_ld2: NewOpc = AArch64ISD::LD2post;		case Intrinsic::aarch64_neon_ld2: NewOpc = AArch64ISD::LD2post;
NumVecs = 2; break;		NumVecs = 2; break;
case Intrinsic::aarch64_neon_ld3: NewOpc = AArch64ISD::LD3post;		case Intrinsic::aarch64_neon_ld3: NewOpc = AArch64ISD::LD3post;
▲ Show 20 Lines • Show All 1,498 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::LOAD:		case ISD::LOAD:
if (performTBISimplification(N->getOperand(1), DCI, DAG))		if (performTBISimplification(N->getOperand(1), DCI, DAG))
return SDValue(N, 0);		return SDValue(N, 0);
break;		break;
case ISD::STORE:		case ISD::STORE:
return performSTORECombine(N, DCI, DAG, Subtarget);		return performSTORECombine(N, DCI, DAG, Subtarget);
case ISD::MGATHER:		case ISD::MGATHER:
case ISD::MSCATTER:		case ISD::MSCATTER:
return performMaskedGatherScatterCombine(N, DCI, DAG);		return performMaskedGatherScatterCombine(N, DCI, DAG, Subtarget);
case ISD::VECTOR_SPLICE:		case ISD::VECTOR_SPLICE:
return performSVESpliceCombine(N, DAG);		return performSVESpliceCombine(N, DAG);
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
return performFPExtendCombine(N, DAG, DCI, Subtarget);		return performFPExtendCombine(N, DAG, DCI, Subtarget);
case AArch64ISD::BRCOND:		case AArch64ISD::BRCOND:
return performBRCONDCombine(N, DCI, DAG);		return performBRCONDCombine(N, DCI, DAG);
case AArch64ISD::TBNZ:		case AArch64ISD::TBNZ:
case AArch64ISD::TBZ:		case AArch64ISD::TBZ:
▲ Show 20 Lines • Show All 2,295 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 660 Lines • ▼ Show 20 Lines	bool swiftAsyncContextIsDynamicallySet() const {
case Triple::MacOSX:		case Triple::MacOSX:
case Triple::Darwin:		case Triple::Darwin:
return Major < 12;		return Major < 12;
}		}
}		}

void mirFileLoaded(MachineFunction &MF) const override;		void mirFileLoaded(MachineFunction &MF) const override;

		// Returns true if it is preferable to use the given gather or scatter for
		// the current subtarget. Otherwise we may attempt to create a pair of
		// contiguous loads/stores instead.
		bool preferGatherScatter(unsigned Opcode, EVT MemTy, unsigned Stride) const;
		david-armUnsubmitted Done Reply Inline Actions I think it's worth being more explicit here about what `VecTy` actually is because it could be different to the loaded or stored data type. Do you think it's worth renaming this to `MemTy` to be clear? Or you can keep the same variable name, but add a comment saying that is the type for the data in memory. The choice is yours! david-arm: I think it's worth being more explicit here about what `VecTy` actually is because it could be…

// Return the known range for the bit length of SVE data registers. A value		// Return the known range for the bit length of SVE data registers. A value
// of 0 means nothing is known about that particular limit beyong what's		// of 0 means nothing is known about that particular limit beyong what's
// implied by the architecture.		// implied by the architecture.
unsigned getMaxSVEVectorSizeInBits() const {		unsigned getMaxSVEVectorSizeInBits() const {
assert(HasSVE && "Tried to get SVE vector length without SVE support!");		assert(HasSVE && "Tried to get SVE vector length without SVE support!");
return MaxSVEVectorSizeInBits;		return MaxSVEVectorSizeInBits;
}		}

Show All 15 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 355 Lines • ▼ Show 20 Lines	if (TargetTriple.isDriverKit())
return true;		return true;
if (TargetTriple.isiOS()) {		if (TargetTriple.isiOS()) {
return TargetTriple.getiOSVersion() >= VersionTuple(8);		return TargetTriple.getiOSVersion() >= VersionTuple(8);
}		}

return false;		return false;
}		}

		bool AArch64Subtarget::preferGatherScatter(unsigned Opcode, EVT MemTy,
		david-armUnsubmitted Done Reply Inline Actions nit: Can you fix the formatting issue? Thanks! david-arm: nit: Can you fix the formatting issue? Thanks!
		unsigned Stride) const {
		if (Stride != 2 \|\| !MemTy.isSimple())
		return true;

		if (ARMProcFamily == NeoverseV1) {
		switch (MemTy.getSimpleVT().SimpleTy) {
		case MVT::nxv4i32:
		case MVT::nxv4f32:
		return false;
		case MVT::nxv2i64:
		case MVT::nxv2f64:
		return Opcode == ISD::MSCATTER;
		default:
		return true;
		}
		}

		return true;
		}

std::unique_ptr<PBQPRAConstraint>		std::unique_ptr<PBQPRAConstraint>
AArch64Subtarget::getCustomPBQPConstraints() const {		AArch64Subtarget::getCustomPBQPConstraints() const {
return balanceFPOps() ? std::make_unique<A57ChainingConstraint>() : nullptr;		return balanceFPOps() ? std::make_unique<A57ChainingConstraint>() : nullptr;
}		}

void AArch64Subtarget::mirFileLoaded(MachineFunction &MF) const {		void AArch64Subtarget::mirFileLoaded(MachineFunction &MF) const {
// We usually compute max call frame size after ISel. Do the computation now		// We usually compute max call frame size after ISel. Do the computation now
// if the .mir file didn't specify it. Note that this will probably give you		// if the .mir file didn't specify it. Note that this will probably give you
// bogus values after PEI has eliminated the callframe setup/destroy pseudo		// bogus values after PEI has eliminated the callframe setup/destroy pseudo
// instructions, specify explicitly if you need it to be correct.		// instructions, specify explicitly if you need it to be correct.
MachineFrameInfo &MFI = MF.getFrameInfo();		MachineFrameInfo &MFI = MF.getFrameInfo();
if (!MFI.isMaxCallFrameSizeComputed())		if (!MFI.isMaxCallFrameSizeComputed())
MFI.computeMaxCallFrameSize(MF);		MFI.computeMaxCallFrameSize(MF);
}		}

bool AArch64Subtarget::useAA() const { return UseAA; }		bool AArch64Subtarget::useAA() const { return UseAA; }

llvm/test/CodeGen/AArch64/sve-gather-scatter-to-contiguous.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mcpu=neoverse-v1 < %s \| FileCheck %s

				; Gather -> Contiguous Load

				define <vscale x 4 x i32> @gather_stride2_4i32(i32* %addr, <vscale x 4 x i1> %mask) {
				; CHECK-LABEL: gather_stride2_4i32:
				david-armUnsubmitted Done Reply Inline Actions I think it's worth having a FP gather test and FP scatter test too, since we support them david-arm: I think it's worth having a FP gather test and FP scatter test too, since we support them
				; CHECK: // %bb.0:
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 p2.s, p0.s, p1.s
				; CHECK-NEXT: zip2 p0.s, p0.s, p1.s
				; CHECK-NEXT: ld1w { z0.s }, p2/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, #1, mul vl]
				; CHECK-NEXT: uzp1 z0.s, z0.s, z1.s
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 4 x i32> undef, i32 2, i32 0
				%2 = shufflevector <vscale x 4 x i32> %1, <vscale x 4 x i32> undef, <vscale x 4 x i32> zeroinitializer
				%stepvector = call <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				%indices = mul <vscale x 4 x i32> %2, %stepvector
				%ptrs = getelementptr i32, i32* %addr, <vscale x 4 x i32> %indices
				%vals = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32(<vscale x 4 x i32*> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x i32> undef)
				ret <vscale x 4 x i32> %vals
				}

				define <vscale x 2 x i64> @gather_stride2_2i64(i64* %addr, <vscale x 2 x i1> %mask) {
				; CHECK-LABEL: gather_stride2_2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 p2.d, p0.d, p1.d
				; CHECK-NEXT: zip2 p0.d, p0.d, p1.d
				; CHECK-NEXT: ld1d { z0.d }, p2/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, #1, mul vl]
				; CHECK-NEXT: uzp1 z0.d, z0.d, z1.d
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 2 x i64> undef, i64 2, i32 0
				%2 = shufflevector <vscale x 2 x i64> %1, <vscale x 2 x i64> undef, <vscale x 2 x i32> zeroinitializer
				%stepvector = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
				%indices = mul <vscale x 2 x i64> %2, %stepvector
				%ptrs = getelementptr i64, i64* %addr, <vscale x 2 x i64> %indices
				%vals = call <vscale x 2 x i64> @llvm.masked.gather.nxv2i64(<vscale x 2 x i64*> %ptrs, i32 8, <vscale x 2 x i1> %mask, <vscale x 2 x i64> undef)
				ret <vscale x 2 x i64> %vals
				}

				define <vscale x 2 x double> @gather_stride2_2f64(double* %addr, <vscale x 2 x i1> %mask) {
				; CHECK-LABEL: gather_stride2_2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 p2.d, p0.d, p1.d
				; CHECK-NEXT: zip2 p0.d, p0.d, p1.d
				; CHECK-NEXT: ld1d { z0.d }, p2/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, #1, mul vl]
				; CHECK-NEXT: uzp1 z0.d, z0.d, z1.d
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 2 x i64> undef, i64 2, i32 0
				%2 = shufflevector <vscale x 2 x i64> %1, <vscale x 2 x i64> undef, <vscale x 2 x i32> zeroinitializer
				%stepvector = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
				%indices = mul <vscale x 2 x i64> %2, %stepvector
				%ptrs = getelementptr double, double* %addr, <vscale x 2 x i64> %indices
				%vals = call <vscale x 2 x double> @llvm.masked.gather.nxv2f64(<vscale x 2 x double*> %ptrs, i32 4, <vscale x 2 x i1> %mask, <vscale x 2 x double> undef)
				ret <vscale x 2 x double> %vals
				}

				; Do not combine an extending gather to a pair of contiguous loads
				define <vscale x 2 x i64> @extending_gather_stride2_2i32(i32* %addr, <vscale x 2 x i1> %mask) {
				; CHECK-LABEL: extending_gather_stride2_2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z0.d, #0, #2
				; CHECK-NEXT: ld1w { z0.d }, p0/z, [x0, z0.d, sxtw #2]
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 2 x i32> undef, i32 2, i32 0
				%2 = shufflevector <vscale x 2 x i32> %1, <vscale x 2 x i32> undef, <vscale x 2 x i32> zeroinitializer
				%stepvector = call <vscale x 2 x i32> @llvm.experimental.stepvector.nxv2i32()
				%indices = mul <vscale x 2 x i32> %2, %stepvector
				%ptrs = getelementptr i32, i32* %addr, <vscale x 2 x i32> %indices
				%vals = call <vscale x 2 x i32> @llvm.masked.gather.nxv2i32(<vscale x 2 x i32*> %ptrs, i32 4, <vscale x 2 x i1> %mask, <vscale x 2 x i32> undef)
				%vals.zext = zext <vscale x 2 x i32> %vals to <vscale x 2 x i64>
				ret <vscale x 2 x i64> %vals.zext
				}

				; Scatter -> Contiguous Store

				define void @scatter_stride2_4i32(i32* %addr, <vscale x 4 x i32> %data, <vscale x 4 x i1> %mask) {
				; CHECK-LABEL: scatter_stride2_4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 z1.s, z0.s, z0.s
				; CHECK-NEXT: zip2 z0.s, z0.s, z0.s
				; CHECK-NEXT: zip1 p2.s, p0.s, p1.s
				; CHECK-NEXT: zip2 p0.s, p0.s, p1.s
				; CHECK-NEXT: st1w { z1.s }, p2, [x0]
				; CHECK-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 4 x i32> undef, i32 2, i32 0
				%2 = shufflevector <vscale x 4 x i32> %1, <vscale x 4 x i32> undef, <vscale x 4 x i32> zeroinitializer
				%stepvector = call <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				%indices = mul <vscale x 4 x i32> %2, %stepvector
				%ptrs = getelementptr i32, i32* %addr, <vscale x 4 x i32> %indices
				call void @llvm.masked.scatter.nxv4i32(<vscale x 4 x i32> %data, <vscale x 4 x i32*> %ptrs, i32 4, <vscale x 4 x i1> %mask)
				ret void
				}

				define void @scatter_stride2_4f32(float* %addr, <vscale x 4 x float> %data, <vscale x 4 x i1> %mask) {
				; CHECK-LABEL: scatter_stride2_4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 z1.s, z0.s, z0.s
				; CHECK-NEXT: zip2 z0.s, z0.s, z0.s
				; CHECK-NEXT: zip1 p2.s, p0.s, p1.s
				; CHECK-NEXT: zip2 p0.s, p0.s, p1.s
				; CHECK-NEXT: st1w { z1.s }, p2, [x0]
				; CHECK-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 4 x i32> undef, i32 2, i32 0
				%2 = shufflevector <vscale x 4 x i32> %1, <vscale x 4 x i32> undef, <vscale x 4 x i32> zeroinitializer
				%stepvector = call <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				%indices = mul <vscale x 4 x i32> %2, %stepvector
				%ptrs = getelementptr float, float* %addr, <vscale x 4 x i32> %indices
				call void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float> %data, <vscale x 4 x float*> %ptrs, i32 4, <vscale x 4 x i1> %mask)
				ret void
				}

				; Contiguous stores are not beneficial over 64 bit scatters here
				define void @scatter_stride2_2i64(i64* %addr, <vscale x 2 x i64> %data, <vscale x 2 x i1> %mask) {
				; CHECK-LABEL: scatter_stride2_2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z1.d, #0, #2
				; CHECK-NEXT: st1d { z0.d }, p0, [x0, z1.d, lsl #3]
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 2 x i64> undef, i64 2, i64 0
				%2 = shufflevector <vscale x 2 x i64> %1, <vscale x 2 x i64> undef, <vscale x 2 x i32> zeroinitializer
				%stepvector = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
				%indices = mul <vscale x 2 x i64> %2, %stepvector
				%ptrs = getelementptr i64, i64* %addr, <vscale x 2 x i64> %indices
				call void @llvm.masked.scatter.nxv2i64(<vscale x 2 x i64> %data, <vscale x 2 x i64*> %ptrs, i32 4, <vscale x 2 x i1> %mask)
				ret void
				}

				; Do not combine a truncating scatter to a pair of contiguous stores
				define void @truncating_scatter_stride2_2i32(i32* %addr, <vscale x 2 x i32> %data, <vscale x 2 x i1> %mask) {
				; CHECK-LABEL: truncating_scatter_stride2_2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z1.d, #0, #2
				; CHECK-NEXT: st1w { z0.d }, p0, [x0, z1.d, sxtw #2]
				; CHECK-NEXT: ret
				%1 = insertelement <vscale x 2 x i32> undef, i32 2, i32 0
				%2 = shufflevector <vscale x 2 x i32> %1, <vscale x 2 x i32> undef, <vscale x 2 x i32> zeroinitializer
				%stepvector = call <vscale x 2 x i32> @llvm.experimental.stepvector.nxv2i32()
				%indices = mul <vscale x 2 x i32> %2, %stepvector
				%ptrs = getelementptr i32, i32* %addr, <vscale x 2 x i32> %indices
				call void @llvm.masked.scatter.nxv2i32(<vscale x 2 x i32> %data, <vscale x 2 x i32*> %ptrs, i32 4, <vscale x 2 x i1> %mask)
				ret void
				}

				declare <vscale x 2 x i32> @llvm.experimental.stepvector.nxv2i32()
				declare <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				declare <vscale x 4 x float> @llvm.experimental.stepvector.nxv4f32()
				declare <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
				declare <vscale x 2 x double> @llvm.experimental.stepvector.nxv2f64()

				declare <vscale x 2 x i32> @llvm.masked.gather.nxv2i32(<vscale x 2 x i32*>, i32, <vscale x 2 x i1>, <vscale x 2 x i32>)
				declare <vscale x 4 x i32> @llvm.masked.gather.nxv4i32(<vscale x 4 x i32*>, i32, <vscale x 4 x i1>, <vscale x 4 x i32>)
				declare <vscale x 2 x i64> @llvm.masked.gather.nxv2i64(<vscale x 2 x i64*>, i32, <vscale x 2 x i1>, <vscale x 2 x i64>)
				declare <vscale x 2 x double> @llvm.masked.gather.nxv2f64(<vscale x 2 x double*>, i32, <vscale x 2 x i1>, <vscale x 2 x double>)

				declare void @llvm.masked.scatter.nxv2i32(<vscale x 2 x i32>, <vscale x 2 x i32*>, i32, <vscale x 2 x i1>)
				declare void @llvm.masked.scatter.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i32*>, i32, <vscale x 4 x i1>)
				declare void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float>, <vscale x 4 x float*>, i32, <vscale x 4 x i1>)
				declare void @llvm.masked.scatter.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i64*>, i32, <vscale x 2 x i1>)